MedEmbed
3๏ธโฃ 3 embedding models: small, base, and large
๐ Strong/SOTA performance for their size on common medical benchmarks like TRECCOVID, MedicalQARetrieval, PublicHealthQA, NFCorpus, and ArguAna.
๐ A permissive Apache 2.0 license allowing for commercial use
Training recipe:
1. 1000s of PubMed Central clinical/medical notes,
2. into 10,000s of synthetic pairs using LLaMA 3.1 70b
3. into 100,000s of training triplets with hard negative mining via Jina AI's jina-embeddings-v3
4. contrastive training (in-batch negatives, akin to MultipleNegativesRankingLoss (https://lnkd.in/e2bkUXns))
Blogpost: https://lnkd.in/eZnNMtcE
Models: https://lnkd.in/e4zPHJus
https://huggingface.co/blog/abhinand/medembed-finetuned-embedding-models-for-medical-ir