Model2Vec

https://github.com/MinishLab/model2vec

Model2Vec is a technique to turn any sentence transformer into a really small fast model, reducing model size by 15x and making the models up to 500x faster, with a small drop in performance. See our results here, or dive in to see how it works

Main Features

Model2Vec is:

  • Small: reduces the size of a Sentence Transformer model by a factor of 15, from 120M params, down to 7.5M (30 MB on disk, making it the smallest model on MTEB!).
  • Static, but better: smaller than GLoVe and BPEmb, but much more performant, even with the same vocabulary.
  • Fast distillation: make your own model in 30 seconds.
  • Fast inference: up to 500 times faster on CPU than the original model. Go green or go home.
  • No data needed: Distillation happens directly on the token level, so no dataset is needed.
  • Simple to use: An easy to use interface for distilling and inferencing.
  • Integrated into Sentence Transformers: Model2Vec can be used directly in Sentence Transformers.
  • Bring your own model: Can be applied to any Sentence Transformer model.
  • Bring your own vocabulary: Can be applied to any vocabulary, allowing you to use your own domain-specific vocabulary. Need biomedical? Just get a medical dictionary, a biomedical model, and inference it.
  • Multi-lingual: Use any language. Need a French model? Pick one. Need multilingual? Here you go.
  • Tightly integrated with HuggingFace hub: easily share and load models from the HuggingFace hub, using the familiar from_pretrained and push_to_hub. Our own models can be found here. Feel free to share your own.
  • Easy Evaluation: evaluate your models on MTEB and some of our own tasks to measure the performance of the distilled model. Model2Vec models work out of the box on MTEB.

What is Model2Vec?

Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Best of all, you don't need any data to distill a model using Model2Vec.

It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.

Model2vec has 3 modes:

  • Output: behaves much like a real sentence transformer, i.e., it uses a subword tokenizer and simply encodes all wordpieces in its vocab. This is really quick to create (30 seconds on a CPU), very small (30 MB in float32), but might be less performant on some tasks.
  • Vocab (word level): creates a word-level tokenizer and only encodes words that are in the vocabulary. This is a bit slower to create and creates a larger model, but might be more performant on some tasks. Note that this model can go out-of-vocabulary, which might be beneficial if your domain is very noisy
  • Vocab (subword): a combination of the two methods above. In this mode, you can pass your own vocabulary, but it also uses the subword vocabulary to create representations for words not in the passed vocabulary.

For a technical deepdive into Model2Vec, please refer to our

Distillation

Distilling from a Sentence Transformer

Distilling from a loaded model

Distilling with the Sentence Transformers library

Distilling with a custom vocabulary

Distilling via CLI

If you are interested in fast small models, also consider looking at these techniques:

  • BPEmb: GLoVE embeddings trained on BPE-encoded Wikipedias. Huge inspiration to this project, multilingual, very fast. If you don't find a sentence transformer in the language you need, check this out.
  • fast-sentence-transformers: distillation using Model2Vec comes at a cost. If that cost is too steep for you, and you have access to a GPU, this package is for you. It automates the quantization and optimization of sentence transformers without loss of performance.
  • wordllama: Uses the input embeddings of a LLama2 model and then performs contrastive learning on these embeddings. As we show above, we think this is a bit overfit on MTEB, as the model is trained on MTEB datasets, and only evaluated on MTEB. It provides an interesting point of comparison to Model2Vec, and, fun fact, was invented at the same time.

If you find other related work, please let us know.

Blog Post https://huggingface.co/blog/Pringled/model2vec