The Language-agnostic BERT Sentence Encoder (LaBSE) is a groundbreaking model powered by BERT (Bidirectional Encoder Representations from Transformers) that has been trained to generate sentence embeddings for 109 languages. Think of it as a multilingual translator that also captures the essence of sentences, allowing you to retrieve similar meanings regardless of language.
Prerequisites
- Python installed on your local machine
- PyTorch and Hugging Face’s Transformers library
Installation
Before diving in, ensure you have the necessary libraries installed. You can do this using pip:
pip install torch transformersUsing LaBSE
import torch
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('setu4993/LaBSE')
model = BertModel.from_pretrained('setu4993/LaBSE')
model = model.eval()
# English sentences
english_sentences = [
"dog",
"Puppies are nice.",
"I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors='pt', padding=True)
with torch.no_grad():
english_outputs = model(**english_inputs)
# Getting the sentence embeddings
english_embeddings = english_outputs.pooler_output
From English to Other Languages
LaBSE can also process sentences in other languages seamlessly. Here’s how you can obtain embeddings for both Italian and Japanese sentences:
# Italian sentences
italian_sentences = [
"cane",
"I cuccioli sono carini.",
"Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
italian_inputs = tokenizer(italian_sentences, return_tensors='pt', padding=True)
with torch.no_grad():
italian_outputs = model(**italian_inputs)
italian_embeddings = italian_outputs.pooler_output
# Japanese sentences
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
japanese_inputs = tokenizer(japanese_sentences, return_tensors='pt', padding=True)
with torch.no_grad():
japanese_outputs = model(**japanese_inputs)
japanese_embeddings = japanese_outputs.pooler_output
Calculating Sentence Similarity
Now that you have the embeddings, let’s explore how to compute the similarity between the sentences. Think of embeddings as unique flavors—what we want to do is find how similar those flavors are.
import torch.nn.functional as F
def similarity(embeddings_1, embeddings_2):
normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
return torch.matmul(normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1))
# Example usages
print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))
Troubleshooting Common Issues
While working with LaBSE, you might encounter a few common issues. Here are some troubleshooting tips:
- Error loading model: Ensure that the transformer library is up to date. Use
pip install --upgrade transformers. - No CUDA available: If your model isn’t using GPU, check your PyTorch installation and ensure you have CUDA enabled.

