Wednesday, May 20, 2026

LaBSE Model for Multilingual Sentence Similarity

 The Language-agnostic BERT Sentence Encoder (LaBSE) is a groundbreaking model powered by BERT (Bidirectional Encoder Representations from Transformers) that has been trained to generate sentence embeddings for 109 languages. Think of it as a multilingual translator that also captures the essence of sentences, allowing you to retrieve similar meanings regardless of language.

Prerequisites

  • Python installed on your local machine
  • PyTorch and Hugging Face’s Transformers library

Installation

Before diving in, ensure you have the necessary libraries installed. You can do this using pip:

pip install torch transformers

Using LaBSE

import torch
from transformers import BertModel, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('setu4993/LaBSE')
model = BertModel.from_pretrained('setu4993/LaBSE')
model = model.eval()

# English sentences
english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors='pt', padding=True)
with torch.no_grad():
    english_outputs = model(**english_inputs)

# Getting the sentence embeddings
english_embeddings = english_outputs.pooler_output

From English to Other Languages

LaBSE can also process sentences in other languages seamlessly. Here’s how you can obtain embeddings for both Italian and Japanese sentences:

# Italian sentences
italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
italian_inputs = tokenizer(italian_sentences, return_tensors='pt', padding=True)
with torch.no_grad():
    italian_outputs = model(**italian_inputs)
italian_embeddings = italian_outputs.pooler_output

# Japanese sentences
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
japanese_inputs = tokenizer(japanese_sentences, return_tensors='pt', padding=True)
with torch.no_grad():
    japanese_outputs = model(**japanese_inputs)
japanese_embeddings = japanese_outputs.pooler_output

Calculating Sentence Similarity

Now that you have the embeddings, let’s explore how to compute the similarity between the sentences. Think of embeddings as unique flavors—what we want to do is find how similar those flavors are.

import torch.nn.functional as F

def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1))

# Example usages
print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))

Troubleshooting Common Issues

While working with LaBSE, you might encounter a few common issues. Here are some troubleshooting tips:

  • Error loading model: Ensure that the transformer library is up to date. Use pip install --upgrade transformers.
  • No CUDA available: If your model isn’t using GPU, check your PyTorch installation and ensure you have CUDA enabled.