FG
๐Ÿ’ป Software๐Ÿค– AI & LLMs

How to use BERT for finding similar sentences or similar news?

Fresh3 days ago
Mar 14, 20260 views
Confidence Score55%
55%

Problem

I have used BERT NextSentencePredictor to find similar sentences or similar news, However, It's super slow. Even on Tesla V100 which is the fastest GPU till now. It takes around 10secs for a query title with around 3,000 articles. Is there a way to use BERT better for finding similar sentences or similar news given a corpus of news articles?

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix โ€“ Awaiting Verification

Optimize BERT for Faster Similar Sentence Retrieval

Medium Risk

The slow performance of BERT NextSentencePredictor in finding similar sentences is primarily due to the model's architecture, which processes each query independently and requires significant computational resources. Additionally, the lack of efficient indexing and retrieval mechanisms for large corpora exacerbates the latency issues.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Use Sentence Embeddings

    Instead of using NextSentencePredictor, utilize BERT to generate embeddings for each sentence in your corpus. This allows you to represent sentences as fixed-length vectors, which can be compared more efficiently using cosine similarity.

    python
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('bert-base-nli-mean-tokens')
    corpus_embeddings = model.encode(corpus)
  2. 2

    Implement Efficient Similarity Search

    Use a library like FAISS (Facebook AI Similarity Search) to index the sentence embeddings. FAISS is optimized for fast nearest neighbor search, which will significantly reduce the time taken to find similar sentences.

    python
    import faiss
    
    index = faiss.IndexFlatL2(corpus_embeddings.shape[1])
    index.add(corpus_embeddings)
    
    D, I = index.search(query_embedding, k)
  3. 3

    Batch Processing

    Process queries in batches instead of one at a time. This can leverage the GPU more effectively and reduce the overhead of multiple calls to the model.

    python
    query_embeddings = model.encode(queries)
    D, I = index.search(query_embeddings, k)
  4. 4

    Use Mixed Precision Training

    If you're using PyTorch, enable mixed precision training with NVIDIA's Apex or PyTorch's native AMP. This can speed up inference times and reduce memory usage.

    python
    from torch.cuda.amp import autocast
    
    with autocast():
        query_embedding = model.encode(query)
  5. 5

    Consider Distillation or Quantization

    If latency is still an issue, consider using a distilled version of BERT (like DistilBERT) or quantizing the model to reduce its size and improve inference speed.

    python
    from transformers import DistilBertModel
    
    model = DistilBertModel.from_pretrained('distilbert-base-uncased')

Validation

To confirm the fix worked, measure the time taken for retrieval before and after implementing these changes. The retrieval time should significantly decrease, ideally to under 1 second for the same number of articles.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

huggingfacetransformersml