FG
💻 Software🤖 AI & LLMs

HNSW + dead tuples: recall loss/usability issues

Fresh3 days ago
Mar 14, 20260 views
Confidence Score55%
55%

Problem

This issue is a follow-up on this discussion in #239. Scenario: User sets `ef_search` to 10 expecting to get top-10 results back. But the top-10 results in the HNSW index happen to be dead tuples (due to updates & deletes), then the query will return 0 results. While this doesn't impact things like ann-benchmark, it will impact more realistic usecases where applications update/delete data. Desired Behavior: HNSW index returns the top-10 results where the tuples aren't dead to avoid causing significant recall loss - e.g. in the scenario above, recall would be 0%. Note that ivfflat has the desired behavior. I can see some potentially ugly workarounds: 1. users always performs a vacuum on every update/delete con: resource intensive con: complexity on application logic 2. users performs more frequent periodic vacuum con: should theoretically minimize the problem but offers no bounded guarantees, user can still hit the above problem between the periodic vacuums 3. users can specify higher `ef_search` con: how will a user know what `ef_search` to specify? * con: `ef_search` can be specified up to 1k, which should theoretically minimize the problem but also offers no bounded guarantees - e.g. user can still hit the above problem when the top-1k results of a large dataset in HNSW are dead And I think the ugly workarounds will fall apart especially for larger datasets, so having a principled fix in the HNSW index would help real-world users. Thoughts?

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix – Awaiting Verification

Implement Dead Tuple Filtering in HNSW Index

Medium Risk

The HNSW index can return dead tuples (deleted or updated records) when the `ef_search` parameter is set low. This occurs because the algorithm prioritizes the nearest neighbors without checking their validity, leading to scenarios where users receive no valid results, especially after updates or deletes.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Modify HNSW Search Algorithm

    Update the HNSW search algorithm to include a validity check for each candidate result. Before returning a result, the algorithm should verify that the tuple is not marked as deleted or updated. If a dead tuple is encountered, the algorithm should continue searching for the next valid tuple until the desired number of results is found or all candidates are exhausted.

    javascript
    function searchHNSW(query, ef_search) {
      let candidates = getCandidates(query, ef_search);
      let results = [];
      for (let candidate of candidates) {
        if (isValidTuple(candidate)) {
          results.push(candidate);
          if (results.length >= 10) break;
        }
      }
      return results;
    }
  2. 2

    Introduce Background Cleanup Process

    Implement a background process that periodically checks for and removes dead tuples from the HNSW index. This process should run at a configurable interval to ensure that the index remains clean and efficient without requiring user intervention.

    javascript
    function cleanupHNSW() {
      let deadTuples = findDeadTuples();
      for (let tuple of deadTuples) {
        removeTupleFromIndex(tuple);
      }
    }
  3. 3

    Provide Configuration for Cleanup Frequency

    Allow users to configure the frequency of the background cleanup process through a setting in the database configuration. This will enable users to balance performance and accuracy based on their specific use case and workload.

    sql
    SET hnsw_cleanup_frequency = '1 hour';
  4. 4

    Enhance Documentation on HNSW Usage

    Update the documentation to provide clear guidelines on how to use the HNSW index effectively, including recommendations for setting `ef_search` and the importance of maintaining the index through periodic cleanups.

  5. 5

    Monitor and Log Search Performance

    Implement logging for search queries to monitor the performance and validity of results returned by the HNSW index. This will help identify any remaining issues and provide insights for future optimizations.

    javascript
    function logSearchPerformance(query, results) {
      console.log(`Query: ${query}, Results: ${results.length}`);
    }

Validation

To confirm the fix worked, run a series of search queries with known valid and invalid tuples in the HNSW index. Verify that the results consistently return valid tuples and that the number of dead tuples returned is minimized. Additionally, monitor the performance logs for any anomalies.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

pgvectorembeddingsvector-search