HNSW + dead tuples: recall loss/usability issues
Problem
This issue is a follow-up on this discussion in #239. Scenario: User sets `ef_search` to 10 expecting to get top-10 results back. But the top-10 results in the HNSW index happen to be dead tuples (due to updates & deletes), then the query will return 0 results. While this doesn't impact things like ann-benchmark, it will impact more realistic usecases where applications update/delete data. Desired Behavior: HNSW index returns the top-10 results where the tuples aren't dead to avoid causing significant recall loss - e.g. in the scenario above, recall would be 0%. Note that ivfflat has the desired behavior. I can see some potentially ugly workarounds: 1. users always performs a vacuum on every update/delete con: resource intensive con: complexity on application logic 2. users performs more frequent periodic vacuum con: should theoretically minimize the problem but offers no bounded guarantees, user can still hit the above problem between the periodic vacuums 3. users can specify higher `ef_search` con: how will a user know what `ef_search` to specify? * con: `ef_search` can be specified up to 1k, which should theoretically minimize the problem but also offers no bounded guarantees - e.g. user can still hit the above problem when the top-1k results of a large dataset in HNSW are dead And I think the ugly workarounds will fall apart especially for larger datasets, so having a principled fix in the HNSW index would help real-world users. Thoughts?
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Implement Dead Tuple Filtering in HNSW Index
The HNSW index can return dead tuples (deleted or updated records) when the `ef_search` parameter is set low. This occurs because the algorithm prioritizes the nearest neighbors without checking their validity, leading to scenarios where users receive no valid results, especially after updates or deletes.
Awaiting Verification
Be the first to verify this fix
- 1
Modify HNSW Search Algorithm
Update the HNSW search algorithm to include a validity check for each candidate result. Before returning a result, the algorithm should verify that the tuple is not marked as deleted or updated. If a dead tuple is encountered, the algorithm should continue searching for the next valid tuple until the desired number of results is found or all candidates are exhausted.
javascriptfunction searchHNSW(query, ef_search) { let candidates = getCandidates(query, ef_search); let results = []; for (let candidate of candidates) { if (isValidTuple(candidate)) { results.push(candidate); if (results.length >= 10) break; } } return results; } - 2
Introduce Background Cleanup Process
Implement a background process that periodically checks for and removes dead tuples from the HNSW index. This process should run at a configurable interval to ensure that the index remains clean and efficient without requiring user intervention.
javascriptfunction cleanupHNSW() { let deadTuples = findDeadTuples(); for (let tuple of deadTuples) { removeTupleFromIndex(tuple); } } - 3
Provide Configuration for Cleanup Frequency
Allow users to configure the frequency of the background cleanup process through a setting in the database configuration. This will enable users to balance performance and accuracy based on their specific use case and workload.
sqlSET hnsw_cleanup_frequency = '1 hour'; - 4
Enhance Documentation on HNSW Usage
Update the documentation to provide clear guidelines on how to use the HNSW index effectively, including recommendations for setting `ef_search` and the importance of maintaining the index through periodic cleanups.
- 5
Monitor and Log Search Performance
Implement logging for search queries to monitor the performance and validity of results returned by the HNSW index. This will help identify any remaining issues and provide insights for future optimizations.
javascriptfunction logSearchPerformance(query, results) { console.log(`Query: ${query}, Results: ${results.length}`); }
Validation
To confirm the fix worked, run a series of search queries with known valid and invalid tuples in the HNSW index. Verify that the results consistently return valid tuples and that the number of dead tuples returned is minimized. Additionally, monitor the performance logs for any anomalies.
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep