Iterative index scans
Problem
Hi all, I wanted to share some work on iterative index scans to get feedback. - hnsw-streaming branch - ivfflat-streaming branch You can enable this functionality (naming TDB) with: [code block] For HNSW, it keeps track of discarded candidates at layer 0. When more tuples are needed, it calls `HnswSearchLayer` / Algorithm 2 with the nearest discarded candidates as entry points (in batches of `ef_search`). The scan terminates when enough tuples are found, `hnsw.ef_stream` elements are visited, or `work_mem` is exceeded. For IVFFlat, it scans the next closest lists in groups of `ivfflat.probes`, up to `ivfflat.max_probes`. --- One issue I'm having trouble addressing is how to terminate scans for queries with distance filters. In the query below, if only 9 records are within the distance, it'll continue scanning the index. I've tried using `xs_orderbyvals` on `IndexScanDesc`, but it doesn't seem to help. [code block]
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Implement Early Termination for Distance Filtered Queries
The current implementation of iterative index scans does not account for early termination when the number of records found within the specified distance filter is less than the required number of tuples. This results in unnecessary scanning of the index, leading to inefficiencies and increased query times.
Awaiting Verification
Be the first to verify this fix
- 1
Modify HNSW Search Logic
Update the `HnswSearchLayer` function to include a check for the number of valid tuples found against the required number of tuples. If the number of valid tuples meets the required count, terminate the scan early.
pseudoif (found_tuples >= required_count) { terminate_scan(); } - 2
Adjust IVFFlat Scanning Logic
In the IVFFlat scanning logic, implement a similar check to terminate scanning when the number of tuples found within the distance filter reaches the desired count. This will prevent unnecessary probing of additional lists.
pseudoif (found_tuples >= required_count) { break; } - 3
Integrate Distance Filter Check
Incorporate a distance filter check in both the HNSW and IVFFlat scanning processes to ensure that only tuples within the specified distance are counted towards the required tuple count.
pseudoif (tuple.distance <= distance_filter) { count_valid_tuples(); } - 4
Test and Validate Changes
Create unit tests that simulate queries with varying distance filters and validate that the scans terminate correctly when the required number of tuples is found. Ensure that performance metrics are collected to compare against previous implementations.
pseudoassert(scan_terminates_early(query)); - 5
Document Changes
Update the documentation to reflect the new behavior of the iterative index scans, including how early termination works with distance filters. This will help future developers understand the changes made.
Validation
Run a series of benchmark tests with both HNSW and IVFFlat implementations using queries that include distance filters. Confirm that the scans terminate early when the required number of records is found, and compare performance metrics to ensure improvements.
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep