FG
💻 Software🤖 AI & LLMs

Iterative index scans

Fresh5 days ago
Mar 14, 20260 views
Confidence Score57%
57%

Problem

Hi all, I wanted to share some work on iterative index scans to get feedback. - hnsw-streaming branch - ivfflat-streaming branch You can enable this functionality (naming TDB) with: [code block] For HNSW, it keeps track of discarded candidates at layer 0. When more tuples are needed, it calls `HnswSearchLayer` / Algorithm 2 with the nearest discarded candidates as entry points (in batches of `ef_search`). The scan terminates when enough tuples are found, `hnsw.ef_stream` elements are visited, or `work_mem` is exceeded. For IVFFlat, it scans the next closest lists in groups of `ivfflat.probes`, up to `ivfflat.max_probes`. --- One issue I'm having trouble addressing is how to terminate scans for queries with distance filters. In the query below, if only 9 records are within the distance, it'll continue scanning the index. I've tried using `xs_orderbyvals` on `IndexScanDesc`, but it doesn't seem to help. [code block]

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix – Awaiting Verification

Implement Early Termination for Distance Filtered Queries

Medium Risk

The current implementation of iterative index scans does not account for early termination when the number of records found within the specified distance filter is less than the required number of tuples. This results in unnecessary scanning of the index, leading to inefficiencies and increased query times.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Modify HNSW Search Logic

    Update the `HnswSearchLayer` function to include a check for the number of valid tuples found against the required number of tuples. If the number of valid tuples meets the required count, terminate the scan early.

    pseudo
    if (found_tuples >= required_count) { terminate_scan(); }
  2. 2

    Adjust IVFFlat Scanning Logic

    In the IVFFlat scanning logic, implement a similar check to terminate scanning when the number of tuples found within the distance filter reaches the desired count. This will prevent unnecessary probing of additional lists.

    pseudo
    if (found_tuples >= required_count) { break; }
  3. 3

    Integrate Distance Filter Check

    Incorporate a distance filter check in both the HNSW and IVFFlat scanning processes to ensure that only tuples within the specified distance are counted towards the required tuple count.

    pseudo
    if (tuple.distance <= distance_filter) { count_valid_tuples(); }
  4. 4

    Test and Validate Changes

    Create unit tests that simulate queries with varying distance filters and validate that the scans terminate correctly when the required number of tuples is found. Ensure that performance metrics are collected to compare against previous implementations.

    pseudo
    assert(scan_terminates_early(query));
  5. 5

    Document Changes

    Update the documentation to reflect the new behavior of the iterative index scans, including how early termination works with distance filters. This will help future developers understand the changes made.

Validation

Run a series of benchmark tests with both HNSW and IVFFlat implementations using queries that include distance filters. Confirm that the scans terminate early when the required number of records is found, and compare performance metrics to ensure improvements.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

pgvectorembeddingsvector-search