Contribution Ideas
Problem
Here are a few places that could currently use some help: 1. Explore updating cost estimates to not use an index when a large % of rows will be filtered by a `WHERE` condition (to avoid returning no results) - #263 - [x] `hnsw-filtering-cost` branch 2. Investigate why `l2_distance` (not just `vector_l2_squared_distance`) is called for index scans</strike> - gist - [x] Explanation in https://github.com/pgvector/pgvector/issues/359#issuecomment-1840786021 - [x] See if this can be addressed in the Postgres executor 3. Investigate why the index condition isn't used for `bigint` attributes (like with `integer`) - hqann-bigint branch - [x] Works with casting (thread) 4. Investigate why parallel index scans aren't used when `amcanparallel` is set - parallel-index-scan3 branch 5. Explore updating cost estimates to not use an index when the limit > `hnsw.ef_search` - [x] `index-limit` branch
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Optimize Cost Estimates and Index Usage for Vector Searches
The current implementation of cost estimates and index usage in the vector search queries is inefficient, particularly when a significant percentage of rows are filtered by a WHERE condition. This results in unnecessary index scans and suboptimal performance. Additionally, certain attributes like bigint are not leveraging index conditions effectively, and parallel index scans are not being utilized even when possible.
Awaiting Verification
Be the first to verify this fix
- 1
Update Cost Estimates for WHERE Conditions
Modify the cost estimation logic to avoid using an index when a large percentage of rows will be filtered by a WHERE condition. This can prevent returning no results and improve performance.
sqlUPDATE cost_estimation SET use_index = FALSE WHERE filter_percentage > threshold; - 2
Address l2_distance Calls in Index Scans
Investigate and modify the Postgres executor to ensure that only vector_l2_squared_distance is called during index scans, thereby reducing unnecessary computation.
c/* Review and modify the executor code in Postgres to optimize distance calculations */ - 3
Enable Index Conditions for bigint Attributes
Ensure that the index conditions are applied for bigint attributes similar to how they are for integer types. This may involve modifying the query planner to recognize and optimize bigint conditions.
sqlALTER TABLE your_table ADD INDEX idx_bigint (your_bigint_column); - 4
Utilize Parallel Index Scans
Investigate the conditions under which parallel index scans are not being used despite amcanparallel being set. Adjust configurations or code to allow parallel processing for index scans.
sqlSET enable_parallel_index_scan = ON; - 5
Limit Index Usage Based on ef_search
Adjust the cost estimation logic to avoid using an index when the limit exceeds hnsw.ef_search, which can help in optimizing the performance of vector searches.
sqlUPDATE cost_estimation SET use_index = FALSE WHERE limit > hnsw.ef_search;
Validation
Run a series of vector search queries before and after implementing the changes. Measure the execution time and resource usage to confirm that the optimizations have led to improved performance. Additionally, verify that the expected results are returned without errors.
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep