FG
💻 Software🤖 AI & LLMs

High LWLock Contention During Concurrent HNSW Index Scans

Fresh3 days ago
Mar 14, 20260 views
Confidence Score52%
52%

Problem

I’ve been running into LWLock contention issues with HNSW indexes during concurrent workloads, and I wanted to see if anyone has insights or suggestions for improving this. The problem becomes noticeable when we have 32+ DB connections. At this level, the database load is dominated by LWLock:LockManager wait events. At lower concurrency levels, the QPS scales pretty well, and there’s minimal lock contention. But as we increase the number of connections, contention spikes, and throughput saturates. Observations Lock Behavior - Based on the code in hnswscan.c, it looks like the search process uses `LockPage(..., HNSW_SCAN_LOCK, ShareLock)` to protect access to the adjacency graph during traversal. These locks are brief but can pile up when multiple queries hit the same graph structures. Scaling Issue- With ~32 workers (1 connection per worker), QPS is great and lock contention is low. When we push to 32+ to 100 or more workers, contention grows exponentially, leading to LWLock:LockManager dominating database load. What I’ve Tried: Concurrency Tuning: Sticking to ~32 workers seems to work best, but we’d like to scale further if possible. Instance Scaling: Larger instances don’t seem to help much because the bottleneck is lock contention, not compute or I/O. Yet to try: Having prepared statements. Ideas: 1. Finer grained Locking: Is there a way to reduce the granularity of the HNSW_SCAN_LOCK to avoid so much contention when multiple queries traverse the graph? 2. Asynchrono

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix – Awaiting Verification

Implement Finer Grained Locking for HNSW Index Scans

Medium Risk

The high LWLock contention during concurrent HNSW index scans is primarily due to the use of a single lock (HNSW_SCAN_LOCK) for protecting access to the adjacency graph. When multiple queries attempt to traverse the same graph structure simultaneously, they contend for this lock, leading to increased wait times and reduced throughput as the number of concurrent connections rises.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Analyze Locking Strategy

    Review the current locking strategy in hnswscan.c to identify opportunities for finer-grained locking. Consider breaking down the HNSW_SCAN_LOCK into multiple locks that can protect smaller sections of the adjacency graph, allowing for concurrent access.

    N/A
    N/A
  2. 2

    Implement Fine-Grained Locks

    Modify the HNSW index implementation to use multiple locks instead of a single lock. This could involve creating locks for individual nodes or clusters within the graph, allowing multiple queries to traverse different parts of the graph simultaneously without contention.

    N/A
    N/A
  3. 3

    Test Locking Changes

    Run performance tests with varying levels of concurrency (32, 64, 100+ connections) to measure the impact of the new locking strategy on LWLock contention and overall throughput. Monitor the LockManager wait events to ensure they are reduced.

    N/A
    N/A
  4. 4

    Optimize Query Patterns

    Review and optimize the query patterns to ensure that they are not excessively locking the same graph structures. Consider using prepared statements to reduce the overhead of lock acquisition and improve performance.

    sql
    PREPARE stmt AS SELECT * FROM hnsw_index WHERE vector <-> $1 < $2;
  5. 5

    Monitor and Adjust

    After deploying the changes, continuously monitor the system for any signs of contention or performance degradation. Be prepared to further adjust the locking strategy or query patterns based on real-world usage and performance metrics.

    N/A
    N/A

Validation

To confirm the fix worked, compare the LWLock:LockManager wait events and overall QPS before and after implementing the changes. A significant reduction in wait events and an increase in throughput under high concurrency should indicate success.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

pgvectorembeddingsvector-search