Proposal: Explicitly set HNSW build workers
Problem
Problem Selecting parallel workers for HNSW follows a similar method to IVFFLAT, i.e. leveraging the PostgreSQL `plan_create_index_workers` function, which uses the number of heap (table) pages return from `estimate_rel_size` to determine the number of parallel workers to use. This makes sense for B-tree / IVFFLAT as the time spent is in the loading the data from the table vs. calculations. But for HNSW, we're likely underestimating the number of workers we could use to speed up the builds as the HNSW process is much more CPU heavy. For example, review the [charts in this blog post][1] that show how we can increase concurrent inserts and can continue scaling the amount of inserts per second. We can see this play out in practice. Using a similar sample to [aforementioned blog post][1] on a 64-core m7gd.16xlarge instance, I created a table with 1,000,000 128-dim vectors. With enabling parallel builds, PostgreSQL elected to spawn 4 parallel workers (+ leader). [code block] For the same data set, I hardcoded to use 8 parallel workers (+leader), and saw a significant speedup: [code block] We're still likely falling to [underestimating parallel workers due to TOAST][2], but this case is not affected by that as a 128-dim vector is not TOAST'd. Additionally, as mentioned, HNSW index builds are CPU bound, so we'd want to maximize the number of cores we can use for the process. (There's still considerations around shared/temporary memory, but I'm seeing the brunt of the issue a
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Solution: Proposal: Explicitly set HNSW build workers
Hey @jkatz, since it's currently possible to set the number of parallel workers with: [code block] I'm hesitant to add a new option for this (as it can be set in the session used for `CREATE INDEX` without affecting other sessions).
Trust Score
1 verification
- 1
Hey @jkatz, since it's currently possible to set the number of parallel workers
Hey @jkatz, since it's currently possible to set the number of parallel workers with:
- 2
I'm hesitant to add a new option for this (as it can be set in the session used
I'm hesitant to add a new option for this (as it can be set in the session used for `CREATE INDEX` without affecting other sessions).
Validation
Resolved in pgvector/pgvector GitHub issue #397. Community reactions: 0 upvotes.
Verification Summary
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep