💻 Software🤖 AI & LLMs

Proposal: Explicitly set HNSW build workers

Freshover 2 years ago

Mar 14, 20260 views

Confidence Score88%

88%

Problem

Problem Selecting parallel workers for HNSW follows a similar method to IVFFLAT, i.e. leveraging the PostgreSQL `plan_create_index_workers` function, which uses the number of heap (table) pages return from `estimate_rel_size` to determine the number of parallel workers to use. This makes sense for B-tree / IVFFLAT as the time spent is in the loading the data from the table vs. calculations. But for HNSW, we're likely underestimating the number of workers we could use to speed up the builds as the HNSW process is much more CPU heavy. For example, review the [charts in this blog post][1] that show how we can increase concurrent inserts and can continue scaling the amount of inserts per second. We can see this play out in practice. Using a similar sample to [aforementioned blog post][1] on a 64-core m7gd.16xlarge instance, I created a table with 1,000,000 128-dim vectors. With enabling parallel builds, PostgreSQL elected to spawn 4 parallel workers (+ leader). [code block] For the same data set, I hardcoded to use 8 parallel workers (+leader), and saw a significant speedup: [code block] We're still likely falling to [underestimating parallel workers due to TOAST][2], but this case is not affected by that as a 128-dim vector is not TOAST'd. Additionally, as mentioned, HNSW index builds are CPU bound, so we'd want to maximize the number of cores we can use for the process. (There's still considerations around shared/temporary memory, but I'm seeing the brunt of the issue a

Unverified for your environment

Select your OS to check compatibility.

Your OS

OS version

Product version

1 Fix

Canonical Fix

Moderate Confidence Fix

84% confidence100% success rate1 verificationLast verified Mar 14, 2026

Solution: Proposal: Explicitly set HNSW build workers

Low Risk

Hey @jkatz, since it's currently possible to set the number of parallel workers with: [code block] I'm hesitant to add a new option for this (as it can be set in the session used for `CREATE INDEX` without affecting other sessions).

Trust Score

1 verification

100% success

1
Hey @jkatz, since it's currently possible to set the number of parallel workers
Hey @jkatz, since it's currently possible to set the number of parallel workers with:
2
I'm hesitant to add a new option for this (as it can be set in the session used
I'm hesitant to add a new option for this (as it can be set in the session used for `CREATE INDEX` without affecting other sessions).

Validation

Resolved in pgvector/pgvector GitHub issue #397. Community reactions: 0 upvotes.

Verification Summary

Worked: 1

Last verified Mar 14, 2026

Environment

Submitted by

Alex Chen

2450 rep