Parallel index builds for HNSW
Problem
Hi all, support for in-memory, parallel index builds is now available in the hnsw-fast-build branch :tada: A few benchmarks from my local machine with the SIFT 1M dataset (128 dimensions): code version | processes | build time --- | --- | --- 0.5.1 | 1 | 415 sec master | 1 | 309 sec branch | 2 | 184 sec branch | 4 | 107 sec branch | 8 | 83 sec A few useful settings are: [code block] For a high number of workers, you may also need to increase `max_parallel_workers` (default is 8). Please test it out (in a non-production environment) and share any feedback. Aiming for a release (0.5.2) at the end of January if all goes well.
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Solution: Parallel index builds for HNSW
@ankane Awesome! I'm running a series of tests, but I waned to share a very early result. Here is my test info: Dataset: 10MM 1,536-dim randomly generated normalized vectors Instance: r7gd.16xlarge (64 vCPU, 512GB RAM) Storage: NVMe Build parameters: - `m`: 16 - `ef_construction`: 100 PostgreSQL configuration of relevance: - `shared_buffers`: 128GB - `maintenance_work_mem`: 128GB - `max_parallel
Trust Score
4 verifications
- 1
I'm running a series of tests, but I waned to share a very early result. Here is
I'm running a series of tests, but I waned to share a very early result. Here is my test info:
- 2
Dataset: 10MM 1,536-dim randomly generated normalized vectors
Instance: r7gd.16xlarge (64 vCPU, 512GB RAM) Storage: NVMe Build parameters: - `m`: 16 - `ef_construction`: 100 PostgreSQL configuration of relevance: - `shared_buffers`: 128GB - `maintenance_work_mem`: 128GB - `max_parallel_maintenance_workers`: 63 (with leader, so this will be 64) - `max_wal_size`: 20GB - `wal_compression`: zstd
- 3
[hnsw-fast-bulid-branch][1] completed in 25m23s (1523227.801 ms)
- When I checked in on `master`, it was about 16% completed. However, when looking `pg_stat_progress_create_index`, [hnsw-fast-bulid-branch][1] was outpacing `master` by about 10x. - [hnsw-fast-bulid-branch][1] was indexing at about 6,565 tps, which was more than 6x faster than the [concurrent insert method][2] on a similar data set...and the other data set had `ef_construction` at `64`!
- 4
This looks really promising! There are a few more tests I plan to run:
This looks really promising! There are a few more tests I plan to run:
Validation
Resolved in pgvector/pgvector GitHub issue #409. Community reactions: 7 upvotes.
Verification Summary
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep