Can I help get tinyint or half branches released?
Problem
Is there more to do on the `tinyint` or `half` branches to get them released or are they ready to be put into `0.5.1`? If there is more to do for them, let me know and I'll see if it's something I could take care of (e.g.: code, docs, tests, etc).
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Solution: Can I help get tinyint or half branches released?
Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work. Thanks for your thoughts on the below. Our smaller datasets have about 300 million (300M) molecules in them. For those molecules there are a few different types of vectors we'd like to generate. Some of these vectors are generated via cheminformatics methods (basically a molecular hash function with certain similari
Trust Score
3 verifications
- 1
Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work
Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work. Thanks for your thoughts on the below.
- 2
Our smaller datasets have about 300 million (300M) molecules in them. For those
Our smaller datasets have about 300 million (300M) molecules in them. For those molecules there are a few different types of vectors we'd like to generate. Some of these vectors are generated via cheminformatics methods (basically a molecular hash function with certain similarity properties) and others are generated via embeddings from various ML models.
- 3
M sparse "count" vectors w/ 1024-2048 dimensions (Morgan Count Fingerprints]. Mo
2. 300M embedding vectors w/ 128-1024 dimensions where each dimension is a non-zero decimal number (these are essentially the same as the standard embeddings everyone uses for various ML tasks). We would likely be ok giving up the precision of using half size floats or product quantization or any other similar technique. 3. I'll also add that we'd like 300M sparse bit vectors w/ 1024-2048 dimensions (Morgan Bit Fingerprints]. For these vectors, we'd like to be able to do ANN searches across them using tanimoto/jaccard distance. I recognize these are probably not going to be supported as true A
- 4
For some higher level context, I'm currently running Postgres via GCP Cloud SQL
For some higher level context, I'm currently running Postgres via GCP Cloud SQL to store our other molecular data and it would be nice to be able to integrate the molecular fingerprints/counts and embeddings into Postgres as well instead of needing to bring in another ANN lib/service (e.g.: faiss, pinecone, etc). I ran some rough numbers on storage cost and found that using the current pgvector, I estimate a (very hand wavy back of the envelope) storage cost of about $2k-$3k/yr for each 300M molecule fingerprint count vectors I store. Cutting that storage cost down by 50% or 75% would make it
Validation
Resolved in pgvector/pgvector GitHub issue #326. Community reactions: 3 upvotes.
Verification Summary
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep