Can I help get tinyint or half branches released?

Question

Accepted Answer

Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work. Thanks for your thoughts on the below.

Our smaller datasets have about 300 million (300M) molecules in them. For those molecules there are a few different types of vectors we'd like to generate. Some of these vectors are generated via cheminformatics methods (basically a molecular hash function with certain similari Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work. Thanks for your thoughts on the below. Our smaller datasets have about 300 million (300M) molecules in them. For those molecules there are a few different types of vectors we'd like to generate. Some of these vectors are generated via cheminformatics methods (basically a molecular hash function with certain similarity properties) and others are generated via embeddings from various ML models. 2. 300M embedding vectors w/ 128-1024 dimensions where each dimension is a non-zero decimal number (these are essentially the same as the standard embeddings everyone uses for various ML tasks). We would likely be ok giving up the precision of using half size floats or product quantization or any other similar technique.
3. I'll also add that we'd like 300M sparse bit vectors w/ 1024-2048 dimensions (Morgan Bit Fingerprints]. For these vectors, we'd like to be able to do ANN searches across them using tanimoto/jaccard distance. I recognize these are probably not going to be supported as true A For some higher level context, I'm currently running Postgres via GCP Cloud SQL to store our other molecular data and it would be nice to be able to integrate the molecular fingerprints/counts and embeddings into Postgres as well instead of needing to bring in another ANN lib/service (e.g.: faiss, pinecone, etc). I ran some rough numbers on storage cost and found that using the current pgvector, I estimate a (very hand wavy back of the envelope) storage cost of about $2k-$3k/yr for each 300M molecule fingerprint count vectors I store. Cutting that storage cost down by 50% or 75% would make it

Can I help get tinyint or half branches released?

Problem

1 Fix

Solution: Can I help get tinyint or half branches released?

Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work

Our smaller datasets have about 300 million (300M) molecules in them. For those

M sparse "count" vectors w/ 1024-2048 dimensions (Morgan Count Fingerprints]. Mo

For some higher level context, I'm currently running Postgres via GCP Cloud SQL

Validation

Verification Summary

Environment

Submitted by

Tags