Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Andrew Dalke
On Sep 9, 2020, at 04:00, Lewis Martin wrote: > I'd like to keep it FOSS since its for academic publication and hopefully to > be re-used. Chemfp is amazing but brute-forcing 100million by 100million > would surely still take a long time compared with an approximate nearest > neighbor approach.

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
OK to sum it up, for me writing to binary is a neat, fast, and low-storage solution for fingerprints. Example: o = open('fingerprints.bin', 'wb') gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64) for smi in tqdm_notebook(df['smiles']): mol = Chem.MolFromSmiles(smi) fp

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Greg Landrum
The most efficient (easy) way to store the fingerprints is using DataStructs.BitVectToBinaryText(). That will return a 64byte binary string for a 512bit fingerprint. FWIW: if you haven't seen the recent blog post about similarity searching with short fingerprints: http://rdkit.blogspot.com/2020/08

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Cheers Francois - that might be the way to go actually. I'll try with 'bitstring' https://github.com/scott-griffiths/bitstring and I guess write the data as concatenated bitarrays in chunked binary files. I'd like to keep it FOSS since its for academic publication and hopefully to be re-used. Chem

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Francois Berenger
On 09/09/2020 09:35, Lewis Martin wrote: Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows o

[Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows out when loaded with pandas.read_csv() or f.