Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
On Sep 9, 2020, at 04:00, Lewis Martin wrote: > I'd like to keep it FOSS since its for academic publication and hopefully to > be re-used. Chemfp is amazing but brute-forcing 100million by 100million > would surely still take a long time compared with an approximate nearest > neighbor approach. As a rough guideline, the chemfp paper Fig. 3 says that searching 100M PubChem fingerprints for the k=10 neighbors averages about 100 ms on a Haswell machine (See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8 ). That was for chemfp 3.3 using 881 bits. You have 512 bits, which is faster (0.6x time, according to my timings). Chemfp 1.6 is about the same performance as chemfp 3.4 for this case. So that's 60ms per query, on average, or 70 days for a single-processor query. (It's hard to compare directly though as the BitBound algorithm is sublinear; I found the PubChem fingerprint kNN search scales at O(n** ~0.6) while Morgan 2048-bit scales at O(n** ~0.8). Then again, the Haswell machine is several years old, so I'm leaving the numbers as-is for an estimate. This is easily parallelized yourself, or you can use chemfp's built-in parallelized NxN kNN searches. (The amount of scaleup on a single multi-core machine depends on your memory architecture.) This means that it shouldn't be hard to get a <1 week performance out of chemfp. Performance-wise, chemfp 3.4.1 (the latest version of the commercial track) is the same as chemfp 1.6.1 (the latest *no-cost*/open-source version). One caveat - chemfp 1.x has a limit of 2**31 bytes for the fingerprints, or 33.6M 512-bit fingerprints. (chemfp 2.x or later use 64-bit indexing and don't have this limitation.) To get around that, with 1.6.1, you'll need split your data into 4 chucks, and search each independently. This also reduces the effect of sublinear scaling. Or, there's no-cost academic licensing for chemfp 3.x using pre-compiled binaries which work on most Linux-based OSes. Still, I think an exact search is entirely doable using FOSS software. Switching to a smaller fingerprint size is a form of approximate nearest neighbors. If you take Greg's observation and drop the fingerprint size from 512 bits to 128 bits then you can get about a factor of 3 faster, and have everything fit into chemfp 1.6's 2GB limit. Best regards, Andrew da...@dalkescientific.com P.S. On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger wrote: > There are many published methods, some open-source software (like > Dalke's chemfp) and even some commercial ones > which claim they are lightning fast (even reaching real-time search > speed!). Dalke's chemfp also claims real-time search speed. ;) It's always a claim of what size that means. Also, at least a couple of those systems use multiple threads to speed up single-query search, which interferes with multi-threaded batch search, as with NxN search on a single machine. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
OK to sum it up, for me writing to binary is a neat, fast, and low-storage solution for fingerprints. Example: o = open('fingerprints.bin', 'wb') gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64) for smi in tqdm_notebook(df['smiles']): mol = Chem.MolFromSmiles(smi) fp = gen_mo.GetFingerprint(mol) bs = bitstring.BitArray(bin=fp.ToBitString()) o.write(bs.bytes) o.close() Most of the time was being taken up in creating numpy arrays. For instance: %%timeit fp = np.array(gen_mo.GetFingerprint(mol)) 351 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) vs: %%timeit fp = gen_mo.GetFingerprint(mol) bs = bitstring.BitArray(bin=fp.ToBitString()) 42 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 1 loops each) so when you remove that step, 100mm ligands takes about 4 1/2 hours in serial, no need for parallelization. Cheers! PS for newbs like me, read the binary fingerprints like this: i = open('fingerprints.bin', 'rb') bits = '' for _ in range(num_fps): fp_bytes = i.read(8) bits += bitstring.BitArray(bytes=fp_bytes).bin i.close() fingerprints_concat = (np.fromstring(bits,'u1') - ord('0')).reshape(num_fps, 64) On Wed, Sep 9, 2020 at 1:28 PM Greg Landrum wrote: > The most efficient (easy) way to store the fingerprints is using > DataStructs.BitVectToBinaryText(). That will return a 64byte binary string > for a 512bit fingerprint. > > FWIW: if you haven't seen the recent blog post about similarity searching > with short fingerprints: > http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html > > -greg > > > On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin > wrote: > >> Hi RDKit, >> >> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd >> like to fit an approximate-nearest neighbors index on a dataset of 100 >> million ligands, featurized by morgan fingerprint. The text file of the >> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or >> f.readlines() due to weird memory allocation issues. >> >> >> It would take 45hrs to process the file in serial (i.e. read line, create >> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner >> so now I'd like to parallelize the job with joblib, which would multiply >> the memory requirements by the number of processes running at a time. >> >> So: what is the smallest possible representation for a binary >> fingerprint? Using `sys.getsizeof` on a >> rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96 >> bytes, but I'm not sure whether to believe that since, like csr_matrix, the >> size depends on accurately returning the object's data. Here's an example >> demonstrating this: >> >> from rdkit import Chem >> from rdkit.Chem import rdFingerprintGenerator >> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' >> mol = Chem.MolFromSmiles(smi) >> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) >> fp = gen_mo.GetFingerprint(mol) >> sparse_fp = sparse.csr_matrix(fp) >> >> print('ExplicitBitVect object size:', getsizeof(fp)) >> print('Sparse matrix size (naive):', getsizeof(sparse_fp)) >> print('Sparse matrix size (real):', >> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) >> print('fp.ToBinary size:', getsizeof(fp.ToBinary())) >> print('fp.ToBinary size:', getsizeof(fp.ToBase64())) >> >>> >> >> ExplicitBitVect object size: 96 >> Sparse matrix size (naive): 64 >> Sparse matrix size (real): 476 >> fp.ToBinary size: 85 >> fp.ToBinary size: 121 >> >> >> >> Note that even the smallest of these multiplied by 100 million would be >> about 8gb, still larger than the text file storing the smiles codes - not >> sure if that is to be expected or not? >> >> Thank for your time! >> Lewis >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
The most efficient (easy) way to store the fingerprints is using DataStructs.BitVectToBinaryText(). That will return a 64byte binary string for a 512bit fingerprint. FWIW: if you haven't seen the recent blog post about similarity searching with short fingerprints: http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html -greg On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin wrote: > Hi RDKit, > > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd > like to fit an approximate-nearest neighbors index on a dataset of 100 > million ligands, featurized by morgan fingerprint. The text file of the > smiles is ~6gb but this blows out when loaded with pandas.read_csv() or > f.readlines() due to weird memory allocation issues. > > > It would take 45hrs to process the file in serial (i.e. read line, create > mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner > so now I'd like to parallelize the job with joblib, which would multiply > the memory requirements by the number of processes running at a time. > > So: what is the smallest possible representation for a binary fingerprint? > Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect > object tells me it is 96 bytes, but I'm not sure whether to believe that > since, like csr_matrix, the size depends on accurately returning the > object's data. Here's an example demonstrating this: > > from rdkit import Chem > from rdkit.Chem import rdFingerprintGenerator > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' > mol = Chem.MolFromSmiles(smi) > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) > fp = gen_mo.GetFingerprint(mol) > sparse_fp = sparse.csr_matrix(fp) > > print('ExplicitBitVect object size:', getsizeof(fp)) > print('Sparse matrix size (naive):', getsizeof(sparse_fp)) > print('Sparse matrix size (real):', > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) > print('fp.ToBinary size:', getsizeof(fp.ToBinary())) > print('fp.ToBinary size:', getsizeof(fp.ToBase64())) > >>> > > ExplicitBitVect object size: 96 > Sparse matrix size (naive): 64 > Sparse matrix size (real): 476 > fp.ToBinary size: 85 > fp.ToBinary size: 121 > > > > Note that even the smallest of these multiplied by 100 million would be > about 8gb, still larger than the text file storing the smiles codes - not > sure if that is to be expected or not? > > Thank for your time! > Lewis > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
Cheers Francois - that might be the way to go actually. I'll try with 'bitstring' https://github.com/scott-griffiths/bitstring and I guess write the data as concatenated bitarrays in chunked binary files. I'd like to keep it FOSS since its for academic publication and hopefully to be re-used. Chemfp is amazing but brute-forcing 100million by 100million would surely still take a long time compared with an approximate nearest neighbor approach. Straying from RDKit so Ill leave it there - thanks! On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger wrote: > On 09/09/2020 09:35, Lewis Martin wrote: > > Hi RDKit, > > > > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd > > like to fit an approximate-nearest neighbors index on a dataset of 100 > > million ligands, featurized by morgan fingerprint. The text file of > > the smiles is ~6gb but this blows out when loaded with > > pandas.read_csv() or f.readlines() due to weird memory allocation > > issues. > > > > It would take 45hrs to process the file in serial (i.e. read line, > > create mol, fingerprint, convert to np.arr or sparse arrays) in a > > streaming manner so now I'd like to parallelize the job with joblib, > > which would multiply the memory requirements by the number of > > processes running at a time. > > > > So: what is the smallest possible representation for a binary > > fingerprint? Using `sys.getsizeof` on a > > rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is > > 96 bytes, but I'm not sure whether to believe that since, like > > csr_matrix, the size depends on accurately returning the object's > > data. Here's an example demonstrating this: > > > > from rdkit import Chem > > from rdkit.Chem import rdFingerprintGenerator > > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' > > mol = Chem.MolFromSmiles(smi) > > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, > > fpSize=512) > > Obviously, if you ask for fpSize = 512, the smallest uncompressed > representation of the fingerprint will be 512 bits (64 bytes). > > 10M of such fingerprints, if there is not any overhead added by the > programming language, > would fit into 6GB of RAM. > > But, the really fun things will start when you want to search fast into > so many molecules. :) > There are many published methods, some open-source software (like > Dalke's chemfp) and even some commercial ones > which claim they are lightning fast (even reaching real-time search > speed!). > > e.g. > https://chemaxon.com/products/madfast > https://www.nextmovesoftware.com/arthor.html > > Regards, > F. > > > fp = gen_mo.GetFingerprint(mol) > > sparse_fp = sparse.csr_matrix(fp) > > > > print('ExplicitBitVect object size:', getsizeof(fp)) > > print('Sparse matrix size (naive):', getsizeof(sparse_fp)) > > print('Sparse matrix size (real):', > > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) > > print('fp.ToBinary size:', getsizeof(fp.ToBinary())) > > print('fp.ToBinary size:', getsizeof(fp.ToBase64())) > > > > > ExplicitBitVect object size: 96 > > Sparse matrix size (naive): 64 > > Sparse matrix size (real): 476 > > fp.ToBinary size: 85 > > fp.ToBinary size: 121 > > > > Note that even the smallest of these multiplied by 100 million would > > be about 8gb, still larger than the text file storing the smiles codes > > - not sure if that is to be expected or not? > > > > Thank for your time! > > Lewis > > ___ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
On 09/09/2020 09:35, Lewis Martin wrote: Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows out when loaded with pandas.read_csv() or f.readlines() due to weird memory allocation issues. It would take 45hrs to process the file in serial (i.e. read line, create mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner so now I'd like to parallelize the job with joblib, which would multiply the memory requirements by the number of processes running at a time. So: what is the smallest possible representation for a binary fingerprint? Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96 bytes, but I'm not sure whether to believe that since, like csr_matrix, the size depends on accurately returning the object's data. Here's an example demonstrating this: from rdkit import Chem from rdkit.Chem import rdFingerprintGenerator smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' mol = Chem.MolFromSmiles(smi) gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) Obviously, if you ask for fpSize = 512, the smallest uncompressed representation of the fingerprint will be 512 bits (64 bytes). 10M of such fingerprints, if there is not any overhead added by the programming language, would fit into 6GB of RAM. But, the really fun things will start when you want to search fast into so many molecules. :) There are many published methods, some open-source software (like Dalke's chemfp) and even some commercial ones which claim they are lightning fast (even reaching real-time search speed!). e.g. https://chemaxon.com/products/madfast https://www.nextmovesoftware.com/arthor.html Regards, F. fp = gen_mo.GetFingerprint(mol) sparse_fp = sparse.csr_matrix(fp) print('ExplicitBitVect object size:', getsizeof(fp)) print('Sparse matrix size (naive):', getsizeof(sparse_fp)) print('Sparse matrix size (real):', sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) print('fp.ToBinary size:', getsizeof(fp.ToBinary())) print('fp.ToBinary size:', getsizeof(fp.ToBase64())) ExplicitBitVect object size: 96 Sparse matrix size (naive): 64 Sparse matrix size (real): 476 fp.ToBinary size: 85 fp.ToBinary size: 121 Note that even the smallest of these multiplied by 100 million would be about 8gb, still larger than the text file storing the smiles codes - not sure if that is to be expected or not? Thank for your time! Lewis ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows out when loaded with pandas.read_csv() or f.readlines() due to weird memory allocation issues. It would take 45hrs to process the file in serial (i.e. read line, create mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner so now I'd like to parallelize the job with joblib, which would multiply the memory requirements by the number of processes running at a time. So: what is the smallest possible representation for a binary fingerprint? Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96 bytes, but I'm not sure whether to believe that since, like csr_matrix, the size depends on accurately returning the object's data. Here's an example demonstrating this: from rdkit import Chem from rdkit.Chem import rdFingerprintGenerator smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' mol = Chem.MolFromSmiles(smi) gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) fp = gen_mo.GetFingerprint(mol) sparse_fp = sparse.csr_matrix(fp) print('ExplicitBitVect object size:', getsizeof(fp)) print('Sparse matrix size (naive):', getsizeof(sparse_fp)) print('Sparse matrix size (real):', sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) print('fp.ToBinary size:', getsizeof(fp.ToBinary())) print('fp.ToBinary size:', getsizeof(fp.ToBase64())) >>> ExplicitBitVect object size: 96 Sparse matrix size (naive): 64 Sparse matrix size (real): 476 fp.ToBinary size: 85 fp.ToBinary size: 121 Note that even the smallest of these multiplied by 100 million would be about 8gb, still larger than the text file storing the smiles codes - not sure if that is to be expected or not? Thank for your time! Lewis ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss