Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Andrew Dalke
On Sep 9, 2020, at 04:00, Lewis Martin  wrote:
> I'd like to keep it FOSS since its for academic publication and hopefully to 
> be re-used. Chemfp is amazing but brute-forcing 100million by 100million 
> would surely still take a long time compared with an approximate nearest 
> neighbor approach. 

As a rough guideline, the chemfp paper Fig. 3 says that searching 100M PubChem 
fingerprints for the k=10 neighbors averages about 100 ms on a Haswell machine 
(See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8 ).

That was for chemfp 3.3 using 881 bits. You have 512 bits, which is faster 
(0.6x time, according to my timings). Chemfp 1.6 is about the same performance 
as chemfp 3.4 for this case.

So that's 60ms per query, on average, or 70 days for a single-processor query.

(It's hard to compare directly though as the BitBound algorithm is sublinear; I 
found the PubChem fingerprint kNN search scales at O(n** ~0.6) while Morgan 
2048-bit scales at O(n** ~0.8). Then again, the Haswell machine is several 
years old, so I'm leaving the numbers as-is for an estimate.

This is easily parallelized yourself, or you can use chemfp's built-in 
parallelized NxN kNN searches. (The amount of scaleup on a single multi-core 
machine depends on your memory architecture.)

This means that it shouldn't be hard to get a <1 week performance out of chemfp.

Performance-wise, chemfp 3.4.1 (the latest version of the commercial track) is 
the same as chemfp 1.6.1 (the latest *no-cost*/open-source version).

One caveat - chemfp 1.x has a limit of 2**31 bytes for the fingerprints, or 
33.6M 512-bit fingerprints. (chemfp 2.x or later use 64-bit indexing and don't 
have this limitation.) To get around that, with 1.6.1, you'll need split your 
data into 4 chucks, and search each independently. This also reduces the effect 
of sublinear scaling.

Or, there's no-cost academic licensing for chemfp 3.x using pre-compiled 
binaries which work on most Linux-based OSes.

Still, I think an exact search is entirely doable using FOSS software.

Switching to a smaller fingerprint size is a form of approximate nearest 
neighbors.

If you take Greg's observation and drop the fingerprint size from 512 bits to 
128 bits then you can get about a factor of 3 faster, and have everything fit 
into chemfp 1.6's 2GB limit.

Best regards,

Andrew
da...@dalkescientific.com


P.S.

On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger  wrote:
> There are many published methods, some open-source software (like 
> Dalke's chemfp) and even some commercial ones
> which claim they are lightning fast (even reaching real-time search 
> speed!).

Dalke's chemfp also claims real-time search speed. ;)

It's always a claim of what size that means. 

Also, at least a couple of those systems use multiple threads to speed up 
single-query search, which interferes with multi-threaded batch search, as with 
NxN search on a single machine.




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
OK to sum it up, for me writing to binary is a neat, fast, and low-storage
solution for fingerprints. Example:
o = open('fingerprints.bin', 'wb')
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64)
for smi in tqdm_notebook(df['smiles']):
mol = Chem.MolFromSmiles(smi)
fp = gen_mo.GetFingerprint(mol)
bs = bitstring.BitArray(bin=fp.ToBitString())
o.write(bs.bytes)

o.close()

Most of the time was being taken up in creating numpy arrays. For instance:
%%timeit
fp = np.array(gen_mo.GetFingerprint(mol))

351 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

vs:

%%timeit
fp = gen_mo.GetFingerprint(mol)
bs = bitstring.BitArray(bin=fp.ToBitString())
42 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)


so when you remove that step, 100mm ligands takes about 4 1/2 hours in
serial, no need for parallelization. Cheers!
PS for newbs like me, read the binary fingerprints like this:
i = open('fingerprints.bin', 'rb')
bits = ''
for _ in range(num_fps):
fp_bytes = i.read(8)
bits += bitstring.BitArray(bytes=fp_bytes).bin
i.close()
fingerprints_concat = (np.fromstring(bits,'u1') -
ord('0')).reshape(num_fps, 64)

On Wed, Sep 9, 2020 at 1:28 PM Greg Landrum  wrote:

> The most efficient (easy) way to store the fingerprints is using
> DataStructs.BitVectToBinaryText(). That will return a 64byte binary string
> for a 512bit fingerprint.
>
> FWIW: if you haven't seen the recent blog post about similarity searching
> with short fingerprints:
> http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html
>
> -greg
>
>
> On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin 
> wrote:
>
>> Hi RDKit,
>>
>> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
>> like to fit an approximate-nearest neighbors index on a dataset of 100
>> million ligands, featurized by morgan fingerprint. The text file of the
>> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or
>> f.readlines() due to weird memory allocation issues.
>>
>>
>> It would take 45hrs to process the file in serial (i.e. read line, create
>> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
>> so now I'd like to parallelize the job with joblib, which would multiply
>> the memory requirements by the number of processes running at a time.
>>
>> So: what is the smallest possible representation for a binary
>> fingerprint? Using `sys.getsizeof` on a
>> rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96
>> bytes, but I'm not sure whether to believe that since, like csr_matrix, the
>> size depends on accurately returning the object's data. Here's an example
>> demonstrating this:
>>
>> from rdkit import Chem
>> from rdkit.Chem import rdFingerprintGenerator
>> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
>> mol = Chem.MolFromSmiles(smi)
>> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
>> fp = gen_mo.GetFingerprint(mol)
>> sparse_fp = sparse.csr_matrix(fp)
>>
>> print('ExplicitBitVect object size:', getsizeof(fp))
>> print('Sparse matrix size (naive):', getsizeof(sparse_fp))
>> print('Sparse matrix size (real):',
>> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
>> print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
>> print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
>> >>>
>>
>> ExplicitBitVect object size: 96
>> Sparse matrix size (naive): 64
>> Sparse matrix size (real): 476
>> fp.ToBinary size: 85
>> fp.ToBinary size: 121
>>
>>
>>
>> Note that even the smallest of these multiplied by 100 million would be
>> about 8gb, still larger than the text file storing the smiles codes - not
>> sure if that is to be expected or not?
>>
>> Thank for your time!
>> Lewis
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Greg Landrum
The most efficient (easy) way to store the fingerprints is using
DataStructs.BitVectToBinaryText(). That will return a 64byte binary string
for a 512bit fingerprint.

FWIW: if you haven't seen the recent blog post about similarity searching
with short fingerprints:
http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html

-greg


On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin 
wrote:

> Hi RDKit,
>
> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
> like to fit an approximate-nearest neighbors index on a dataset of 100
> million ligands, featurized by morgan fingerprint. The text file of the
> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or
> f.readlines() due to weird memory allocation issues.
>
>
> It would take 45hrs to process the file in serial (i.e. read line, create
> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
> so now I'd like to parallelize the job with joblib, which would multiply
> the memory requirements by the number of processes running at a time.
>
> So: what is the smallest possible representation for a binary fingerprint?
> Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect
> object tells me it is 96 bytes, but I'm not sure whether to believe that
> since, like csr_matrix, the size depends on accurately returning the
> object's data. Here's an example demonstrating this:
>
> from rdkit import Chem
> from rdkit.Chem import rdFingerprintGenerator
> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
> mol = Chem.MolFromSmiles(smi)
> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
> fp = gen_mo.GetFingerprint(mol)
> sparse_fp = sparse.csr_matrix(fp)
>
> print('ExplicitBitVect object size:', getsizeof(fp))
> print('Sparse matrix size (naive):', getsizeof(sparse_fp))
> print('Sparse matrix size (real):',
> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
> print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
> print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
> >>>
>
> ExplicitBitVect object size: 96
> Sparse matrix size (naive): 64
> Sparse matrix size (real): 476
> fp.ToBinary size: 85
> fp.ToBinary size: 121
>
>
>
> Note that even the smallest of these multiplied by 100 million would be
> about 8gb, still larger than the text file storing the smiles codes - not
> sure if that is to be expected or not?
>
> Thank for your time!
> Lewis
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Cheers Francois - that might be the way to go actually. I'll try with
'bitstring' https://github.com/scott-griffiths/bitstring and I guess write
the data as concatenated bitarrays in chunked binary files.

I'd like to keep it FOSS since its for academic publication and hopefully
to be re-used. Chemfp is amazing but brute-forcing 100million by 100million
would surely still take a long time compared with an approximate nearest
neighbor approach.

Straying from RDKit so Ill leave it there - thanks!

On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger  wrote:

> On 09/09/2020 09:35, Lewis Martin wrote:
> > Hi RDKit,
> >
> > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
> > like to fit an approximate-nearest neighbors index on a dataset of 100
> > million ligands, featurized by morgan fingerprint. The text file of
> > the smiles is ~6gb but this blows out when loaded with
> > pandas.read_csv() or f.readlines() due to weird memory allocation
> > issues.
> >
> > It would take 45hrs to process the file in serial (i.e. read line,
> > create mol, fingerprint, convert to np.arr or sparse arrays) in a
> > streaming manner so now I'd like to parallelize the job with joblib,
> > which would multiply the memory requirements by the number of
> > processes running at a time.
> >
> > So: what is the smallest possible representation for a binary
> > fingerprint? Using `sys.getsizeof` on a
> > rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is
> > 96 bytes, but I'm not sure whether to believe that since, like
> > csr_matrix, the size depends on accurately returning the object's
> > data. Here's an example demonstrating this:
> >
> > from rdkit import Chem
> > from rdkit.Chem import rdFingerprintGenerator
> > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
> > mol = Chem.MolFromSmiles(smi)
> > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2,
> > fpSize=512)
>
> Obviously, if you ask for fpSize = 512, the smallest uncompressed
> representation of the fingerprint will be 512 bits (64 bytes).
>
> 10M of such fingerprints, if there is not any overhead added by the
> programming language,
> would fit into 6GB of RAM.
>
> But, the really fun things will start when you want to search fast into
> so many molecules. :)
> There are many published methods, some open-source software (like
> Dalke's chemfp) and even some commercial ones
> which claim they are lightning fast (even reaching real-time search
> speed!).
>
> e.g.
> https://chemaxon.com/products/madfast
> https://www.nextmovesoftware.com/arthor.html
>
> Regards,
> F.
>
> > fp = gen_mo.GetFingerprint(mol)
> > sparse_fp = sparse.csr_matrix(fp)
> >
> > print('ExplicitBitVect object size:', getsizeof(fp))
> > print('Sparse matrix size (naive):', getsizeof(sparse_fp))
> > print('Sparse matrix size (real):',
> > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
> > print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
> > print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
> 
> >
> > ExplicitBitVect object size: 96
> > Sparse matrix size (naive): 64
> > Sparse matrix size (real): 476
> > fp.ToBinary size: 85
> > fp.ToBinary size: 121
> >
> > Note that even the smallest of these multiplied by 100 million would
> > be about 8gb, still larger than the text file storing the smiles codes
> > - not sure if that is to be expected or not?
> >
> > Thank for your time!
> > Lewis
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Francois Berenger

On 09/09/2020 09:35, Lewis Martin wrote:

Hi RDKit,

Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
like to fit an approximate-nearest neighbors index on a dataset of 100
million ligands, featurized by morgan fingerprint. The text file of
the smiles is ~6gb but this blows out when loaded with
pandas.read_csv() or f.readlines() due to weird memory allocation
issues.

It would take 45hrs to process the file in serial (i.e. read line,
create mol, fingerprint, convert to np.arr or sparse arrays) in a
streaming manner so now I'd like to parallelize the job with joblib,
which would multiply the memory requirements by the number of
processes running at a time.

So: what is the smallest possible representation for a binary
fingerprint? Using `sys.getsizeof` on a
rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is
96 bytes, but I'm not sure whether to believe that since, like
csr_matrix, the size depends on accurately returning the object's
data. Here's an example demonstrating this:

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
mol = Chem.MolFromSmiles(smi)
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2,
fpSize=512)


Obviously, if you ask for fpSize = 512, the smallest uncompressed
representation of the fingerprint will be 512 bits (64 bytes).

10M of such fingerprints, if there is not any overhead added by the 
programming language,

would fit into 6GB of RAM.

But, the really fun things will start when you want to search fast into 
so many molecules. :)
There are many published methods, some open-source software (like 
Dalke's chemfp) and even some commercial ones
which claim they are lightning fast (even reaching real-time search 
speed!).


e.g.
https://chemaxon.com/products/madfast
https://www.nextmovesoftware.com/arthor.html

Regards,
F.


fp = gen_mo.GetFingerprint(mol)
sparse_fp = sparse.csr_matrix(fp)

print('ExplicitBitVect object size:', getsizeof(fp))
print('Sparse matrix size (naive):', getsizeof(sparse_fp))
print('Sparse matrix size (real):',
sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
print('fp.ToBinary size:', getsizeof(fp.ToBase64()))




ExplicitBitVect object size: 96
Sparse matrix size (naive): 64
Sparse matrix size (real): 476
fp.ToBinary size: 85
fp.ToBinary size: 121

Note that even the smallest of these multiplied by 100 million would
be about 8gb, still larger than the text file storing the smiles codes
- not sure if that is to be expected or not?

Thank for your time!
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Hi RDKit,

Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like
to fit an approximate-nearest neighbors index on a dataset of 100 million
ligands, featurized by morgan fingerprint. The text file of the smiles is
~6gb but this blows out when loaded with pandas.read_csv() or f.readlines()
due to weird memory allocation issues.


It would take 45hrs to process the file in serial (i.e. read line, create
mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
so now I'd like to parallelize the job with joblib, which would multiply
the memory requirements by the number of processes running at a time.

So: what is the smallest possible representation for a binary fingerprint?
Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect
object tells me it is 96 bytes, but I'm not sure whether to believe that
since, like csr_matrix, the size depends on accurately returning the
object's data. Here's an example demonstrating this:

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
mol = Chem.MolFromSmiles(smi)
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
fp = gen_mo.GetFingerprint(mol)
sparse_fp = sparse.csr_matrix(fp)

print('ExplicitBitVect object size:', getsizeof(fp))
print('Sparse matrix size (naive):', getsizeof(sparse_fp))
print('Sparse matrix size (real):',
sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
>>>

ExplicitBitVect object size: 96
Sparse matrix size (naive): 64
Sparse matrix size (real): 476
fp.ToBinary size: 85
fp.ToBinary size: 121



Note that even the smallest of these multiplied by 100 million would be
about 8gb, still larger than the text file storing the smiles codes - not
sure if that is to be expected or not?

Thank for your time!
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss