On Sep 9, 2020, at 04:00, Lewis Martin wrote:
> I'd like to keep it FOSS since its for academic publication and hopefully to
> be re-used. Chemfp is amazing but brute-forcing 100million by 100million
> would surely still take a long time compared with an approximate nearest
> neighbor approach.
OK to sum it up, for me writing to binary is a neat, fast, and low-storage
solution for fingerprints. Example:
o = open('fingerprints.bin', 'wb')
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64)
for smi in tqdm_notebook(df['smiles']):
mol = Chem.MolFromSmiles(smi)
fp
The most efficient (easy) way to store the fingerprints is using
DataStructs.BitVectToBinaryText(). That will return a 64byte binary string
for a 512bit fingerprint.
FWIW: if you haven't seen the recent blog post about similarity searching
with short fingerprints:
http://rdkit.blogspot.com/2020/08
Cheers Francois - that might be the way to go actually. I'll try with
'bitstring' https://github.com/scott-griffiths/bitstring and I guess write
the data as concatenated bitarrays in chunked binary files.
I'd like to keep it FOSS since its for academic publication and hopefully
to be re-used. Chem
On 09/09/2020 09:35, Lewis Martin wrote:
Hi RDKit,
Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
like to fit an approximate-nearest neighbors index on a dataset of 100
million ligands, featurized by morgan fingerprint. The text file of
the smiles is ~6gb but this blows o
Hi RDKit,
Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like
to fit an approximate-nearest neighbors index on a dataset of 100 million
ligands, featurized by morgan fingerprint. The text file of the smiles is
~6gb but this blows out when loaded with pandas.read_csv() or f.
6 matches
Mail list logo