Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Andrew Dalke
On Sep 9, 2020, at 04:00, Lewis Martin wrote: > I'd like to keep it FOSS since its for academic publication and hopefully to > be re-used. Chemfp is amazing but brute-forcing 100million by 100million > would surely still take a long time compared with an approximate nearest > neighbor

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
OK to sum it up, for me writing to binary is a neat, fast, and low-storage solution for fingerprints. Example: o = open('fingerprints.bin', 'wb') gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64) for smi in tqdm_notebook(df['smiles']): mol = Chem.MolFromSmiles(smi) fp

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Greg Landrum
The most efficient (easy) way to store the fingerprints is using DataStructs.BitVectToBinaryText(). That will return a 64byte binary string for a 512bit fingerprint. FWIW: if you haven't seen the recent blog post about similarity searching with short fingerprints:

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Cheers Francois - that might be the way to go actually. I'll try with 'bitstring' https://github.com/scott-griffiths/bitstring and I guess write the data as concatenated bitarrays in chunked binary files. I'd like to keep it FOSS since its for academic publication and hopefully to be re-used.

Re: [Rdkit-discuss] h-bond geometry

2020-09-08 Thread Francois Berenger
On 09/09/2020 01:33, Tim Dudgeon wrote: Hi All, thanks for the suggestions. Greg, that's part of what's needed but there's also some more complex logic needed. For instance, if the atom the H is attached to is rotatable e.g. an OH group) then it is more complex than if it is fixed (e.g a N in a

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Francois Berenger
On 09/09/2020 09:35, Lewis Martin wrote: Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows

[Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows out when loaded with pandas.read_csv() or

Re: [Rdkit-discuss] Rdkit-discuss] MACCS keys - revisited

2020-09-08 Thread Andrew Dalke
On Sep 8, 2020, at 14:30, Mike Mazanetz wrote: > Does anyone know whether it’s possible to obtain not just a fingerprint keys > for MACCS (binary values) but the number of occurrences of the keys, > particularly these details: The SMARTS patterns for most of the MACCS keys is available by:

Re: [Rdkit-discuss] Rdkit-discuss] MACCS keys - revisited

2020-09-08 Thread Paolo Tosco
Hi Mike, I put together a gist that might help: https://gist.github.com/ptosco/7bbad9e6441724e9638bc4093f48e31b This is basically a modification of the MACCSkeys._pyGenMACCSKeys() RDKit Python function, combined with a function I wrote some time ago to count non-overlapping matches in a

[Rdkit-discuss] Rdkit-discuss] MACCS keys - revisited

2020-09-08 Thread Mike Mazanetz
Hi, On second thoughts. The KNIME node does a lot of double counting for the RDKit Substructure Counter, so it's not a useful tool for counting MACCS keys. Anyone got any better ideas? Cheers, mike From: Mike Mazanetz Sent: 08 September 2020 18:42 To:

Re: [Rdkit-discuss] MACCS keys

2020-09-08 Thread Mike Mazanetz
Hi folks, I found that I can always use the KNIME nodes to count these, so no need to reply. Best, mike From: Mike Mazanetz Sent: 08 September 2020 13:30 To: rdkit-discuss@lists.sourceforge.net Subject: [Rdkit-discuss] MACCS keys Hello Forum, Does anyone know whether it's

Re: [Rdkit-discuss] h-bond geometry

2020-09-08 Thread Tim Dudgeon
Hi All, thanks for the suggestions. Greg, that's part of what's needed but there's also some more complex logic needed. For instance, if the atom the H is attached to is rotatable e.g. an OH group) then it is more complex than if it is fixed (e.g a N in a ring). I was wondering whether anyone had

[Rdkit-discuss] MACCS keys

2020-09-08 Thread Mike Mazanetz
Hello Forum, Does anyone know whether it's possible to obtain not just a fingerprint keys for MACCS (binary values) but the number of occurrences of the keys, particularly these details: Thanks, mike 1: #isotopes 2: #atoms with atomic number > 103 3: #group IVA, VA and VIA periods 4-6 4:

Re: [Rdkit-discuss] h-bond geometry

2020-09-08 Thread Greg Landrum
Hi Tim, Assuming that you already have the indices of the atoms that you're interested in looking at, it's pretty easy to calculate the angle between three arbitrary atoms. Here's an example: In [3]: m = Chem.AddHs(Chem.MolFromSmiles('COCO')) In [4]: AllChem.EmbedMolecule(m) Out[4]: 0 In

Re: [Rdkit-discuss] h-bond geometry

2020-09-08 Thread Tosstorff, Andreas via Rdkit-discuss
Hi Tim, also not a solution within RDKit, but maybe of help: The CSD Python API has a lot of functions around hbonds: https://downloads.ccdc.cam.ac.uk/documentation/API/modules/molecule_api.html?highlight=hbond#ccdc.molecule.Molecule.hbonds Hope this helps, Andy On Mon, Sep 7, 2020 at 3:07

Re: [Rdkit-discuss] h-bond geometry

2020-09-08 Thread David Cosgrove
Hi Tim, I don’t have any code, but if you go to https://github.com/harryjubb/arpeggio and look in config.py there are SMARTS definitions for various interaction types with geometric tests that might help. If you already have a suitable complex, you could just use arpeggio.py to pull out the