Re: [Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty

2020-05-20 Thread Hao
Thanks a bunch Greg for the very helpful explanation! Things make more
senses now.

On Wed, May 20, 2020 at 12:51 AM Greg Landrum 
wrote:

> Hi Hao,
>
> Good question! I had to do a bit of digging to figure that out
>
> Here's what's going on:
> The Morgan fingerprint code uses CIP codes when you set useChirality=True
> Atomic CIP codes are stored as an atomic property
> When you use the multiprocessing module everything ends up being pickled
> and sent to the individual workers in the pool.
> By default, when you pickle RDKit molecules the properties (things you
> access via GetProp()) are not preserved.
> So when you call a function using multiprocessing, the CIP information
> doesn't make it through to the function you call and you don't see any
> difference between different stereoisomers.
>
> The fix to #1993 (https://github.com/rdkit/rdkit/issues/1993), which was
> part of the 2018.09 release, modified the Morgan fingerprinting code so
> that it re-assigns stereochemistry when that information is not present
> already.
>
> Best,
> -greg
>
>
> On Tue, May 19, 2020 at 11:53 PM Hao  wrote:
>
>> Hello,
>>
>> This was a very strange bug that I saw. I was getting inconsistent
>> fingerprints using GetMorganFingerprint with useChirality=True, when I used
>> multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It
>> seems to have been fixed in the latest version. Woo! I was just wondering
>> if anyone has any insights on what was causing this before because I was
>> stumped for the longest time. Example:
>>
>> from multiprocessing import Pool
>> from rdkit import Chem
>> from rdkit.Chem import AllChem
>>
>> def compute_ecfp_bitvect(mol, ecfp_power = 11):
>> print(Chem.MolToSmiles(mol, isomericSmiles=True))
>> print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
>> nBits=2 ** ecfp_power, useChirality=True).GetOnBits()))
>> return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
>> nBits=2 ** ecfp_power, useChirality=True)
>>
>> smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"]
>>
>> mol1 = Chem.MolFromSmiles(smiles[0])
>> mol2 = Chem.MolFromSmiles(smiles[1])
>> print("with pool")
>> with Pool(1) as pool:
>> jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2])
>> list(jobs)
>> print("without pool")
>> [compute_ecfp_bitvect(m) for m in [mol1,mol2]]
>>
>> = Output =
>> with pool
>> C[C@H](N)C(=O)O
>> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
>> C[C@@H](N)C(=O)O
>> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
>> without  pool
>> C[C@H](N)C(=O)O
>> [1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917]
>> C[C@@H](N)C(=O)O
>> [1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917]
>>
>> Thanks and hope everyone is staying healthy!
>> Hao
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty

2020-05-19 Thread Greg Landrum
Hi Hao,

Good question! I had to do a bit of digging to figure that out

Here's what's going on:
The Morgan fingerprint code uses CIP codes when you set useChirality=True
Atomic CIP codes are stored as an atomic property
When you use the multiprocessing module everything ends up being pickled
and sent to the individual workers in the pool.
By default, when you pickle RDKit molecules the properties (things you
access via GetProp()) are not preserved.
So when you call a function using multiprocessing, the CIP information
doesn't make it through to the function you call and you don't see any
difference between different stereoisomers.

The fix to #1993 (https://github.com/rdkit/rdkit/issues/1993), which was
part of the 2018.09 release, modified the Morgan fingerprinting code so
that it re-assigns stereochemistry when that information is not present
already.

Best,
-greg


On Tue, May 19, 2020 at 11:53 PM Hao  wrote:

> Hello,
>
> This was a very strange bug that I saw. I was getting inconsistent
> fingerprints using GetMorganFingerprint with useChirality=True, when I used
> multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It
> seems to have been fixed in the latest version. Woo! I was just wondering
> if anyone has any insights on what was causing this before because I was
> stumped for the longest time. Example:
>
> from multiprocessing import Pool
> from rdkit import Chem
> from rdkit.Chem import AllChem
>
> def compute_ecfp_bitvect(mol, ecfp_power = 11):
> print(Chem.MolToSmiles(mol, isomericSmiles=True))
> print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
> nBits=2 ** ecfp_power, useChirality=True).GetOnBits()))
> return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
> nBits=2 ** ecfp_power, useChirality=True)
>
> smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"]
>
> mol1 = Chem.MolFromSmiles(smiles[0])
> mol2 = Chem.MolFromSmiles(smiles[1])
> print("with pool")
> with Pool(1) as pool:
> jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2])
> list(jobs)
> print("without pool")
> [compute_ecfp_bitvect(m) for m in [mol1,mol2]]
>
> = Output =
> with pool
> C[C@H](N)C(=O)O
> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
> C[C@@H](N)C(=O)O
> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
> without  pool
> C[C@H](N)C(=O)O
> [1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917]
> C[C@@H](N)C(=O)O
> [1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917]
>
> Thanks and hope everyone is staying healthy!
> Hao
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty

2020-05-19 Thread Hao
Hello,

This was a very strange bug that I saw. I was getting inconsistent
fingerprints using GetMorganFingerprint with useChirality=True, when I used
multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It
seems to have been fixed in the latest version. Woo! I was just wondering
if anyone has any insights on what was causing this before because I was
stumped for the longest time. Example:

from multiprocessing import Pool
from rdkit import Chem
from rdkit.Chem import AllChem

def compute_ecfp_bitvect(mol, ecfp_power = 11):
print(Chem.MolToSmiles(mol, isomericSmiles=True))
print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
nBits=2 ** ecfp_power, useChirality=True).GetOnBits()))
return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
nBits=2 ** ecfp_power, useChirality=True)

smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"]

mol1 = Chem.MolFromSmiles(smiles[0])
mol2 = Chem.MolFromSmiles(smiles[1])
print("with pool")
with Pool(1) as pool:
jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2])
list(jobs)
print("without pool")
[compute_ecfp_bitvect(m) for m in [mol1,mol2]]

= Output =
with pool
C[C@H](N)C(=O)O
[1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
C[C@@H](N)C(=O)O
[1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
without  pool
C[C@H](N)C(=O)O
[1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917]
C[C@@H](N)C(=O)O
[1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917]

Thanks and hope everyone is staying healthy!
Hao
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss