Re: [Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty
Thanks a bunch Greg for the very helpful explanation! Things make more senses now. On Wed, May 20, 2020 at 12:51 AM Greg Landrum wrote: > Hi Hao, > > Good question! I had to do a bit of digging to figure that out > > Here's what's going on: > The Morgan fingerprint code uses CIP codes when you set useChirality=True > Atomic CIP codes are stored as an atomic property > When you use the multiprocessing module everything ends up being pickled > and sent to the individual workers in the pool. > By default, when you pickle RDKit molecules the properties (things you > access via GetProp()) are not preserved. > So when you call a function using multiprocessing, the CIP information > doesn't make it through to the function you call and you don't see any > difference between different stereoisomers. > > The fix to #1993 (https://github.com/rdkit/rdkit/issues/1993), which was > part of the 2018.09 release, modified the Morgan fingerprinting code so > that it re-assigns stereochemistry when that information is not present > already. > > Best, > -greg > > > On Tue, May 19, 2020 at 11:53 PM Hao wrote: > >> Hello, >> >> This was a very strange bug that I saw. I was getting inconsistent >> fingerprints using GetMorganFingerprint with useChirality=True, when I used >> multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It >> seems to have been fixed in the latest version. Woo! I was just wondering >> if anyone has any insights on what was causing this before because I was >> stumped for the longest time. Example: >> >> from multiprocessing import Pool >> from rdkit import Chem >> from rdkit.Chem import AllChem >> >> def compute_ecfp_bitvect(mol, ecfp_power = 11): >> print(Chem.MolToSmiles(mol, isomericSmiles=True)) >> print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, >> nBits=2 ** ecfp_power, useChirality=True).GetOnBits())) >> return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, >> nBits=2 ** ecfp_power, useChirality=True) >> >> smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"] >> >> mol1 = Chem.MolFromSmiles(smiles[0]) >> mol2 = Chem.MolFromSmiles(smiles[1]) >> print("with pool") >> with Pool(1) as pool: >> jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2]) >> list(jobs) >> print("without pool") >> [compute_ecfp_bitvect(m) for m in [mol1,mol2]] >> >> = Output = >> with pool >> C[C@H](N)C(=O)O >> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917] >> C[C@@H](N)C(=O)O >> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917] >> without pool >> C[C@H](N)C(=O)O >> [1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917] >> C[C@@H](N)C(=O)O >> [1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917] >> >> Thanks and hope everyone is staying healthy! >> Hao >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty
Hi Hao, Good question! I had to do a bit of digging to figure that out Here's what's going on: The Morgan fingerprint code uses CIP codes when you set useChirality=True Atomic CIP codes are stored as an atomic property When you use the multiprocessing module everything ends up being pickled and sent to the individual workers in the pool. By default, when you pickle RDKit molecules the properties (things you access via GetProp()) are not preserved. So when you call a function using multiprocessing, the CIP information doesn't make it through to the function you call and you don't see any difference between different stereoisomers. The fix to #1993 (https://github.com/rdkit/rdkit/issues/1993), which was part of the 2018.09 release, modified the Morgan fingerprinting code so that it re-assigns stereochemistry when that information is not present already. Best, -greg On Tue, May 19, 2020 at 11:53 PM Hao wrote: > Hello, > > This was a very strange bug that I saw. I was getting inconsistent > fingerprints using GetMorganFingerprint with useChirality=True, when I used > multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It > seems to have been fixed in the latest version. Woo! I was just wondering > if anyone has any insights on what was causing this before because I was > stumped for the longest time. Example: > > from multiprocessing import Pool > from rdkit import Chem > from rdkit.Chem import AllChem > > def compute_ecfp_bitvect(mol, ecfp_power = 11): > print(Chem.MolToSmiles(mol, isomericSmiles=True)) > print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, > nBits=2 ** ecfp_power, useChirality=True).GetOnBits())) > return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, > nBits=2 ** ecfp_power, useChirality=True) > > smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"] > > mol1 = Chem.MolFromSmiles(smiles[0]) > mol2 = Chem.MolFromSmiles(smiles[1]) > print("with pool") > with Pool(1) as pool: > jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2]) > list(jobs) > print("without pool") > [compute_ecfp_bitvect(m) for m in [mol1,mol2]] > > = Output = > with pool > C[C@H](N)C(=O)O > [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917] > C[C@@H](N)C(=O)O > [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917] > without pool > C[C@H](N)C(=O)O > [1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917] > C[C@@H](N)C(=O)O > [1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917] > > Thanks and hope everyone is staying healthy! > Hao > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty
Hello, This was a very strange bug that I saw. I was getting inconsistent fingerprints using GetMorganFingerprint with useChirality=True, when I used multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It seems to have been fixed in the latest version. Woo! I was just wondering if anyone has any insights on what was causing this before because I was stumped for the longest time. Example: from multiprocessing import Pool from rdkit import Chem from rdkit.Chem import AllChem def compute_ecfp_bitvect(mol, ecfp_power = 11): print(Chem.MolToSmiles(mol, isomericSmiles=True)) print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2 ** ecfp_power, useChirality=True).GetOnBits())) return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2 ** ecfp_power, useChirality=True) smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"] mol1 = Chem.MolFromSmiles(smiles[0]) mol2 = Chem.MolFromSmiles(smiles[1]) print("with pool") with Pool(1) as pool: jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2]) list(jobs) print("without pool") [compute_ecfp_bitvect(m) for m in [mol1,mol2]] = Output = with pool C[C@H](N)C(=O)O [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917] C[C@@H](N)C(=O)O [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917] without pool C[C@H](N)C(=O)O [1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917] C[C@@H](N)C(=O)O [1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917] Thanks and hope everyone is staying healthy! Hao ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss