Re: [Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty

2020-05-19 Thread Greg Landrum
Hi Hao,

Good question! I had to do a bit of digging to figure that out

Here's what's going on:
The Morgan fingerprint code uses CIP codes when you set useChirality=True
Atomic CIP codes are stored as an atomic property
When you use the multiprocessing module everything ends up being pickled
and sent to the individual workers in the pool.
By default, when you pickle RDKit molecules the properties (things you
access via GetProp()) are not preserved.
So when you call a function using multiprocessing, the CIP information
doesn't make it through to the function you call and you don't see any
difference between different stereoisomers.

The fix to #1993 (https://github.com/rdkit/rdkit/issues/1993), which was
part of the 2018.09 release, modified the Morgan fingerprinting code so
that it re-assigns stereochemistry when that information is not present
already.

Best,
-greg


On Tue, May 19, 2020 at 11:53 PM Hao  wrote:

> Hello,
>
> This was a very strange bug that I saw. I was getting inconsistent
> fingerprints using GetMorganFingerprint with useChirality=True, when I used
> multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It
> seems to have been fixed in the latest version. Woo! I was just wondering
> if anyone has any insights on what was causing this before because I was
> stumped for the longest time. Example:
>
> from multiprocessing import Pool
> from rdkit import Chem
> from rdkit.Chem import AllChem
>
> def compute_ecfp_bitvect(mol, ecfp_power = 11):
> print(Chem.MolToSmiles(mol, isomericSmiles=True))
> print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
> nBits=2 ** ecfp_power, useChirality=True).GetOnBits()))
> return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
> nBits=2 ** ecfp_power, useChirality=True)
>
> smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"]
>
> mol1 = Chem.MolFromSmiles(smiles[0])
> mol2 = Chem.MolFromSmiles(smiles[1])
> print("with pool")
> with Pool(1) as pool:
> jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2])
> list(jobs)
> print("without pool")
> [compute_ecfp_bitvect(m) for m in [mol1,mol2]]
>
> = Output =
> with pool
> C[C@H](N)C(=O)O
> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
> C[C@@H](N)C(=O)O
> [1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
> without  pool
> C[C@H](N)C(=O)O
> [1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917]
> C[C@@H](N)C(=O)O
> [1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917]
>
> Thanks and hope everyone is staying healthy!
> Hao
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Comparing sets of comformers

2020-05-19 Thread Othman Al Bahri
Hello,

I have two sets of conforms (ca. 2000 each) of a molecule. I’d like to find the 
relative complement of set A in set B (i.e., unique conformers in set B that 
are not in set A).

I’m thinking of calculating the distance matrix of each conformer, then looping 
through all conformers to find the relative complement. However, this doesn’t 
seem like an elegant solution.

If you have any ideas, I’d be very grateful.

Regards,

Othman
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Question about ECFP fingerprints when using multiprocessing and chiralty

2020-05-19 Thread Hao
Hello,

This was a very strange bug that I saw. I was getting inconsistent
fingerprints using GetMorganFingerprint with useChirality=True, when I used
multiprocessing vs when I ran serially on rdkit 2017.09.1 and 2018.03.2. It
seems to have been fixed in the latest version. Woo! I was just wondering
if anyone has any insights on what was causing this before because I was
stumped for the longest time. Example:

from multiprocessing import Pool
from rdkit import Chem
from rdkit.Chem import AllChem

def compute_ecfp_bitvect(mol, ecfp_power = 11):
print(Chem.MolToSmiles(mol, isomericSmiles=True))
print(list(Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
nBits=2 ** ecfp_power, useChirality=True).GetOnBits()))
return Chem.AllChem.GetMorganFingerprintAsBitVect(mol, radius=2,
nBits=2 ** ecfp_power, useChirality=True)

smiles = ["N[C@@H](C)C(=O)O", "N[C@H](C)C(=O)O"]

mol1 = Chem.MolFromSmiles(smiles[0])
mol2 = Chem.MolFromSmiles(smiles[1])
print("with pool")
with Pool(1) as pool:
jobs = pool.imap(compute_ecfp_bitvect, [mol1,mol2])
list(jobs)
print("without pool")
[compute_ecfp_bitvect(m) for m in [mol1,mol2]]

= Output =
with pool
C[C@H](N)C(=O)O
[1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
C[C@@H](N)C(=O)O
[1, 283, 389, 537, 650, 786, 807, 1057, 1119, 1171, 1844, 1917]
without  pool
C[C@H](N)C(=O)O
[1, 283, 389, 650, 786, 807, 1057, 1112, 1171, 1187, 1844, 1917]
C[C@@H](N)C(=O)O
[1, 46, 283, 389, 650, 786, 807, 1057, 1113, 1171, 1844, 1917]

Thanks and hope everyone is staying healthy!
Hao
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread Paolo Tosco

Hi Theo,

I don't think the RDKit version should make a difference; did you notice 
that rdmolops.AdjustQueryProperties() does not modify the molecule in 
place, but rather returns a modified copy?


pattern_generic_bonds  =  Chem.AdjustQueryProperties(pattern,  query_params)

That might be the reason. Also, only pattern_generic_bonds will have 
UNSPECIFIED bonds, the mols will still have SINGLE and DOUBLE bonds.


Feel free to contact me off-list if you need help with the above.

Cheers,
p.

On 19/05/2020 17:01, theozh wrote:

Hi Paolo,

thank you very much for your detailed answer.
I tried to reproduce your last suggestion (but I don't have Jupyter Notebook).
However, my bonds are still SINGLE and DOUBLE instead of UNSPECIFIED.
Does this maybe depend on the RDKit Version, I have 2019.03... ?

Maybe, I should update and need to investigate further.
Theo.


Am 19.05.2020 um 16:44 schrieb Paolo Tosco:

Hi Theo,

the lack of match is due to different aromaticity flags on atoms and bonds in 
the larger molecule.

This gist provides some explanation and a possible solution:

https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788

Cheers,
p.

On 19/05/2020 14:13, theozh wrote:

Dear RDKit-users,

I would like to do a very simple substructure search.
The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) is 
pretty short and doesn't point to a solution. So far, I've learned that you can create 
your search pattern via Chem.MolFromSmiles() or Chem.MolFromSmarts().

In the below copy minimal example, I want to use the first SMILES in the 
list as search pattern. I expect 2 matches but I get either 1 or 0 matches. So, I'm 
doing something wrong. What am I missing?
Is it about implicit/explicit aromatic and aliphatic bonds or some 
explicit/implicit hydrogen?
How to find the first structure in both SMILES?

thank you for any hints,
Theo.

### simple substructure search (but doesn't find what is expected)
from rdkit import Chem

smiles_strings = '''
C12=CC=CN1NCCC2
C12=CC=CC(C=C3)=C1N3NCC2
'''
smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 1, why not 2?

pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 0, why not 2?
### end of code


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread theozh
Hi Paolo,

thank you very much for your detailed answer.
I tried to reproduce your last suggestion (but I don't have Jupyter Notebook).
However, my bonds are still SINGLE and DOUBLE instead of UNSPECIFIED.
Does this maybe depend on the RDKit Version, I have 2019.03... ?

Maybe, I should update and need to investigate further.
Theo.


Am 19.05.2020 um 16:44 schrieb Paolo Tosco:
> Hi Theo,
>
> the lack of match is due to different aromaticity flags on atoms and bonds in 
> the larger molecule.
>
> This gist provides some explanation and a possible solution:
>
> https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788
>
> Cheers,
> p.
>
> On 19/05/2020 14:13, theozh wrote:
>> Dear RDKit-users,
>>
>> I would like to do a very simple substructure search.
>> The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) 
>> is pretty short and doesn't point to a solution. So far, I've learned that 
>> you can create your search pattern via Chem.MolFromSmiles() or 
>> Chem.MolFromSmarts().
>>
>> In the below copy minimal example, I want to use the first SMILES in 
>> the list as search pattern. I expect 2 matches but I get either 1 or 0 
>> matches. So, I'm doing something wrong. What am I missing?
>> Is it about implicit/explicit aromatic and aliphatic bonds or some 
>> explicit/implicit hydrogen?
>> How to find the first structure in both SMILES?
>>
>> thank you for any hints,
>> Theo.
>>
>> ### simple substructure search (but doesn't find what is expected)
>> from rdkit import Chem
>>
>> smiles_strings = '''
>> C12=CC=CN1NCCC2
>> C12=CC=CC(C=C3)=C1N3NCC2
>> '''
>> smiles_list = smiles_strings.splitlines()[1:]
>> print(smiles_list)
>>
>> pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
>> matches = [x for x in smiles_list if 
>> Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
>> print(len(matches))   # result: 1, why not 2?
>>
>> pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
>> matches = [x for x in smiles_list if 
>> Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
>> print(len(matches))   # result: 0, why not 2?
>> ### end of code
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread Paolo Tosco

Hi Theo,

the lack of match is due to different aromaticity flags on atoms and 
bonds in the larger molecule.


This gist provides some explanation and a possible solution:

https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788

Cheers,
p.

On 19/05/2020 14:13, theozh wrote:

Dear RDKit-users,

I would like to do a very simple substructure search.
The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) is 
pretty short and doesn't point to a solution. So far, I've learned that you can create 
your search pattern via Chem.MolFromSmiles() or Chem.MolFromSmarts().

In the below copy minimal example, I want to use the first SMILES in the 
list as search pattern. I expect 2 matches but I get either 1 or 0 matches. So, I'm 
doing something wrong. What am I missing?
Is it about implicit/explicit aromatic and aliphatic bonds or some 
explicit/implicit hydrogen?
How to find the first structure in both SMILES?

thank you for any hints,
Theo.

### simple substructure search (but doesn't find what is expected)
from rdkit import Chem

smiles_strings = '''
C12=CC=CN1NCCC2
C12=CC=CC(C=C3)=C1N3NCC2
'''
smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 1, why not 2?

pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 0, why not 2?
### end of code


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] performance issues with PandasTools LoadSDF

2020-05-19 Thread Mario Lovrić
The original file is around 25MB.

I changed the content in the dataframe (from sdf) and wrote:

PandasTools.WriteSDF( dataframe, 'path',
properties= dataframe .columns,
idName='ID')


Thanks, Mario

On Tue, May 19, 2020 at 9:13 AM Greg Landrum  wrote:

> Hi Mario,
>
> how big is the file?
> did you *add* properties to it or just modify existing values?
>
> -greg
>
>
> On Fri, May 15, 2020 at 11:34 AM Mario Lovrić 
> wrote:
>
>> Dear all,
>>
>>
>> I have loaded a SDF file (lets call it file1) with PandasTools, corrected
>> some properties and wrote it with PandasTools to SDF again (file2).
>> Have two issues there:
>>
>> 1st)
>> file2 had additional integers added to properties tags:
>>
>> >(2)
>> >(2)
>> >(2)
>> >(2)
>>
>> 2nd)
>> file2 is impossible to load again, it is eating up my CPU and RAM like
>> crazy
>> file1 is usually loaded within 30 seconds
>>
>> Any recommendations there? Is there something I didnt consider?
>> Thanks
>> --
>> Mario Lovrić
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

-- 
Mario Lovrić
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread theozh
Dear RDKit-users,

I would like to do a very simple substructure search.
The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) is 
pretty short and doesn't point to a solution. So far, I've learned that you can 
create your search pattern via Chem.MolFromSmiles() or Chem.MolFromSmarts().

In the below copy minimal example, I want to use the first SMILES in the 
list as search pattern. I expect 2 matches but I get either 1 or 0 matches. So, 
I'm doing something wrong. What am I missing?
Is it about implicit/explicit aromatic and aliphatic bonds or some 
explicit/implicit hydrogen?
How to find the first structure in both SMILES?

thank you for any hints,
Theo.

### simple substructure search (but doesn't find what is expected)
from rdkit import Chem

smiles_strings = '''
C12=CC=CN1NCCC2
C12=CC=CC(C=C3)=C1N3NCC2
'''
smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 1, why not 2?

pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 0, why not 2?
### end of code


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] performance issues with PandasTools LoadSDF

2020-05-19 Thread Greg Landrum
Hi Mario,

how big is the file?
did you *add* properties to it or just modify existing values?

-greg


On Fri, May 15, 2020 at 11:34 AM Mario Lovrić 
wrote:

> Dear all,
>
>
> I have loaded a SDF file (lets call it file1) with PandasTools, corrected
> some properties and wrote it with PandasTools to SDF again (file2).
> Have two issues there:
>
> 1st)
> file2 had additional integers added to properties tags:
>
> >(2)
> >(2)
> >(2)
> >(2)
>
> 2nd)
> file2 is impossible to load again, it is eating up my CPU and RAM like
> crazy
> file1 is usually loaded within 30 seconds
>
> Any recommendations there? Is there something I didnt consider?
> Thanks
> --
> Mario Lovrić
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss