Dear RDKit-experts,

I'm using RDKit to search substructures in molecular structures.
I used Chem.MolFromSmiles() for my substructure search and was wondering why 
the substructure was not found in some structures.
On Chemistry.StackExchange I got a helpful hint. And now, I guess, I better 
understand the difference between SMILES and SMARTS.

The following example:
I guess I cannot attach images here. So, for a visualization please check 
(https://chemistry.stackexchange.com/q/128440/81125)
The first SMILES is searched in the other structures. You will find molecules 2 
and 4, but not 3 and 5.

Code:

### substructure search with RDKit
from rdkit import Chem

smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CCCCC7=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6']

pattern = Chem.MolFromSmiles(smiles_list[0])
for idx,smiles in enumerate(smiles_list):
    m = Chem.MolFromSmiles(smiles)
    print("Structure {}: pattern found 
{}".format(idx+1,m.HasSubstructMatch(pattern)))
### end of code

Result:
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found False
Structure 4: pattern found True
Structure 5: pattern found False


The solution I have come up so far is the following: (see also 
https://chemistry.stackexchange.com/a/128453/81125)

Basically, you convert the the search-SMILES 'C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4' 
to mol and this mol via Chem.MolToSmiles() back to SMILES again, you get 
c1ccc2c(c1)-c1cccc3cccc-2c13. If you create your search pattern via 
Chem.MolFromSmarts(), you will still not find structures 3 and 5. Probably 
because of the defined single bonds. However, if you replace - by ~ in, you get 
smiles_1b: c1ccc2c(c1)~c1cccc3cccc~2c13. With this, you will find also 
structures 3 and 5.

Code: (I also added Benzene as structure 6 to have a non-match)

### substructure search with RDKit
from rdkit import Chem

smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CCCCC7=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6','c1ccccc1']

def search_structure(pattern):
    for idx,smiles in enumerate(smiles_list):
        m = Chem.MolFromSmiles(smiles)
        print("Structure {}: pattern found 
{}".format(idx+1,m.HasSubstructMatch(pattern)))

smiles_1a  = smiles_list[0]
pattern_1a = Chem.MolFromSmiles(smiles_1a)
smiles_1b  = Chem.MolToSmiles(pattern_1a).replace('-','~')   # replace bonds
pattern_1b = Chem.MolFromSmarts(smiles_1b)

print("\nSMILES 1a: {}".format(smiles_1a))
search_structure(pattern_1a)
print("\nSMILES 1b: {}".format(smiles_1b))
search_structure(pattern_1b)
### end of code

Result:

SMILES 1a: C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found False
Structure 4: pattern found True
Structure 5: pattern found False
Structure 6: pattern found False

SMILES 1b: c1ccc2c(c1)~c1cccc3cccc~2c13
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found True
Structure 4: pattern found True
Structure 5: pattern found True
Structure 6: pattern found False


My question is now: is this the way to go or could this lead to other surprises 
or unexpected results?
Please excuse that I'm asking the same question in the WWW twice, but I guess 
this is the primary place to ask.

Best regards,
Theo.


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to