Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.

2016-11-01 Thread 杨弘宾






Hi, Brian,    The first point you mentioned was acturally what I guessed and it 
is deprecated in my context, I think.    Thanks for the second suggestion, I 
tried this and the performance improved:
suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')l = 
len(suppl)  # This line is crucialsuppl = list(suppl)
And the types of suppl are repectively: , ,
So, though the second suppl (after len(suppl) ) is selectable, it was not a 
list indeed. It is amazing that the all molecules were instantiated after the 
`list` operator.
: )

Hongbin Yang

 From: Brian KelleyDate: 2016-11-01 19:56To: 杨弘宾CC: rdkit-discussSubject: Re: 
[Rdkit-discuss] Is there a way to init the conformations of smiles supplier to 
improve the performance for substructure matching.I'll make two more points ( 
thanks to Greg Landrum for pointing this out )
1). In your code each call to suppl[i] makes a new molecule, calling it twice 
in a row is twice as slow.  This explains your last result.
2) in my example, I was assuming that the queries were already in a python list 
and not from a supplier.  If they are being read from a supplier, you can 
easily keep them all in memory with:
queries = list(query_supplier)

Note that for large files, this can take up a lot of memory.
Thanks for the clarification Greg.
Brian Kelley
On Nov 1, 2016, at 4:22 AM, 杨弘宾  wrote:


Hi,    Supposing I'd like to matching 100 substructures with 1000 compounds 
represented as smiles.What I did is:
suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')l = 
len(suppl)for j in range(ll):  # I have to make substructures in the first 
loop.    for i in range(l):

        suppl[i].GetSubstructMatches(s[j]) and found the performance is not 
good.
Then I did a comparison and found that it was because the conformation of the 
compounds where not initiated.If I use MolFromSmiles,the performance will 
improve a lot.start = time.clock()suppl = 
AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')

l=len(suppl) print time.clock()-start   # >>> 0.0373735355168  indicating that 
the molecules were not initiated.
for i in range(l):

    suppl[i].GetSubstructMatches(sa)

    suppl[i].GetSubstructMatches(sa2)

print time.clock()-start   # >>> 11.1884715172
start = time.clock()

f = open('allmoleculenew.smi')

for i in range(l):

    mol = Chem.MolFromSmiles(f.next().split('\t')[0])

    mol.GetSubstructMatches(sa)

    mol.GetSubstructMatches(sa2)print time.clock()-start # >>> 5.44030582111
The second method was double faster than the first, indicating that the "init" 
is more time consuming compared to matching.I think SmilesMolSupplier is a good 
API to load multiple compounds but it didnot parse the smiles immediately, 
which adds the time complexity to the further application. So is it possible to 
manually initiate the compounds?


Hongbin Yang 杨弘宾

Research: Toxicophore and Chemoinformatics
Pharmaceutical Science, School of Pharmacy

East China University of Science and Technology 

--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. 
http://sdm.link/xeonphi___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.

2016-11-01 Thread Brian Kelley
I'll make two more points ( thanks to Greg Landrum for pointing this out )

1). In your code each call to suppl[i] makes a new molecule, calling it twice 
in a row is twice as slow.  This explains your last result.

2) in my example, I was assuming that the queries were already in a python list 
and not from a supplier.  If they are being read from a supplier, you can 
easily keep them all in memory with:

queries = list(query_supplier)

Note that for large files, this can take up a lot of memory.

Thanks for the clarification Greg.

Brian Kelley

> On Nov 1, 2016, at 4:22 AM, 杨弘宾  wrote:
> 
> Hi,
> Supposing I'd like to matching 100 substructures with 1000 compounds 
> represented as smiles.
> What I did is:
> 
> suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')
> l = len(suppl)
> for j in range(ll):  # I have to make substructures in the first loop.
> for i in range(l): 
> suppl[i].GetSubstructMatches(s[j]) 
> and found the performance is not good.
> 
> Then I did a comparison and found that it was because the conformation of the 
> compounds where not initiated.
> If I use MolFromSmiles,the performance will improve a lot.
> start = time.clock()
> suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') 
> l=len(suppl) 
> print time.clock()-start   # >>> 0.0373735355168  indicating that the 
> molecules were not initiated.
> for i in range(l): 
> suppl[i].GetSubstructMatches(sa) 
> suppl[i].GetSubstructMatches(sa2) 
> print time.clock()-start   # >>> 11.1884715172
> start = time.clock() 
> f = open('allmoleculenew.smi') 
> for i in range(l): 
> mol = Chem.MolFromSmiles(f.next().split('\t')[0]) 
> mol.GetSubstructMatches(sa) 
> mol.GetSubstructMatches(sa2)
> print time.clock()-start # >>> 5.44030582111
> 
> The second method was double faster than the first, indicating that the 
> "init" is more time consuming compared to matching.
> I think SmilesMolSupplier is a good API to load multiple compounds but it 
> didnot parse the smiles immediately, which adds the time complexity to the 
> further application. So is it possible to manually initiate the compounds?
> 
> Hongbin Yang 杨弘宾 
> Research: Toxicophore and Chemoinformatics
> Pharmaceutical Science, School of Pharmacy 
> East China University of Science and Technology 
> --
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.

2016-11-01 Thread Brian Kelley
A supplier is random access, so your call to supp[I] here is probably quite
expensive:

suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')
l = len(suppl)
for j in range(ll):  # I have to make substructures in the first loop.
for i in range(l):
suppl[i].GetSubstructMatches(s[j])

I highly suggest using the python iteration as opposed to using an index
such as:

for mol in suppl:
  for pat in s:
  mol.GetSubstructMatches(pat)

I expect this will help quite a bit.  You may also consider using the
FilterCatalog which is designed to handle larger data sets and may help in
your case.

On Tue, Nov 1, 2016 at 4:22 AM, 杨弘宾  wrote:

> Hi,
> Supposing I'd like to matching 100 substructures with 1000 compounds
> represented as smiles.
> What I did is:
>
> suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')
> l = len(suppl)
> for j in range(ll):  # I have to make substructures in the first loop.
> for i in range(l):
> suppl[i].GetSubstructMatches(s[j])
> and found the performance is not good.
>
> Then I did a comparison and found that it was because the conformation of
> the compounds where not initiated.
> If I use MolFromSmiles,the performance will improve a lot.
> start = time.clock()
> suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')
> l=len(suppl)
> print time.clock()-start   # >>> 0.0373735355168  indicating that the
> molecules were not initiated.
> for i in range(l):
> suppl[i].GetSubstructMatches(sa)
> suppl[i].GetSubstructMatches(sa2)
> print time.clock()-start   # >>> 11.1884715172
> start = time.clock()
> f = open('allmoleculenew.smi')
> for i in range(l):
> mol = Chem.MolFromSmiles(f.next().split('\t')[0])
> mol.GetSubstructMatches(sa)
> mol.GetSubstructMatches(sa2)
> print time.clock()-start # >>> 5.44030582111
>
> The second method was double faster than the first, indicating that the
> "init" is more time consuming compared to matching.
> I think SmilesMolSupplier is a good API to load multiple compounds but it
> didnot parse the smiles immediately, which adds the time complexity to
> the further application. So is it possible to manually initiate the
> compounds?
>
> --
> Hongbin Yang 杨弘宾
> Research: Toxicophore and Chemoinformatics
> Pharmaceutical Science, School of Pharmacy
> East China University of Science and Technology
>
> 
> --
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.

2016-11-01 Thread 杨弘宾






Hi,? ??Supposing I'd like to matching 100 substructures with 1000 compounds 
represented as smiles.What I did is:
suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')l = 
len(suppl)for j in range(ll): ?# I have to make substructures in the first 
loop.? ??for i in range(l):

? ??? ??suppl[i].GetSubstructMatches(s[j])?and found the performance is not 
good.
Then I did a comparison and found that it was because the conformation of the 
compounds where not initiated.If I use MolFromSmiles,the performance will 
improve a lot.start = time.clock()suppl = 
AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')

l=len(suppl)?print time.clock()-start ? # >>>?0.0373735355168 ?indicating that 
the molecules were not initiated.
for i in range(l):

? ??suppl[i].GetSubstructMatches(sa)

? ??suppl[i].GetSubstructMatches(sa2)

print time.clock()-start ? # >>>?11.1884715172
start = time.clock()

f = open('allmoleculenew.smi')

for i in range(l):

? ??mol = Chem.MolFromSmiles(f.next().split('\t')[0])

? ??mol.GetSubstructMatches(sa)

? ??mol.GetSubstructMatches(sa2)print time.clock()-start # >>>?5.44030582111
The second method was double faster than the first, indicating that the "init" 
is more time consuming compared to matching.I think?SmilesMolSupplier is a good 
API to load multiple compounds but it didnot parse the smiles immediately, 
which adds the?time complexity to the further application. So is it possible to 
manually initiate the compounds?


Hongbin Yang 杨弘宾

Research: Toxicophore and Chemoinformatics
Pharmaceutical Science, School of Pharmacy

East China University of Science and Technology?


--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss