Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.
Hi, Brian, The first point you mentioned was acturally what I guessed and it is deprecated in my context, I think. Thanks for the second suggestion, I tried this and the performance improved: suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')l = len(suppl) # This line is crucialsuppl = list(suppl) And the types of suppl are repectively: , , So, though the second suppl (after len(suppl) ) is selectable, it was not a list indeed. It is amazing that the all molecules were instantiated after the `list` operator. : ) Hongbin Yang From: Brian KelleyDate: 2016-11-01 19:56To: 杨弘宾CC: rdkit-discussSubject: Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.I'll make two more points ( thanks to Greg Landrum for pointing this out ) 1). In your code each call to suppl[i] makes a new molecule, calling it twice in a row is twice as slow. This explains your last result. 2) in my example, I was assuming that the queries were already in a python list and not from a supplier. If they are being read from a supplier, you can easily keep them all in memory with: queries = list(query_supplier) Note that for large files, this can take up a lot of memory. Thanks for the clarification Greg. Brian Kelley On Nov 1, 2016, at 4:22 AM, 杨弘宾 wrote: Hi, Supposing I'd like to matching 100 substructures with 1000 compounds represented as smiles.What I did is: suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')l = len(suppl)for j in range(ll): # I have to make substructures in the first loop. for i in range(l): suppl[i].GetSubstructMatches(s[j]) and found the performance is not good. Then I did a comparison and found that it was because the conformation of the compounds where not initiated.If I use MolFromSmiles,the performance will improve a lot.start = time.clock()suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') l=len(suppl) print time.clock()-start # >>> 0.0373735355168 indicating that the molecules were not initiated. for i in range(l): suppl[i].GetSubstructMatches(sa) suppl[i].GetSubstructMatches(sa2) print time.clock()-start # >>> 11.1884715172 start = time.clock() f = open('allmoleculenew.smi') for i in range(l): mol = Chem.MolFromSmiles(f.next().split('\t')[0]) mol.GetSubstructMatches(sa) mol.GetSubstructMatches(sa2)print time.clock()-start # >>> 5.44030582111 The second method was double faster than the first, indicating that the "init" is more time consuming compared to matching.I think SmilesMolSupplier is a good API to load multiple compounds but it didnot parse the smiles immediately, which adds the time complexity to the further application. So is it possible to manually initiate the compounds? Hongbin Yang 杨弘宾 Research: Toxicophore and Chemoinformatics Pharmaceutical Science, School of Pharmacy East China University of Science and Technology -- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.
I'll make two more points ( thanks to Greg Landrum for pointing this out ) 1). In your code each call to suppl[i] makes a new molecule, calling it twice in a row is twice as slow. This explains your last result. 2) in my example, I was assuming that the queries were already in a python list and not from a supplier. If they are being read from a supplier, you can easily keep them all in memory with: queries = list(query_supplier) Note that for large files, this can take up a lot of memory. Thanks for the clarification Greg. Brian Kelley > On Nov 1, 2016, at 4:22 AM, 杨弘宾 wrote: > > Hi, > Supposing I'd like to matching 100 substructures with 1000 compounds > represented as smiles. > What I did is: > > suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') > l = len(suppl) > for j in range(ll): # I have to make substructures in the first loop. > for i in range(l): > suppl[i].GetSubstructMatches(s[j]) > and found the performance is not good. > > Then I did a comparison and found that it was because the conformation of the > compounds where not initiated. > If I use MolFromSmiles,the performance will improve a lot. > start = time.clock() > suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') > l=len(suppl) > print time.clock()-start # >>> 0.0373735355168 indicating that the > molecules were not initiated. > for i in range(l): > suppl[i].GetSubstructMatches(sa) > suppl[i].GetSubstructMatches(sa2) > print time.clock()-start # >>> 11.1884715172 > start = time.clock() > f = open('allmoleculenew.smi') > for i in range(l): > mol = Chem.MolFromSmiles(f.next().split('\t')[0]) > mol.GetSubstructMatches(sa) > mol.GetSubstructMatches(sa2) > print time.clock()-start # >>> 5.44030582111 > > The second method was double faster than the first, indicating that the > "init" is more time consuming compared to matching. > I think SmilesMolSupplier is a good API to load multiple compounds but it > didnot parse the smiles immediately, which adds the time complexity to the > further application. So is it possible to manually initiate the compounds? > > Hongbin Yang 杨弘宾 > Research: Toxicophore and Chemoinformatics > Pharmaceutical Science, School of Pharmacy > East China University of Science and Technology > -- > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today. http://sdm.link/xeonphi > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.
A supplier is random access, so your call to supp[I] here is probably quite expensive: suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') l = len(suppl) for j in range(ll): # I have to make substructures in the first loop. for i in range(l): suppl[i].GetSubstructMatches(s[j]) I highly suggest using the python iteration as opposed to using an index such as: for mol in suppl: for pat in s: mol.GetSubstructMatches(pat) I expect this will help quite a bit. You may also consider using the FilterCatalog which is designed to handle larger data sets and may help in your case. On Tue, Nov 1, 2016 at 4:22 AM, 杨弘宾 wrote: > Hi, > Supposing I'd like to matching 100 substructures with 1000 compounds > represented as smiles. > What I did is: > > suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') > l = len(suppl) > for j in range(ll): # I have to make substructures in the first loop. > for i in range(l): > suppl[i].GetSubstructMatches(s[j]) > and found the performance is not good. > > Then I did a comparison and found that it was because the conformation of > the compounds where not initiated. > If I use MolFromSmiles,the performance will improve a lot. > start = time.clock() > suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') > l=len(suppl) > print time.clock()-start # >>> 0.0373735355168 indicating that the > molecules were not initiated. > for i in range(l): > suppl[i].GetSubstructMatches(sa) > suppl[i].GetSubstructMatches(sa2) > print time.clock()-start # >>> 11.1884715172 > start = time.clock() > f = open('allmoleculenew.smi') > for i in range(l): > mol = Chem.MolFromSmiles(f.next().split('\t')[0]) > mol.GetSubstructMatches(sa) > mol.GetSubstructMatches(sa2) > print time.clock()-start # >>> 5.44030582111 > > The second method was double faster than the first, indicating that the > "init" is more time consuming compared to matching. > I think SmilesMolSupplier is a good API to load multiple compounds but it > didnot parse the smiles immediately, which adds the time complexity to > the further application. So is it possible to manually initiate the > compounds? > > -- > Hongbin Yang 杨弘宾 > Research: Toxicophore and Chemoinformatics > Pharmaceutical Science, School of Pharmacy > East China University of Science and Technology > > > -- > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today. http://sdm.link/xeonphi > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Is there a way to init the conformations of smiles supplier to improve the performance for substructure matching.
Hi,? ??Supposing I'd like to matching 100 substructures with 1000 compounds represented as smiles.What I did is: suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t')l = len(suppl)for j in range(ll): ?# I have to make substructures in the first loop.? ??for i in range(l): ? ??? ??suppl[i].GetSubstructMatches(s[j])?and found the performance is not good. Then I did a comparison and found that it was because the conformation of the compounds where not initiated.If I use MolFromSmiles,the performance will improve a lot.start = time.clock()suppl = AllChem.SmilesMolSupplier('allmoleculenew.smi',delimiter='\t') l=len(suppl)?print time.clock()-start ? # >>>?0.0373735355168 ?indicating that the molecules were not initiated. for i in range(l): ? ??suppl[i].GetSubstructMatches(sa) ? ??suppl[i].GetSubstructMatches(sa2) print time.clock()-start ? # >>>?11.1884715172 start = time.clock() f = open('allmoleculenew.smi') for i in range(l): ? ??mol = Chem.MolFromSmiles(f.next().split('\t')[0]) ? ??mol.GetSubstructMatches(sa) ? ??mol.GetSubstructMatches(sa2)print time.clock()-start # >>>?5.44030582111 The second method was double faster than the first, indicating that the "init" is more time consuming compared to matching.I think?SmilesMolSupplier is a good API to load multiple compounds but it didnot parse the smiles immediately, which adds the?time complexity to the further application. So is it possible to manually initiate the compounds? Hongbin Yang 杨弘宾 Research: Toxicophore and Chemoinformatics Pharmaceutical Science, School of Pharmacy East China University of Science and Technology? -- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss