Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules
Hi Wandré, The easiest way to avoid recalculating the fingerprints is to keep the FPS file around. The rdkit2fps program calculates the AtomPair fingerprint and converts the resulting DataStructs fingerprint object into a hex-encoded fingerprint, which is stored as text in the FPS file. One difference that I just realized, however, is that it uses "GetHashedAtomPairFingerprintAsBitVect" while you use "GetAtomPairFingerprint". The example I gave generates a dense fingerprint rather than a sparse one. This difference probably isn't a lot, but it may make my proposed solution unusable for your needs. Regarding a database, I didn't realize you were using a database. Your original email showed a script that didn't make use of a database. The details of how to import/export data from a database are database- and schema-specific. I don't have any experience with the RDKit Postgres cartridge to be able to offer any advice, if that's what you are using. Chemfp includes a programming API, with documentation at http://chemfp.readthedocs.io/en/chemfp-1.3/ , which may help with any data import/export . Depending on your needs, you may find that the FPS file by itself is enough. RDKit also supports adapters from the hex-encoded fingerprint used in the FPS format a dense bit vector using: http://www.rdkit.org/docs-beta/api/rdkit.DataStructs.cDataStructs-module.html#BitVectToFPSText http://www.rdkit.org/docs-beta/api/rdkit.DataStructs.cDataStructs-module.html#CreateFromFPSText Again, note that this is for an ExplicitBitVect and not an IntSparseIntVect. Best regards, Andrew da...@dalkescientific.com > On Jan 11, 2018, at 18:49, Wandréwrote: > > Thanks Andrew, I will try this steps. > So, to avoid recalculate fingerprints, how can I calculate them and store in > database? > When I calculate AtomPair fingerprint, returns a > rdkit.DataStructs.cDataStructs.IntSparseIntVect object > How to store this rdkit Python object in a database and how to read them > again? > > -- > Wandré Nunes de Pinho Veloso > Professor Assistente - Unifei - Campus Avançado de Itabira-MG > Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG > Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência > Computacional - UNIFEI > Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ > Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG > Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules
Thanks Andrew, I will try this steps. So, to avoid recalculate fingerprints, how can I calculate them and store in database? When I calculate AtomPair fingerprint, returns a rdkit.DataStructs.cDataStructs.IntSparseIntVect object How to store this rdkit Python object in a database and how to read them again? -- Wandré Nunes de Pinho Veloso Professor Assistente - Unifei - Campus Avançado de Itabira-MG Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência Computacional - UNIFEI Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG 2018-01-11 12:46 GMT-02:00 Andrew Dalke: > On Jan 11, 2018, at 12:04, Wandré wrote: > > Thanks for the link. It is very interesting. I will read very carefully. > > So, as input on ChemFP, I have to put a file with all molecules in 1 SDF? > > Chemfp works with fingerprint files, in your case, chemfp's text-based > "FPS" format. You'll need to use 'rdkit2fps' to convert your InChI > structures into a fingerprint. > > Here's an example file, where I follow the Open Babel convention of > allowing an identifier after the InChI string: > > % cat examples.inchi > InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol > InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene > InChI=1S/CH4/h1H4/i1D4 deuterated methane > > You could also use an SDF or SMILES file. > > Next, I generate AtomPair fingerprints. The output goes to "examples.fps", > which I'll then display. > > % rdkit2fps --pairs examples.inchi -o examples.fps > % cat examples.fps > #FPS1 > #num_bits=2048 > #type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30 > #software=RDKit/2016.09.3 chemfp/3.1 > #source=examples.inchi > #date=2018-01-11T14:38:57 > > > > 1100 > 00310300 > 00700303 > 00073000 > > phenol > > > > > 0030 > 0070 > 0007 > > benzene > > > > 7000 > > 0070 > > > deuterated methane > > > Finally, I run the clustering program, with a low threshold so it does > something other than the trivial output of three clusters. > > % python taylor_butina.py -t 0.3 examples.fps > 0 true singletons > => > > 1 false singletons > => deuterated methane > > 1 clusters > phenol has 1 other members > => benzene > > This output format is rather ad hoc. I need to figure out what format > people want from a clustering tool; preferably one that other tools can > import without further conversion. > > I'll be glad to hear any suggestions. > > Cheers, > > > Andrew > da...@dalkescientific.com > > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules
On Jan 11, 2018, at 12:04, Wandréwrote: > Thanks for the link. It is very interesting. I will read very carefully. > So, as input on ChemFP, I have to put a file with all molecules in 1 SDF? Chemfp works with fingerprint files, in your case, chemfp's text-based "FPS" format. You'll need to use 'rdkit2fps' to convert your InChI structures into a fingerprint. Here's an example file, where I follow the Open Babel convention of allowing an identifier after the InChI string: % cat examples.inchi InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene InChI=1S/CH4/h1H4/i1D4 deuterated methane You could also use an SDF or SMILES file. Next, I generate AtomPair fingerprints. The output goes to "examples.fps", which I'll then display. % rdkit2fps --pairs examples.inchi -o examples.fps % cat examples.fps #FPS1 #num_bits=2048 #type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30 #software=RDKit/2016.09.3 chemfp/3.1 #source=examples.inchi #date=2018-01-11T14:38:57 11310370030300073000 phenol 00300077 benzene 7070 deuterated methane Finally, I run the clustering program, with a low threshold so it does something other than the trivial output of three clusters. % python taylor_butina.py -t 0.3 examples.fps 0 true singletons => 1 false singletons => deuterated methane 1 clusters phenol has 1 other members => benzene This output format is rather ad hoc. I need to figure out what format people want from a clustering tool; preferably one that other tools can import without further conversion. I'll be glad to hear any suggestions. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules
Hi Andrew, Thanks for the link. It is very interesting. I will read very carefully. So, as input on ChemFP, I have to put a file with all molecules in 1 SDF? -- Wandré Nunes de Pinho Veloso Professor Assistente - Unifei - Campus Avançado de Itabira-MG Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência Computacional - UNIFEI Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG 2018-01-11 6:59 GMT-02:00 Andrew Dalke: > Hi Wandré, > > You may want to look at chemfp for this sort of clustering. > > Last year Chris Swain reviewed a few different ways to do clustering, at > https://www.macinchem.org/reviews/clustering/clustering.php . His data > set had 4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity > threshold. > > Chemfp doesn't include the Taylor-Butina algorithm as part of the > distribution. That will likely be included in the next release. > > Instead, I worked with Chris to develop a version he could use for > testing. It looks like the copy from his web page is not available (the > download URL redirects to itself, producing an infinite loop). > > I have put a copy at http://dalkescientific.com/writings/taylor_butina.py > , if you want to try it out. > > Best regards, > > Andrew > da...@dalkescientific.com > > > > On Jan 11, 2018, at 09:27, Wandré wrote: > > > > Hi, > > (first of all, sorry by my poor english...) > > I'm trying to clustering a large dataset of molecules, but, in a server > with 64GB of RAM and 32 cores, all RAM memory and cache are occuped and, > after 10 hours, the clustering is not calculated yet. > > My set of molecules have more than 1 million of hits, I'm using the > atompair fingerprint and clusterFPS Butina algorithm to clustering. > > What can I do? > > I thought about calculating all the fingerprints, store them in my > relational PostgreSQL database (not cartridge), store the result of > BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM > and allow to run new clustering in minor time (I spend 30 minutes just to > calculate all fingerprints). > > How to store this values (fingerprint and BulkTanimotoSimilarity)? > > Here is a part of my code: > > > > for i in range(0, len(tb_hit_data)): > > try: > > #This step I want to save to use less CPU time (just run once) > > mol = Chem.MolFromInchi(tb_hit_data[i][1]) > > fps.append(Pairs.GetAtomPairFingerprint(mol)) > > ids.append(tb_hit_data[i][0]) > > except: > > print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated" > > clusters = self.clusterfps(fps, cutoff_value) > > > > > > def clusterfps(cls, fps, cutoff=0.99): > > """Method that clustering all data, passed in fps, with an specific > cutoff > > """ > > from rdkit.ML.Cluster import Butina > > > > # first generate the distance matrix: > > dists = [] > > nfps = len(fps) > > for i in range(1, nfps): > > #This is other step that I want to store in database (just run > once) > > sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i]) > > dists.extend([1 - x for x in sims]) > > > > # now cluster the data: > > cluster_data = Butina.ClusterData(dists, nfps, cutoff, > isDistData=True) > > return cluster_data > > # End def clusterfps > > > > Thanks! > > -- > > Wandré Nunes de Pinho Veloso > > Professor Assistente - Unifei - Campus Avançado de Itabira-MG > > Doutorando em Bioinformática - Universidade Federal de Minas Gerais - > UFMG > > Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e > Inteligência Computacional - UNIFEI > > Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ > > Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG > > Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG > > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules
Hi Wandré, You may want to look at chemfp for this sort of clustering. Last year Chris Swain reviewed a few different ways to do clustering, at https://www.macinchem.org/reviews/clustering/clustering.php . His data set had 4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity threshold. Chemfp doesn't include the Taylor-Butina algorithm as part of the distribution. That will likely be included in the next release. Instead, I worked with Chris to develop a version he could use for testing. It looks like the copy from his web page is not available (the download URL redirects to itself, producing an infinite loop). I have put a copy at http://dalkescientific.com/writings/taylor_butina.py , if you want to try it out. Best regards, Andrew da...@dalkescientific.com > On Jan 11, 2018, at 09:27, Wandréwrote: > > Hi, > (first of all, sorry by my poor english...) > I'm trying to clustering a large dataset of molecules, but, in a server with > 64GB of RAM and 32 cores, all RAM memory and cache are occuped and, after 10 > hours, the clustering is not calculated yet. > My set of molecules have more than 1 million of hits, I'm using the atompair > fingerprint and clusterFPS Butina algorithm to clustering. > What can I do? > I thought about calculating all the fingerprints, store them in my relational > PostgreSQL database (not cartridge), store the result of > BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM and > allow to run new clustering in minor time (I spend 30 minutes just to > calculate all fingerprints). > How to store this values (fingerprint and BulkTanimotoSimilarity)? > Here is a part of my code: > > for i in range(0, len(tb_hit_data)): > try: > #This step I want to save to use less CPU time (just run once) > mol = Chem.MolFromInchi(tb_hit_data[i][1]) > fps.append(Pairs.GetAtomPairFingerprint(mol)) > ids.append(tb_hit_data[i][0]) > except: > print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated" > clusters = self.clusterfps(fps, cutoff_value) > > > def clusterfps(cls, fps, cutoff=0.99): > """Method that clustering all data, passed in fps, with an specific cutoff > """ > from rdkit.ML.Cluster import Butina > > # first generate the distance matrix: > dists = [] > nfps = len(fps) > for i in range(1, nfps): > #This is other step that I want to store in database (just run once) > sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i]) > dists.extend([1 - x for x in sims]) > > # now cluster the data: > cluster_data = Butina.ClusterData(dists, nfps, cutoff, isDistData=True) > return cluster_data > # End def clusterfps > > Thanks! > -- > Wandré Nunes de Pinho Veloso > Professor Assistente - Unifei - Campus Avançado de Itabira-MG > Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG > Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência > Computacional - UNIFEI > Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ > Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG > Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss