Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Andrew Dalke
Hi Wandré,

  The easiest way to avoid recalculating the fingerprints is to keep the FPS 
file around. The rdkit2fps program calculates the AtomPair fingerprint and 
converts the resulting DataStructs fingerprint object into a hex-encoded 
fingerprint, which is stored as text in the FPS file.

One difference that I just realized, however, is that it uses 
"GetHashedAtomPairFingerprintAsBitVect" while you use "GetAtomPairFingerprint". 
The example I gave generates a dense fingerprint rather than a sparse one. This 
difference probably isn't a lot, but it may make my proposed solution unusable 
for your needs.

Regarding a database, I didn't realize you were using a database. Your original 
email showed a script that didn't make use of a database.

The details of how to import/export data from a database are database- and 
schema-specific. I don't have any experience with the RDKit Postgres cartridge 
to be able to offer any advice, if that's what you are using.

Chemfp includes a programming API, with documentation at 
http://chemfp.readthedocs.io/en/chemfp-1.3/ , which may help with any data 
import/export . Depending on your needs, you may find that the FPS file by 
itself is enough.

RDKit also supports adapters from the hex-encoded fingerprint used in the FPS 
format a dense bit vector using:

http://www.rdkit.org/docs-beta/api/rdkit.DataStructs.cDataStructs-module.html#BitVectToFPSText
http://www.rdkit.org/docs-beta/api/rdkit.DataStructs.cDataStructs-module.html#CreateFromFPSText

Again, note that this is for an ExplicitBitVect and not an IntSparseIntVect.

Best regards,


Andrew
da...@dalkescientific.com


> On Jan 11, 2018, at 18:49, Wandré  wrote:
> 
> Thanks Andrew, I will try this steps.
> So, to avoid recalculate fingerprints, how can I calculate them and store in 
> database?
> When I calculate AtomPair fingerprint, returns a 
> rdkit.DataStructs.cDataStructs.IntSparseIntVect object
> How to store this rdkit Python object in a database and how to read them 
> again?
> 
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência 
> Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Wandré
Thanks Andrew, I will try this steps.
So, to avoid recalculate fingerprints, how can I calculate them and store
in database?
When I calculate AtomPair fingerprint, returns
a rdkit.DataStructs.cDataStructs.IntSparseIntVect object
How to store this rdkit Python object in a database and how to read them
again?

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2018-01-11 12:46 GMT-02:00 Andrew Dalke :

> On Jan 11, 2018, at 12:04, Wandré  wrote:
> > Thanks for the link. It is very interesting. I will read very carefully.
> > So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?
>
> Chemfp works with fingerprint files, in your case, chemfp's text-based
> "FPS" format. You'll need to use 'rdkit2fps' to convert your InChI
> structures into a fingerprint.
>
> Here's an example file, where I follow the Open Babel convention of
> allowing an identifier after the InChI string:
>
> % cat examples.inchi
> InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol
> InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene
> InChI=1S/CH4/h1H4/i1D4 deuterated methane
>
> You could also use an SDF or SMILES file.
>
> Next, I generate AtomPair fingerprints. The output goes to "examples.fps",
> which I'll then display.
>
> % rdkit2fps --pairs examples.inchi -o examples.fps
> % cat examples.fps
> #FPS1
> #num_bits=2048
> #type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30
> #software=RDKit/2016.09.3 chemfp/3.1
> #source=examples.inchi
> #date=2018-01-11T14:38:57
> 
> 
> 
> 1100
> 00310300
> 00700303
> 00073000
> 
> phenol
> 
> 
> 
> 
> 0030
> 0070
> 0007
> 
> benzene
> 
> 
> 
> 7000
> 
> 0070
> 
> 
> deuterated methane
>
>
> Finally, I run the clustering program, with a low threshold so it does
> something other than the trivial output of three clusters.
>
> % python taylor_butina.py -t 0.3 examples.fps
> 0 true singletons
> =>
>
> 1 false singletons
> => deuterated methane
>
> 1 clusters
> phenol has 1 other members
> => benzene
>
> This output format is rather ad hoc. I need to figure out what format
> people want from a clustering tool; preferably one that other tools can
> import without further conversion.
>
> I'll be glad to hear any suggestions.
>
> Cheers,
>
>
> Andrew
> da...@dalkescientific.com
>
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Andrew Dalke
On Jan 11, 2018, at 12:04, Wandré  wrote:
> Thanks for the link. It is very interesting. I will read very carefully.
> So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?

Chemfp works with fingerprint files, in your case, chemfp's text-based "FPS" 
format. You'll need to use 'rdkit2fps' to convert your InChI structures into a 
fingerprint.

Here's an example file, where I follow the Open Babel convention of allowing an 
identifier after the InChI string:

% cat examples.inchi
InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol
InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene
InChI=1S/CH4/h1H4/i1D4 deuterated methane

You could also use an SDF or SMILES file.

Next, I generate AtomPair fingerprints. The output goes to "examples.fps", 
which I'll then display.

% rdkit2fps --pairs examples.inchi -o examples.fps
% cat examples.fps
#FPS1
#num_bits=2048
#type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30
#software=RDKit/2016.09.3 chemfp/3.1
#source=examples.inchi
#date=2018-01-11T14:38:57
11310370030300073000
phenol
00300077
benzene
7070
deuterated methane


Finally, I run the clustering program, with a low threshold so it does 
something other than the trivial output of three clusters.

% python taylor_butina.py -t 0.3 examples.fps
0 true singletons
=>

1 false singletons
=> deuterated methane

1 clusters
phenol has 1 other members
=> benzene

This output format is rather ad hoc. I need to figure out what format people 
want from a clustering tool; preferably one that other tools can import without 
further conversion.

I'll be glad to hear any suggestions.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Wandré
Hi Andrew,

Thanks for the link. It is very interesting. I will read very carefully.
So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2018-01-11 6:59 GMT-02:00 Andrew Dalke :

> Hi Wandré,
>
>   You may want to look at chemfp for this sort of clustering.
>
> Last year Chris Swain reviewed a few different ways to do clustering, at
> https://www.macinchem.org/reviews/clustering/clustering.php . His data
> set had 4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity
> threshold.
>
> Chemfp doesn't include the Taylor-Butina algorithm as part of the
> distribution. That will likely be included in the next release.
>
> Instead, I worked with Chris to develop a version he could use for
> testing. It looks like the copy from his web page is not available (the
> download URL redirects to itself, producing an infinite loop).
>
> I have put a copy at http://dalkescientific.com/writings/taylor_butina.py
> , if you want to try it out.
>
> Best regards,
>
> Andrew
> da...@dalkescientific.com
>
>
> > On Jan 11, 2018, at 09:27, Wandré  wrote:
> >
> > Hi,
> > (first of all, sorry by my poor english...)
> > I'm trying to clustering a large dataset of molecules, but, in a server
> with 64GB of RAM and 32 cores, all RAM memory and cache are occuped and,
> after 10 hours, the clustering is not calculated yet.
> > My set of molecules have more than 1 million of hits, I'm using the
> atompair fingerprint and clusterFPS Butina algorithm to clustering.
> > What can I do?
> > I thought about calculating all the fingerprints, store them in my
> relational PostgreSQL database (not cartridge), store the result of
> BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM
> and allow to run new clustering in minor time (I spend 30 minutes just to
> calculate all fingerprints).
> > How to store this values (fingerprint and BulkTanimotoSimilarity)?
> > Here is a part of my code:
> >
> > for i in range(0, len(tb_hit_data)):
> > try:
> > #This step I want to save to use less CPU time (just run once)
> > mol = Chem.MolFromInchi(tb_hit_data[i][1])
> > fps.append(Pairs.GetAtomPairFingerprint(mol))
> > ids.append(tb_hit_data[i][0])
> > except:
> > print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated"
> > clusters = self.clusterfps(fps, cutoff_value)
> >
> >
> > def clusterfps(cls, fps, cutoff=0.99):
> > """Method that clustering all data, passed in fps, with an specific
> cutoff
> > """
> > from rdkit.ML.Cluster import Butina
> >
> > # first generate the distance matrix:
> > dists = []
> > nfps = len(fps)
> > for i in range(1, nfps):
> > #This is other step that I want to store in database (just run
> once)
> > sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
> > dists.extend([1 - x for x in sims])
> >
> > # now cluster the data:
> > cluster_data = Butina.ClusterData(dists, nfps, cutoff,
> isDistData=True)
> > return cluster_data
> > # End def clusterfps
> >
> > Thanks!
> > --
> > Wandré Nunes de Pinho Veloso
> > Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> > Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
> UFMG
> > Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> > Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> > Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> > Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

2018-01-11 Thread Andrew Dalke
Hi Wandré,

  You may want to look at chemfp for this sort of clustering.

Last year Chris Swain reviewed a few different ways to do clustering, at 
https://www.macinchem.org/reviews/clustering/clustering.php . His data set had 
4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity threshold.

Chemfp doesn't include the Taylor-Butina algorithm as part of the distribution. 
That will likely be included in the next release.

Instead, I worked with Chris to develop a version he could use for testing. It 
looks like the copy from his web page is not available (the download URL 
redirects to itself, producing an infinite loop).

I have put a copy at http://dalkescientific.com/writings/taylor_butina.py , if 
you want to try it out.

Best regards,

Andrew
da...@dalkescientific.com


> On Jan 11, 2018, at 09:27, Wandré  wrote:
> 
> Hi,
> (first of all, sorry by my poor english...)
> I'm trying to clustering a large dataset of molecules, but, in a server with 
> 64GB of RAM and 32 cores, all RAM memory and cache are occuped and, after 10 
> hours, the clustering is not calculated yet.
> My set of molecules have more than 1 million of hits, I'm using the atompair 
> fingerprint and clusterFPS Butina algorithm to clustering.
> What can I do?
> I thought about calculating all the fingerprints, store them in my relational 
> PostgreSQL database (not cartridge), store the result of 
> BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM and 
> allow to run new clustering in minor time (I spend 30 minutes just to 
> calculate all fingerprints).
> How to store this values (fingerprint and BulkTanimotoSimilarity)?
> Here is a part of my code:
> 
> for i in range(0, len(tb_hit_data)):
> try:
> #This step I want to save to use less CPU time (just run once)
> mol = Chem.MolFromInchi(tb_hit_data[i][1])
> fps.append(Pairs.GetAtomPairFingerprint(mol))
> ids.append(tb_hit_data[i][0])
> except:
> print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated"
> clusters = self.clusterfps(fps, cutoff_value)
> 
> 
> def clusterfps(cls, fps, cutoff=0.99):
> """Method that clustering all data, passed in fps, with an specific cutoff
> """
> from rdkit.ML.Cluster import Butina
> 
> # first generate the distance matrix:
> dists = []
> nfps = len(fps)
> for i in range(1, nfps):
> #This is other step that I want to store in database (just run once)
> sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
> dists.extend([1 - x for x in sims])
> 
> # now cluster the data:
> cluster_data = Butina.ClusterData(dists, nfps, cutoff, isDistData=True)
> return cluster_data
> # End def clusterfps
> 
> Thanks!
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência 
> Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss