Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

2018-05-18 Thread Andrew Dalke
On May 18, 2018, at 17:48, Jennifer Hemmerich  
wrote:
> I really liked the idea and I implemented it as follows: 


> df = pd.DataFrame(columns=counts.keys())
> for i,fp in enumerate(allfps):
> logger.debug('appending %s', str(i))
> df.append(fp, ignore_index=True)


While this saves a bit over creating an (N=100,000 x 2**32) data frame, it 
still creates a dataframe with (N x #unique keys).

This is probably where your memory is going.

Your next operation selects only those with a given number of fingerprints in 
it, and throws away the rest of the data.

> df = df[df.columns[df.sum(axis=0) >= cut]]

What you can do instead is to select only those keys with at least cut 
elements, before making the DataFrame.

One way to do that is to use the collections.Counter class, which is like a 
collections.defaultdict(int) with the addition that it implements a 
"most_common()" method.

When called by with no arguments it returns the list of (key, value) pairs, 
sorted from largest value to smallest.

That means you can use a loop like:

selected_keys = []
for idx, count in counts.most_common():
if count < cut:
break
selected_keys.append(idx)

to select only the keys which are at or above the cut value.

Once you have those keys, you can create the rows which are then passed to the 
DataFrame.

The following seems to do what you are looking for. At the very least, it gets 
you more of the way there.




from rdkit.Chem import AllChem

from collections import Counter
import pandas as pd

def get_ecfp(mols, n=3, cut=10):
allfps = []
counts = Counter()
for i, mol in enumerate(mols):
fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
allfps.append(fp)
for idx, count in fp.items():
# use "+= count" for the most frequent feature"
counts[idx] += count
# use "+= 1" for the feature in the most fingerprints
#counts[idx] += 1

selected_keys = []
for idx, count in counts.most_common():
if count < cut:
break
selected_keys.append(idx)

rows = []
for fp in allfps:
rows.append([fp.get(key, 0) for key in selected_keys])

df = pd.DataFrame(
rows,
columns=selected_keys)

return df

if __name__ == "__main__":
mols = []
for smi in ("c1c1O", "CNO", "C#N"):
mol = AllChem.MolFromSmiles(smi)
mols.append(mol)

print(get_ecfp(mols, cut=2))



Note that I construct a list of rows which I then pass in to the DataFrame at 
once, rather then append() one row at a time as your code does.

This is because the Pandas documentation says, at 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
 : Iteratively appending rows to a DataFrame can be more computationally 
intensive than a single concatenate. A better solution is to append those rows 
to a list and then concatenate the list with the original DataFrame all at once.


Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

2018-05-18 Thread Jennifer Hemmerich

Hi Greg,

thank you for the quick reply. I really liked the idea and I implemented 
it as follows:


def get_ecfp(mols, n=3, cut=10):
allfps = []
counts = defaultdict(int)
for i, molin enumerate(mols):
fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
logger.debug('Getting fp %s', str(i))
allfps.append(fp)
for idx, countin fp.items():
counts[idx] += count
df = pd.DataFrame(columns=counts.keys())
for i,fpin enumerate(allfps):
logger.debug('appending %s', str(i))
df.append(fp, ignore_index=True)

df = df[df.columns[df.sum(axis=0) >= cut]]
return df

Not sure if I got your idea right but in this way it made sense for me. 
Unfortunately the process gets killed somewhere while appending to the 
dataframe. I did not have time to look into it fully just wanted to 
check if I got your idea right. Ill play around with this and the scipy 
sparse matrices more at the weekend.


Thanks!
Jennifer



On 2018-05-18 11:00, Greg Landrum wrote:

Hi Jennifer,

I think what you're going to want to do is start by getting the 
indices and counts that are present across the entire dataset and then 
using those to decide what the columns in your pandas table will be. 
Here's some code that may help get you started:


from collections import defaultdict

def get_ecfp(mols, n=3, cut=10):
    allfps = []
    counts = defaultdict(int)
    for i, mol in enumerate(mols):
        fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
        allfps.append(fp)
        for idx,count in fp.items():
            counts[idx] += count
    print(counts)

If that's not enough, let me know and I will try to do a full version 
tomorrow.


-greg




On Fri, May 18, 2018 at 9:41 AM Jennifer Hemmerich 
> wrote:


Hi all,

I am trying to calculate Morgan Fingerprints for approximately
100.000 Molecules. I do not want to use the folded FPs. I want to
use the counts for the bits which are on and afterwards drop all
the infrequent features. So I would ultimately need a dataframe
with the counts for each molecule assigned to the respective bit
of the FP like this:

Molecule1
2...
6...
...n
Structure1
0
0
4
1
Structure2
1
0
0
8

The function I am currently using is:

def get_ecfp(mols, n=3, cut=10):

 df = []
 all_dfs = pd.DataFrame()
 for i, molin enumerate(mols):
 d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
 df.append(d)
 logger.debug('Calculating %s',str(i))


#append all collected dataframes in between to prevent memory issue

  if i%2 ==0 or i ==len(mols):
 logger.debug('Concatenation %s', str(i))
 part = pd.concat(df)
 all_dfs = pd.concat([part,all_dfs])
 logger.debug('Datatypes %s', str(all_dfs.dtypes))
 del part
 df = []

 all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
columns where count<10

 return df

But the concatenation is awfully slow. Without the intermediate
concatenation I am quickly running out of Memory trying to
concatenate all dataframes, although using a Machine with 128GB of
RAM.

I found the possibility to convert the fingerprint to a numpy
array. That needs me to assign a numpy array with a certain length
which is impossible, as I do not know how long the final array has
to be. Assigning it to an array without predefining the length
just never finishes the computation. If I check for the length of
the fp with fp.GetLength() I get 4294967295 which just seems to be
the maximum number of a 32bit int. This means that converting all
of the FPs to such long numpy Arrays also is not really an option.

Is there any way which I did not see to get the desired DataFrame
or ndarray out of RDKit directly? Or any better conversion? I
assume that the keys of the dict I get with  GetNonzeroElements()
are the set bits of the 4294967295 bit long vector?

Thanks in advance!

Jennifer


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!
http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
Check out the vibrant tech community 

Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

2018-05-18 Thread Greg Landrum
Hi Jennifer,

I think what you're going to want to do is start by getting the indices and
counts that are present across the entire dataset and then using those to
decide what the columns in your pandas table will be. Here's some code that
may help get you started:

from collections import defaultdict

def get_ecfp(mols, n=3, cut=10):
allfps = []
counts = defaultdict(int)
for i, mol in enumerate(mols):
fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
allfps.append(fp)
for idx,count in fp.items():
counts[idx] += count
print(counts)

If that's not enough, let me know and I will try to do a full version
tomorrow.

-greg




On Fri, May 18, 2018 at 9:41 AM Jennifer Hemmerich <
jennyhemmeric...@gmail.com> wrote:

> Hi all,
>
> I am trying to calculate Morgan Fingerprints for approximately 100.000
> Molecules. I do not want to use the folded FPs. I want to use the counts
> for the bits which are on and afterwards drop all the infrequent features.
> So I would ultimately need a dataframe with the counts for each molecule
> assigned to the respective bit of the FP like this:
> Molecule 1
> 2...
> 6...
> ...n
> Structure1
> 0
> 0
> 4
> 1
> Structure2
> 1
> 0
> 0
> 8
>
> The function I am currently using is:
>
> def get_ecfp(mols, n=3, cut=10):
>
> df = []
> all_dfs = pd.DataFrame()
> for i, mol in enumerate(mols):
> d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
> n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
> df.append(d)
> logger.debug('Calculating %s',str(i))
>
>   
>   #append all collected dataframes in between to prevent memory issue
>
>   if i%2 == 0 or i == len(mols):
> logger.debug('Concatenation %s', str(i))
> part = pd.concat(df)
> all_dfs = pd.concat([part,all_dfs])
> logger.debug('Datatypes %s', str(all_dfs.dtypes))
> del part
> df = []
>
> all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
> columns where count<10
>
> return df
>
> But the concatenation is awfully slow. Without the intermediate
> concatenation I am quickly running out of Memory trying to concatenate all
> dataframes, although using a Machine with 128GB of RAM.
>
> I found the possibility to convert the fingerprint to a numpy array. That
> needs me to assign a numpy array with a certain length which is impossible,
> as I do not know how long the final array has to be. Assigning it to an
> array without predefining the length just never finishes the computation.
> If I check for the length of the fp with fp.GetLength() I get 4294967295
> which just seems to be the maximum number of a 32bit int. This means that
> converting all of the FPs to such long numpy Arrays also is not really an
> option.
>
> Is there any way which I did not see to get the desired DataFrame or
> ndarray out of RDKit directly? Or any better conversion? I assume that the
> keys of the dict I get with  GetNonzeroElements() are the set bits of the
> 4294967295 bit long vector?
>
> Thanks in advance!
>
> Jennifer
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

2018-05-18 Thread Jennifer Hemmerich

Hi all,

I am trying to calculate Morgan Fingerprints for approximately 100.000 
Molecules. I do not want to use the folded FPs. I want to use the counts 
for the bits which are on and afterwards drop all the infrequent 
features. So I would ultimately need a dataframe with the counts for 
each molecule assigned to the respective bit of the FP like this:


Molecule1
2...
6...
...n
Structure1
0
0
4
1
Structure2
1
0
0
8

The function I am currently using is:

def get_ecfp(mols, n=3, cut=10):

df = []
all_dfs = pd.DataFrame()
for i, molin enumerate(mols):
d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
df.append(d)
logger.debug('Calculating %s',str(i))


#append all collected dataframes in between to prevent memory issue

 if i%2 ==0 or i ==len(mols):
logger.debug('Concatenation %s', str(i))
part = pd.concat(df)
all_dfs = pd.concat([part,all_dfs])
logger.debug('Datatypes %s', str(all_dfs.dtypes))
del part
df = []

all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
columns where count<10

return df

But the concatenation is awfully slow. Without the intermediate 
concatenation I am quickly running out of Memory trying to concatenate 
all dataframes, although using a Machine with 128GB of RAM.


I found the possibility to convert the fingerprint to a numpy array. 
That needs me to assign a numpy array with a certain length which is 
impossible, as I do not know how long the final array has to be. 
Assigning it to an array without predefining the length just never 
finishes the computation. If I check for the length of the fp with 
fp.GetLength() I get 4294967295 which just seems to be the maximum 
number of a 32bit int. This means that converting all of the FPs to such 
long numpy Arrays also is not really an option.


Is there any way which I did not see to get the desired DataFrame or 
ndarray out of RDKit directly? Or any better conversion? I assume that 
the keys of the dict I get with GetNonzeroElements() are the set bits of 
the 4294967295 bit long vector?


Thanks in advance!

Jennifer

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss