Re: [Rdkit-discuss] Hankering after faster builds

2016-12-02 Thread Gianluca Sforna
On Fri, Dec 2, 2016 at 6:29 PM, Tim Dudgeon  wrote:
> But since I've been working with the Release_2016_09_2 release my Docker
> image builds on Docker Hub [1] are timing out as they sometimes exceed
> the 2 hour limit. If I try at a quiet time I can sometimes get them to
> complete, but I suppose the situation is only going to get worse.
>

I feel your pain. The RPM build has to compile everything TWICE, one
against python2 and one against python3...

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread George Papadatos
:)

George. 

Sent from my giPhone

> On 2 Dec 2016, at 22:11, Dimitri Maziuk  wrote:
> 
>> On 12/02/2016 03:12 PM, George Papadatos wrote:
>> Here's a pragmatic idea:
> ... would it not be safe to
>> assume that *any *word containing more than 4 'C' or 'c' characters would
>> only be a SMILES string?
> 
> pneumonoultramicroscopicsilicovolcanoconiosis
> 
> 
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> 
> --
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Dimitri Maziuk
On 12/02/2016 03:12 PM, George Papadatos wrote:
> Here's a pragmatic idea:
... would it not be safe to
> assume that *any *word containing more than 4 'C' or 'c' characters would
> only be a SMILES string?

pneumonoultramicroscopicsilicovolcanoconiosis


-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke
On Dec 2, 2016, at 10:05 PM, Brian Kelley wrote:
> Here is a very old version of Andrew's parser in code form: ... It was fairy 
> well tested on the sigma catalog back in the day.  It might be fun to 
> resurrect use it in some form.

There's also my OpenSMILES parser written for Ragel:

  https://bitbucket.org/dalke/opensmiles-ragel

Taking that path goes more along the lines of what NextMove has done.

BTW, upon consideration,

>>   [^]]*   # ignore anything up to the ']'

should be more restrictive and exclude '[', ' ', newline ... or really, only 
allow those characters which are valid after the element (+, -, 0-9, @, :, T, 
H, and a few others).

The exercise is left for the students. ;)


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke
On Dec 2, 2016, at 10:12 PM, George Papadatos wrote:
> If Alexis wants to search for valid SMILES strings representing typical 
> organic molecules among text of plain English words, would it not be safe to 
> assume that any word containing more than 4 'C' or 'c' characters would only 
> be a SMILES string?

Maybe. It depends on the text. That's the problem with any sort of text 
extraction.

If it contains entries like:

  The combination of phenol (c1c1O) and 
or
  The SMILES for phenol is c1c1O.


then my code will extract the 'c1c1O', even though the whitespace delimited 
words of "(c1c1O)" and "c1c1O." cause RDKit to complain with a parse 
error.

I implemented your heuristic as:

def find_possible_smiles(text):
return [(0, 0, term) for term in text.split() if term.count("C") + 
term.count("c") >= 4]

Here are some of the matches:

/Users/dalke/talks/ICCS_2014_paper.txt:0:0 'CACTVS-specific'
/Users/dalke/talks/ICCS_2014_paper.txt:0:0 'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
/Users/dalke/talks/ICCS_2014_paper2.txt:0:0 
'[http://www.dalkescientific.com/writings/diary/archive/2005/03/02/faster_fingerprint_substructure_tests.html]'
/Users/dalke/talks/Sheffield2013.txt:0:0 '"C1=CC=CC=C1"'
/Users/dalke/talks/Sheffield2013.txt:0:0 '"c1c1",'
/Users/dalke/talks/bugs.txt:0:0 
'http://localhost:8080/files?responder=%3Cscript%3Ealert%28%22hi!%22%29%3C/script%3E'
/Users/dalke/talks/garfield.txt:0:0 
'http://www.chemheritage.org/discover/collections/oral-histories/details/henderson-madeline-m.aspx'

You can see it grabbed as trailing comma for a SMILES, as well as a bunch of 
URLs. Those could, of course, be easily post-filtered. But why not use a regexp?


Of course, another level on top of this would be de-hypenation.

This is a well trod path, but not an easy one.


BTW, I tested how many missed structures there might be using your heuristic:

>>> sum(1 for line in open("/Users/dalke/databases/pubchem.smi") if 
>>> line.count("C") >= 4)
68228954

% wc -l /Users/dalke/databases/pubchem.smi
 68413797 /Users/dalke/databases/pubchem.smi

I inverted the logic, so that's
  68413797-68228954 = 184843 = 0.3%




Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread George Papadatos
Here's a pragmatic idea:

If Alexis wants to search for valid SMILES strings representing
typical *organic
*molecules among text of plain English words, would it not be safe to
assume that *any *word containing more than 4 'C' or 'c' characters would
only be a SMILES string?
This simple filter (word.lower().count('c')>=4) would quickly eliminate all
normal English words, leaving only SMILES to parse. No need for regexes,
unless you really care for ISIS or IOPS molecules. :)

George

On 2 December 2016 at 19:36, Andrew Dalke  wrote:

> On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
> > An initial start on some regexps that match SMILES is here:
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b3
> 8da34e22fb
> >
> > that may also be useful
>
>
> I've put together a more gnarly regular expression to find possible SMILES
> strings. It's configured for at least 4 atom terms, but that's easy to
> change (there's a "{3,}" which can be changed as desired.)
>
> It's follows the SMILES specification a bit more closely, which means
> there should be fewer false positives than the regular expression Greg
> pointed out.
>
> The file which constructs the regular expression, and an example driver,
> is attached. Here's what the output looks like:
>
>
>
>
> % python detect_smiles.py ~/talks/*.txt
> /Users/dalke/talks/ICCS_2014_paper.txt:528:532 'IOPS'
> /Users/dalke/talks/ICCS_2014_paper.txt:30150:30183
> 'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
> /Users/dalke/talks/ICCS_2014_paper2.txt:3270:3274 'CBCC'
> /Users/dalke/talks/ICCS_2014_paper2.txt:10229:10239 'CC(=O)[O-]'
> /Users/dalke/talks/ICCS_2014_paper2.txt:32766:32770 'ISIS'
> /Users/dalke/talks/Sheffield2013.txt:25002:25013 'C1=CC=CC=C1'
> /Users/dalke/talks/Sheffield2013.txt:25039:25047 'c1c1'
> /Users/dalke/talks/Sheffield_2016.txt:2767:2771 'CBCC'
> /Users/dalke/talks/Sheffield_2016.txt:10295:10301 'O0'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7302:7306 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7564:7568 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7716:7720 'CBCC'
> /Users/dalke/talks/Sheffield_2016_v2.txt:2874:2878 'soon'
> /Users/dalke/talks/Sheffield_2016_v2.txt:7312:7317 'O'
> /Users/dalke/talks/Sheffield_2016_v2.txt:22770:22774 'ICCS'
> /Users/dalke/talks/Sheffield_2016_v3.txt:2982:2986 'soon'
> /Users/dalke/talks/Sheffield_2016_v3.txt:7627:7632 'O'
> /Users/dalke/talks/Sheffield_2016_v3.txt:24546:24550 'ICCS'
> /Users/dalke/talks/tdd_part_2.txt:7547:7551 'scop'
>
> You can also modify the code for line-by-line processing rather than an
> entire block of text like I did.
>
>
> As others have pointed out, this is a well-trodden path. Follow their
> warnings and advice.
>
> Also, I didn't fully test it.
>
>
>
> Andrew
> da...@dalkescientific.com
>
>
> P.S.
>
> Here's the regular expression:
>
> (? term
>
> (
>
> (
> (
>  Cl? | # Cl and Br are part of the organic subset
>  Br? |
>  [NOSPFIbcnosp*] |  # as are these single-letter elements
>
>  # bracket atom
>  \[\d*  # optional atomic mass
>(# valid element names
> C[laroudsemf]? |
> Os?|N[eaibdpos]? |
> S[icernbmg]? |
> P[drmtboau]? |
> H[eofgas]? |
> c|n|o|s|p |
> A[lrsgutcm] |
> B[eraik]? |
> Dy|E[urs] |
> F[erm]? |
> G[aed] |
> I[nr]? |
> Kr? |
> L[iaur] |
> M[gnodt] |
> R[buhenaf] |
> T[icebmalh] |
> U|V|W|Xe |
> Yb?|Z[nr]
>)
>[^]]*   # ignore anything up to the ']'
> \]
> )
># allow 0 or more closures directly after any atom
> (
>   [-=#$/\\]?  # optional bond type
>   (
> [0-9] |# single digit closure
> (%[0-9][0-9])  # two digit closure
>   )
> ) *
> )
>
> (
>
> (
>  (
>   \( [-=#$/\\]?   # a '(', which can have an optional bond (no dot)
>  ) | (
>\)*   # any number of close parens, followed by
>(
>  ( \( [-=#$/\\]? ) |  # an open parens and optional bond (no dot)
>  [.-=#$/\\]?  # or a dot disconnect or bond
>)
>  )
> )
> ?
>
> (
> (
>  Cl? | # Cl and Br are part of the organic subset
>  Br? |
>  [NOSPFIbcnosp*] |  # as are these single-letter elements
>
>  # bracket atom
>  \[\d*  # optional atomic mass
>(# valid element names
> C[laroudsemf]? |
> Os?|N[eaibdpos]? |
> S[icernbmg]? |
> P[drmtboau]? |
> H[eofgas]? |
> c|n|o|s|p |
> A[lrsgutcm] |
> B[eraik]? |
> Dy|E[urs] |
> F[erm]? |
> G[aed] |
> I[nr]? |
> Kr? |
> L[iaur] |
> M[gnodt] |
> R[buhenaf] |
> T[icebmalh] |
> U|V|W|Xe |
> Yb?|Z[nr]
>)
>[^]]*   # ignore anything up to the ']'
> \]
> )
># allow 0 or more closures directly after any atom
> (
>   [-=#$/\\]?  # optional bond type
>   (
> [0-9] |# single digit closure
> (%[0-9][0-9])  # two digit closure
>   )
> ) *
> )
>
> 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley
Here is a very old version of Andrew's parser in code form:

http://frowns.cvs.sourceforge.net/viewvc/frowns/frowns/smiles_parsers/Smiles.py?revision=1.1.1.1=text%2Fplain

that I used in frowns more than a decade ago.  It was fairy well tested on the 
sigma catalog back in the day.  It might be fun to resurrect use it in some 
form.


Brian Kelley

> On Dec 2, 2016, at 2:36 PM, Andrew Dalke  wrote:
> 
>> On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
>> An initial start on some regexps that match SMILES is here: 
>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>> 
>> that may also be useful
> 
> 
> I've put together a more gnarly regular expression to find possible SMILES 
> strings. It's configured for at least 4 atom terms, but that's easy to change 
> (there's a "{3,}" which can be changed as desired.)
> 
> It's follows the SMILES specification a bit more closely, which means there 
> should be fewer false positives than the regular expression Greg pointed out.
> 
> The file which constructs the regular expression, and an example driver, is 
> attached. Here's what the output looks like:
> 
> 
> 
> 
> % python detect_smiles.py ~/talks/*.txt
> /Users/dalke/talks/ICCS_2014_paper.txt:528:532 'IOPS'
> /Users/dalke/talks/ICCS_2014_paper.txt:30150:30183 
> 'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
> /Users/dalke/talks/ICCS_2014_paper2.txt:3270:3274 'CBCC'
> /Users/dalke/talks/ICCS_2014_paper2.txt:10229:10239 'CC(=O)[O-]'
> /Users/dalke/talks/ICCS_2014_paper2.txt:32766:32770 'ISIS'
> /Users/dalke/talks/Sheffield2013.txt:25002:25013 'C1=CC=CC=C1'
> /Users/dalke/talks/Sheffield2013.txt:25039:25047 'c1c1'
> /Users/dalke/talks/Sheffield_2016.txt:2767:2771 'CBCC'
> /Users/dalke/talks/Sheffield_2016.txt:10295:10301 'O0'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7302:7306 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7564:7568 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7716:7720 'CBCC'
> /Users/dalke/talks/Sheffield_2016_v2.txt:2874:2878 'soon'
> /Users/dalke/talks/Sheffield_2016_v2.txt:7312:7317 'O'
> /Users/dalke/talks/Sheffield_2016_v2.txt:22770:22774 'ICCS'
> /Users/dalke/talks/Sheffield_2016_v3.txt:2982:2986 'soon'
> /Users/dalke/talks/Sheffield_2016_v3.txt:7627:7632 'O'
> /Users/dalke/talks/Sheffield_2016_v3.txt:24546:24550 'ICCS'
> /Users/dalke/talks/tdd_part_2.txt:7547:7551 'scop'
> 
> You can also modify the code for line-by-line processing rather than an 
> entire block of text like I did.
> 
> 
> As others have pointed out, this is a well-trodden path. Follow their 
> warnings and advice.
> 
> Also, I didn't fully test it.
> 
> 
> 
>Andrew
>da...@dalkescientific.com
> 
> 
> P.S.
> 
> Here's the regular expression:
> 
> (? term
> 
> (
> 
> (
> (
> Cl? | # Cl and Br are part of the organic subset
> Br? |
> [NOSPFIbcnosp*] |  # as are these single-letter elements
> 
> # bracket atom
> \[\d*  # optional atomic mass
>   (# valid element names
>C[laroudsemf]? |
>Os?|N[eaibdpos]? |
>S[icernbmg]? |
>P[drmtboau]? |
>H[eofgas]? |
>c|n|o|s|p |
>A[lrsgutcm] |
>B[eraik]? |
>Dy|E[urs] |
>F[erm]? |
>G[aed] |
>I[nr]? |
>Kr? |
>L[iaur] |
>M[gnodt] |
>R[buhenaf] |
>T[icebmalh] |
>U|V|W|Xe |
>Yb?|Z[nr]
>   )
>   [^]]*   # ignore anything up to the ']'
> \]
> )
>   # allow 0 or more closures directly after any atom
> (
>  [-=#$/\\]?  # optional bond type
>  (
>[0-9] |# single digit closure
>(%[0-9][0-9])  # two digit closure
>  )
> ) *
> )
> 
> (
> 
> (
> (
>  \( [-=#$/\\]?   # a '(', which can have an optional bond (no dot)
> ) | (
>   \)*   # any number of close parens, followed by
>   (
> ( \( [-=#$/\\]? ) |  # an open parens and optional bond (no dot)
> [.-=#$/\\]?  # or a dot disconnect or bond
>   )
> )
> )
> ?
> 
> (
> (
> Cl? | # Cl and Br are part of the organic subset
> Br? |
> [NOSPFIbcnosp*] |  # as are these single-letter elements
> 
> # bracket atom
> \[\d*  # optional atomic mass
>   (# valid element names
>C[laroudsemf]? |
>Os?|N[eaibdpos]? |
>S[icernbmg]? |
>P[drmtboau]? |
>H[eofgas]? |
>c|n|o|s|p |
>A[lrsgutcm] |
>B[eraik]? |
>Dy|E[urs] |
>F[erm]? |
>G[aed] |
>I[nr]? |
>Kr? |
>L[iaur] |
>M[gnodt] |
>R[buhenaf] |
>T[icebmalh] |
>U|V|W|Xe |
>Yb?|Z[nr]
>   )
>   [^]]*   # ignore anything up to the ']'
> \]
> )
>   # allow 0 or more closures directly after any atom
> (
>  [-=#$/\\]?  # optional bond type
>  (
>[0-9] |# single digit closure
>(%[0-9][0-9])  # two digit closure
>  )
> ) *
> )
> 
> ){3,}  # must have at least 4 atoms
> 
> (?!\w)   # this isn't a SMILES if there are letters or numbers after the term
> 
> )
> 
> --
> 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke
On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
> An initial start on some regexps that match SMILES is here: 
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
> 
> that may also be useful


I've put together a more gnarly regular expression to find possible SMILES 
strings. It's configured for at least 4 atom terms, but that's easy to change 
(there's a "{3,}" which can be changed as desired.)

It's follows the SMILES specification a bit more closely, which means there 
should be fewer false positives than the regular expression Greg pointed out.

The file which constructs the regular expression, and an example driver, is 
attached. Here's what the output looks like:

from __future__ import print_function

# Identify likely SMILES strings embedded in a block of text.

# This constructs a rather complicated regular expression which does a
# decent job of matching the OpenSMILES grammar. It assumes that a
# SMILES string looks like:
#
#   atom ( connector? atom)*
#
# where the "atom" is the organic subset or a 'bracket' atom term from
# the SMILES followed by any optional closures, and where the "connector"
# is the possible combinations of open/close parentheses, dot disconnect,
# or bond.

# It does not attempt to balance parenthesies, ensure matching ring
# closures, or handle aromaticity. those cannot be done with a regular
# expression.

# Written in 2016 by Andrew Dalke 

import re

# Match the atom term and any closures
atom_re = r"""
(
(
 Cl? | # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements

 # bracket atom
 \[\d*  # optional atomic mass
   (# valid element names
C[laroudsemf]? |
Os?|N[eaibdpos]? |
S[icernbmg]? |
P[drmtboau]? |
H[eofgas]? |
c|n|o|s|p |
A[lrsgutcm] |
B[eraik]? |
Dy|E[urs] |
F[erm]? |
G[aed] |
I[nr]? |
Kr? |
L[iaur] |
M[gnodt] |
R[buhenaf] |
T[icebmalh] |
U|V|W|Xe |
Yb?|Z[nr]
   )
   [^]]*   # ignore anything up to the ']'
\]
)
   # allow 0 or more closures directly after any atom
(
  [-=#$/\\]?  # optional bond type
  (
[0-9] |# single digit closure
(%[0-9][0-9])  # two digit closure
  )
) *
)
"""

# Things that can go between atoms. This is complicated. Some
# of the patterns are:
#   C))C
#   C(C
#   C(=C)
#   C))C
#   C)=C
#   C=C
#   C.C

connection_re = r"""
(
 (
  \( [-=#$/\\]?   # a '(', which can have an optional bond (no dot)
 ) | (
   \)*   # any number of close parens, followed by
   (
 ( \( [-=#$/\\]? ) |  # an open parens and optional bond (no dot)
 [.-=#$/\\]?  # or a dot disconnect or bond
   )
 )
)
"""
# The full regular expression. Use zero-width assertion to ensure the
# the putative SMILES is not inside of a larger "word".
smiles_re = r"""
(?

% python detect_smiles.py ~/talks/*.txt
/Users/dalke/talks/ICCS_2014_paper.txt:528:532 'IOPS'
/Users/dalke/talks/ICCS_2014_paper.txt:30150:30183 
'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
/Users/dalke/talks/ICCS_2014_paper2.txt:3270:3274 'CBCC'
/Users/dalke/talks/ICCS_2014_paper2.txt:10229:10239 'CC(=O)[O-]'
/Users/dalke/talks/ICCS_2014_paper2.txt:32766:32770 'ISIS'
/Users/dalke/talks/Sheffield2013.txt:25002:25013 'C1=CC=CC=C1'
/Users/dalke/talks/Sheffield2013.txt:25039:25047 'c1c1'
/Users/dalke/talks/Sheffield_2016.txt:2767:2771 'CBCC'
/Users/dalke/talks/Sheffield_2016.txt:10295:10301 'O0'
/Users/dalke/talks/Sheffield_2016_talk.txt:7302:7306 'CBCC'
/Users/dalke/talks/Sheffield_2016_talk.txt:7564:7568 'CBCC'
/Users/dalke/talks/Sheffield_2016_talk.txt:7716:7720 'CBCC'
/Users/dalke/talks/Sheffield_2016_v2.txt:2874:2878 'soon'
/Users/dalke/talks/Sheffield_2016_v2.txt:7312:7317 'O'
/Users/dalke/talks/Sheffield_2016_v2.txt:22770:22774 'ICCS'
/Users/dalke/talks/Sheffield_2016_v3.txt:2982:2986 'soon'
/Users/dalke/talks/Sheffield_2016_v3.txt:7627:7632 'O'
/Users/dalke/talks/Sheffield_2016_v3.txt:24546:24550 'ICCS'
/Users/dalke/talks/tdd_part_2.txt:7547:7551 'scop'

You can also modify the code for line-by-line processing rather than an entire 
block of text like I did.


As others have pointed out, this is a well-trodden path. Follow their warnings 
and advice.

Also, I didn't fully test it.



Andrew
da...@dalkescientific.com


P.S.

Here's the regular expression:

(?--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley
George,
  My point was actually parsing the words as IUPAC/SMILES is surprisingly 
effective as opposed to an ai or rule based system.  Without sanitization, 
Rdkit is about 60,000/second for smiles parsing on my laptop.  This is much 
faster when not making molecules, but I don't have the number handy.   

I expect it to be even faster with failing non smiles.  This should be 
sufficient for document scanning I think.


Brian Kelley

> On Dec 2, 2016, at 1:28 PM, George Papadatos  wrote:
> 
> I think Alexis was referring to converting actual SMILES strings found in 
> random text. Chemical entity recognition and name to structure conversion is 
> another story altogether and nowadays one can quickly go a long way with open 
> tools such as OSCAR + OPSIN in KNIME or with something like this: 
> http://chemdataextractor.org/docs/intro
> 
> George
> 
>> On 2 December 2016 at 17:35, Brian Kelley  wrote:
>> This was why they started using the dictionary lookup as I recall :). The 
>> iupac system they ended up using was Roger's when at OpenEye.
>> 
>> 
>> Brian Kelley
>> 
>>> On Dec 2, 2016, at 12:33 PM, Igor Filippov  
>>> wrote:
>>> 
>>> I could be wrong but I believe IBM system had a preprocessing step which 
>>> removed all known dictionary words - which would get rid of "submarine" etc.
>>> I also believe this problem has been solved multiple times in the past, 
>>> NextMove software comes to mind, chemical tagger - 
>>> http://chemicaltagger.ch.cam.ac.uk/, etc.
>>> 
>>> my 2 cents,
>>> Igor
>>> 
>>> 
>>> 
>>> 
 On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley  wrote:
 I hacked a version of RDKit's smiles parser to compute heavy atom count, 
 perhaps some version of this could be used to check smiles validity 
 without making the actual molecule.
 
 From a fun historical perspective:  IBM had an expert system to find IUPAC 
 names in documents.  They ended up finding things like "submarine" which 
 was amusing.  It turned out that just parsing all words with the IUPAC 
 parser was by far the fastest and best solution.  I expect the same will 
 be true for finding smiles.
 
 It would be interesting to put the common OCR errors into the parser as 
 well (l's and 1's are hard for instance).
 
 
> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck  
> wrote:
> Hello Alexis,
> 
> Depending on the size of your document, you could consider limit storing 
> the already tested strings by word length and only memoize shorter words. 
> SMILES tend to be longer, so everything above a given number of 
> characters has a higher probability of being a SMILES. Large words 
> probably also contain a lot of chemical names. They often contain commas 
> (,), so they are easy to remove quickly. 
> 
> Best,
> 
> Peter
> 
> 
>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty 
>>  wrote:
>> Dear Pavel And Greg,
>> 
>>  
>> 
>> Thanks Greg for the regexps link. I’ll use that too.
>> 
>> 
>> 
>> Pavel, I need to track on which document the SMILES are coming from, but 
>> I will indeed make a set of unique word for each document before 
>> looping. Thanks!
>> 
>> Best,
>> 
>> Alexis
>> 
>> 
>> On 2 December 2016 at 11:21, Pavel  wrote:
>> Hi, Alexis,
>> 
>>   if you should not track from which document SMILES come, you may just 
>> combine all words from all document in a list, take only unique words 
>> and try to test them. Thus, you should not store and check for 
>> valid/non-valid strings. That would reduce problem complexity as well.
>> 
>> Pavel.
>>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>> An initial start on some regexps that match SMILES is here: 
>>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>>> 
>>> that may also be useful
>>> 
>>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis   Parenty 
>>>  wrote:
>>> Hi Markus,
>>> 
>>> 
>>> Yes! I might discover novel compounds that way!! Would be interesting 
>>> to see how they look like…
>>> 
>>> 
>>> Good suggestion to also store the words that were correctly identified 
>>> as SMILES. I’ll add that to the script.
>>> 
>>> 
>>> I also like your “distribution of word” idea. I could safely skip any 
>>> words that occur more than 1% of the time and could try to play around 
>>> with the threshold to find an optimum.
>>> 
>>> 
>>> I will try every suggestions and will time it to see what is best. I’ll 
>>> keep everyone in the loop and will share the script and results.
>>> 
>>> 
>>> 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread George Papadatos
I think Alexis was referring to converting actual SMILES strings found in
random text. Chemical entity recognition and name to structure conversion
is another story altogether and nowadays one can quickly go a long way with
open tools such as OSCAR + OPSIN in KNIME or with something like this:
http://chemdataextractor.org/docs/intro

George

On 2 December 2016 at 17:35, Brian Kelley  wrote:

> This was why they started using the dictionary lookup as I recall :). The
> iupac system they ended up using was Roger's when at OpenEye.
>
> 
> Brian Kelley
>
> On Dec 2, 2016, at 12:33 PM, Igor Filippov 
> wrote:
>
> I could be wrong but I believe IBM system had a preprocessing step which
> removed all known dictionary words - which would get rid of "submarine" etc.
> I also believe this problem has been solved multiple times in the past,
> NextMove software comes to mind, chemical tagger -
> http://chemicaltagger.ch.cam.ac.uk/, etc.
>
> my 2 cents,
> Igor
>
>
>
>
> On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley 
> wrote:
>
>> I hacked a version of RDKit's smiles parser to compute heavy atom count,
>> perhaps some version of this could be used to check smiles validity without
>> making the actual molecule.
>>
>> From a fun historical perspective:  IBM had an expert system to find
>> IUPAC names in documents.  They ended up finding things like "submarine"
>> which was amusing.  It turned out that just parsing all words with the
>> IUPAC parser was by far the fastest and best solution.  I expect the same
>> will be true for finding smiles.
>>
>> It would be interesting to put the common OCR errors into the parser as
>> well (l's and 1's are hard for instance).
>>
>>
>> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck 
>> wrote:
>>
>>> Hello Alexis,
>>>
>>> Depending on the size of your document, you could consider limit storing
>>> the already tested strings by word length and only memoize shorter words.
>>> SMILES tend to be longer, so everything above a given number of characters
>>> has a higher probability of being a SMILES. Large words probably also
>>> contain a lot of chemical names. They often contain commas (,), so they are
>>> easy to remove quickly.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>>
>>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
 Dear Pavel And Greg,



 Thanks Greg for the regexps link. I’ll use that too.


 Pavel, I need to track on which document the SMILES are coming from,
 but I will indeed make a set of unique word for each document before
 looping. Thanks!

 Best,

 Alexis

 On 2 December 2016 at 11:21, Pavel  wrote:

 Hi, Alexis,

   if you should not track from which document SMILES come, you may just
 combine all words from all document in a list, take only unique words and
 try to test them. Thus, you should not store and check for valid/non-valid
 strings. That would reduce problem complexity as well.

 Pavel.
 On 12/02/2016 11:11 AM, Greg Landrum wrote:

 An initial start on some regexps that match SMILES is here:
 https://gist.github.com/lsauer/1312860/264ae813c2bd2c2
 7a769d261c8c6b38da34e22fb

 that may also be useful

 On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
 alexis.parenty.h...@gmail.com> wrote:

 Hi Markus,


 Yes! I might discover novel compounds that way!! Would be interesting
 to see how they look like…


 Good suggestion to also store the words that were correctly identified
 as SMILES. I’ll add that to the script.


 I also like your “distribution of word” idea. I could safely skip any
 words that occur more than 1% of the time and could try to play around with
 the threshold to find an optimum.


 I will try every suggestions and will time it to see what is best. I’ll
 keep everyone in the loop and will share the script and results.


 Thanks,


 Alexis

 On 2 December 2016 at 10:47, Markus Sitzmann  wrote:

 Hi Alexis,

 you may find also so some "novel" compounds by this approach :-).

 Whether your tuple solution improves performance strongly depends on
 the content of your text documents and how often they repeat the same words
 again - but my guess would be it will help. Probably the best way is even
 to look at the distribution of words before you feed them to RDKit. You
 should also "memorize" those ones that successfully generated a structure,
 doesn't make sense to do it again, then.

 Markus

 On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
 mac...@wojcikowski.pl> wrote:

 Hi Alexis,

 You may want to filter with some regex 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley
This was why they started using the dictionary lookup as I recall :). The iupac 
system they ended up using was Roger's when at OpenEye.


Brian Kelley

> On Dec 2, 2016, at 12:33 PM, Igor Filippov  wrote:
> 
> I could be wrong but I believe IBM system had a preprocessing step which 
> removed all known dictionary words - which would get rid of "submarine" etc.
> I also believe this problem has been solved multiple times in the past, 
> NextMove software comes to mind, chemical tagger - 
> http://chemicaltagger.ch.cam.ac.uk/, etc.
> 
> my 2 cents,
> Igor
> 
> 
> 
> 
>> On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley  wrote:
>> I hacked a version of RDKit's smiles parser to compute heavy atom count, 
>> perhaps some version of this could be used to check smiles validity without 
>> making the actual molecule.
>> 
>> From a fun historical perspective:  IBM had an expert system to find IUPAC 
>> names in documents.  They ended up finding things like "submarine" which was 
>> amusing.  It turned out that just parsing all words with the IUPAC parser 
>> was by far the fastest and best solution.  I expect the same will be true 
>> for finding smiles.
>> 
>> It would be interesting to put the common OCR errors into the parser as well 
>> (l's and 1's are hard for instance).
>> 
>> 
>>> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck  
>>> wrote:
>>> Hello Alexis,
>>> 
>>> Depending on the size of your document, you could consider limit storing 
>>> the already tested strings by word length and only memoize shorter words. 
>>> SMILES tend to be longer, so everything above a given number of characters 
>>> has a higher probability of being a SMILES. Large words probably also 
>>> contain a lot of chemical names. They often contain commas (,), so they are 
>>> easy to remove quickly. 
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
 On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty 
  wrote:
 Dear Pavel And Greg,
 
  
 
 Thanks Greg for the regexps link. I’ll use that too.
 
 
 
 Pavel, I need to track on which document the SMILES are coming from, but I 
 will indeed make a set of unique word for each document before looping. 
 Thanks!
 
 Best,
 
 Alexis
 
 
 On 2 December 2016 at 11:21, Pavel  wrote:
 Hi, Alexis,
 
   if you should not track from which document SMILES come, you may just 
 combine all words from all document in a list, take only unique words and 
 try to test them. Thus, you should not store and check for valid/non-valid 
 strings. That would reduce problem complexity as well.
 
 Pavel.
> On 12/02/2016 11:11 AM, Greg Landrum wrote:
> An initial start on some regexps that match SMILES is here: 
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
> 
> that may also be useful
> 
> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty 
>  wrote:
> Hi Markus,
> 
> 
> Yes! I might discover novel compounds that way!! Would be 
> interesting to see how they look like…
> 
> 
> Good suggestion to also store the words that were correctly identified as 
> SMILES. I’ll add that to the script.
> 
> 
> I also like your “distribution of word” idea. I could safely skip any 
> words that occur more than 1% of the time and could try to play around 
> with the threshold to find an optimum.
> 
> 
> I will try every suggestions and will time it to see what is best. I’ll 
> keep everyone in the loop and will share the script and results.
> 
> 
> Thanks,
> 
> 
> Alexis
> 
> 
> On 2 December 2016 at 10:47, Markus Sitzmann  
> wrote:
> Hi Alexis,
> 
> you may find also so some "novel" compounds by this approach :-).
> 
> Whether your tuple solution improves performance strongly depends on the 
> content of your text documents and how often they repeat the same words 
> again - but my guess would be it will help. Probably the best way is even 
> to look at the distribution of words before you feed them to RDKit. You 
> should also "memorize" those ones that successfully generated a 
> structure, doesn't make sense to do it again, then.
> 
> Markus
> 
> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski 
>  wrote:
> Hi Alexis,
> 
> You may want to filter with some regex strings containing not valid 
> characters (i.e. there is small subset of atoms that may be without 
> brackets). See "Atoms" section: 
> http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html 
> 
> The set might grow pretty quick and may be inefficient, so I'd parse all 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Igor Filippov
I could be wrong but I believe IBM system had a preprocessing step which
removed all known dictionary words - which would get rid of "submarine" etc.
I also believe this problem has been solved multiple times in the past,
NextMove software comes to mind, chemical tagger -
http://chemicaltagger.ch.cam.ac.uk/, etc.

my 2 cents,
Igor




On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley  wrote:

> I hacked a version of RDKit's smiles parser to compute heavy atom count,
> perhaps some version of this could be used to check smiles validity without
> making the actual molecule.
>
> From a fun historical perspective:  IBM had an expert system to find IUPAC
> names in documents.  They ended up finding things like "submarine" which
> was amusing.  It turned out that just parsing all words with the IUPAC
> parser was by far the fastest and best solution.  I expect the same will be
> true for finding smiles.
>
> It would be interesting to put the common OCR errors into the parser as
> well (l's and 1's are hard for instance).
>
>
> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck 
> wrote:
>
>> Hello Alexis,
>>
>> Depending on the size of your document, you could consider limit storing
>> the already tested strings by word length and only memoize shorter words.
>> SMILES tend to be longer, so everything above a given number of characters
>> has a higher probability of being a SMILES. Large words probably also
>> contain a lot of chemical names. They often contain commas (,), so they are
>> easy to remove quickly.
>>
>> Best,
>>
>> Peter
>>
>>
>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty <
>> alexis.parenty.h...@gmail.com> wrote:
>>
>>> Dear Pavel And Greg,
>>>
>>>
>>>
>>> Thanks Greg for the regexps link. I’ll use that too.
>>>
>>>
>>> Pavel, I need to track on which document the SMILES are coming from, but
>>> I will indeed make a set of unique word for each document before looping.
>>> Thanks!
>>>
>>> Best,
>>>
>>> Alexis
>>>
>>> On 2 December 2016 at 11:21, Pavel  wrote:
>>>
>>> Hi, Alexis,
>>>
>>>   if you should not track from which document SMILES come, you may just
>>> combine all words from all document in a list, take only unique words and
>>> try to test them. Thus, you should not store and check for valid/non-valid
>>> strings. That would reduce problem complexity as well.
>>>
>>> Pavel.
>>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>>
>>> An initial start on some regexps that match SMILES is here:
>>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c2
>>> 7a769d261c8c6b38da34e22fb
>>>
>>> that may also be useful
>>>
>>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
>>> Hi Markus,
>>>
>>>
>>> Yes! I might discover novel compounds that way!! Would be interesting to
>>> see how they look like…
>>>
>>>
>>> Good suggestion to also store the words that were correctly identified
>>> as SMILES. I’ll add that to the script.
>>>
>>>
>>> I also like your “distribution of word” idea. I could safely skip any
>>> words that occur more than 1% of the time and could try to play around with
>>> the threshold to find an optimum.
>>>
>>>
>>> I will try every suggestions and will time it to see what is best. I’ll
>>> keep everyone in the loop and will share the script and results.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Alexis
>>>
>>> On 2 December 2016 at 10:47, Markus Sitzmann 
>>> wrote:
>>>
>>> Hi Alexis,
>>>
>>> you may find also so some "novel" compounds by this approach :-).
>>>
>>> Whether your tuple solution improves performance strongly depends on
>>> the content of your text documents and how often they repeat the same words
>>> again - but my guess would be it will help. Probably the best way is even
>>> to look at the distribution of words before you feed them to RDKit. You
>>> should also "memorize" those ones that successfully generated a structure,
>>> doesn't make sense to do it again, then.
>>>
>>> Markus
>>>
>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>>> mac...@wojcikowski.pl> wrote:
>>>
>>> Hi Alexis,
>>>
>>> You may want to filter with some regex strings containing not valid
>>> characters (i.e. there is small subset of atoms that may be without
>>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>>> l/doc/theory/theory.smiles.html
>>>
>>> The set might grow pretty quick and may be inefficient, so I'd parse all
>>> strings passing above filter. Although there will be some false positives
>>> like "CC" which may occur in text (emails especially).
>>>
>>> 
>>> Pozdrawiam,  |  Best regards,
>>> Maciek Wójcikowski
>>> mac...@wojcikowski.pl
>>>
>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty >> >:
>>>
>>> Dear all,
>>>
>>>
>>> I am looking for a way to extract SMILES scattered in many text
>>> documents (thousands documents of several pages each).
>>>
>>> At the moment, I am thinking to scan each words 

[Rdkit-discuss] Hankering after faster builds

2016-12-02 Thread Tim Dudgeon
Of course builds from source are never fast enough, and the RDKit one is 
pretty big.
So far I've lived with this and made cups of coffee.
But since I've been working with the Release_2016_09_2 release my Docker 
image builds on Docker Hub [1] are timing out as they sometimes exceed 
the 2 hour limit. If I try at a quiet time I can sometimes get them to 
complete, but I suppose the situation is only going to get worse.

I've tried breaking the build into 2 steps, the first to prepare the 
base image [2] and the second to build RDKit [3], and that's helped a 
bit, but I don't think there's more mileage here as nearly all the time 
is spend in the 'make' command.

Anticipating things getting worse, does anyone have suggestions for 
speeding this up. I hanker after a --fast mode, but I suppose if there 
was one it would already be the default.

Any ideas?

In case anyone wonders I'm building from source as I need to also create 
a Java enabled version and one for the Postgres cartridge and want to 
use the same approach to generating them all.

Tim

[1] https://hub.docker.com/r/informaticsmatters/rdkit/
[2] https://github.com/InformaticsMatters/rdkit_debian_base
[3] https://github.com/InformaticsMatters/rdkit





--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley
I hacked a version of RDKit's smiles parser to compute heavy atom count,
perhaps some version of this could be used to check smiles validity without
making the actual molecule.

>From a fun historical perspective:  IBM had an expert system to find IUPAC
names in documents.  They ended up finding things like "submarine" which
was amusing.  It turned out that just parsing all words with the IUPAC
parser was by far the fastest and best solution.  I expect the same will be
true for finding smiles.

It would be interesting to put the common OCR errors into the parser as
well (l's and 1's are hard for instance).


On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck 
wrote:

> Hello Alexis,
>
> Depending on the size of your document, you could consider limit storing
> the already tested strings by word length and only memoize shorter words.
> SMILES tend to be longer, so everything above a given number of characters
> has a higher probability of being a SMILES. Large words probably also
> contain a lot of chemical names. They often contain commas (,), so they are
> easy to remove quickly.
>
> Best,
>
> Peter
>
>
> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Dear Pavel And Greg,
>>
>>
>>
>> Thanks Greg for the regexps link. I’ll use that too.
>>
>>
>> Pavel, I need to track on which document the SMILES are coming from, but
>> I will indeed make a set of unique word for each document before looping.
>> Thanks!
>>
>> Best,
>>
>> Alexis
>>
>> On 2 December 2016 at 11:21, Pavel  wrote:
>>
>> Hi, Alexis,
>>
>>   if you should not track from which document SMILES come, you may just
>> combine all words from all document in a list, take only unique words and
>> try to test them. Thus, you should not store and check for valid/non-valid
>> strings. That would reduce problem complexity as well.
>>
>> Pavel.
>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>
>> An initial start on some regexps that match SMILES is here:
>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b3
>> 8da34e22fb
>>
>> that may also be useful
>>
>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
>> alexis.parenty.h...@gmail.com> wrote:
>>
>> Hi Markus,
>>
>>
>> Yes! I might discover novel compounds that way!! Would be interesting to
>> see how they look like…
>>
>>
>> Good suggestion to also store the words that were correctly identified as
>> SMILES. I’ll add that to the script.
>>
>>
>> I also like your “distribution of word” idea. I could safely skip any
>> words that occur more than 1% of the time and could try to play around with
>> the threshold to find an optimum.
>>
>>
>> I will try every suggestions and will time it to see what is best. I’ll
>> keep everyone in the loop and will share the script and results.
>>
>>
>> Thanks,
>>
>>
>> Alexis
>>
>> On 2 December 2016 at 10:47, Markus Sitzmann 
>> wrote:
>>
>> Hi Alexis,
>>
>> you may find also so some "novel" compounds by this approach :-).
>>
>> Whether your tuple solution improves performance strongly depends on the
>> content of your text documents and how often they repeat the same words
>> again - but my guess would be it will help. Probably the best way is even
>> to look at the distribution of words before you feed them to RDKit. You
>> should also "memorize" those ones that successfully generated a structure,
>> doesn't make sense to do it again, then.
>>
>> Markus
>>
>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>> mac...@wojcikowski.pl> wrote:
>>
>> Hi Alexis,
>>
>> You may want to filter with some regex strings containing not valid
>> characters (i.e. there is small subset of atoms that may be without
>> brackets). See "Atoms" section: http://www.daylight.com/
>> dayhtml/doc/theory/theory.smiles.html
>>
>> The set might grow pretty quick and may be inefficient, so I'd parse all
>> strings passing above filter. Although there will be some false positives
>> like "CC" which may occur in text (emails especially).
>>
>> 
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty 
>> :
>>
>> Dear all,
>>
>>
>> I am looking for a way to extract SMILES scattered in many text documents
>> (thousands documents of several pages each).
>>
>> At the moment, I am thinking to scan each words from the text and try to
>> make a mol object from them using Chem.MolFromSmiles() then store the words
>> if they return a mol object that is not None.
>>
>> Can anyone think of a better/quicker way?
>>
>>
>> Would it be worth storing in a tuple any word that do not return a mol
>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>
>>
>> Something along those lines
>>
>>
>> excluded_set = set()
>>
>> smiles_list = []
>>
>> For each_word in text:
>>
>> If each_word not in excluded_set:
>>
>> each_word_mol =  

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Peter Gedeck
Hello Alexis,

Depending on the size of your document, you could consider limit storing
the already tested strings by word length and only memoize shorter words.
SMILES tend to be longer, so everything above a given number of characters
has a higher probability of being a SMILES. Large words probably also
contain a lot of chemical names. They often contain commas (,), so they are
easy to remove quickly.

Best,

Peter

On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty 
wrote:

> Dear Pavel And Greg,
>
>
>
> Thanks Greg for the regexps link. I’ll use that too.
>
>
> Pavel, I need to track on which document the SMILES are coming from, but I
> will indeed make a set of unique word for each document before looping.
> Thanks!
>
> Best,
>
> Alexis
>
> On 2 December 2016 at 11:21, Pavel  wrote:
>
> Hi, Alexis,
>
>   if you should not track from which document SMILES come, you may just
> combine all words from all document in a list, take only unique words and
> try to test them. Thus, you should not store and check for valid/non-valid
> strings. That would reduce problem complexity as well.
>
> Pavel.
> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>
> An initial start on some regexps that match SMILES is here:
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>
> that may also be useful
>
> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
> Hi Markus,
>
>
> Yes! I might discover novel compounds that way!! Would be interesting to
> see how they look like…
>
>
> Good suggestion to also store the words that were correctly identified as
> SMILES. I’ll add that to the script.
>
>
> I also like your “distribution of word” idea. I could safely skip any
> words that occur more than 1% of the time and could try to play around with
> the threshold to find an optimum.
>
>
> I will try every suggestions and will time it to see what is best. I’ll
> keep everyone in the loop and will share the script and results.
>
>
> Thanks,
>
>
> Alexis
>
> On 2 December 2016 at 10:47, Markus Sitzmann 
> wrote:
>
> Hi Alexis,
>
> you may find also so some "novel" compounds by this approach :-).
>
> Whether your tuple solution improves performance strongly depends on the
> content of your text documents and how often they repeat the same words
> again - but my guess would be it will help. Probably the best way is even
> to look at the distribution of words before you feed them to RDKit. You
> should also "memorize" those ones that successfully generated a structure,
> doesn't make sense to do it again, then.
>
> Markus
>
> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski  > wrote:
>
> Hi Alexis,
>
> You may want to filter with some regex strings containing not valid
> characters (i.e. there is small subset of atoms that may be without
> brackets). See "Atoms" section:
> http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
>
> The set might grow pretty quick and may be inefficient, so I'd parse all
> strings passing above filter. Although there will be some false positives
> like "CC" which may occur in text (emails especially).
>
> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2016-12-02 10:11 GMT+01:00 Alexis Parenty :
>
> Dear all,
>
>
> I am looking for a way to extract SMILES scattered in many text documents
> (thousands documents of several pages each).
>
> At the moment, I am thinking to scan each words from the text and try to
> make a mol object from them using Chem.MolFromSmiles() then store the words
> if they return a mol object that is not None.
>
> Can anyone think of a better/quicker way?
>
>
> Would it be worth storing in a tuple any word that do not return a mol
> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>
>
> Something along those lines
>
>
> excluded_set = set()
>
> smiles_list = []
>
> For each_word in text:
>
> If each_word not in excluded_set:
>
> each_word_mol =  Chem.MolFromSmiles(each_word)
>
> if each_word_mol is not None:
>
> smiles_list.append(each_word)
>
>  else:
>
>  excluded_set.add(each_word_mol)
>
>
> Would not searching into that growing tuple take actually more time than
> trying to blindly make a mol object for every word?
>
>
>
> Any suggestion?
>
>
> Many thanks and regards,
>
>
> Alexis
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
>
> 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty
Dear Pavel And Greg,



Thanks Greg for the regexps link. I’ll use that too.


Pavel, I need to track on which document the SMILES are coming from, but I
will indeed make a set of unique word for each document before looping.
Thanks!

Best,

Alexis

On 2 December 2016 at 11:21, Pavel  wrote:

> Hi, Alexis,
>
>   if you should not track from which document SMILES come, you may just
> combine all words from all document in a list, take only unique words and
> try to test them. Thus, you should not store and check for valid/non-valid
> strings. That would reduce problem complexity as well.
>
> Pavel.
> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>
> An initial start on some regexps that match SMILES is here:
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b3
> 8da34e22fb
>
> that may also be useful
>
> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi Markus,
>>
>>
>> Yes! I might discover novel compounds that way!! Would be interesting to
>> see how they look like…
>>
>>
>> Good suggestion to also store the words that were correctly identified as
>> SMILES. I’ll add that to the script.
>>
>>
>> I also like your “distribution of word” idea. I could safely skip any
>> words that occur more than 1% of the time and could try to play around with
>> the threshold to find an optimum.
>>
>>
>> I will try every suggestions and will time it to see what is best. I’ll
>> keep everyone in the loop and will share the script and results.
>>
>>
>> Thanks,
>>
>>
>> Alexis
>>
>> On 2 December 2016 at 10:47, Markus Sitzmann 
>> wrote:
>>
>>> Hi Alexis,
>>>
>>> you may find also so some "novel" compounds by this approach :-).
>>>
>>> Whether your tuple solution improves performance strongly depends on
>>> the content of your text documents and how often they repeat the same words
>>> again - but my guess would be it will help. Probably the best way is even
>>> to look at the distribution of words before you feed them to RDKit. You
>>> should also "memorize" those ones that successfully generated a structure,
>>> doesn't make sense to do it again, then.
>>>
>>> Markus
>>>
>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>>> mac...@wojcikowski.pl> wrote:
>>>
 Hi Alexis,

 You may want to filter with some regex strings containing not valid
 characters (i.e. there is small subset of atoms that may be without
 brackets). See "Atoms" section: http://www.daylight.com/dayhtm
 l/doc/theory/theory.smiles.html

 The set might grow pretty quick and may be inefficient, so I'd parse
 all strings passing above filter. Although there will be some false
 positives like "CC" which may occur in text (emails especially).

 
 Pozdrawiam,  |  Best regards,
 Maciek Wójcikowski
 mac...@wojcikowski.pl

 2016-12-02 10:11 GMT+01:00 Alexis Parenty <
 alexis.parenty.h...@gmail.com>:

> Dear all,
>
>
> I am looking for a way to extract SMILES scattered in many text
> documents (thousands documents of several pages each).
>
> At the moment, I am thinking to scan each words from the text and try
> to make a mol object from them using Chem.MolFromSmiles() then store the
> words if they return a mol object that is not None.
>
> Can anyone think of a better/quicker way?
>
>
> Would it be worth storing in a tuple any word that do not return a mol
> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>
>
> Something along those lines
>
>
> excluded_set = set()
>
> smiles_list = []
>
> For each_word in text:
>
> If each_word not in excluded_set:
>
> each_word_mol =  Chem.MolFromSmiles(each_word)
>
> if each_word_mol is not None:
>
> smiles_list.append(each_word)
>
>  else:
>
>  excluded_set.add(each_word_mol)
>
>
> Would not searching into that growing tuple take actually more time
> than trying to blindly make a mol object for every word?
>
>
>
> Any suggestion?
>
>
> Many thanks and regards,
>
>
> Alexis
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, SlashDot.org! 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Pavel

Hi, Alexis,

  if you should not track from which document SMILES come, you may just 
combine all words from all document in a list, take only unique words 
and try to test them. Thus, you should not store and check for 
valid/non-valid strings. That would reduce problem complexity as well.


Pavel.

On 12/02/2016 11:11 AM, Greg Landrum wrote:
An initial start on some regexps that match SMILES is here: 
https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb 



that may also be useful

On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty 
> 
wrote:


Hi Markus,


Yes! I might discover novel compounds that way!! Would be
interesting to see how they look like…


Good suggestion to also store the words that were correctly
identified as SMILES. I’ll add that to the script.


I also like your “distribution of word” idea. I could safely skip
any words that occur more than 1% of the time and could try to
play around with the threshold to find an optimum.


I will try every suggestions and will time it to see what is best.
I’ll keep everyone in the loop and will share the script and results.


Thanks,


Alexis


On 2 December 2016 at 10:47, Markus Sitzmann
> wrote:

Hi Alexis,

you may find also so some "novel" compounds by this approach :-).

Whether your tuple solution improves performance strongly
depends on the content of your text documents and how often
they repeat the same words again - but my guess would be it
will help. Probably the best way is even to look at the
distribution of words before you feed them to RDKit. You
should also "memorize" those ones that successfully generated
a structure, doesn't make sense to do it again, then.

Markus

On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski
> wrote:

Hi Alexis,

You may want to filter with some regex strings containing
not valid characters (i.e. there is small subset of atoms
that may be without brackets). See "Atoms" section:
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html



The set might grow pretty quick and may be inefficient, so
I'd parse all strings passing above filter. Although there
will be some false positives like "CC" which may occur in
text (emails especially).


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl 

2016-12-02 10:11 GMT+01:00 Alexis Parenty
>:

Dear all,


I am looking for a way to extract SMILES scattered in
many text documents (thousands documents of several
pages each).

At the moment, I am thinking to scan each words from
the text and try to make a mol object from them using
Chem.MolFromSmiles() then store the words if they
return a mol object that is not None.

Can anyone think of a better/quicker way?


Would it be worth storing in a tuple any word that do
not return a mol object from Chem.MolFromSmiles() and
exclude them from subsequent search?


Something along those lines


excluded_set = set()

smiles_list = []

For each_word in text:

If each_word not in excluded_set:

  each_word_mol = Chem.MolFromSmiles(each_word)

  if each_word_mol is not None:

smiles_list.append(each_word)

   else:

 excluded_set.add(each_word_mol)


Would not searching into that growing tuple take
actually more time than trying to blindly make a mol
object for every word?

Any suggestion?


Many thanks and regards,


Alexis



--
Check out the vibrant tech community on one of the
world's most
engaging tech sites, SlashDot.org!
http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Greg Landrum
An initial start on some regexps that match SMILES is here:
https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb

that may also be useful

On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi Markus,
>
>
> Yes! I might discover novel compounds that way!! Would be interesting to
> see how they look like…
>
>
> Good suggestion to also store the words that were correctly identified as
> SMILES. I’ll add that to the script.
>
>
> I also like your “distribution of word” idea. I could safely skip any
> words that occur more than 1% of the time and could try to play around with
> the threshold to find an optimum.
>
>
> I will try every suggestions and will time it to see what is best. I’ll
> keep everyone in the loop and will share the script and results.
>
>
> Thanks,
>
>
> Alexis
>
> On 2 December 2016 at 10:47, Markus Sitzmann 
> wrote:
>
>> Hi Alexis,
>>
>> you may find also so some "novel" compounds by this approach :-).
>>
>> Whether your tuple solution improves performance strongly depends on the
>> content of your text documents and how often they repeat the same words
>> again - but my guess would be it will help. Probably the best way is even
>> to look at the distribution of words before you feed them to RDKit. You
>> should also "memorize" those ones that successfully generated a structure,
>> doesn't make sense to do it again, then.
>>
>> Markus
>>
>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>> mac...@wojcikowski.pl> wrote:
>>
>>> Hi Alexis,
>>>
>>> You may want to filter with some regex strings containing not valid
>>> characters (i.e. there is small subset of atoms that may be without
>>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>>> l/doc/theory/theory.smiles.html
>>>
>>> The set might grow pretty quick and may be inefficient, so I'd parse all
>>> strings passing above filter. Although there will be some false positives
>>> like "CC" which may occur in text (emails especially).
>>>
>>> 
>>> Pozdrawiam,  |  Best regards,
>>> Maciek Wójcikowski
>>> mac...@wojcikowski.pl
>>>
>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty >> >:
>>>
 Dear all,


 I am looking for a way to extract SMILES scattered in many text
 documents (thousands documents of several pages each).

 At the moment, I am thinking to scan each words from the text and try
 to make a mol object from them using Chem.MolFromSmiles() then store the
 words if they return a mol object that is not None.

 Can anyone think of a better/quicker way?


 Would it be worth storing in a tuple any word that do not return a mol
 object from Chem.MolFromSmiles() and exclude them from subsequent search?


 Something along those lines


 excluded_set = set()

 smiles_list = []

 For each_word in text:

 If each_word not in excluded_set:

 each_word_mol =  Chem.MolFromSmiles(each_word)

 if each_word_mol is not None:

 smiles_list.append(each_word)

  else:

  excluded_set.add(each_word_mol)


 Would not searching into that growing tuple take actually more time
 than trying to blindly make a mol object for every word?



 Any suggestion?


 Many thanks and regards,


 Alexis

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, SlashDot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> 

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty
Hi Markus,


Yes! I might discover novel compounds that way!! Would be interesting to
see how they look like…


Good suggestion to also store the words that were correctly identified as
SMILES. I’ll add that to the script.


I also like your “distribution of word” idea. I could safely skip any words
that occur more than 1% of the time and could try to play around with the
threshold to find an optimum.


I will try every suggestions and will time it to see what is best. I’ll
keep everyone in the loop and will share the script and results.


Thanks,


Alexis

On 2 December 2016 at 10:47, Markus Sitzmann 
wrote:

> Hi Alexis,
>
> you may find also so some "novel" compounds by this approach :-).
>
> Whether your tuple solution improves performance strongly depends on the
> content of your text documents and how often they repeat the same words
> again - but my guess would be it will help. Probably the best way is even
> to look at the distribution of words before you feed them to RDKit. You
> should also "memorize" those ones that successfully generated a structure,
> doesn't make sense to do it again, then.
>
> Markus
>
> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski  > wrote:
>
>> Hi Alexis,
>>
>> You may want to filter with some regex strings containing not valid
>> characters (i.e. there is small subset of atoms that may be without
>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>> l/doc/theory/theory.smiles.html
>>
>> The set might grow pretty quick and may be inefficient, so I'd parse all
>> strings passing above filter. Although there will be some false positives
>> like "CC" which may occur in text (emails especially).
>>
>> 
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty 
>> :
>>
>>> Dear all,
>>>
>>>
>>> I am looking for a way to extract SMILES scattered in many text
>>> documents (thousands documents of several pages each).
>>>
>>> At the moment, I am thinking to scan each words from the text and try to
>>> make a mol object from them using Chem.MolFromSmiles() then store the words
>>> if they return a mol object that is not None.
>>>
>>> Can anyone think of a better/quicker way?
>>>
>>>
>>> Would it be worth storing in a tuple any word that do not return a mol
>>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>>
>>>
>>> Something along those lines
>>>
>>>
>>> excluded_set = set()
>>>
>>> smiles_list = []
>>>
>>> For each_word in text:
>>>
>>> If each_word not in excluded_set:
>>>
>>> each_word_mol =  Chem.MolFromSmiles(each_word)
>>>
>>> if each_word_mol is not None:
>>>
>>> smiles_list.append(each_word)
>>>
>>>  else:
>>>
>>>  excluded_set.add(each_word_mol)
>>>
>>>
>>> Would not searching into that growing tuple take actually more time than
>>> trying to blindly make a mol object for every word?
>>>
>>>
>>>
>>> Any suggestion?
>>>
>>>
>>> Many thanks and regards,
>>>
>>>
>>> Alexis
>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Markus Sitzmann
Hi Alexis,

you may find also so some "novel" compounds by this approach :-).

Whether your tuple solution improves performance strongly depends on the
content of your text documents and how often they repeat the same words
again - but my guess would be it will help. Probably the best way is even
to look at the distribution of words before you feed them to RDKit. You
should also "memorize" those ones that successfully generated a structure,
doesn't make sense to do it again, then.

Markus

On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski 
wrote:

> Hi Alexis,
>
> You may want to filter with some regex strings containing not valid
> characters (i.e. there is small subset of atoms that may be without
> brackets). See "Atoms" section: http://www.daylight.com/
> dayhtml/doc/theory/theory.smiles.html
>
> The set might grow pretty quick and may be inefficient, so I'd parse all
> strings passing above filter. Although there will be some false positives
> like "CC" which may occur in text (emails especially).
>
> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2016-12-02 10:11 GMT+01:00 Alexis Parenty :
>
>> Dear all,
>>
>>
>> I am looking for a way to extract SMILES scattered in many text documents
>> (thousands documents of several pages each).
>>
>> At the moment, I am thinking to scan each words from the text and try to
>> make a mol object from them using Chem.MolFromSmiles() then store the words
>> if they return a mol object that is not None.
>>
>> Can anyone think of a better/quicker way?
>>
>>
>> Would it be worth storing in a tuple any word that do not return a mol
>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>
>>
>> Something along those lines
>>
>>
>> excluded_set = set()
>>
>> smiles_list = []
>>
>> For each_word in text:
>>
>> If each_word not in excluded_set:
>>
>> each_word_mol =  Chem.MolFromSmiles(each_word)
>>
>> if each_word_mol is not None:
>>
>> smiles_list.append(each_word)
>>
>>  else:
>>
>>  excluded_set.add(each_word_mol)
>>
>>
>> Would not searching into that growing tuple take actually more time than
>> trying to blindly make a mol object for every word?
>>
>>
>>
>> Any suggestion?
>>
>>
>> Many thanks and regards,
>>
>>
>> Alexis
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty
Hi Maciek,


Thanks for your quick response. Excellent suggestions. I could filter out a
lot of crap that way... Maybe I could also add a filter on word length to
avoid having a lot of Ethane and Iodide false positives!


This also made me think that I could transform the text into a set to avoid
scanning all the duplicates. I can filter out quite a lot before embarking
into making mols.


Thanks,


Alexis

On 2 December 2016 at 10:21, Maciek Wójcikowski 
wrote:

> Hi Alexis,
>
> You may want to filter with some regex strings containing not valid
> characters (i.e. there is small subset of atoms that may be without
> brackets). See "Atoms" section: http://www.daylight.com/
> dayhtml/doc/theory/theory.smiles.html
>
> The set might grow pretty quick and may be inefficient, so I'd parse all
> strings passing above filter. Although there will be some false positives
> like "CC" which may occur in text (emails especially).
>
> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2016-12-02 10:11 GMT+01:00 Alexis Parenty :
>
>> Dear all,
>>
>>
>> I am looking for a way to extract SMILES scattered in many text documents
>> (thousands documents of several pages each).
>>
>> At the moment, I am thinking to scan each words from the text and try to
>> make a mol object from them using Chem.MolFromSmiles() then store the words
>> if they return a mol object that is not None.
>>
>> Can anyone think of a better/quicker way?
>>
>>
>> Would it be worth storing in a tuple any word that do not return a mol
>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>
>>
>> Something along those lines
>>
>>
>> excluded_set = set()
>>
>> smiles_list = []
>>
>> For each_word in text:
>>
>> If each_word not in excluded_set:
>>
>> each_word_mol =  Chem.MolFromSmiles(each_word)
>>
>> if each_word_mol is not None:
>>
>> smiles_list.append(each_word)
>>
>>  else:
>>
>>  excluded_set.add(each_word_mol)
>>
>>
>> Would not searching into that growing tuple take actually more time than
>> trying to blindly make a mol object for every word?
>>
>>
>>
>> Any suggestion?
>>
>>
>> Many thanks and regards,
>>
>>
>> Alexis
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Maciek Wójcikowski
Hi Alexis,

You may want to filter with some regex strings containing not valid
characters (i.e. there is small subset of atoms that may be without
brackets). See "Atoms" section:
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

The set might grow pretty quick and may be inefficient, so I'd parse all
strings passing above filter. Although there will be some false positives
like "CC" which may occur in text (emails especially).


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2016-12-02 10:11 GMT+01:00 Alexis Parenty :

> Dear all,
>
>
> I am looking for a way to extract SMILES scattered in many text documents
> (thousands documents of several pages each).
>
> At the moment, I am thinking to scan each words from the text and try to
> make a mol object from them using Chem.MolFromSmiles() then store the words
> if they return a mol object that is not None.
>
> Can anyone think of a better/quicker way?
>
>
> Would it be worth storing in a tuple any word that do not return a mol
> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>
>
> Something along those lines
>
>
> excluded_set = set()
>
> smiles_list = []
>
> For each_word in text:
>
> If each_word not in excluded_set:
>
> each_word_mol =  Chem.MolFromSmiles(each_word)
>
> if each_word_mol is not None:
>
> smiles_list.append(each_word)
>
>  else:
>
>  excluded_set.add(each_word_mol)
>
>
> Would not searching into that growing tuple take actually more time than
> trying to blindly make a mol object for every word?
>
>
>
> Any suggestion?
>
>
> Many thanks and regards,
>
>
> Alexis
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty
Dear all,


I am looking for a way to extract SMILES scattered in many text documents
(thousands documents of several pages each).

At the moment, I am thinking to scan each words from the text and try to
make a mol object from them using Chem.MolFromSmiles() then store the words
if they return a mol object that is not None.

Can anyone think of a better/quicker way?


Would it be worth storing in a tuple any word that do not return a mol
object from Chem.MolFromSmiles() and exclude them from subsequent search?


Something along those lines


excluded_set = set()

smiles_list = []

For each_word in text:

If each_word not in excluded_set:

each_word_mol =  Chem.MolFromSmiles(each_word)

if each_word_mol is not None:

smiles_list.append(each_word)

 else:

 excluded_set.add(each_word_mol)


Would not searching into that growing tuple take actually more time than
trying to blindly make a mol object for every word?



Any suggestion?


Many thanks and regards,


Alexis
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss