Over the past couple of days I've spent some time doing some tuning of
the RDKit's SMILES parser.

I made a couple of minor changes here and there and saw some
improvement before making a change in the YACC grammar used to
generate the parser. This made the parser source a bit more difficult
to read, but had a pretty significant impact on performance.

In order to just measure performance of the SMILES parser, I did a
benchmark using ~560K molecules from ZINC where I generated a molecule
from SMILES without any sanitization.
Here are the timings on my linux box for that benchmark:

RDKit_2011_06_1: 50.6s
RDKit_2012_03_1: 49.6s
RDKit_2012_06_1: 57.6s  [ <- I'm not sure I understand this outlier]
svn: 30.6s

I'm pretty pleased about that last number. :-)

For those who are interested, here's the commit:
https://sourceforge.net/p/rdkit/code/2159/
and the specific grammar changes that made the difference:
https://sourceforge.net/p/rdkit/code/2159/tree//trunk/Code/GraphMol/SmilesParse/smiles.yy?diff=502dda6571b75b41b4b10063:2158

-greg

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-devel mailing list
Rdkit-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-devel

Reply via email to