Re: [Rdkit-discuss] SDF properties in case of error

2015-05-04 Thread Dimitri Maziuk
On 2015-05-03 15:06, Michael Reutlinger wrote:

 Well... I think my proposal should enable us to put more strict, robust
 QC in place, but I guess you are missing this point.

My definition of strict and robust is if the input is bad, what comes 
out does is an out of band error signal. Such that there is no way it 
can possibly be mistaken for any kind of output other than the error signal.

Dimitri



--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF properties in case of error

2015-05-03 Thread Markus Sitzmann
No, cutting out a chunk of lines from a file might be simple, but
can become an expensive operation if you want to deal with thousands
of files and million of records. That is one of the reasons why I
(unfortunately) couldn't consider rdkit any further for one of my
projects a few years ago. So, I support Michael's idea :-)

On Sat, May 2, 2015 at 12:17 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:
 On 04/30/2015 05:01 PM, Michael Reutlinger wrote:

 However, in some cases this does not help. E.g. when an unknown atom (most
 of the time this is X) is found in the MolBlock the import fails with an
 Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

 I'd say the best you can do skip over to the next molecule and report
 molecule in lines X to Y is corrupt. Cutting out a chunk of lines from
 a file is trivial, and if you're reading from a stream rather than a
 file then, well, don't. Without a valid mol block you don't have a
 molecule and you shouldn't be making one up. As in conservative in what
 you produce.

 --
 Dimitri Maziuk
 Programmer/sysadmin
 BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF properties in case of error

2015-05-02 Thread Greg Landrum
Hi Michael,

What you request is certainly possible, but it is a pretty fundamental
change in the way the supplier (and mol file parser) works, so it would
need some thought.

Once concern that immediately occurs to me is that you will not be able to
tell which molecules from the input file were actually empty in the input
and which were just empty because there was a problem parsing an input
molecule.

A possible alternative, more general and somewhat lighter weight, would be
to ensure that you can always get the text of the last item parsed from a
ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would
allow you to do whatever special error handling you are interested in doing

-greg


On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de wrote:

 Hi all,

 I am currently working on a program which needs to process libraries of
 large SDF files. One requirement is to always produce a valid output
 including the molecule title/name or a specified property for referencing.

 With specifying sanitize=False with ForwardSDMolSupplier and using
 Chem.Sanitize afterwards with an appropriate Exception handling helps in
 most cases to get the SD file properties and still detect errors in the
 molecules to avoid importing rubbish.

 However, in some cases this does not help. E.g. when an unknown atom (most
 of the time this is X) is found in the MolBlock the import fails with an
 Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

 My question is if there is a way to get to the data even for those cases?
 The files tend to be very big so accessing the molecule re-parsing it
 line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.

 What would be a good solution in my opinion is to create an empty molecule
 with all sd properties, including _Name, in case of an error instead of
 None. The actual error could then also be communicated into python via an
 '_Error' property. With this it would still be possible to continue
 processing of the file in a for loop, in contrast to raising an Exception,
 and it is easy to check if the molecule is empty.
 Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.

 I am very keen on getting your view on this issue.

 Best regards,
 Michael


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF properties in case of error

2015-05-02 Thread Michael Reutlinger
Hi Greg,

thanks for your answer, I agree that the lighter weighted solution is
certainly also a possibility and would clearly solve my (and possibly
others) problem. Maybe a suppl.GetLastItemError() would then also be handy
to get the error messages that usually are only visible in the log.

But maybe something like an ErrorMol (as described in more detail by Andrew
Dalke) could potentially be more versatile. If an ErrorMol class is
inherited from Mol it could be processed in a standard way but one could
clearly differentiate this vehicle from an empty molecule. By having
different handlers, it would also be possible to add Exceptions in the
future, if people prefer having this behaviour :-)

However, both implementations would be a big improvement and could help to
avoid dealing with special cases somewhere else in the workflow, leading to
more robust workflows and eventually less errors.

Have a nice weekend,
Michael




On Sat, May 2, 2015 at 2:25 PM, Greg Landrum greg.land...@gmail.com wrote:

 Hi Michael,

 What you request is certainly possible, but it is a pretty fundamental
 change in the way the supplier (and mol file parser) works, so it would
 need some thought.

 Once concern that immediately occurs to me is that you will not be able to
 tell which molecules from the input file were actually empty in the input
 and which were just empty because there was a problem parsing an input
 molecule.

 A possible alternative, more general and somewhat lighter weight, would be
 to ensure that you can always get the text of the last item parsed from a
 ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would
 allow you to do whatever special error handling you are interested in doing

 -greg


 On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de
 wrote:

 Hi all,

 I am currently working on a program which needs to process libraries of
 large SDF files. One requirement is to always produce a valid output
 including the molecule title/name or a specified property for referencing.

 With specifying sanitize=False with ForwardSDMolSupplier and using
 Chem.Sanitize afterwards with an appropriate Exception handling helps in
 most cases to get the SD file properties and still detect errors in the
 molecules to avoid importing rubbish.

 However, in some cases this does not help. E.g. when an unknown atom
 (most of the time this is X) is found in the MolBlock the import fails with
 an Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

 My question is if there is a way to get to the data even for those cases?
 The files tend to be very big so accessing the molecule re-parsing it
 line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.

 What would be a good solution in my opinion is to create an empty
 molecule with all sd properties, including _Name, in case of an error
 instead of None. The actual error could then also be communicated into
 python via an '_Error' property. With this it would still be possible to
 continue processing of the file in a for loop, in contrast to raising an
 Exception, and it is easy to check if the molecule is empty.
 Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.

 I am very keen on getting your view on this issue.

 Best regards,
 Michael


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF properties in case of error

2015-05-01 Thread Dimitri Maziuk
On 04/30/2015 05:01 PM, Michael Reutlinger wrote:

 However, in some cases this does not help. E.g. when an unknown atom (most
 of the time this is X) is found in the MolBlock the import fails with an
 Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

I'd say the best you can do skip over to the next molecule and report
molecule in lines X to Y is corrupt. Cutting out a chunk of lines from
a file is trivial, and if you're reading from a stream rather than a
file then, well, don't. Without a valid mol block you don't have a
molecule and you shouldn't be making one up. As in conservative in what
you produce.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] SDF properties in case of error

2015-04-30 Thread Michael Reutlinger
Hi all,

I am currently working on a program which needs to process libraries of
large SDF files. One requirement is to always produce a valid output
including the molecule title/name or a specified property for referencing.

With specifying sanitize=False with ForwardSDMolSupplier and using
Chem.Sanitize afterwards with an appropriate Exception handling helps in
most cases to get the SD file properties and still detect errors in the
molecules to avoid importing rubbish.

However, in some cases this does not help. E.g. when an unknown atom (most
of the time this is X) is found in the MolBlock the import fails with an
Post-condition Violation and None is yielded. This is fine to detect the
problem BUT it is impossible to get any information about the molecule
which failed.

My question is if there is a way to get to the data even for those cases?
The files tend to be very big so accessing the molecule re-parsing it
line-by-line in python to get the name for a specific molecule number
(found by enumerating the supplier) is not really an option.

What would be a good solution in my opinion is to create an empty molecule
with all sd properties, including _Name, in case of an error instead of
None. The actual error could then also be communicated into python via an
'_Error' property. With this it would still be possible to continue
processing of the file in a for loop, in contrast to raising an Exception,
and it is easy to check if the molecule is empty.
Maybe this behaviour could be activated via an option and the default would
be to return None, to not break any existing code.

I am very keen on getting your view on this issue.

Best regards,
Michael
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss