Re: [Rdkit-discuss] SDF properties in case of error
On 2015-05-03 15:06, Michael Reutlinger wrote: Well... I think my proposal should enable us to put more strict, robust QC in place, but I guess you are missing this point. My definition of strict and robust is if the input is bad, what comes out does is an out of band error signal. Such that there is no way it can possibly be mistaken for any kind of output other than the error signal. Dimitri -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SDF properties in case of error
No, cutting out a chunk of lines from a file might be simple, but can become an expensive operation if you want to deal with thousands of files and million of records. That is one of the reasons why I (unfortunately) couldn't consider rdkit any further for one of my projects a few years ago. So, I support Michael's idea :-) On Sat, May 2, 2015 at 12:17 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: On 04/30/2015 05:01 PM, Michael Reutlinger wrote: However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. I'd say the best you can do skip over to the next molecule and report molecule in lines X to Y is corrupt. Cutting out a chunk of lines from a file is trivial, and if you're reading from a stream rather than a file then, well, don't. Without a valid mol block you don't have a molecule and you shouldn't be making one up. As in conservative in what you produce. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SDF properties in case of error
Hi Michael, What you request is certainly possible, but it is a pretty fundamental change in the way the supplier (and mol file parser) works, so it would need some thought. Once concern that immediately occurs to me is that you will not be able to tell which molecules from the input file were actually empty in the input and which were just empty because there was a problem parsing an input molecule. A possible alternative, more general and somewhat lighter weight, would be to ensure that you can always get the text of the last item parsed from a ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would allow you to do whatever special error handling you are interested in doing -greg On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de wrote: Hi all, I am currently working on a program which needs to process libraries of large SDF files. One requirement is to always produce a valid output including the molecule title/name or a specified property for referencing. With specifying sanitize=False with ForwardSDMolSupplier and using Chem.Sanitize afterwards with an appropriate Exception handling helps in most cases to get the SD file properties and still detect errors in the molecules to avoid importing rubbish. However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. My question is if there is a way to get to the data even for those cases? The files tend to be very big so accessing the molecule re-parsing it line-by-line in python to get the name for a specific molecule number (found by enumerating the supplier) is not really an option. What would be a good solution in my opinion is to create an empty molecule with all sd properties, including _Name, in case of an error instead of None. The actual error could then also be communicated into python via an '_Error' property. With this it would still be possible to continue processing of the file in a for loop, in contrast to raising an Exception, and it is easy to check if the molecule is empty. Maybe this behaviour could be activated via an option and the default would be to return None, to not break any existing code. I am very keen on getting your view on this issue. Best regards, Michael -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SDF properties in case of error
Hi Greg, thanks for your answer, I agree that the lighter weighted solution is certainly also a possibility and would clearly solve my (and possibly others) problem. Maybe a suppl.GetLastItemError() would then also be handy to get the error messages that usually are only visible in the log. But maybe something like an ErrorMol (as described in more detail by Andrew Dalke) could potentially be more versatile. If an ErrorMol class is inherited from Mol it could be processed in a standard way but one could clearly differentiate this vehicle from an empty molecule. By having different handlers, it would also be possible to add Exceptions in the future, if people prefer having this behaviour :-) However, both implementations would be a big improvement and could help to avoid dealing with special cases somewhere else in the workflow, leading to more robust workflows and eventually less errors. Have a nice weekend, Michael On Sat, May 2, 2015 at 2:25 PM, Greg Landrum greg.land...@gmail.com wrote: Hi Michael, What you request is certainly possible, but it is a pretty fundamental change in the way the supplier (and mol file parser) works, so it would need some thought. Once concern that immediately occurs to me is that you will not be able to tell which molecules from the input file were actually empty in the input and which were just empty because there was a problem parsing an input molecule. A possible alternative, more general and somewhat lighter weight, would be to ensure that you can always get the text of the last item parsed from a ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would allow you to do whatever special error handling you are interested in doing -greg On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de wrote: Hi all, I am currently working on a program which needs to process libraries of large SDF files. One requirement is to always produce a valid output including the molecule title/name or a specified property for referencing. With specifying sanitize=False with ForwardSDMolSupplier and using Chem.Sanitize afterwards with an appropriate Exception handling helps in most cases to get the SD file properties and still detect errors in the molecules to avoid importing rubbish. However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. My question is if there is a way to get to the data even for those cases? The files tend to be very big so accessing the molecule re-parsing it line-by-line in python to get the name for a specific molecule number (found by enumerating the supplier) is not really an option. What would be a good solution in my opinion is to create an empty molecule with all sd properties, including _Name, in case of an error instead of None. The actual error could then also be communicated into python via an '_Error' property. With this it would still be possible to continue processing of the file in a for loop, in contrast to raising an Exception, and it is easy to check if the molecule is empty. Maybe this behaviour could be activated via an option and the default would be to return None, to not break any existing code. I am very keen on getting your view on this issue. Best regards, Michael -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SDF properties in case of error
On 04/30/2015 05:01 PM, Michael Reutlinger wrote: However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. I'd say the best you can do skip over to the next molecule and report molecule in lines X to Y is corrupt. Cutting out a chunk of lines from a file is trivial, and if you're reading from a stream rather than a file then, well, don't. Without a valid mol block you don't have a molecule and you shouldn't be making one up. As in conservative in what you produce. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] SDF properties in case of error
Hi all, I am currently working on a program which needs to process libraries of large SDF files. One requirement is to always produce a valid output including the molecule title/name or a specified property for referencing. With specifying sanitize=False with ForwardSDMolSupplier and using Chem.Sanitize afterwards with an appropriate Exception handling helps in most cases to get the SD file properties and still detect errors in the molecules to avoid importing rubbish. However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. My question is if there is a way to get to the data even for those cases? The files tend to be very big so accessing the molecule re-parsing it line-by-line in python to get the name for a specific molecule number (found by enumerating the supplier) is not really an option. What would be a good solution in my opinion is to create an empty molecule with all sd properties, including _Name, in case of an error instead of None. The actual error could then also be communicated into python via an '_Error' property. With this it would still be possible to continue processing of the file in a for loop, in contrast to raising an Exception, and it is easy to check if the molecule is empty. Maybe this behaviour could be activated via an option and the default would be to return None, to not break any existing code. I am very keen on getting your view on this issue. Best regards, Michael -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss