Re: [Rdkit-discuss] SDF tags and -

2015-04-30 Thread Dimitri Maziuk
On 2015-04-29 23:08, Greg Landrum wrote:
 Here are my thoughts on this:
 The RDKit is usually strict while parsing molecules from SDF, SMILES, or
 other formats.

My point was that given
'''
   my_property2
1234

   my_property3
'''
a lexer shouldn't have a problem recognizing the 2 tags. A leninent 
parser would return stuff in between as value: 1234\n\n

 There are exceptions to this: the RDKit ignores the limit on line length
 while reading SDFs: there's no chance of confusion here, so I believe
 it's safe to do so.

Similarly, a lenient parser could ignore the line length and value 
length limits.

 I still need to put some thought into patching the SDWriter so that it
 can recognize things like consecutive line endings in property values.
 The big question is what it should do when it encounters such a case. Is
 that an error? Should it just write the output up to the blank line?

A conservative writer should never write out 1234\n\n. Squash the 
multiple newlines. And/or give it a strict flag that makes it error 
out instead.

I'm sure Andrew's seen a lot of badly broken SDFs. It doesn't mean you 
can't handle the ones you can unambiguously parse.

Dimitri


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Nicholas Firth
Ahh ok… Interesting way to format a file! Got to love ChemAxon...

Best,
Nick

Nicholas C. Firth | PhD Student | Cancer Therapeutics
The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton | Surrey 
| SM2 5NG
T 020 8722 4033 | E nicholas.fi...@icr.ac.ukmailto:nicholas.fi...@icr.ac.uk | 
W www.icr.ac.ukhttp://www.icr.ac.uk/ | Twitter 
@ICRnewshttps://twitter.com/ICRnews
Facebook 
www.facebook.com/theinstituteofcancerresearchhttp://www.facebook.com/theinstituteofcancerresearch
Making the discoveries that defeat cancer

[cid:image001.gif@01CE053D.51D3C4E0]

On 29 Apr 2015, at 12:23, Paolo Tosco 
paolo.to...@unito.itmailto:paolo.to...@unito.it wrote:

Hi Nick,

newlines in data properties are fine, but they should not include blank lines 
(i.e., multiple newlines).
For example, in:

  my_property1
1

2

3

4

  my_property2
1234

  my_property3
5678

my_property1 will be truncated to just 1. Based on the specifications, if you 
want to include a blank line, it should actually be either a   or a \t, 
rather than being completely blank.

Cheers,
Paolo

On 04/29/15 12:16, Nicholas Firth wrote:
I use SD files with new lines in the properties quite frequently (inherited 
from Pipeline Pilot's merge function) and I've never had a problem reading 
them. I've attached an SD file that works fine for me.

In [2]: suppl = Chem.SDMolSupplier('/Volumes/nfirth/tempf.sdf')

In [3]: m = suppl[0]

In [4]: t = m.GetProp('genNum')

In [5]: print t
1
2
3
4

In [6]: print t.split('\n')
['1', '2', '3', '4']


So I guess the problem is in the writer?

Best,
Nick

The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company 
Limited by Guarantee, Registered in England under Company No. 534147 with its 
Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only. If the 
message is received by anyone other than the addressee, please return the 
message to the sender by replying to it and then delete the message from your 
computer and network.



--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.netmailto:Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company 
Limited by Guarantee, Registered in England under Company No. 534147 with its 
Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only.  If the 
message is received by anyone other than the addressee, please return the 
message to the sender by replying to it and then delete the message from your 
computer and network.--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Paolo Tosco

Dear all,

Indeed, as Riccardo mentions, according to the specifications in 
CTfile.pdf a property should be truncated after the first blank line. 
This is also what other SDF parsers I have tried actually do. What I 
noticed is that other SDF parsers are tolerant of spurious lines not 
starting with a , either blank or containing characters. Currently 
the RDKit isn't on read, while it is on write.
I think the easiest solution is to make the SDF parser more tolerant in 
such cases, printing a warning rather than throwing an exception. I have 
just submitted a pull request about it - feel free to ignore it if you 
do not agree with me!


Cheers,
Paolo

On 04/29/15 11:27, Tuomo Kalliokoski wrote:

Hello Riccardo,

That sounds very reasonable solution to the issue.

[I replied to rdkit-discuss to bring this thread on the list back again]

Best regards,
Tuomo


From: riccardo.viane...@gmail.com
Date: Wed, 29 Apr 2015 12:08:48 +0200
Subject: Re: [Rdkit-discuss] SDF tags and -
To: tkall...@live.com

Hi Tuomo,

yes, I agree the behavior seems a bit inconsistent. I suppose that if 
the correctness of the parser is confirmed, then a change could be 
suggested for the writer, consisting in raising an error if blank 
lines are present inside the data item.


[but once again, I didn't notice the defailt reply-to settings of 
rdkit-discuss and accidentally brought the thread off-list, sorry.]


Regards,
Riccardo



On Wed, Apr 29, 2015 at 11:46 AM, Tuomo Kalliokoski tkall...@live.com 
mailto:tkall...@live.com wrote:



Hello Riccardo,

Thanks for the swift reply! Indeed, it is the extra line-feed, not
the -. It was just around the same line where I had the issue,
so it got me confused.
I suppose the current functionality of RDKit, irrespective to the
SDF file format specifications, is a bit odd: SDWriter produces
file that SDMolSupplier can't handle.

Best regards,
Tuomo


From: riccardo.viane...@gmail.com mailto:riccardo.viane...@gmail.com
Date: Wed, 29 Apr 2015 11:33:14 +0200
Subject: Re: [Rdkit-discuss] SDF tags and -
To: tkall...@live.com mailto:tkall...@live.com


Hi Tuomo,

On Wed, Apr 29, 2015 at 10:47 AM, Tuomo Kalliokoski
tkall...@live.com mailto:tkall...@live.com wrote:

I have got a bunch of SDF-files with molecules and some long
descriptions in SDF-tags on them that include stuff like -
inside.
These files have been produced by ChemAxon's software and are
handled fine by their software.
Such files can be written out also from RDKit 2014_09_02, but
they fail when you try to read them in.


 I suspect the parse error could be independent from the -, but
due to the blank line (\n\n) that appears inside the TESTFIELD
data:


  mol.SetProp(TESTFIELD,This should not work - Let's
see\n\nI guess this is not visible\n)


and that is interpreted as the data item terminator. Iirc this
interpretation is compliant with the specifications for the SDF
file format, but I could be mistaken.

Best regards,
Riccardo




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Andrew Dalke
Riccardo Vianello:
  I suppose that if the correctness of the parser is confirmed, then a change 
 could be suggested for the writer, consisting in raising an error if blank 
 lines are present inside the data item.


Yes, the SD tag data is not a general purpose data field. It's not possible, 
for example, to embed the contents of an SD file as a value.

According to the spec:

   A [Data] value may extend over multiple lines containing up to 200
   characters each. A blank line terminates each data item.

The failure cases are: a data value that starts with a newline, or ends with a 
newline, or contains two or more successive newline characters. Some file 
readers, like Python's universal mode, will normalize \r\n to \n on 
input, so \r\n\r\n and variants are also problematic.


However, experience shows that there are many incorrectly written SD parsers. 
Some think that the data extends until the next line that starts with '' or 
line that is ''. For example, one organization decided to use '$', '$$', 
'$$$', and '' as rough estimates of the cost of a compound. It might have 
looked like this:

 
  M  END
   cost
  
  
   MW
  123.45

   ...

Some non-compliant parsers interpret the '' as the end of the compound 
record, rather than the data value it's supposed to be.

In practice, enough SD parsers are broken in this regards that the best 
practice for a workflow is to avoid those byte sequences if at all possible.

A common solution to store free-form data is to use base64 encoding.

   data = \nHello\n\n and\ngoodbye*10
   print(data.encode(base64))
  CkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4gPGFuZD4KZ29vZGJ5ZQpIZWxsbwoKPiA8
  YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4gPGFuZD4KZ29vZGJ5
  ZQpIZWxsbwoKPiA8YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4g
  PGFuZD4KZ29vZGJ5ZQpIZWxsbwoKPiA8YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2Ri
  eWU=

This also ensures that the line length never exceeds 200 characters. (I have no 
experience on what sort of errors might occur should the line length exceed 200 
characters.)

The SD format has no way to indicate the encoding, so the information about how 
tag data is encoded must be passed through other means. This is unfortunate.

And while I'm here, it's also a bad idea to have a \0 (NUL) in the data, as 
some tools use C's string functions on the assumption that a \0 does not 
exist. RDKit will write the \0:

   from rdkit import Chem
   mol = Chem.MolFromSmiles(C)
   mol.SetProp(abc, x\0z)
   mol.GetProp(abc)
  'x\x00z'
   writer = Chem.SDWriter(tmp.sdf)
   writer.write(mol)
   writer.close()
   content = open(tmp.sdf).read()
   \0 in content
  True


On Apr 29, 2015, at 12:27 PM, Tuomo Kalliokoski wrote:
 I suppose the current functionality of RDKit, irrespective to the SDF file 
 format specifications, is a bit odd: SDWriter produces file that 
 SDMolSupplier can't handle.


All of the toolkits I've used have the same behavior. They trust that the user 
of the API knows to not  pass arbitrary data as the value.

ChemAxon's toolkit, as you saw, can produce invalid SD files. You might see 
what happens if you add x\n\n data2\nvalue as the value; I suspect you'll 
end up with a new data2 tag.

I know that Open Babel and OEChem will also forward the value unchanged.

I can see the argument that RDKit should check for \n\n in the data. What 
should it do? It's built on the GetProp/SetProp mechanism, which allows 
arbitrary string data, and it's reasonable to SetProp() a value containing a 
\n\n for purposes other than writing to an SD file, so it has to be done in 
the reader or writer layer:

Here are some possibilities for changing the writer:

  - stop writing to the file, with an error
  - skip records which contain tags with forbidden values
  - skip tags which contains forbidden values
  - convert multiple newlines into one (including the edge cases)
  - also enforce the 200 character restriction
  - also enforce a check for well-known legal but ill-advised
  character sequences like a line starting with , or
  starting with , or containing a \0.

Paolo's suggestion is to change the reader to be more lenient:

-throw FileParseException(Problems encountered parsing data fields);
+BOOST_LOG(rdWarningLog)
+ Ignoring spurious lines encountered parsing data fields
+ std::endl;

While it's true that this would be lenient, it wouldn't handle Tuomo's problem. 
Tuomo has the following:

   data1
  X
  
  Y
  
   data2
  Whatever

Tuomo wants data1 to become X\n\n\nY.

Paolo's patch will set data1 to X, and generate a warning for the spurious Y 
but otherwise ignore it. I do not believe this is any better for Tuomo that 
what RDKit does now -- actually, it's worse because it's data loss and people 
tend to ignore warnings.

I think the problem is, what should the writer do if given data which cannot be 
represented as an SD data value? Suppose that one of Tuomo's 

Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Tuomo Kalliokoski
Hello Riccardo,

That sounds very reasonable solution to the issue. 

[I replied to rdkit-discuss to bring this thread on the list back again]

Best regards,
Tuomo

From: riccardo.viane...@gmail.com
Date: Wed, 29 Apr 2015 12:08:48 +0200
Subject: Re: [Rdkit-discuss] SDF tags and -
To: tkall...@live.com

Hi Tuomo,

yes, I agree the behavior seems a bit inconsistent. I suppose that if the 
correctness of the parser is confirmed, then a change could be suggested for 
the writer, consisting in raising an error if blank lines are present inside 
the data item.

[but once again, I didn't notice the defailt reply-to settings of rdkit-discuss 
and accidentally brought the thread off-list, sorry.]

Regards,
Riccardo



On Wed, Apr 29, 2015 at 11:46 AM, Tuomo Kalliokoski tkall...@live.com wrote:




Hello Riccardo,

Thanks for the swift reply! Indeed, it is the extra line-feed, not the -. It 
was just around the same line where I had the issue, so it got me confused. 
I suppose the current functionality of RDKit, irrespective to the SDF file 
format specifications, is a bit odd: SDWriter produces file that SDMolSupplier 
can't handle.

Best regards,
Tuomo

From: riccardo.viane...@gmail.com
Date: Wed, 29 Apr 2015 11:33:14 +0200
Subject: Re: [Rdkit-discuss] SDF tags and -
To: tkall...@live.com

Hi Tuomo,

On Wed, Apr 29, 2015 at 10:47 AM, Tuomo Kalliokoski tkall...@live.com wrote:



I have got a bunch of SDF-files with molecules and some long descriptions in 
SDF-tags on them that include stuff like - inside. 
These files have been produced by ChemAxon's software and are handled fine by 
their software.
Such files can be written out also from RDKit 2014_09_02, but they fail when 
you try to read them in. 

 I suspect the parse error could be independent from the -, but due to the 
blank line (\n\n) that appears inside the TESTFIELD data:


  mol.SetProp(TESTFIELD,This should not work - Let's see\n\nI guess this is 
not visible\n)

and that is interpreted as the data item terminator. Iirc this interpretation 
is compliant with the specifications for the SDF file format, but I could be 
mistaken.
 
Best regards,
Riccardo

  

  --
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Dimitri Maziuk
On 04/29/2015 01:47 PM, Andrew Dalke wrote:

 Postel's Robustness principle is a mistake.
 
 See RFC 3117 for elaboration, 
...
 Or from 
 http://cacm.acm.org/magazines/2011/8/114933-the-robustness-principle-reconsidered/fulltext
  :

There is a difference between ACM members writing network protocols and
domain people writing junk. XML in this example

 Or http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim for
 an example with XML.

is written by a ball street wanker. Much of xml is. Similarly, MOL/SDF
is written by chemists.

 On the pro-ish side, which
 recommends a patch to the law, see  
 http://langsec.org/papers/postel-patch.pdf .

I've spent enough time looking for definitive documentation on any
number of file formats to know: domain people don't do that. With one
or two exceptions to reinforce the rule. Again, c.f. to computer
scientists:  every RFC starts with the definitions of may, must, etc.

  - if a record contains forbidden values, stop writing to the file,
 with an error.
 
 Yes, I agree with this. What constitutes forbidden?

Simply put, the ones that lexer will match as not values.

 If there is an error, does the writer generate a partial record,

My interpretation of conservative is wipe out the file then crash and
burn. With a useful error message.

 For what it's worth, those values are acceptable. The following is legal,
 according to the specification:
...

If you define your lexical tokens properly, no problem. The problem is
when lexer can't decide what's what.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Dimitri Maziuk
On 04/29/2015 07:54 AM, Andrew Dalke wrote:

 I don't have a good solution. Were it me, I would have the writer 
 fail should any unsupported value be present in the output,
 including those which are allowed by the SD specification but will
 cause problems in practice, like embedded \0 and leading .

Based on be liberal in what you accept and conservative in what you
produce, the writer should

 - convert multiple newlines into one (including the edge cases) -
 also enforce the 200 character restriction - also enforce a check for
 well-known legal but ill-advised character sequences

   - if a record contains forbidden values, stop writing to the file,
with an error.

With the reader it looks like you can't help it if someone makes a value
like  55 or . With that caveat, you should be able to find tags
and read everything in between as a value.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Dimitri Maziuk
On 04/29/2015 05:32 PM, Andrew Dalke wrote:
 On Apr 29, 2015, at 9:19 PM, Dimitri Maziuk wrote:
 There is a difference between ACM members writing network protocols and
 domain people writing junk.
 
 I think that you are saying that the MDL connection table
 file formats are junk. I do not disagree. But it's something
 we have to deal with so my personal views matter little.
 
 The MDL file formats are definitely not network protocols,
 but as you brought up Postel's Robustness Principle I
 thought you were suggesting that the principle applies
 more broadly than just network protocols.
 
 And for what it's worth, I used to be an ACM member.

Mee too ;)

No, what I was suggesting is that something as well-defined as an RFC'ed
protocol should not need Postel's principle in the first place. No, it
should be applied to the stuff we have to deal with: that way we'll
generate fewer bad files and the users will be happier when it doesn't
crash on whatever stuff they have to deal with. Or at least not on every
input file.

 If the output is to a stream than there is no file to wipe.

Yeah, there's that I suppose...

 P.S.
   XML in this example ... is written by a ball street wanker.
 
 This slur is both gratuitous and wrong.

You misunderstood: I was just rephrasing Tim's

Then it is clearly not OK to guess that someone just forgot the
/amount and /trade but didn’t also drop a trailing zero or two. A
programmer in a position of responsibility who did this would be spanked
and maybe fired. A manager who mandated or authorized such an
implementation would be spanked, maybe fired, and maybe subject to legal
action.

The problem there argument, though, is that XML is well defined and
Anyone who can’t make a syndication feed that’s well-formed XML is an
incompetent fool (ibid). Blaming Postel for incompetence of fools is
like blaming Jesus for Salem witch trials.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Jan Holst Jensen
Actually, you want to send your loving thoughts to MDL (now: Biovia). 
They defined the SDF format :-).


Cheers
-- Jan

On 2015-04-29 13:26, Nicholas Firth wrote:

Ahh ok… Interesting way to format a file! Got to love ChemAxon...

Best,
Nick

*Nicholas C. Firth*| PhD Student | Cancer Therapeutics
The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton 
| Surrey | SM2 5NG


*T* 020 8722 4033 |*E*nicholas.fi...@icr.ac.uk 
mailto:nicholas.fi...@icr.ac.uk|*W*www.icr.ac.uk 
http://www.icr.ac.uk/|*Twitter*@ICRnews https://twitter.com/ICRnews


*Facebook*www.facebook.com/theinstituteofcancerresearch 
http://www.facebook.com/theinstituteofcancerresearch


*Making the discoveries that defeat cancer*



On 29 Apr 2015, at 12:23, Paolo Tosco paolo.to...@unito.it 
mailto:paolo.to...@unito.it wrote:


Hi Nick,

newlines in data properties are fine, but they should not include 
blank lines (i.e., multiple newlines).

For example, in:

  my_property1
1

2

3

4

  my_property2
1234

  my_property3
5678

my_property1 will be truncated to just 1. Based on the 
specifications, if you want to include a blank line, it should 
actually be either a   or a \t, rather than being completely blank.


Cheers,
Paolo

On 04/29/15 12:16, Nicholas Firth wrote:
I use SD files with new lines in the properties quite frequently 
(inherited from Pipeline Pilot's merge function) and I've never had a 
problem reading them. I've attached an SD file that works fine for me.


In [2]: suppl = Chem.SDMolSupplier('/Volumes/nfirth/tempf.sdf')

In [3]: m = suppl[0]

In [4]: t = m.GetProp('genNum')

In [5]: print t
1
2
3
4

In [6]: print t.split('\n')
['1', '2', '3', '4']


So I guess the problem is in the writer?

Best,
Nick


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF tags and -

2015-04-29 Thread Greg Landrum
Here are my thoughts on this:
The RDKit is usually strict while parsing molecules from SDF, SMILES, or
other formats. This is done for one simple reason: it tends to be
difficult/impossible to recover from syntax errors in input in a way that
doesn't result in a significant chance of producing a result that is
different from what the original writer intended. In this case, as Andrew
pointed out elsewhere on the thread, if Paolo's suggested patch is applied,
the molecule will be loaded with the TESTFIELD property present, but
different from what it was in the input. Since people ignore warning
messages (again quoting Andrew) this difference is not going to be noticed
most of the time.

There are exceptions to this: the RDKit ignores the limit on line length
while reading SDFs: there's no chance of confusion here, so I believe it's
safe to do so.

I'm planning on accepting Paolo's patch, but after it has been modified to
only accept the extra blank lines if the SDMolSupplier is not in strict
mode. This will allow these files to be parsed if the client/user indicates
that they are willing to take the risk of incorrect data.

I still need to put some thought into patching the SDWriter so that it can
recognize things like consecutive line endings in property values. The big
question is what it should do when it encounters such a case. Is that an
error? Should it just write the output up to the blank line?

-greg


On Wed, Apr 29, 2015 at 10:47 AM, Tuomo Kalliokoski tkall...@live.com
wrote:

 Hello all,

 I have got a bunch of SDF-files with molecules and some long descriptions
 in SDF-tags on them that include stuff like - inside.
 These files have been produced by ChemAxon's software and are handled fine
 by their software.
 Such files can be written out also from RDKit 2014_09_02, but they fail
 when you try to read them in.

 Here is an example code:

 1. Generate t.sdf in Python:

   from rdkit import Chem
   mol = Chem.MolFromSmiles(CC)
   mol.SetProp(TESTFIELD,This should not work - Let's see\n\nI guess
 this is not visible\n)
   mol.SetProp(TESTFIELD2,Beep)
   mol2 = Chem.MolFromSmiles(CCC)
   mol2.SetProp(TESTFIELD,Added another molecule - Here the same
 thing\n\nI guess this is not visible\n)
   mol2.SetProp(TESTFIELD2,Beep)
   w = Chem.SDWriter(t.sdf)
   w.write(mol)
   w.write(mol2)
   w.close()

 2. Trying to read the file in Python fails:

from rdkit import Chem
s = Chem.SDMolSupplier(t.sdf)
for mol in s:
   print mol.GetProp(TESTFIELD)
   // The TESTFIELD text is cropped and TESTFIELD2 is skipped completely
   // so the line below will fail:
   // print mol.GetProp(TESTFIELD2)

 [10:29:43] ERROR: Problems encountered parsing data fields
 [10:29:43] ERROR: moving to the begining of the next molecule

 I guess in this case I will do some pre-processing for the files before
 reading them in SDMolSupplier, but I just wanted to point out this special
 case. Apologies if this was old news, but at least I was unable to find it
 after quick look.

 Best regards,
 Tuomo





 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss