Re: [Rdkit-discuss] mol properties in SDWriter

2023-09-29 Thread Ling Chan
Thank you Andrew for the information. It is good to know that this is part
of the standard. So I don't need to worry now. And I like the safety
checking part of your code.

Dan, I wrote my email because from the SD file definition documents that I
could find, I did not see any mention of this. I could have overlooked. But
if it really is not part of the definition,  it is always possible to
encounter I/O problems. And we have encountered several similar situations
with non-conformed files and non-conformed parsers. I had to check the
format definition to determine which (writer or reader side) customer
support to write to. This is why I am careful now. Updating the software
you use would not solve it. It's not a bug as far as the parsing software
is concerned.

Ling

On Fri., Sep. 29, 2023, 10:07 Dan Nealschneider, <
[email protected]> wrote:

> I'd also be curious how the index is causing you problems. All SD reading
> code that I know about ignores those suffixes. If you're not using RDKit to
> read the SD file, maybe it would be best to update whatever it is you *are
> *using to parse the file.
>
> dan nealschneider | senior staff developer
>
> *he/him/his*
>
> [image: Schrödinger, Inc.] 
>
>
> On Fri, Sep 29, 2023 at 1:08 AM Andrew Dalke 
> wrote:
>
>> On Sep 26, 2023, at 01:17, Ling Chan  wrote:
>> > >(1)
>> > 4.099
>>   ..
>> > Just wonder what was the rationale behind this extra "(1)" on the
>> property field lines (pKa and logP in the above example)?
>> >
>> > And is there a way to get rid of these? I am not sure if this extra
>> "(1)" is part of the standard sd format.
>>
>> RDKit uses the increasing value as a sort of per-file registry number.
>>
>> This is follows the part of the standard which says "External registry
>> numbers must be enclosed in parentheses."
>>
>> The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :
>>
>>   if (d_molid >= 0) {
>> (*dp_ostream) << "(" << d_molid + 1 << ") ";
>>   }
>>
>> There is no way to suppress this output. No only is there no direct way
>> to change the d_molid, but d_molid cannot be negative as
>> Code/GraphMol/FileParsers/MolWriters.h declares it as:
>>
>>   unsigned int d_molid;  // the number of the molecules we wrote so
>> far
>>
>>
>> Wim suggested a post-processing approach. Another is to write the SD data
>> items yourself, that is, use MolToMolBlock() to generate the connection
>> table/molfile as a string, then iterate through the properties and generate
>> the data items.
>>
>>
>> import sys
>> from rdkit import Chem
>>
>> def MolToSDFRecord(
>> mol,
>> includeStereo: bool = True,
>> confId: int = -1,
>> kekulize: bool = True,
>> forceV3000: bool = False):
>> mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize,
>> forceV3000)
>>
>> lines = []
>> for prop_name in mol.GetPropNames():
>> if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
>> sys.stderr.write(f"WARNING: Skipping property {prop_name!r}
>> because the "
>>  "name includes an unsupported character.\n")
>> continue
>>
>> prop_value = mol.GetProp(prop_name)
>> if "\n" in prop_value:
>> if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
>> sys.stderr.write(f"WARNING: Skipping property
>> {prop_name!r} because the "
>>  "value includes an embedded newline.\n")
>> continue
>> if prop_value.endswith("\r\n"):
>> prop_value = prop_value[:-2]
>> elif prop_value.endswith("\n"):
>> prop_value = prop_value[:-1]
>>
>> lines.append(f"> <{prop_name}>\n{prop_value}\n\n")
>>
>> lines.append("\n")
>>
>> return mol_block + "".join(lines)
>>
>> mol = Chem.MolFromSmiles("CCO")
>> mol.SetProp("pKa","3.3\r\n")
>> print(MolToSDFRecord(mol))
>>
>>
>> Andrew
>> [email protected]
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] mol properties in SDWriter

2023-09-29 Thread Dan Nealschneider
I'd also be curious how the index is causing you problems. All SD reading
code that I know about ignores those suffixes. If you're not using RDKit to
read the SD file, maybe it would be best to update whatever it is you
*are *using
to parse the file.

dan nealschneider | senior staff developer

*he/him/his*

[image: Schrödinger, Inc.] 


On Fri, Sep 29, 2023 at 1:08 AM Andrew Dalke 
wrote:

> On Sep 26, 2023, at 01:17, Ling Chan  wrote:
> > >(1)
> > 4.099
>   ..
> > Just wonder what was the rationale behind this extra "(1)" on the
> property field lines (pKa and logP in the above example)?
> >
> > And is there a way to get rid of these? I am not sure if this extra
> "(1)" is part of the standard sd format.
>
> RDKit uses the increasing value as a sort of per-file registry number.
>
> This is follows the part of the standard which says "External registry
> numbers must be enclosed in parentheses."
>
> The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :
>
>   if (d_molid >= 0) {
> (*dp_ostream) << "(" << d_molid + 1 << ") ";
>   }
>
> There is no way to suppress this output. No only is there no direct way to
> change the d_molid, but d_molid cannot be negative as
> Code/GraphMol/FileParsers/MolWriters.h declares it as:
>
>   unsigned int d_molid;  // the number of the molecules we wrote so far
>
>
> Wim suggested a post-processing approach. Another is to write the SD data
> items yourself, that is, use MolToMolBlock() to generate the connection
> table/molfile as a string, then iterate through the properties and generate
> the data items.
>
>
> import sys
> from rdkit import Chem
>
> def MolToSDFRecord(
> mol,
> includeStereo: bool = True,
> confId: int = -1,
> kekulize: bool = True,
> forceV3000: bool = False):
> mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize,
> forceV3000)
>
> lines = []
> for prop_name in mol.GetPropNames():
> if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
> sys.stderr.write(f"WARNING: Skipping property {prop_name!r}
> because the "
>  "name includes an unsupported character.\n")
> continue
>
> prop_value = mol.GetProp(prop_name)
> if "\n" in prop_value:
> if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
> sys.stderr.write(f"WARNING: Skipping property
> {prop_name!r} because the "
>  "value includes an embedded newline.\n")
> continue
> if prop_value.endswith("\r\n"):
> prop_value = prop_value[:-2]
> elif prop_value.endswith("\n"):
> prop_value = prop_value[:-1]
>
> lines.append(f"> <{prop_name}>\n{prop_value}\n\n")
>
> lines.append("\n")
>
> return mol_block + "".join(lines)
>
> mol = Chem.MolFromSmiles("CCO")
> mol.SetProp("pKa","3.3\r\n")
> print(MolToSDFRecord(mol))
>
>
> Andrew
> [email protected]
>
>
>
>
> ___
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] mol properties in SDWriter

2023-09-29 Thread Andrew Dalke
On Sep 26, 2023, at 01:17, Ling Chan  wrote:
> >(1) 
> 4.099
  ..
> Just wonder what was the rationale behind this extra "(1)" on the property 
> field lines (pKa and logP in the above example)?
> 
> And is there a way to get rid of these? I am not sure if this extra "(1)" is 
> part of the standard sd format.

RDKit uses the increasing value as a sort of per-file registry number.

This is follows the part of the standard which says "External registry numbers 
must be enclosed in parentheses."

The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :

  if (d_molid >= 0) {
(*dp_ostream) << "(" << d_molid + 1 << ") ";
  }

There is no way to suppress this output. No only is there no direct way to 
change the d_molid, but d_molid cannot be negative as 
Code/GraphMol/FileParsers/MolWriters.h declares it as:

  unsigned int d_molid;  // the number of the molecules we wrote so far


Wim suggested a post-processing approach. Another is to write the SD data items 
yourself, that is, use MolToMolBlock() to generate the connection table/molfile 
as a string, then iterate through the properties and generate the data items.


import sys
from rdkit import Chem

def MolToSDFRecord(
mol,
includeStereo: bool = True,
confId: int = -1,
kekulize: bool = True,
forceV3000: bool = False):
mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize, 
forceV3000)

lines = []
for prop_name in mol.GetPropNames():
if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
sys.stderr.write(f"WARNING: Skipping property {prop_name!r} because 
the "
 "name includes an unsupported character.\n")
continue

prop_value = mol.GetProp(prop_name)
if "\n" in prop_value:
if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
sys.stderr.write(f"WARNING: Skipping property {prop_name!r} 
because the "
 "value includes an embedded newline.\n")
continue
if prop_value.endswith("\r\n"):
prop_value = prop_value[:-2]
elif prop_value.endswith("\n"):
prop_value = prop_value[:-1]

lines.append(f"> <{prop_name}>\n{prop_value}\n\n")

lines.append("\n")

return mol_block + "".join(lines)

mol = Chem.MolFromSmiles("CCO")
mol.SetProp("pKa","3.3\r\n")
print(MolToSDFRecord(mol))


Andrew
[email protected]




___
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] mol properties in SDWriter

2023-09-28 Thread Ling Chan
Thank you Wim. I shall post-process the SDF as you suggested.
Ling


Wim Dehaen  於 2023年9月25日週一 下午5:11寫道:

> Why there is a counter between parentheses there, I don't know, but in
> case there's no option to remove it, you might just manually remove it
> using a regex to remove anything between parentheses on a line that starts
> with >
> for example:
>
> from rdkit import Chem
> import re
> from io import StringIO
> m = Chem.MolFromSmiles("CCC")
> m.SetProp("pKa","3.3")
> sio = StringIO()
> with Chem.SDWriter(sio) as o:
> o.write(m)
> sio.seek(0)
> with open("temp3.sdf", "w") as f:
> for line in sio.readlines():
> f.write(re.sub(r'^>(.*?)\((.*?)\)', r'>\1', line))
>
> best wishes
> wim
>
> On Tue, Sep 26, 2023 at 1:20 AM Ling Chan  wrote:
>
>> Dear Colleagues,
>>
>> I noticed that when writing out molecules using SDWriter() , the
>> properties fields are followed by something like "(1)" , "(2)". I mean, the
>> sdf looks like:
>>
>> propane
>>  RDKit  3D
>>
>>   3  2  0  0  0  0  0  0  0  0999 V2000
>> 0.0.0. C   0  0  0  0  0  0  0  0  0  0  0  0
>> 1.42800.0. C   0  0  0  0  0  0  0  0  0  0  0  0
>> 1.90401.3000   -0.3480 C   0  0  0  0  0  0  0  0  0  0  0  0
>>   1  2  1  0
>>   2  3  1  0
>> M  END
>> >(1)
>> 4.099
>>
>> >(1)
>> 2
>>
>> 
>>
>> Just wonder what was the rationale behind this extra "(1)" on the
>> property field lines (pKa and logP in the above example)?
>>
>> And is there a way to get rid of these? I am not sure if this extra "(1)"
>> is part of the standard sd format.
>>
>> Thank you!
>>
>> Regards,
>> Ling
>>
>>
>> ---
>>
>> To create an sdf, you can do something like:
>>
>> >>> from rdkit import Chem
>> >>> m = Chem.MolFromSmiles("CCC")
>> >>> m.SetProp("pKa","3.3")
>> >>> with Chem.SDWriter("temp3.sdf") as o:
>> ...   o.write(m)
>>
>> Or use Chem.SDMolSupplier() to get mols from another sdf.
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] mol properties in SDWriter

2023-09-25 Thread Wim Dehaen
Why there is a counter between parentheses there, I don't know, but in case
there's no option to remove it, you might just manually remove it using a
regex to remove anything between parentheses on a line that starts with >
for example:

from rdkit import Chem
import re
from io import StringIO
m = Chem.MolFromSmiles("CCC")
m.SetProp("pKa","3.3")
sio = StringIO()
with Chem.SDWriter(sio) as o:
o.write(m)
sio.seek(0)
with open("temp3.sdf", "w") as f:
for line in sio.readlines():
f.write(re.sub(r'^>(.*?)\((.*?)\)', r'>\1', line))

best wishes
wim

On Tue, Sep 26, 2023 at 1:20 AM Ling Chan  wrote:

> Dear Colleagues,
>
> I noticed that when writing out molecules using SDWriter() , the
> properties fields are followed by something like "(1)" , "(2)". I mean, the
> sdf looks like:
>
> propane
>  RDKit  3D
>
>   3  2  0  0  0  0  0  0  0  0999 V2000
> 0.0.0. C   0  0  0  0  0  0  0  0  0  0  0  0
> 1.42800.0. C   0  0  0  0  0  0  0  0  0  0  0  0
> 1.90401.3000   -0.3480 C   0  0  0  0  0  0  0  0  0  0  0  0
>   1  2  1  0
>   2  3  1  0
> M  END
> >(1)
> 4.099
>
> >(1)
> 2
>
> 
>
> Just wonder what was the rationale behind this extra "(1)" on the property
> field lines (pKa and logP in the above example)?
>
> And is there a way to get rid of these? I am not sure if this extra "(1)"
> is part of the standard sd format.
>
> Thank you!
>
> Regards,
> Ling
>
>
> ---
>
> To create an sdf, you can do something like:
>
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles("CCC")
> >>> m.SetProp("pKa","3.3")
> >>> with Chem.SDWriter("temp3.sdf") as o:
> ...   o.write(m)
>
> Or use Chem.SDMolSupplier() to get mols from another sdf.
>
>
>
>
> ___
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss