Re: [Open Babel] What is the time complexity of generating a canonical SMILES

2018-11-07 Thread Noel O'Boyle
Sorry: typo "...quickly eliminate some duplicates using non-canonical
SMILES..."

On Wed, 7 Nov 2018 at 09:32, Noel O'Boyle  wrote:

> I don't believe that there is a better way to eliminate duplicates apart
> from using canonical forms (e.g. InChI or canonical SMILES, though InChI
> also changes the structure so use with care). You can quickly eliminate
> some duplicates using canonical SMILES or a matrix comparison as you
> suggest, but to actually be sure you will still have to use a canonical
> form.
>
> I'm not sure exactly what you're doing but another idea would be to avoid
> generating some structures in the first place using graph symmetry
> (available in OB). For example, don't apply the same reaction to symmetry
> equivalent atoms.
>
> - Noel
>
> On Fri, 2 Nov 2018 at 18:08, Xianghai Sheng  wrote:
>
>> Hi Andrew and Noel,
>>
>> Thank you for the inspiring answers!
>>
>> Andrew: Sorry I didn't make it clear, but 1/2 of the time is only for a
>> random job I tested and I don't know the scaling of it. Its cost may grow
>> faster than the searching algorithm. I think it is beneficial to know the
>> formal time complexity in this sense. But I agree that it is mostly useful
>> when thinking about large N. I may have to just measure the actual scaling
>> and deal with it.
>>
>> Noel: That's a brilliant idea! I do think most of the time is wasted on
>> canonicalizing the same graph in my code. I will give it a try. On another
>> note, the purpose of canonicalization in my code is to eliminate duplicates
>> in my reaction path network. Is there a better way to do it? Molecules in
>> my code are represented by bond-electron matrix. A naive approach would be
>> to sort rows in both matrices and compare them row by row. (I think this
>> takes O(N^2*logN))
>>
>> Best,
>> Xianghai
>>
>> On Fri, Nov 2, 2018 at 3:31 AM Noel O'Boyle  wrote:
>>
>>> First of all, I'd recommend using the development code rather than the
>>> latest release. Also, ensure that hydrogens are suppressed. You may also
>>> consider trying to avoid multiple canonicalisations of the same graph by
>>> just writing out a non-canonical SMILES, and only at the end collating them
>>> and canonicalising.
>>>
>>> While I agree with Andrew, it is also true that Open Babel's
>>> canonicalisation implementation is slow. This is partly due to the fact
>>> that we calculate all of the graph invariants up-front rather than lazily
>>> to split ties. In particular, we calculate the pairwise distance of every
>>> atom from every other atom (as far as I remember). This is O(N**2).
>>>
>>>
>>> Regards,
>>> - Noel
>>>
>>> On Fri, 2 Nov 2018 at 08:02, Andrew Dalke 
>>> wrote:
>>>
 On Nov 2, 2018, at 01:33, Xianghai Sheng  wrote:
 > I am trying to figure out the time complexity (Big O) of converting
 from OBMol to canonical smiles.

 I don't think that's a useful way to think about your problem.

 Graph canonicalization is "at least as computationally hard as the
 graph isomorphism problem."
 https://en.wikipedia.org/wiki/Graph_canonization , and graph
 isomorphism is believed to be quasi-polynomial.

 That said, I believe the actual algorithm Open Babel uses may have
 exponential time for some graphs.

 However, most molecular graphs don't reach the worst-time behavior.

 Furthermore, time complexity is mostly useful when thinking about large
 N. Molecules typically have small N, and there may be lower-order terms
 which dominate the times for small N.

 If you are interested in the topic of graph canonicalization algorithms
 in general, you might start with the nauty papers; see
 http://pallini.di.uniroma1.it/

 > I want to know this because getting canonical smiles takes up half of
 the running time of my reaction path searching program.

 How would knowing the time complexity help you?

 While this may be obvious, if canonicalization takes 1/2 the time, then
 you'll only get 2x speedup even if that time goes to 0. How much work are
 you willing to do for a 2x speedup?

 Might it be possible to do about the same amount of work to parallelize
 your algorithm? For example, if your reaction path search generates a set
 of nodes which must all be canonicalized in order to evaluate them, then
 you can do all of those canonicalizations in parallel.

 Cheers,

 Andrew
 da...@dalkescientific.com




 ___
 OpenBabel-discuss mailing list
 OpenBabel-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

>>>
___
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss


Re: [Open Babel] What is the time complexity of generating a canonical SMILES

2018-11-07 Thread Noel O'Boyle
I don't believe that there is a better way to eliminate duplicates apart
from using canonical forms (e.g. InChI or canonical SMILES, though InChI
also changes the structure so use with care). You can quickly eliminate
some duplicates using canonical SMILES or a matrix comparison as you
suggest, but to actually be sure you will still have to use a canonical
form.

I'm not sure exactly what you're doing but another idea would be to avoid
generating some structures in the first place using graph symmetry
(available in OB). For example, don't apply the same reaction to symmetry
equivalent atoms.

- Noel

On Fri, 2 Nov 2018 at 18:08, Xianghai Sheng  wrote:

> Hi Andrew and Noel,
>
> Thank you for the inspiring answers!
>
> Andrew: Sorry I didn't make it clear, but 1/2 of the time is only for a
> random job I tested and I don't know the scaling of it. Its cost may grow
> faster than the searching algorithm. I think it is beneficial to know the
> formal time complexity in this sense. But I agree that it is mostly useful
> when thinking about large N. I may have to just measure the actual scaling
> and deal with it.
>
> Noel: That's a brilliant idea! I do think most of the time is wasted on
> canonicalizing the same graph in my code. I will give it a try. On another
> note, the purpose of canonicalization in my code is to eliminate duplicates
> in my reaction path network. Is there a better way to do it? Molecules in
> my code are represented by bond-electron matrix. A naive approach would be
> to sort rows in both matrices and compare them row by row. (I think this
> takes O(N^2*logN))
>
> Best,
> Xianghai
>
> On Fri, Nov 2, 2018 at 3:31 AM Noel O'Boyle  wrote:
>
>> First of all, I'd recommend using the development code rather than the
>> latest release. Also, ensure that hydrogens are suppressed. You may also
>> consider trying to avoid multiple canonicalisations of the same graph by
>> just writing out a non-canonical SMILES, and only at the end collating them
>> and canonicalising.
>>
>> While I agree with Andrew, it is also true that Open Babel's
>> canonicalisation implementation is slow. This is partly due to the fact
>> that we calculate all of the graph invariants up-front rather than lazily
>> to split ties. In particular, we calculate the pairwise distance of every
>> atom from every other atom (as far as I remember). This is O(N**2).
>>
>>
>> Regards,
>> - Noel
>>
>> On Fri, 2 Nov 2018 at 08:02, Andrew Dalke 
>> wrote:
>>
>>> On Nov 2, 2018, at 01:33, Xianghai Sheng  wrote:
>>> > I am trying to figure out the time complexity (Big O) of converting
>>> from OBMol to canonical smiles.
>>>
>>> I don't think that's a useful way to think about your problem.
>>>
>>> Graph canonicalization is "at least as computationally hard as the graph
>>> isomorphism problem." https://en.wikipedia.org/wiki/Graph_canonization
>>> , and graph isomorphism is believed to be quasi-polynomial.
>>>
>>> That said, I believe the actual algorithm Open Babel uses may have
>>> exponential time for some graphs.
>>>
>>> However, most molecular graphs don't reach the worst-time behavior.
>>>
>>> Furthermore, time complexity is mostly useful when thinking about large
>>> N. Molecules typically have small N, and there may be lower-order terms
>>> which dominate the times for small N.
>>>
>>> If you are interested in the topic of graph canonicalization algorithms
>>> in general, you might start with the nauty papers; see
>>> http://pallini.di.uniroma1.it/
>>>
>>> > I want to know this because getting canonical smiles takes up half of
>>> the running time of my reaction path searching program.
>>>
>>> How would knowing the time complexity help you?
>>>
>>> While this may be obvious, if canonicalization takes 1/2 the time, then
>>> you'll only get 2x speedup even if that time goes to 0. How much work are
>>> you willing to do for a 2x speedup?
>>>
>>> Might it be possible to do about the same amount of work to parallelize
>>> your algorithm? For example, if your reaction path search generates a set
>>> of nodes which must all be canonicalized in order to evaluate them, then
>>> you can do all of those canonicalizations in parallel.
>>>
>>> Cheers,
>>>
>>> Andrew
>>> da...@dalkescientific.com
>>>
>>>
>>>
>>>
>>> ___
>>> OpenBabel-discuss mailing list
>>> OpenBabel-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
>>>
>>
___
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss


Re: [Open Babel] What is the time complexity of generating a canonical SMILES

2018-11-02 Thread Xianghai Sheng
Hi Andrew and Noel,

Thank you for the inspiring answers!

Andrew: Sorry I didn't make it clear, but 1/2 of the time is only for a random 
job I tested and I don't know the scaling of it. Its cost may grow faster than 
the searching algorithm. I think it is beneficial to know the formal time 
complexity in this sense. But I agree that it is mostly useful when thinking 
about large N. I may have to just measure the actual scaling and deal with it.

Noel: That's a brilliant idea! I do think most of the time is wasted on 
canonicalizing the same graph in my code. I will give it a try. On another 
note, the purpose of canonicalization in my code is to eliminate duplicates in 
my reaction path network. Is there a better way to do it? Molecules in my code 
are represented by bond-electron matrix. A naive approach would be to sort rows 
in both matrices and compare them row by row. (I think this takes O(N^2*logN))

Best,
Xianghai

On Fri, Nov 2, 2018 at 3:31 AM Noel O'Boyle 
mailto:baoille...@gmail.com>> wrote:
First of all, I'd recommend using the development code rather than the latest 
release. Also, ensure that hydrogens are suppressed. You may also consider 
trying to avoid multiple canonicalisations of the same graph by just writing 
out a non-canonical SMILES, and only at the end collating them and 
canonicalising.

While I agree with Andrew, it is also true that Open Babel's canonicalisation 
implementation is slow. This is partly due to the fact that we calculate all of 
the graph invariants up-front rather than lazily to split ties. In particular, 
we calculate the pairwise distance of every atom from every other atom (as far 
as I remember). This is O(N**2).


Regards,
- Noel

On Fri, 2 Nov 2018 at 08:02, Andrew Dalke 
mailto:da...@dalkescientific.com>> wrote:
On Nov 2, 2018, at 01:33, Xianghai Sheng 
mailto:xsh...@ucmerced.edu>> wrote:
> I am trying to figure out the time complexity (Big O) of converting from 
> OBMol to canonical smiles.

I don't think that's a useful way to think about your problem.

Graph canonicalization is "at least as computationally hard as the graph 
isomorphism problem." https://en.wikipedia.org/wiki/Graph_canonization , and 
graph isomorphism is believed to be quasi-polynomial.

That said, I believe the actual algorithm Open Babel uses may have exponential 
time for some graphs.

However, most molecular graphs don't reach the worst-time behavior.

Furthermore, time complexity is mostly useful when thinking about large N. 
Molecules typically have small N, and there may be lower-order terms which 
dominate the times for small N.

If you are interested in the topic of graph canonicalization algorithms in 
general, you might start with the nauty papers; see 
http://pallini.di.uniroma1.it/

> I want to know this because getting canonical smiles takes up half of the 
> running time of my reaction path searching program.

How would knowing the time complexity help you?

While this may be obvious, if canonicalization takes 1/2 the time, then you'll 
only get 2x speedup even if that time goes to 0. How much work are you willing 
to do for a 2x speedup?

Might it be possible to do about the same amount of work to parallelize your 
algorithm? For example, if your reaction path search generates a set of nodes 
which must all be canonicalized in order to evaluate them, then you can do all 
of those canonicalizations in parallel.

Cheers,

Andrew

da...@dalkescientific.com




___
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
___
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss


Re: [Open Babel] What is the time complexity of generating a canonical SMILES

2018-11-02 Thread Noel O'Boyle
First of all, I'd recommend using the development code rather than the
latest release. Also, ensure that hydrogens are suppressed. You may also
consider trying to avoid multiple canonicalisations of the same graph by
just writing out a non-canonical SMILES, and only at the end collating them
and canonicalising.

While I agree with Andrew, it is also true that Open Babel's
canonicalisation implementation is slow. This is partly due to the fact
that we calculate all of the graph invariants up-front rather than lazily
to split ties. In particular, we calculate the pairwise distance of every
atom from every other atom (as far as I remember). This is O(N**2).


Regards,
- Noel

On Fri, 2 Nov 2018 at 08:02, Andrew Dalke  wrote:

> On Nov 2, 2018, at 01:33, Xianghai Sheng  wrote:
> > I am trying to figure out the time complexity (Big O) of converting from
> OBMol to canonical smiles.
>
> I don't think that's a useful way to think about your problem.
>
> Graph canonicalization is "at least as computationally hard as the graph
> isomorphism problem." https://en.wikipedia.org/wiki/Graph_canonization ,
> and graph isomorphism is believed to be quasi-polynomial.
>
> That said, I believe the actual algorithm Open Babel uses may have
> exponential time for some graphs.
>
> However, most molecular graphs don't reach the worst-time behavior.
>
> Furthermore, time complexity is mostly useful when thinking about large N.
> Molecules typically have small N, and there may be lower-order terms which
> dominate the times for small N.
>
> If you are interested in the topic of graph canonicalization algorithms in
> general, you might start with the nauty papers; see
> http://pallini.di.uniroma1.it/
>
> > I want to know this because getting canonical smiles takes up half of
> the running time of my reaction path searching program.
>
> How would knowing the time complexity help you?
>
> While this may be obvious, if canonicalization takes 1/2 the time, then
> you'll only get 2x speedup even if that time goes to 0. How much work are
> you willing to do for a 2x speedup?
>
> Might it be possible to do about the same amount of work to parallelize
> your algorithm? For example, if your reaction path search generates a set
> of nodes which must all be canonicalized in order to evaluate them, then
> you can do all of those canonicalizations in parallel.
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> ___
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
>
___
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss


Re: [Open Babel] What is the time complexity of generating a canonical SMILES

2018-11-02 Thread Andrew Dalke
On Nov 2, 2018, at 01:33, Xianghai Sheng  wrote:
> I am trying to figure out the time complexity (Big O) of converting from 
> OBMol to canonical smiles.

I don't think that's a useful way to think about your problem.

Graph canonicalization is "at least as computationally hard as the graph 
isomorphism problem." https://en.wikipedia.org/wiki/Graph_canonization , and 
graph isomorphism is believed to be quasi-polynomial.

That said, I believe the actual algorithm Open Babel uses may have exponential 
time for some graphs.

However, most molecular graphs don't reach the worst-time behavior.

Furthermore, time complexity is mostly useful when thinking about large N. 
Molecules typically have small N, and there may be lower-order terms which 
dominate the times for small N.

If you are interested in the topic of graph canonicalization algorithms in 
general, you might start with the nauty papers; see 
http://pallini.di.uniroma1.it/ 

> I want to know this because getting canonical smiles takes up half of the 
> running time of my reaction path searching program.

How would knowing the time complexity help you?

While this may be obvious, if canonicalization takes 1/2 the time, then you'll 
only get 2x speedup even if that time goes to 0. How much work are you willing 
to do for a 2x speedup?

Might it be possible to do about the same amount of work to parallelize your 
algorithm? For example, if your reaction path search generates a set of nodes 
which must all be canonicalized in order to evaluate them, then you can do all 
of those canonicalizations in parallel.

Cheers,

Andrew
da...@dalkescientific.com




___
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss