Re: [Rdkit-discuss] What is RDkit

2007-09-14 Thread Greg Landrum
Hi Geoff,

On 9/14/07, Geoffrey Hutchison  wrote:
> Hi Greg and company! I think we talked at some point about YAHhMOP
> during my postdoc at Cornell.

yes! I knew your name was familiar beyond seeing it associated with
OpenBabel. That's it.

> Browsing through the thread, I see your comment that RDKit stuff
> isn't nearly as optimized or "battle-tested" as Daylight (or
> presumably Open Babel).

I didn't include OpenBabel there because I honestly don't know enough
about it to be able to realistically assess its performance or
robustness.

> You also seem worried that if the project will have more users,
> you'll spend lots of time answering questions.
>
>  From my perspective, neither of these have been problems with Open
> Babel. Of course at this point, we also have more users and
> developers, so the burden of support is spread over more people.
> (Noel, for example, has been very active, as you've probably
> noticed. :-)

This is something that's nice to hear. I'd be happy to have a popular
open-source project that works as it's supposed to (contributions from
the community in terms of either code, examples, documentation, help
for other users, etc.) but I know one needs to be very lucky to have
that happen.

> What I'm curious about is the question of collaboration/merging
> efforts. I haven't had enough time to sit down and pour through your
> code, but it looks like much of RDKit is complementary to Open Babel.
> That is, what we do well (arbitrary generic data, file formats,
> "battle-tested," etc.) is not something already in RDKit. We've
> worked out installation/building on Mac, Linux, Windows, etc. and
> automatic Python, Ruby, etc. interfaces using SWIG. We do have a
> force field framework, and it has at least partial implementations of
> MM2, MMFF94, and Ghemical (i.e., Tripos-5.2).
>
> I also think we have a nice framework for adding additional features
> like force fields, fingerprints, file formats. It's all plugins that
> can be dynamically loaded.
>
> OTOH, we've been discussing an effort for an "Open Babel 3.0" where
> we break backwards compatibility and clean up some of the core code.
> This would obviously be a fairly significant undertaking. Many of the
> features and changes requested are along the lines of RDKit's codebase.
>
> Fortunately, the weekend is coming up. Do you think that:
> a) There's some complementary overlap between RDKit and Open Babel
> b) Collaboration and/or merging might be good for both projects?
>   (e.g., we'd certainly help improve the documentation, etc.)

There's a lot here to discuss and think about. I am certainly in favor
of collaboration and cooperation, but talking about integration is
scary. :-) Let's definitely talk though.

> Obviously, the combined project would have some GPL bits and some BSD
> bits.

To be completely up front, the GPL bit bothers me and presents a
problem for any tight integration. We're already stuck with some "GPL
contamination" for the GUI components of the RDKit due to our use of
Qt, but I have tried to keep that as localized as I possibly can.

Out of curiosity: is the GPL use in OpenBabel due to historical
reasons (OB is derived from the old OELib, right?) or philosophical?

> Is this just a crazy idea?

I don't think so. I'm sure that we can find *something* productive and
useful to do so that the two toolkits aren't competing/completely
disjoint. :-)

-greg



Re: [Rdkit-discuss] What is RDkit

2007-09-14 Thread Geoffrey Hutchison
I think Noel is a few hours ahead of me, so I'll have to play catch- 
up. :-)


Hi Greg and company! I think we talked at some point about YAHhMOP  
during my postdoc at Cornell.


Browsing through the thread, I see your comment that RDKit stuff  
isn't nearly as optimized or "battle-tested" as Daylight (or  
presumably Open Babel).


You also seem worried that if the project will have more users,  
you'll spend lots of time answering questions.


From my perspective, neither of these have been problems with Open  
Babel. Of course at this point, we also have more users and  
developers, so the burden of support is spread over more people.  
(Noel, for example, has been very active, as you've probably  
noticed. :-)


What I'm curious about is the question of collaboration/merging  
efforts. I haven't had enough time to sit down and pour through your  
code, but it looks like much of RDKit is complementary to Open Babel.  
That is, what we do well (arbitrary generic data, file formats,  
"battle-tested," etc.) is not something already in RDKit. We've  
worked out installation/building on Mac, Linux, Windows, etc. and  
automatic Python, Ruby, etc. interfaces using SWIG. We do have a  
force field framework, and it has at least partial implementations of  
MM2, MMFF94, and Ghemical (i.e., Tripos-5.2).


I also think we have a nice framework for adding additional features  
like force fields, fingerprints, file formats. It's all plugins that  
can be dynamically loaded.


OTOH, we've been discussing an effort for an "Open Babel 3.0" where  
we break backwards compatibility and clean up some of the core code.  
This would obviously be a fairly significant undertaking. Many of the  
features and changes requested are along the lines of RDKit's codebase.


Fortunately, the weekend is coming up. Do you think that:
a) There's some complementary overlap between RDKit and Open Babel
b) Collaboration and/or merging might be good for both projects?
 (e.g., we'd certainly help improve the documentation, etc.)

Obviously, the combined project would have some GPL bits and some BSD  
bits.


Is this just a crazy idea?

Cheers,
-Geoff



Re: [Rdkit-discuss] What is RDkit?

2007-09-14 Thread Noel O'Boyle
> Well...don't worry about the publicisation...you can leave that to me
> (I feel a blog post coming on :-) )

As promised, here's a blog post:

http://baoilleach.blogspot.com/2007/09/rdkit-not-just-yet-another.html

Noel



Re: [Rdkit-discuss] What is RDkit?

2007-09-13 Thread Noel O'Boyle
On 13/09/2007, Greg Landrum  wrote:
> On 9/13/07, Noel O'Boyle  wrote:
> > So on to the questions:
> > (1) Can I believe my eyes? Is this really open source? A lot of the
> > Python code has a very restrictive copyright statement right at the
> > start (see Windows release, AllChem.py for example)
>
> The "All rights reserved" is, in my opinion, superseded by the
> license.txt file (which is BSD, except for the GUI components, which
> are GPL due to Qt license restrictions). It's really open source and
> it's as open as we could make it without going public domain (I
> consider the BSD license to be far more open than the GPL, which is
> quite restrictive IMO).

I would definitely encourage you to replace  the "All rights reserved"
to be replaced with something more friendly. I'll check out some codes
I'm involved with and suggest a change.

> > (4) Why haven't you publicised RDKit, if you don't mind me asking? For
> > example, there is an excellent (if I do say so myself) website called
> > Linux4Chemistry which lists the excellent (if you do say so yourself)
> > YaEHMOP. Also there's the CCL mailing list. I only found RDKit because
> > of trawling through the SF software map. Is this, um, shyness,
> > intentional?
>
> There are many components to the answer to this question. Some are:
>  1) Promotion isn't something I enjoy or am particularly good at.
>  2) I'm kind of afraid of having more users. I do a lot of this as a
> free-time project and I'm afraid of spending all my time answering
> questions. This is, of course, a bit stupid because if the whole open
> source thing works then other people will pitch in and help with those
> questions. For that to happen I need those other people as users,
> which requires that I find them, which... it's a Catch 22

Well...don't worry about the publicisation...you can leave that to me
(I feel a blog post coming on :-) ) As regards too much  time
answering questions, cheminformatics toolkits are a pretty niche
interest. Also, I try to avoid answering any question twice; i.e. I
update the documentation if someone doesn't know how to do something,
and you can always send them over to OpenBabel if they don't behave.
But it'd be wrong to think that people will pitch in to answer
questions - they don't. They are more forthcoming finding bugs though,
which is useful too.

> > (5) You may/not be aware but Numeric is deprecrated to the extent that
> > it is not available for Python 2.5 on Windows. I had to replace a
> > couple of "import Numeric"s with "from numpy import oldnumeric as
> > Numeric", but this is only a temporary solution.
>
> The Numeric thing is a definite problem (though it works fine for me
> with Python2.5 under windows). I made an attempt a while ago to port
> the code to use numpy, but was immediately frustrated by the lack of
> documentation available (unless you buy the book) and the very
> aggressive response of the community when I complained about this.

I'll help if I can, but I'd be relying on the unittests to ensure correctness.

> > (6) It'd be nice to have an installer for the Python stuff...I've done
> > this for OpenBabel. It's pretty easy.
>
> If you care to share how you did this, I'd be happy to learn. It's a nice 
> idea.
Will do. In fact, the one major flaw with promoting your toolkit is
that it's not clear how it's installed on Windows or on Linux. You
might want to consider writing this up.

> > (12) Interested in easily converting a ROMol to an OBMol and vice
> > versa? I am. It'd be trivial to do this at the Python level. We could
> > coordinate a bit to make the methods somewhat symmetrical. It would
> > make it easy to unittest shared algorithms against each other, e.g.
> > LogP calculation, SMILES, or whatever.
>
> It would be an interesting exercise. I'm not convinced that it would
> be trivial to get it right.  There's a lot of devil in the details of
> things like aromaticity handling and general sanitization problems
> (the RDKit is *very* picky about molecules being "clean").

Well, as you can imagine, there are several levels of information in a
chemical structure. I was initially thinking of just sharing
coordinates, and allowing each program to work the rest out from
there. Do you have bond perception? Well, anyway, we will sort this
out later.

Thanks for all the answers,

   Noel



Re: [Rdkit-discuss] What is RDkit?

2007-09-13 Thread Greg Landrum
Hi Neal,

Let me provide a more thorough answer than I did previously, I'll move
it back on list too so that the answers are out there and archived.

On 9/13/07, Noel O'Boyle  wrote:
> (off-list due to inquisitive nature of some questions, although feel
> free to move back onto-list)
>
> First of all, Greg, this is pretty expletive impressive. In fact, it's
> unbelievable. Not only have you matched the OpenBabel or Daylight
> toolkits, in many ways (perhaps all?) you have surpassed them. I can't

Thanks for the complements, I appreciate them and I'm sure that
Santosh (the other developer) does as well, but I do want to temper
the enthusiasm with a bit of reality: the RDKit stuff isn't nearly as
highly optimized or thoroughly battle tested as Daylight.

> believe this code has been out for more than a year. You've got 2D and
> 3D coordinate generation, and everything! Something the open source
> world has been crying out for for the last few years. If this code had
> been around when I first heard about the Python interface to OB
> (almost 2y ago now), I probably wouldn't have been involved. (To
> clarify, I'm involved at the Python end of OpenBabel)
>
> So on to the questions:
> (1) Can I believe my eyes? Is this really open source? A lot of the
> Python code has a very restrictive copyright statement right at the
> start (see Windows release, AllChem.py for example)

The "All rights reserved" is, in my opinion, superseded by the
license.txt file (which is BSD, except for the GUI components, which
are GPL due to Qt license restrictions). It's really open source and
it's as open as we could make it without going public domain (I
consider the BSD license to be far more open than the GPL, which is
quite restrictive IMO).

> (2) How does, if at all, being the product of a company affect this
> toolkit? I guess what I'm saying is, are you willing to engage with
> the OS community. Is the code likely to be taken back in house (which
> happened with OEChem, for example)?

The company is no more, so no worries about that.

> (3) What's the story with version numbers and backwards-incompatible
> API changes? That is, do you try to maintain the API across releases?

Yes. API changes would break the unit tests, which would require a lot
of work to fix, so pure developer laziness dictates API stability. We
also put a lot of time into making sure that the various binary
formats used can always be parsed backwards (e.g. if a file format
change happens newer versions can still read old files).

> (4) Why haven't you publicised RDKit, if you don't mind me asking? For
> example, there is an excellent (if I do say so myself) website called
> Linux4Chemistry which lists the excellent (if you do say so yourself)
> YaEHMOP. Also there's the CCL mailing list. I only found RDKit because
> of trawling through the SF software map. Is this, um, shyness,
> intentional?

There are many components to the answer to this question. Some are:
 1) Promotion isn't something I enjoy or am particularly good at.
 2) I'm kind of afraid of having more users. I do a lot of this as a
free-time project and I'm afraid of spending all my time answering
questions. This is, of course, a bit stupid because if the whole open
source thing works then other people will pitch in and help with those
questions. For that to happen I need those other people as users,
which requires that I find them, which... it's a Catch 22

> (5) You may/not be aware but Numeric is deprecrated to the extent that
> it is not available for Python 2.5 on Windows. I had to replace a
> couple of "import Numeric"s with "from numpy import oldnumeric as
> Numeric", but this is only a temporary solution.

The Numeric thing is a definite problem (though it works fine for me
with Python2.5 under windows). I made an attempt a while ago to port
the code to use numpy, but was immediately frustrated by the lack of
documentation available (unless you buy the book) and the very
aggressive response of the community when I complained about this.

> (6) It'd be nice to have an installer for the Python stuff...I've done
> this for OpenBabel. It's pretty easy.

If you care to share how you did this, I'd be happy to learn. It's a nice idea.

> (7) Conceptually is this a C++ toolkit, or a Python toolkit with a C++
> backend? It seems that a lot of the work is done in Python...

It's both. The core data structures and algorithms are almost entirely
in C++, a lot of the "end-user" functionality is written in Python.
The model has been that new algorithms get coded first in Python and
then ported into C++ if it's needed for speed. The two APIs are
similar enough (to me at least) that this usually ends up being fairly
straightforward.

> (8) Are there any particular reasons you didn't base your code on OpenBabel?

Again, a complicated question. The short answer is:
1) at the time we started the RDKit development OpenBabel was still
OELib (or close to it) and didn't do what we wanted.
2) we were doing this

Re: [Rdkit-discuss] What is RDkit?

2007-09-12 Thread Greg Landrum
On 9/12/07, Noel O'Boyle  wrote:
>
> Just a note: you can point the SF home page to go to whatever website you 
> want.

yep; I just had never updated that after registering the rdkit.org
domain. It's fixed(ish) now.

> > My last look at blue obelisk was, admittedly, a while ago, but I
> > somehow got the impression at the time that it was pretty
> > Java-centric. Is this true?
>
> I can understand your impression. I'm not into Java though. The other
> side is the C++ code of OpenBabel, and Python code for various things
> (e.g. GaussSum, cclib for comp chem, and Python bindings for
> OpenBabel). Also, the BO is involved in sharing chemical data, e.g.
> names of elements, atomic weights and so on, so that we don't keep
> having to re-enter this information in different software. (Check out
> Blue Obelisk Data Repository on google). Interoperability is also one
> of our goals.

I'll take a fresh look at what's on the website (particularly the data
pages); thanks for the pointer.

> In the end, the BO is a loosely coupled bunch of chemists/computer
> scientists with largely the same goals but lots of different ideas
> about getting there. In short, I recommend subscribing to the mailing
> list, or the RSS feed, checking out the wiki, and keeping an eye on
> things. There's a couple of very interesting blogs too, if I do say so
> myself..

Nice ad. :-)
I'll keep an eye on things and we can collaboratively see if there's
useful overlap.

-greg



Re: [Rdkit-discuss] What is RDkit?

2007-09-12 Thread Noel O'Boyle
On 12/09/2007, Greg Landrum  wrote:
> Hi Noel,
>
> Sorry for the very slow posting of your message and reply; I was on
> vacation until yesterday and needed to approve your posting since you
> aren't subscribed to the discuss list; that shouldn't be a problem
> from now on.

Don't worry about it. Hope you had a good holiday.

> On 8/27/07, Noel O'Boyle  wrote:
> > Any chance of more info on what RDkit contains? E.g. an API, or a
> > website. Although Open Source projects are infamous for incomplete
> > documentation, you seem to have taken this to an extreme :-)
>
> ah hah! that's where you're wrong! :-)
> We actually do have have some documentation, we just don't have any
> obvious links to it from sourceforge. This is an oversight on my part
> and something I will clear up.
>
> There are some useful links here:
> http://www.rdkit.org/
> specifically to the overview PDF:
> http://www.rdkit.org/RDKit_Overview.pdf
> and some (automatically generated and somewhat out of date) API documentation:
> http://www.rdkit.org/C++_Docs

Just a note: you can point the SF home page to go to whatever website you want.

> There's also an introduction to using the code from Python on
> sourceforge that provides something of an overview of the
> functionality:
> http://downloads.sourceforge.net/rdkit/GettingStartedInPython.pdf

Great! I'll read up on it.

> > I'm involved in other open source cheminformatics packages (via the
> > BlueObelisk, with varying degrees of documentation it must be
> > admitted) so it would be good to know to what extent there is overlap,
> > and whether we could share code, etc...
>
> My last look at blue obelisk was, admittedly, a while ago, but I
> somehow got the impression at the time that it was pretty
> Java-centric. Is this true?

I can understand your impression. I'm not into Java though. The other
side is the C++ code of OpenBabel, and Python code for various things
(e.g. GaussSum, cclib for comp chem, and Python bindings for
OpenBabel). Also, the BO is involved in sharing chemical data, e.g.
names of elements, atomic weights and so on, so that we don't keep
having to re-enter this information in different software. (Check out
Blue Obelisk Data Repository on google). Interoperability is also one
of our goals.

In the end, the BO is a loosely coupled bunch of chemists/computer
scientists with largely the same goals but lots of different ideas
about getting there. In short, I recommend subscribing to the mailing
list, or the RSS feed, checking out the wiki, and keeping an eye on
things. There's a couple of very interesting blogs too, if I do say so
myself..

Noel

> Regards,
> -greg
>



Re: [Rdkit-discuss] What is RDkit?

2007-09-12 Thread Greg Landrum
Hi Noel,

Sorry for the very slow posting of your message and reply; I was on
vacation until yesterday and needed to approve your posting since you
aren't subscribed to the discuss list; that shouldn't be a problem
from now on.

On 8/27/07, Noel O'Boyle  wrote:
> Any chance of more info on what RDkit contains? E.g. an API, or a
> website. Although Open Source projects are infamous for incomplete
> documentation, you seem to have taken this to an extreme :-)

ah hah! that's where you're wrong! :-)
We actually do have have some documentation, we just don't have any
obvious links to it from sourceforge. This is an oversight on my part
and something I will clear up.

There are some useful links here:
http://www.rdkit.org/
specifically to the overview PDF:
http://www.rdkit.org/RDKit_Overview.pdf
and some (automatically generated and somewhat out of date) API documentation:
http://www.rdkit.org/C++_Docs

There's also an introduction to using the code from Python on
sourceforge that provides something of an overview of the
functionality:
http://downloads.sourceforge.net/rdkit/GettingStartedInPython.pdf

> I'm involved in other open source cheminformatics packages (via the
> BlueObelisk, with varying degrees of documentation it must be
> admitted) so it would be good to know to what extent there is overlap,
> and whether we could share code, etc...

My last look at blue obelisk was, admittedly, a while ago, but I
somehow got the impression at the time that it was pretty
Java-centric. Is this true?

Regards,
-greg