Re: [Rdkit-discuss] What is RDkit
Hi Geoff, On 9/14/07, Geoffrey Hutchison wrote: > Hi Greg and company! I think we talked at some point about YAHhMOP > during my postdoc at Cornell. yes! I knew your name was familiar beyond seeing it associated with OpenBabel. That's it. > Browsing through the thread, I see your comment that RDKit stuff > isn't nearly as optimized or "battle-tested" as Daylight (or > presumably Open Babel). I didn't include OpenBabel there because I honestly don't know enough about it to be able to realistically assess its performance or robustness. > You also seem worried that if the project will have more users, > you'll spend lots of time answering questions. > > From my perspective, neither of these have been problems with Open > Babel. Of course at this point, we also have more users and > developers, so the burden of support is spread over more people. > (Noel, for example, has been very active, as you've probably > noticed. :-) This is something that's nice to hear. I'd be happy to have a popular open-source project that works as it's supposed to (contributions from the community in terms of either code, examples, documentation, help for other users, etc.) but I know one needs to be very lucky to have that happen. > What I'm curious about is the question of collaboration/merging > efforts. I haven't had enough time to sit down and pour through your > code, but it looks like much of RDKit is complementary to Open Babel. > That is, what we do well (arbitrary generic data, file formats, > "battle-tested," etc.) is not something already in RDKit. We've > worked out installation/building on Mac, Linux, Windows, etc. and > automatic Python, Ruby, etc. interfaces using SWIG. We do have a > force field framework, and it has at least partial implementations of > MM2, MMFF94, and Ghemical (i.e., Tripos-5.2). > > I also think we have a nice framework for adding additional features > like force fields, fingerprints, file formats. It's all plugins that > can be dynamically loaded. > > OTOH, we've been discussing an effort for an "Open Babel 3.0" where > we break backwards compatibility and clean up some of the core code. > This would obviously be a fairly significant undertaking. Many of the > features and changes requested are along the lines of RDKit's codebase. > > Fortunately, the weekend is coming up. Do you think that: > a) There's some complementary overlap between RDKit and Open Babel > b) Collaboration and/or merging might be good for both projects? > (e.g., we'd certainly help improve the documentation, etc.) There's a lot here to discuss and think about. I am certainly in favor of collaboration and cooperation, but talking about integration is scary. :-) Let's definitely talk though. > Obviously, the combined project would have some GPL bits and some BSD > bits. To be completely up front, the GPL bit bothers me and presents a problem for any tight integration. We're already stuck with some "GPL contamination" for the GUI components of the RDKit due to our use of Qt, but I have tried to keep that as localized as I possibly can. Out of curiosity: is the GPL use in OpenBabel due to historical reasons (OB is derived from the old OELib, right?) or philosophical? > Is this just a crazy idea? I don't think so. I'm sure that we can find *something* productive and useful to do so that the two toolkits aren't competing/completely disjoint. :-) -greg
Re: [Rdkit-discuss] What is RDkit
I think Noel is a few hours ahead of me, so I'll have to play catch- up. :-) Hi Greg and company! I think we talked at some point about YAHhMOP during my postdoc at Cornell. Browsing through the thread, I see your comment that RDKit stuff isn't nearly as optimized or "battle-tested" as Daylight (or presumably Open Babel). You also seem worried that if the project will have more users, you'll spend lots of time answering questions. From my perspective, neither of these have been problems with Open Babel. Of course at this point, we also have more users and developers, so the burden of support is spread over more people. (Noel, for example, has been very active, as you've probably noticed. :-) What I'm curious about is the question of collaboration/merging efforts. I haven't had enough time to sit down and pour through your code, but it looks like much of RDKit is complementary to Open Babel. That is, what we do well (arbitrary generic data, file formats, "battle-tested," etc.) is not something already in RDKit. We've worked out installation/building on Mac, Linux, Windows, etc. and automatic Python, Ruby, etc. interfaces using SWIG. We do have a force field framework, and it has at least partial implementations of MM2, MMFF94, and Ghemical (i.e., Tripos-5.2). I also think we have a nice framework for adding additional features like force fields, fingerprints, file formats. It's all plugins that can be dynamically loaded. OTOH, we've been discussing an effort for an "Open Babel 3.0" where we break backwards compatibility and clean up some of the core code. This would obviously be a fairly significant undertaking. Many of the features and changes requested are along the lines of RDKit's codebase. Fortunately, the weekend is coming up. Do you think that: a) There's some complementary overlap between RDKit and Open Babel b) Collaboration and/or merging might be good for both projects? (e.g., we'd certainly help improve the documentation, etc.) Obviously, the combined project would have some GPL bits and some BSD bits. Is this just a crazy idea? Cheers, -Geoff
Re: [Rdkit-discuss] What is RDkit?
> Well...don't worry about the publicisation...you can leave that to me > (I feel a blog post coming on :-) ) As promised, here's a blog post: http://baoilleach.blogspot.com/2007/09/rdkit-not-just-yet-another.html Noel
Re: [Rdkit-discuss] What is RDkit?
On 13/09/2007, Greg Landrum wrote: > On 9/13/07, Noel O'Boyle wrote: > > So on to the questions: > > (1) Can I believe my eyes? Is this really open source? A lot of the > > Python code has a very restrictive copyright statement right at the > > start (see Windows release, AllChem.py for example) > > The "All rights reserved" is, in my opinion, superseded by the > license.txt file (which is BSD, except for the GUI components, which > are GPL due to Qt license restrictions). It's really open source and > it's as open as we could make it without going public domain (I > consider the BSD license to be far more open than the GPL, which is > quite restrictive IMO). I would definitely encourage you to replace the "All rights reserved" to be replaced with something more friendly. I'll check out some codes I'm involved with and suggest a change. > > (4) Why haven't you publicised RDKit, if you don't mind me asking? For > > example, there is an excellent (if I do say so myself) website called > > Linux4Chemistry which lists the excellent (if you do say so yourself) > > YaEHMOP. Also there's the CCL mailing list. I only found RDKit because > > of trawling through the SF software map. Is this, um, shyness, > > intentional? > > There are many components to the answer to this question. Some are: > 1) Promotion isn't something I enjoy or am particularly good at. > 2) I'm kind of afraid of having more users. I do a lot of this as a > free-time project and I'm afraid of spending all my time answering > questions. This is, of course, a bit stupid because if the whole open > source thing works then other people will pitch in and help with those > questions. For that to happen I need those other people as users, > which requires that I find them, which... it's a Catch 22 Well...don't worry about the publicisation...you can leave that to me (I feel a blog post coming on :-) ) As regards too much time answering questions, cheminformatics toolkits are a pretty niche interest. Also, I try to avoid answering any question twice; i.e. I update the documentation if someone doesn't know how to do something, and you can always send them over to OpenBabel if they don't behave. But it'd be wrong to think that people will pitch in to answer questions - they don't. They are more forthcoming finding bugs though, which is useful too. > > (5) You may/not be aware but Numeric is deprecrated to the extent that > > it is not available for Python 2.5 on Windows. I had to replace a > > couple of "import Numeric"s with "from numpy import oldnumeric as > > Numeric", but this is only a temporary solution. > > The Numeric thing is a definite problem (though it works fine for me > with Python2.5 under windows). I made an attempt a while ago to port > the code to use numpy, but was immediately frustrated by the lack of > documentation available (unless you buy the book) and the very > aggressive response of the community when I complained about this. I'll help if I can, but I'd be relying on the unittests to ensure correctness. > > (6) It'd be nice to have an installer for the Python stuff...I've done > > this for OpenBabel. It's pretty easy. > > If you care to share how you did this, I'd be happy to learn. It's a nice > idea. Will do. In fact, the one major flaw with promoting your toolkit is that it's not clear how it's installed on Windows or on Linux. You might want to consider writing this up. > > (12) Interested in easily converting a ROMol to an OBMol and vice > > versa? I am. It'd be trivial to do this at the Python level. We could > > coordinate a bit to make the methods somewhat symmetrical. It would > > make it easy to unittest shared algorithms against each other, e.g. > > LogP calculation, SMILES, or whatever. > > It would be an interesting exercise. I'm not convinced that it would > be trivial to get it right. There's a lot of devil in the details of > things like aromaticity handling and general sanitization problems > (the RDKit is *very* picky about molecules being "clean"). Well, as you can imagine, there are several levels of information in a chemical structure. I was initially thinking of just sharing coordinates, and allowing each program to work the rest out from there. Do you have bond perception? Well, anyway, we will sort this out later. Thanks for all the answers, Noel
Re: [Rdkit-discuss] What is RDkit?
Hi Neal, Let me provide a more thorough answer than I did previously, I'll move it back on list too so that the answers are out there and archived. On 9/13/07, Noel O'Boyle wrote: > (off-list due to inquisitive nature of some questions, although feel > free to move back onto-list) > > First of all, Greg, this is pretty expletive impressive. In fact, it's > unbelievable. Not only have you matched the OpenBabel or Daylight > toolkits, in many ways (perhaps all?) you have surpassed them. I can't Thanks for the complements, I appreciate them and I'm sure that Santosh (the other developer) does as well, but I do want to temper the enthusiasm with a bit of reality: the RDKit stuff isn't nearly as highly optimized or thoroughly battle tested as Daylight. > believe this code has been out for more than a year. You've got 2D and > 3D coordinate generation, and everything! Something the open source > world has been crying out for for the last few years. If this code had > been around when I first heard about the Python interface to OB > (almost 2y ago now), I probably wouldn't have been involved. (To > clarify, I'm involved at the Python end of OpenBabel) > > So on to the questions: > (1) Can I believe my eyes? Is this really open source? A lot of the > Python code has a very restrictive copyright statement right at the > start (see Windows release, AllChem.py for example) The "All rights reserved" is, in my opinion, superseded by the license.txt file (which is BSD, except for the GUI components, which are GPL due to Qt license restrictions). It's really open source and it's as open as we could make it without going public domain (I consider the BSD license to be far more open than the GPL, which is quite restrictive IMO). > (2) How does, if at all, being the product of a company affect this > toolkit? I guess what I'm saying is, are you willing to engage with > the OS community. Is the code likely to be taken back in house (which > happened with OEChem, for example)? The company is no more, so no worries about that. > (3) What's the story with version numbers and backwards-incompatible > API changes? That is, do you try to maintain the API across releases? Yes. API changes would break the unit tests, which would require a lot of work to fix, so pure developer laziness dictates API stability. We also put a lot of time into making sure that the various binary formats used can always be parsed backwards (e.g. if a file format change happens newer versions can still read old files). > (4) Why haven't you publicised RDKit, if you don't mind me asking? For > example, there is an excellent (if I do say so myself) website called > Linux4Chemistry which lists the excellent (if you do say so yourself) > YaEHMOP. Also there's the CCL mailing list. I only found RDKit because > of trawling through the SF software map. Is this, um, shyness, > intentional? There are many components to the answer to this question. Some are: 1) Promotion isn't something I enjoy or am particularly good at. 2) I'm kind of afraid of having more users. I do a lot of this as a free-time project and I'm afraid of spending all my time answering questions. This is, of course, a bit stupid because if the whole open source thing works then other people will pitch in and help with those questions. For that to happen I need those other people as users, which requires that I find them, which... it's a Catch 22 > (5) You may/not be aware but Numeric is deprecrated to the extent that > it is not available for Python 2.5 on Windows. I had to replace a > couple of "import Numeric"s with "from numpy import oldnumeric as > Numeric", but this is only a temporary solution. The Numeric thing is a definite problem (though it works fine for me with Python2.5 under windows). I made an attempt a while ago to port the code to use numpy, but was immediately frustrated by the lack of documentation available (unless you buy the book) and the very aggressive response of the community when I complained about this. > (6) It'd be nice to have an installer for the Python stuff...I've done > this for OpenBabel. It's pretty easy. If you care to share how you did this, I'd be happy to learn. It's a nice idea. > (7) Conceptually is this a C++ toolkit, or a Python toolkit with a C++ > backend? It seems that a lot of the work is done in Python... It's both. The core data structures and algorithms are almost entirely in C++, a lot of the "end-user" functionality is written in Python. The model has been that new algorithms get coded first in Python and then ported into C++ if it's needed for speed. The two APIs are similar enough (to me at least) that this usually ends up being fairly straightforward. > (8) Are there any particular reasons you didn't base your code on OpenBabel? Again, a complicated question. The short answer is: 1) at the time we started the RDKit development OpenBabel was still OELib (or close to it) and didn't do what we wanted. 2) we were doing this
Re: [Rdkit-discuss] What is RDkit?
On 9/12/07, Noel O'Boyle wrote: > > Just a note: you can point the SF home page to go to whatever website you > want. yep; I just had never updated that after registering the rdkit.org domain. It's fixed(ish) now. > > My last look at blue obelisk was, admittedly, a while ago, but I > > somehow got the impression at the time that it was pretty > > Java-centric. Is this true? > > I can understand your impression. I'm not into Java though. The other > side is the C++ code of OpenBabel, and Python code for various things > (e.g. GaussSum, cclib for comp chem, and Python bindings for > OpenBabel). Also, the BO is involved in sharing chemical data, e.g. > names of elements, atomic weights and so on, so that we don't keep > having to re-enter this information in different software. (Check out > Blue Obelisk Data Repository on google). Interoperability is also one > of our goals. I'll take a fresh look at what's on the website (particularly the data pages); thanks for the pointer. > In the end, the BO is a loosely coupled bunch of chemists/computer > scientists with largely the same goals but lots of different ideas > about getting there. In short, I recommend subscribing to the mailing > list, or the RSS feed, checking out the wiki, and keeping an eye on > things. There's a couple of very interesting blogs too, if I do say so > myself.. Nice ad. :-) I'll keep an eye on things and we can collaboratively see if there's useful overlap. -greg
Re: [Rdkit-discuss] What is RDkit?
On 12/09/2007, Greg Landrum wrote: > Hi Noel, > > Sorry for the very slow posting of your message and reply; I was on > vacation until yesterday and needed to approve your posting since you > aren't subscribed to the discuss list; that shouldn't be a problem > from now on. Don't worry about it. Hope you had a good holiday. > On 8/27/07, Noel O'Boyle wrote: > > Any chance of more info on what RDkit contains? E.g. an API, or a > > website. Although Open Source projects are infamous for incomplete > > documentation, you seem to have taken this to an extreme :-) > > ah hah! that's where you're wrong! :-) > We actually do have have some documentation, we just don't have any > obvious links to it from sourceforge. This is an oversight on my part > and something I will clear up. > > There are some useful links here: > http://www.rdkit.org/ > specifically to the overview PDF: > http://www.rdkit.org/RDKit_Overview.pdf > and some (automatically generated and somewhat out of date) API documentation: > http://www.rdkit.org/C++_Docs Just a note: you can point the SF home page to go to whatever website you want. > There's also an introduction to using the code from Python on > sourceforge that provides something of an overview of the > functionality: > http://downloads.sourceforge.net/rdkit/GettingStartedInPython.pdf Great! I'll read up on it. > > I'm involved in other open source cheminformatics packages (via the > > BlueObelisk, with varying degrees of documentation it must be > > admitted) so it would be good to know to what extent there is overlap, > > and whether we could share code, etc... > > My last look at blue obelisk was, admittedly, a while ago, but I > somehow got the impression at the time that it was pretty > Java-centric. Is this true? I can understand your impression. I'm not into Java though. The other side is the C++ code of OpenBabel, and Python code for various things (e.g. GaussSum, cclib for comp chem, and Python bindings for OpenBabel). Also, the BO is involved in sharing chemical data, e.g. names of elements, atomic weights and so on, so that we don't keep having to re-enter this information in different software. (Check out Blue Obelisk Data Repository on google). Interoperability is also one of our goals. In the end, the BO is a loosely coupled bunch of chemists/computer scientists with largely the same goals but lots of different ideas about getting there. In short, I recommend subscribing to the mailing list, or the RSS feed, checking out the wiki, and keeping an eye on things. There's a couple of very interesting blogs too, if I do say so myself.. Noel > Regards, > -greg >
Re: [Rdkit-discuss] What is RDkit?
Hi Noel, Sorry for the very slow posting of your message and reply; I was on vacation until yesterday and needed to approve your posting since you aren't subscribed to the discuss list; that shouldn't be a problem from now on. On 8/27/07, Noel O'Boyle wrote: > Any chance of more info on what RDkit contains? E.g. an API, or a > website. Although Open Source projects are infamous for incomplete > documentation, you seem to have taken this to an extreme :-) ah hah! that's where you're wrong! :-) We actually do have have some documentation, we just don't have any obvious links to it from sourceforge. This is an oversight on my part and something I will clear up. There are some useful links here: http://www.rdkit.org/ specifically to the overview PDF: http://www.rdkit.org/RDKit_Overview.pdf and some (automatically generated and somewhat out of date) API documentation: http://www.rdkit.org/C++_Docs There's also an introduction to using the code from Python on sourceforge that provides something of an overview of the functionality: http://downloads.sourceforge.net/rdkit/GettingStartedInPython.pdf > I'm involved in other open source cheminformatics packages (via the > BlueObelisk, with varying degrees of documentation it must be > admitted) so it would be good to know to what extent there is overlap, > and whether we could share code, etc... My last look at blue obelisk was, admittedly, a while ago, but I somehow got the impression at the time that it was pretty Java-centric. Is this true? Regards, -greg