Re: [cellml-discussion] Using CellML to represent huge CellML models:Has anyone worked on this already?
I am working on developing a CellML model (using external code) of transcriptional control in yeast which is 23 MB in size. I hope to eventually do a similar thing for organisms which have much more complicated sets of interactions, in which case this size may grow substantially. so you have 23MB of XML? Cool! Even combining all my models I have less than 7MB, and even then I'm sure that figure includes some simulation results. I guess an interesting test would be uploading it to the model repository to see how that handles such a large model (presuming you have a CellML 1.0 model). If anyone on this list is interested in similar problems (I presume similar issues come up in a range of systems biology problems, whether you are working with CellML or SBML), I would welcome your feedback and suggestions, and perhaps we could collaborate . I really have no idea what an transcriptional control in yeast model looks like, but my initial thought would be to abstract out any similar math and import common declarations - I'm guessing you have already done this if its possible. This creates some unique issues for CellML processing tools: 1) Just parsing the CellML model (especially with a DOM-type parser which stores all the nodes into a tree, but probably with any type of parser) is very slow. it might be interesting to look at doing some simple task to check the performance of DOM vs SAX based tools? I have found in the past that with 500MB fieldML files that the SAX parser used in CMGUI was quite fast at parsing the file - especially if you go from a gzip compressed file. 2) The CellML model might not all fit in memory at the same time, especially if the model gets to be multi-gigabyte. It might be possible to make use of swap to deal with this, but if the algorithms don't have explicit control over when things are swapped in and out, it will be hard to work with such a model. I think if you have a model getting that large then there needs to be some serious thinking about how to handle such models...but generally can't you just let the OS worry about swapping in and out as required? Or would you expect a customised scheme for a particular application to be more efficient? C) Another leaner API, read-only CellML API (perhaps based off the same IDLs, but with certain functionality, like the ability to modify the model, or set mutation event listeners, unavailable). We could add a SAX-style event dispatcher instead, to allow users to save any information they do want from extension elements, which will also not be kept in the model. Comments, white-space, and so on would all be stripped unlike in the current CellML API implementation. Tools which are currently using the full CellML API but only require read-only access (e.g. the CCGS) might be able to just 'flick the switch' and benefit from the leaner API. This would probably be beneficial even for those of us without such large models - especially if it is as easy as flicking a switch to swap between the complete and restricted implementations. Andre. ___ cellml-discussion mailing list cellml-discussion@cellml.org http://www.cellml.org/mailman/listinfo/cellml-discussion
Re: [cellml-discussion] Using CellML to represent huge CellML models:Has anyone worked on this already?
David Nickerson wrote: I am working on developing a CellML model (using external code) of transcriptional control in yeast which is 23 MB in size. I hope to eventually do a similar thing for organisms which have much more complicated sets of interactions, in which case this size may grow substantially. so you have 23MB of XML? Cool! Even combining all my models I have less than 7MB, and even then I'm sure that figure includes some simulation results. My model is entirely generated from experimental data, none of it is written by hand (aside from a one-page script used to generate CellML from the relational database). I guess an interesting test would be uploading it to the model repository to see how that handles such a large model (presuming you have a CellML 1.0 model). It is currently a CellML 1.0 model. I'm not sure I want to break the live Plone, however. I'm not sure it is much use to anyone else at this stage, however. If anyone on this list is interested in similar problems (I presume similar issues come up in a range of systems biology problems, whether you are working with CellML or SBML), I would welcome your feedback and suggestions, and perhaps we could collaborate . I really have no idea what an transcriptional control in yeast model looks like, but my initial thought would be to abstract out any similar math and import common declarations - I'm guessing you have already done this if its possible. My model only has machine-learning external-code in it, it doesn't have any equations at the moment. Just to give you an idea of what it looks like... model xmlns=http://www.cellml.org/cellml/1.0#; name=interactions component name=PAU8 variable name=sig_PAU8 initial_value=0 units=signal_level public_interface=out/ variable name=sig_SUT1 units=signal_level public_interface=in/ variable name=sig_STE12 units=signal_level public_interface=in/ variable name=sig_ADR1 units=signal_level public_interface=in/ variable name=sig_YAP5 units=signal_level public_interface=in/ variable name=sig_RME1 units=signal_level public_interface=in/ variable name=sig_TEC1 units=signal_level public_interface=in/ variable name=sig_SWI5 units=signal_level public_interface=in/ variable name=sig_ARR1 units=signal_level public_interface=in/ variable name=sig_MET31 units=signal_level public_interface=in/ variable name=sig_RLM1 units=signal_level public_interface=in/ variable name=sig_INO4 units=signal_level public_interface=in/ variable name=sig_RAP1 units=signal_level public_interface=in/ variable name=sig_MOT3 units=signal_level public_interface=in/ math xmlns=http://www.w3.org/1998/Math/MathML; applyeq/ cisig_PAU8/ci apply csymbol definitionURL=http://www.bioeng.auckland.ac.nz/people/miller/black_box/k-nearest-neighbours;blackbox/csymbol cisig_SUT1/ci cisig_STE12/ci cisig_ADR1/ci cisig_YAP5/ci cisig_RME1/ci cisig_TEC1/ci cisig_SWI5/ci cisig_ARR1/ci cisig_MET31/ci cisig_RLM1/ci cisig_INO4/ci cisig_RAP1/ci cisig_MOT3/ci /apply /apply /math /component component name=YAL067W_A variable name=sig_YAL067W_A initial_value=0 units=signal_level public_interface=out/ variable name=sig_SPT23 units=signal_level public_interface=in/ variable name=sig_STE12 units=signal_level public_interface=in/ variable name=sig_DAL80 units=signal_level public_interface=in/ variable name=sig_YAP5 units=signal_level public_interface=in/ variable name=sig_BAS1 units=signal_level public_interface=in/ variable name=sig_DIG1 units=signal_level public_interface=in/ variable name=sig_PHO2 units=signal_level public_interface=in/ variable name=sig_HAP2 units=signal_level public_interface=in/ variable name=sig_PHD1 units=signal_level public_interface=in/ variable name=sig_GLN3 units=signal_level public_interface=in/ math xmlns=http://www.w3.org/1998/Math/MathML; applyeq/ cisig_YAL067W_A/ci apply csymbol definitionURL=http://www.bioeng.auckland.ac.nz/people/miller/black_box/k-nearest-neighbours;blackbox/csymbol cisig_SPT23/ci cisig_STE12/ci cisig_DAL80/ci cisig_YAP5/ci cisig_BAS1/ci cisig_DIG1/ci cisig_PHO2/ci cisig_HAP2/ci cisig_PHD1/ci cisig_GLN3/ci /apply /apply /math /component ... Note that the initial_value=0 is a place-holder, I could abstract out my blackbox function calls based on the number of parameters (it is variable, from 1 through to 41, in this case, although there is no theoretical limit on how many putative transcription factors could affect a signal). However, I suspect that this would not solve the performance problems (it takes