David Nickerson wrote:
>> I am working on developing a CellML model (using external code) of
>> transcriptional control in yeast which is 23 MB in size. I hope to
>> eventually do a similar thing for organisms which have much more
>> complicated sets of interactions, in which case this size may grow
>> substantially.
>>
>
> so you have 23MB of XML? Cool! Even combining all my models I have less
> than 7MB, and even then I'm sure that figure includes some simulation
> results.
>
My model is entirely generated from experimental data, none of it is
written by hand (aside from a one-page script used to generate CellML
from the relational database).
> I guess an interesting test would be uploading it to the model
> repository to see how that handles such a large model (presuming you
> have a CellML 1.0 model).
>
It is currently a CellML 1.0 model. I'm not sure I want to break the
live Plone, however. I'm not sure it is much use to anyone else at this
stage, however.
>
>> If anyone on this list is interested in similar problems (I presume
>> similar issues come up in a range of systems biology problems, whether
>> you are working with CellML or SBML), I would welcome your feedback and
>> suggestions, and perhaps we could collaborate .
>>
>
> I really have no idea what an transcriptional control in yeast model
> looks like, but my initial thought would be to abstract out any similar
> math and import common declarations - I'm guessing you have already done
> this if its possible.
>
My model only has machine-learning external-code in it, it doesn't have
any equations at the moment. Just to give you an idea of what it looks
like...
http://www.cellml.org/cellml/1.0#"; name="interactions">
http://www.w3.org/1998/Math/MathML";>
sig_PAU8
http://www.bioeng.auckland.ac.nz/people/miller/black_box/k-nearest-neighbours";>blackbox
sig_SUT1
sig_STE12
sig_ADR1
sig_YAP5
sig_RME1
sig_TEC1
sig_SWI5
sig_ARR1
sig_MET31
sig_RLM1
sig_INO4
sig_RAP1
sig_MOT3
http://www.w3.org/1998/Math/MathML";>
sig_YAL067W_A
http://www.bioeng.auckland.ac.nz/people/miller/black_box/k-nearest-neighbours";>blackbox
sig_SPT23
sig_STE12
sig_DAL80
sig_YAP5
sig_BAS1
sig_DIG1
sig_PHO2
sig_HAP2
sig_PHD1
sig_GLN3
...
Note that the initial_value="0" is a place-holder,
I could abstract out my blackbox function calls based on the number of
parameters (it is variable, from 1 through to 41, in this case, although
there is no theoretical limit on how many putative transcription factors
could affect a signal). However, I suspect that this would not solve the
performance problems (it takes 8 seconds to load the model into the
CellML API, but this would be much costlier if it had to load literally
thousands of copies of the same file. This could be optimised using a
cache, but I still don't think it will help very much).
>
>> This creates some unique issues for CellML processing tools:
>> 1) Just parsing the CellML model (especially with a DOM-type parser
>> which stores all the nodes into a tree, but probably with any type of
>> parser) is very slow.
>>
>
> it might be interesting to look at doing some simple task to check the
> performance of DOM vs SAX based tools? I have found in the past that
> with 500MB "fieldML" files that the SAX parser used in CMGUI was quite
> fast at parsing the file - especially if you go from a gzip compressed file.
>
Possibly, but the current CellML API implementation needs an underlying
DOM representation, and while 8s to parse the file is a long time, it is
probably one of the smaller issues compared to the time to actually do
things with the model.
>
>> 2) The CellML model might not all fit in memory at the same time,
>> especially if the model gets to be multi-gigabyte. It might be possible
>> to make use of swap to deal with this, but if the algorithms don't have
>> explicit control over when things are swapped in and out, it will be
>> hard to work with such a model.
>>
>
> I think if you have a model getting that large then there needs to be
> some serious thinking about how to handle such models...but generally
> can't you just let the OS worry about swapping in and out as required?
> Or would you expect a customised scheme for a particular application to
> be more efficient?
>
It probably depends on the algorithm. However, if you have a huge amount
of data, you might be better to build some sort of index, and to do this
efficiently, it is generally better to k