Re: couchdb for genome data

Tom Sante Thu, 04 Mar 2010 01:09:57 -0800

Indeed using "type" could be the alternative to mongoDB collections.But my question is if I have billions of documents in the DB wouldthis make view generation very slow and take up lot of disk space justto be able to search for all probes with a certain experiment_id. LikeI said the data is structured in experiments so almost all queries andchanges to the data will be within an experiment with no need to acton the huge amount of probes from the other experiments.

Thanks,
Tom

Op 4-mrt-2010 om 08:35 heeft km <[email protected]> hetvolgende geschreven:

Hi,
You could have an additional key in the document identifying it asprobe -
eg "type" (key) with value  "probe" like this:

{
      "type":"probe".
      "probe_id" : 1234567890,
      "experiment_id" : 1234567890,
      "raw_value" : 0.43524,
      "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
}

so all your probe documents would contain a key called  "type" set to
"probe". you can identify only these documents with this key.
Now when u design a view to search probe documents alone, u coulduse a
simple filter statement like this:
if(doc.type=='probe'){ do something ...}
this will only search/index probe type documents.
NOTE: "type" is not a user defined key just like any other key - ucan use
anyother name for it !
U might have other types of documents for which the type keywordwill differ
accordingly.
Here there is no need to explicitly define a collection as in Mongodb.
All JSON documents could be stored in a single database.

HTH,
Krishna
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Thu, Mar 4, 2010 at 7:21 AM, Tom Sante <[email protected]> wrote:
Hi
The data is now stored in a mysql table with about a billion (1000million)
rows.
These rows are the data of a genetic test (arrayCGH) and build uplike
this:
Every experiment (a few thousand of them total) containsmeasurements ofabout 180000 genetic probes. This raw data will be analyzed and thevaluesrun through different algorithms, so every probe needs to storemore than 1value after the analysis is done. The values of different analysisare now
stored in columns in that table making it a pain if we have to add a
analysis to the table not yet part of the existing columns. This iswhy a
schema free document based DB is probably a better fit.
The initial idea was to give each probe a separate document, andwhen the
original value is transform to an other value store this in the same
document.

{
      "probe_id" : 1234567890,
      "experiment_id" : 1234567890,
      "raw_value" : 0.43524,
      "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
}
Once added to the database almost all changes to the data will becontained
within an experiment.

MongoDB has something like collections that would be a appropriate
abstraction ~ experiment. But in couchdb I would have to add allthese probedocuments in 1 big database without collections. So if I only makechangesto probes within an experiment this would influence the views ofall theother billions document in the db. Because of the large number ofdocuments
it would be good to know beforehand what the implications are of this
performance wise?

Any suggestions are welcome.

Tom

Re: couchdb for genome data

Reply via email to