Re: couchdb for genome data

Metin Akat Thu, 04 Mar 2010 01:31:17 -0800

I wouldn't recommend storing billions of documents in a single database.
Think about partitioning the data across lots of databases.


On Thu, Mar 4, 2010 at 11:08 AM, Tom Sante <[email protected]> wrote:
> Indeed using "type" could be the alternative to mongoDB collections. But my
> question is if I have billions of documents in the DB would this make view
> generation very slow and take up lot of disk space just to be able to search
> for all probes with a certain experiment_id. Like I said the data is
> structured in experiments so almost all queries and changes to the data will
> be within an experiment with no need to act on the huge amount of probes
> from the other experiments.
> Thanks,
> Tom
>
> Op 4-mrt-2010 om 08:35 heeft km <[email protected]> het volgende
> geschreven:
>>
>> Hi,
>>
>> You could have an additional key in the document identifying it as probe -
>> eg "type" (key) with value  "probe" like this:
>>
>> {
>>      "type":"probe".
>>      "probe_id" : 1234567890,
>>      "experiment_id" : 1234567890,
>>      "raw_value" : 0.43524,
>>      "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
>> }
>>
>> so all your probe documents would contain a key called  "type" set to
>> "probe". you can identify only these documents with this key.
>> Now when u design a view to search probe documents alone, u could use a
>> simple filter statement like this:
>> if(doc.type=='probe'){ do something ...}
>> this will only search/index probe type documents.
>>
>> NOTE: "type" is not a user defined key just like any other key - u can use
>> anyother name for it !
>>
>> U might have other types of documents for which the type keyword will
>> differ
>> accordingly.
>> Here there is no need to explicitly define a collection as in Mongodb.
>> All JSON documents could be stored in a single database.
>>
>> HTH,
>> Krishna
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> On Thu, Mar 4, 2010 at 7:21 AM, Tom Sante <[email protected]> wrote:
>>
>>> Hi
>>>
>>> The data is now stored in a mysql table with about a billion (1000
>>> million)
>>> rows.
>>> These rows are the data of a genetic test (arrayCGH) and build up like
>>> this:
>>>
>>> Every experiment (a few thousand of them total) contains measurements of
>>> about 180000 genetic probes. This raw data will be analyzed and the
>>> values
>>> run through different algorithms, so every probe needs to store more than
>>> 1
>>> value after the analysis is done. The values of different analysis are
>>> now
>>> stored in columns in that table making it a pain if we have to add a
>>> analysis to the table not yet part of the existing columns. This is why a
>>> schema free document based DB is probably a better fit.
>>> The initial idea was to give each probe a separate document, and when the
>>> original value is transform to an other value store this in the same
>>> document.
>>>
>>> {
>>>      "probe_id" : 1234567890,
>>>      "experiment_id" : 1234567890,
>>>      "raw_value" : 0.43524,
>>>      "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
>>> }
>>>
>>> Once added to the database almost all changes to the data will be
>>> contained
>>> within an experiment.
>>>
>>> MongoDB has something like collections that would be a appropriate
>>> abstraction ~ experiment. But in couchdb I would have to add all these
>>> probe
>>> documents in 1 big database without collections. So if I only make
>>> changes
>>> to probes within an experiment this would influence the views of all the
>>> other billions document in the db. Because of the large number of
>>> documents
>>> it would be good to know beforehand what the implications are of this
>>> performance wise?
>>>
>>> Any suggestions are welcome.
>>>
>>> Tom
>>>
>

Re: couchdb for genome data

Reply via email to