Re: couchdb for genome data

Tom Sante Thu, 04 Mar 2010 01:47:37 -0800

There is gonna be some partitioning of the data and using a fasterview server might help to. The only issue I have left is that I can'tuse too many views because storing the generated views of that manydoucuments will take lots of disk space. Than again disks are cheapand easy to add and acceptable trade off for fast queries. And if Iuse a key of like doc.type + doc.experiment + doc.genome_position thanthat could also limit the need for more than one view.

Tom

Op 4-mrt-2010 om 10:33 heeft km <[email protected]> hetvolgende geschreven:\

Hi,

No it would be fast.
All the documents are indexed as per views in the database.
temporary views will have to search each and every document in thedatabase.but permanebt views (saved views) will have to only do that for thefirsttime. That first time, couchdb would start searching all docs andindexes
according to the view  in the database.
Once indexed, accessing the same view will instantly retrieve results.
(This first time indexing would take a bit of time if ur database has
billions of docs- probably u can also partition them into different
databases according to category)

Also it would update view indexes if new documents added/removed
automatically -without changing the views.

Its like having static views with dynamic data.

Krishna
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
On Thu, Mar 4, 2010 at 6:08 PM, Tom Sante <[email protected]> wrote:
Indeed using "type" could be the alternative to mongoDBcollections. But myquestion is if I have billions of documents in the DB would thismake viewgeneration very slow and take up lot of disk space just to be ableto search
for all probes with a certain experiment_id. Like I said the data is
structured in experiments so almost all queries and changes to thedata willbe within an experiment with no need to act on the huge amount ofprobes
from the other experiments.
Thanks,
Tom
Op 4-mrt-2010 om 08:35 heeft km <[email protected]> hetvolgende
geschreven:

Hi,
You could have an additional key in the document identifying it asprobe -
eg "type" (key) with value  "probe" like this:

{
    "type":"probe".
    "probe_id" : 1234567890,
    "experiment_id" : 1234567890,
    "raw_value" : 0.43524,
    "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
}
so all your probe documents would contain a key called "type" setto
"probe". you can identify only these documents with this key.
Now when u design a view to search probe documents alone, u coulduse a
simple filter statement like this:
if(doc.type=='probe'){ do something ...}
this will only search/index probe type documents.
NOTE: "type" is not a user defined key just like any other key - ucan use
anyother name for it !
U might have other types of documents for which the type keywordwill
differ
accordingly.
Here there is no need to explicitly define a collection as inMongodb.
All JSON documents could be stored in a single database.

HTH,
Krishna
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On Thu, Mar 4, 2010 at 7:21 AM, Tom Sante <[email protected]>wrote:
Hi
The data is now stored in a mysql table with about a billion (1000
million)
rows.
These rows are the data of a genetic test (arrayCGH) and build uplike
this:
Every experiment (a few thousand of them total) containsmeasurements of
about 180000 genetic probes. This raw data will be analyzed and the
values
run through different algorithms, so every probe needs to storemore than
1
value after the analysis is done. The values of differentanalysis are
now
stored in columns in that table making it a pain if we have toadd aanalysis to the table not yet part of the existing columns. Thisis why a
schema free document based DB is probably a better fit.
The initial idea was to give each probe a separate document, andwhen theoriginal value is transform to an other value store this in thesame
document.

{
    "probe_id" : 1234567890,
    "experiment_id" : 1234567890,
    "raw_value" : 0.43524,
    "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
}

Once added to the database almost all changes to the data will be
contained
within an experiment.

MongoDB has something like collections that would be a appropriate
abstraction ~ experiment. But in couchdb I would have to add allthese
probe
documents in 1 big database without collections. So if I only make
changes
to probes within an experiment this would influence the views ofall the
other billions document in the db. Because of the large number of
documents
it would be good to know beforehand what the implications are ofthis
performance wise?

Any suggestions are welcome.

Tom

Re: couchdb for genome data

Reply via email to