I wouldn't recommend storing billions of documents in a single database. Think about partitioning the data across lots of databases.
On Thu, Mar 4, 2010 at 11:08 AM, Tom Sante <[email protected]> wrote: > Indeed using "type" could be the alternative to mongoDB collections. But my > question is if I have billions of documents in the DB would this make view > generation very slow and take up lot of disk space just to be able to search > for all probes with a certain experiment_id. Like I said the data is > structured in experiments so almost all queries and changes to the data will > be within an experiment with no need to act on the huge amount of probes > from the other experiments. > Thanks, > Tom > > Op 4-mrt-2010 om 08:35 heeft km <[email protected]> het volgende > geschreven: >> >> Hi, >> >> You could have an additional key in the document identifying it as probe - >> eg "type" (key) with value "probe" like this: >> >> { >> "type":"probe". >> "probe_id" : 1234567890, >> "experiment_id" : 1234567890, >> "raw_value" : 0.43524, >> "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 } >> } >> >> so all your probe documents would contain a key called "type" set to >> "probe". you can identify only these documents with this key. >> Now when u design a view to search probe documents alone, u could use a >> simple filter statement like this: >> if(doc.type=='probe'){ do something ...} >> this will only search/index probe type documents. >> >> NOTE: "type" is not a user defined key just like any other key - u can use >> anyother name for it ! >> >> U might have other types of documents for which the type keyword will >> differ >> accordingly. >> Here there is no need to explicitly define a collection as in Mongodb. >> All JSON documents could be stored in a single database. >> >> HTH, >> Krishna >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> On Thu, Mar 4, 2010 at 7:21 AM, Tom Sante <[email protected]> wrote: >> >>> Hi >>> >>> The data is now stored in a mysql table with about a billion (1000 >>> million) >>> rows. >>> These rows are the data of a genetic test (arrayCGH) and build up like >>> this: >>> >>> Every experiment (a few thousand of them total) contains measurements of >>> about 180000 genetic probes. This raw data will be analyzed and the >>> values >>> run through different algorithms, so every probe needs to store more than >>> 1 >>> value after the analysis is done. The values of different analysis are >>> now >>> stored in columns in that table making it a pain if we have to add a >>> analysis to the table not yet part of the existing columns. This is why a >>> schema free document based DB is probably a better fit. >>> The initial idea was to give each probe a separate document, and when the >>> original value is transform to an other value store this in the same >>> document. >>> >>> { >>> "probe_id" : 1234567890, >>> "experiment_id" : 1234567890, >>> "raw_value" : 0.43524, >>> "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 } >>> } >>> >>> Once added to the database almost all changes to the data will be >>> contained >>> within an experiment. >>> >>> MongoDB has something like collections that would be a appropriate >>> abstraction ~ experiment. But in couchdb I would have to add all these >>> probe >>> documents in 1 big database without collections. So if I only make >>> changes >>> to probes within an experiment this would influence the views of all the >>> other billions document in the db. Because of the large number of >>> documents >>> it would be good to know beforehand what the implications are of this >>> performance wise? >>> >>> Any suggestions are welcome. >>> >>> Tom >>> >
