Hi Xueling,
Here's a general outline:
My guess is that your position of match field is bounded (perhaps by the
number of base pairs in the human genome?) Given this, you can probably
write a very simple Partitioner implementation that divides this field into
ranges, each with an approximately
Hi Todd:
After finishing some tasks I finally get back to HDFS testing.
One question for your last reply to this thread: Are there any code examples
close to your second and third recommendations? Or what APIs I should start
with for my testing?
Thanks.
Xueling
On Sat, Dec 12, 2009 at 1:01 PM,
Hi Todd:
Thank you for your reply.
The datasets wont be updated often. But the query against a data set is
frequent. The quicker the query, the better. For example we have done
testing on a Mysql database (5 billion records randomly scattered into 24
tables) and the slowest query against the
You might also consider hbase, particularly If you find that your data is
being updated with some regularity, particularly if the updates are randomly
distributed over the data set. See
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor
+1 for hbase
On Sat, Dec 12, 2009 at 2:56 PM, Xueling Shu x...@systemsbiology.orgwrote:
Great information! Thank you for your help, Todd.
Xueling
On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon t...@cloudera.com wrote:
Hi Xueling,
In that case, I would recommend the following:
1)
Hi Xueling,
One important question that can really change the answer:
How often does the dataset change? Can the changes be merged in in
bulk every once in a while, or do you need to actually update them
randomly very often?
Also, how fast is quick? Do you mean 1 minute, 10 seconds, 1 second,