Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-06 Thread Todd Lipcon
Hi Xueling, Here's a general outline: My guess is that your position of match field is bounded (perhaps by the number of base pairs in the human genome?) Given this, you can probably write a very simple Partitioner implementation that divides this field into ranges, each with an approximately

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-05 Thread Xueling Shu
Hi Todd: After finishing some tasks I finally get back to HDFS testing. One question for your last reply to this thread: Are there any code examples close to your second and third recommendations? Or what APIs I should start with for my testing? Thanks. Xueling On Sat, Dec 12, 2009 at 1:01 PM,

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread Xueling Shu
Hi Todd: Thank you for your reply. The datasets wont be updated often. But the query against a data set is frequent. The quicker the query, the better. For example we have done testing on a Mysql database (5 billion records randomly scattered into 24 tables) and the slowest query against the

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread stack
You might also consider hbase, particularly If you find that your data is being updated with some regularity, particularly if the updates are randomly distributed over the data set. See http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread Fred Zappert
+1 for hbase On Sat, Dec 12, 2009 at 2:56 PM, Xueling Shu x...@systemsbiology.orgwrote: Great information! Thank you for your help, Todd. Xueling On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon t...@cloudera.com wrote: Hi Xueling, In that case, I would recommend the following: 1)

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-11 Thread Todd Lipcon
Hi Xueling, One important question that can really change the answer: How often does the dataset change? Can the changes be merged in in bulk every once in a while, or do you need to actually update them randomly very often? Also, how fast is quick? Do you mean 1 minute, 10 seconds, 1 second,