Re: Sample data set of HBase

2011-06-04 Thread Stack
You could start with ycsb, http://hbase.apache.org/book.html#d470e4911, Jason? St.Ack On Fri, Jun 3, 2011 at 5:31 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm looking for a sample data set to benchmark the Lucene FST, specifically the keys.  I'm guessing a common key type for

Re: prefix compression

2011-06-04 Thread Jason Rutherglen
I varied the ms increment randomly between 1-20, then created 10 mil dates. The FST was then 58,481,582 bytes, eg, 57 MB. Guess it's not perfect! 19,739,994 bytes, eg, 18.8 MB for random 1-5 increments. I think that's still pretty good. I need to try varying the long value stored alongside to

Re: HDFS-1599 status? (HDFS tickets to improve HBase)

2011-06-04 Thread Andrew Purtell
From: Todd Lipcon t...@cloudera.com Not to be too mean and discouraging to everyone passing around patches against CDH3 and/or 0.20-append, but just an FYI: there is no chance that these things will get committed to an 0.20 branch without first going through trunk. Sharing patches and testing

Re: prefix compression

2011-06-04 Thread Jason Rutherglen
Here's some more data for the 10 mil dates: 68.1 MB random increment up to 1000 87.1 MB random increment up to 10,000 162.1 MB total not using the FST On Fri, Jun 3, 2011 at 10:57 PM, Stack st...@duboce.net wrote: That can't be true?  (smile)  How would you search a 'key' in the FST? St.Ack

Re: prefix compression

2011-06-04 Thread Stack
On Fri, Jun 3, 2011 at 7:03 PM, Matt Corgan mcor...@hotpads.com wrote: Pluggable formats would help here so you could tune for mem vs cpu. More history. At the time of KV and hfile incubation, we thought about making these building blocks pluggable but it was thought that there would be a

Pluggable block index

2011-06-04 Thread Jason Rutherglen
I want to take a wh/hack at creating a pluggable block index, is there an open issue for this? I looked and couldn't find one.

Re: Pluggable block index

2011-06-04 Thread Jason Rutherglen
You'd have to change how the Scanner code works, etc. You'll find out. Nice! Sounds fun. On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote: What are the specs/goals of a pluggable block index?  Right now the block index is fairly tied deep in how HFile works. You'd have

Re: Pluggable block index

2011-06-04 Thread Ryan Rawson
Also, dont break it :-) Part of the goal of HFile was to build something quick and reliable. It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. I have

Re: Pluggable block index

2011-06-04 Thread Jason Rutherglen
It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. Isn't the block index separate from the actual data? So corruption in that case is unlikely. I

Re: Pluggable block index

2011-06-04 Thread Ryan Rawson
Oh BTW, you can't mmap anything in HBase unless you copy it to local disk first. HDFS = no mmap. just thought you'd like to know. On Sat, Jun 4, 2011 at 3:41 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: It can be hard to know you have all the corner cases down and you won't find out

Re: Pluggable block index

2011-06-04 Thread Jason Rutherglen
Oh BTW, you can't mmap anything in HBase unless you copy it to local disk first. HDFS = no mmap. Right. I know that! Once the block index is pluggable, the FST would be an in heap byte[]. On Sat, Jun 4, 2011 at 3:49 PM, Ryan Rawson ryano...@gmail.com wrote: Oh BTW, you can't mmap anything

CPU cache effectiveness

2011-06-04 Thread Matt Corgan
I mentioned a bunch of stuff in that prefix compression email about cache lines, prefetching, trie node sizes, etc... The gist of it all is that memory has become relatively slow to the point where you need to start thinking of it in similar ways as we think of disk/network. I dug up and cleaned