Hmm... So I tried in HBase (current trunk). I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable).
-- Lars ----- Original Message ----- From: lars hofhansl <[email protected]> To: "[email protected]" <[email protected]> Cc: Sent: Monday, August 20, 2012 7:50 PM Subject: Re: Slow full-table scans Thanks Gurjeet, I'll (hopefully) have a look tomorrow. -- Lars ----- Original Message ----- From: Gurjeet Singh <[email protected]> To: [email protected]; lars hofhansl <[email protected]> Cc: Sent: Monday, August 20, 2012 7:42 PM Subject: Re: Slow full-table scans Hi Lars, Here is a testcase: https://gist.github.com/3410948 Benchmarking code: https://gist.github.com/3410952 Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 Gurjeet On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[email protected]> wrote: > Sure - I can create a minimal testcase and send it along. > > Gurjeet > > On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[email protected]> wrote: >> That's interesting. >> Could you share your old and new schema. I would like to track down the >> performance problems you saw. >> (If you had a demo program that populates your rows with 200.000 columns in >> a way where you saw the performance issues, that'd be even better, but not >> necessary). >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Gurjeet Singh <[email protected]> >> To: [email protected]; lars hofhansl <[email protected]> >> Sent: Thursday, August 16, 2012 11:26 AM >> Subject: Re: Slow full-table scans >> >> Sorry for the delay guys. >> >> Here are a few results: >> >> 1. Regions in the table = 11 >> 2. The region servers don't appear to be very busy with the query ~5% >> CPU (but with parallelization, they are all busy) >> >> Finally, I changed the format of my data, such that each cell in HBase >> contains a chunk of a row instead of the single value it had. So, >> stuffing each Hbase cell with 500 columns of a row, gave me a >> performance boost of 1000x. It seems that the underlying issue was IO >> overhead per byte of actual data stored. >> >> >> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[email protected]> wrote: >>> Yeah... It looks OK. >>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>> >>> >>> If you can I'd like to know how busy your regionservers are during these >>> operations. That would be an indication on whether the parallelization is >>> good or not. >>> >>> -- Lars >>> >>> >>> ----- Original Message ----- >>> From: Stack <[email protected]> >>> To: [email protected] >>> Cc: >>> Sent: Wednesday, August 15, 2012 3:13 PM >>> Subject: Re: Slow full-table scans >>> >>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[email protected]> wrote: >>>> I am beginning to think that this is a configuration issue on my >>>> cluster. Do the following configuration files seem sane ? >>>> >>>> hbase-env.sh https://gist.github.com/3345338 >>>> >>> >>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>> >>> >>>> hbase-site.xml https://gist.github.com/3345356 >>>> >>> >>> This is all defaults effectively. I don't see any of the configs. >>> recommended by the performance section of the reference guide and/or >>> those suggested by the GBIF blog. >>> >>> You don't answer LarsH's query about where you see the 4% difference. >>> >>> How many regions in your table? Whats the HBase Master UI look like >>> when this scan is running? >>> St.Ack >>>
