Hmm... So I tried in HBase (current trunk).
I created 100 rows with 200.000 columns each (using your oldMakeTable). The 
creation took a bit, but scanning finished in 1.8s. (HBase in pseudo 
distributed mode - with your oldScanTable).

-- Lars



----- Original Message -----
From: lars hofhansl <[email protected]>
To: "[email protected]" <[email protected]>
Cc: 
Sent: Monday, August 20, 2012 7:50 PM
Subject: Re: Slow full-table scans

Thanks Gurjeet,

I'll (hopefully) have a look tomorrow.

-- Lars



----- Original Message -----
From: Gurjeet Singh <[email protected]>
To: [email protected]; lars hofhansl <[email protected]>
Cc: 
Sent: Monday, August 20, 2012 7:42 PM
Subject: Re: Slow full-table scans

Hi Lars,

Here is a testcase:

https://gist.github.com/3410948

Benchmarking code:

https://gist.github.com/3410952

Try running it with numRows = 100, numCols = 200000, segmentSize = 1000

Gurjeet


On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[email protected]> wrote:
> Sure - I can create a minimal testcase and send it along.
>
> Gurjeet
>
> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[email protected]> wrote:
>> That's interesting.
>> Could you share your old and new schema. I would like to track down the 
>> performance problems you saw.
>> (If you had a demo program that populates your rows with 200.000 columns in 
>> a way where you saw the performance issues, that'd be even better, but not 
>> necessary).
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Gurjeet Singh <[email protected]>
>> To: [email protected]; lars hofhansl <[email protected]>
>> Sent: Thursday, August 16, 2012 11:26 AM
>> Subject: Re: Slow full-table scans
>>
>> Sorry for the delay guys.
>>
>> Here are a few results:
>>
>> 1. Regions in the table = 11
>> 2. The region servers don't appear to be very busy with the query ~5%
>> CPU (but with parallelization, they are all busy)
>>
>> Finally, I changed the format of my data, such that each cell in HBase
>> contains a chunk of a row instead of the single value it had. So,
>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> performance boost of 1000x. It seems that the underlying issue was IO
>> overhead per byte of actual data stored.
>>
>>
>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[email protected]> wrote:
>>> Yeah... It looks OK.
>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>
>>>
>>> If you can I'd like to know how busy your regionservers are during these 
>>> operations. That would be an indication on whether the parallelization is 
>>> good or not.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: Stack <[email protected]>
>>> To: [email protected]
>>> Cc:
>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>> Subject: Re: Slow full-table scans
>>>
>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[email protected]> wrote:
>>>> I am beginning to think that this is a configuration issue on my
>>>> cluster. Do the following configuration files seem sane ?
>>>>
>>>> hbase-env.sh    https://gist.github.com/3345338
>>>>
>>>
>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>
>>>
>>>> hbase-site.xml    https://gist.github.com/3345356
>>>>
>>>
>>> This is all defaults effectively.   I don't see any of the configs.
>>> recommended by the performance section of the reference guide and/or
>>> those suggested by the GBIF blog.
>>>
>>> You don't answer LarsH's query about where you see the 4% difference.
>>>
>>> How many regions in your table?  Whats the HBase Master UI look like
>>> when this scan is running?
>>> St.Ack
>>>

Reply via email to