Okay, I just ran extensive tests with my minimal test case and you are
correct, the old and the new version do the scans in about the same
amount of time (although puts are MUCH faster in the packed scheme).

I guess my test case is too minimal. I will try to make a better
testcase since in my production code, there is still a 500x
difference.

Gurjeet

On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[email protected]> wrote:
> Try a quick TestDFSIO to see if things are okay.
>
> ./zahoor
>
> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[email protected]>wrote:
>
>> It's possible that there is a bad or slower disk on Gurjeet's machine. I
>> think details of iostat and cpu would clear things up.
>>
>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[email protected]>
>> wrote:
>>
>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
>> > 100
>> >
>> >
>> >
>> > ________________________________
>> >  From: Gurjeet Singh <[email protected]>
>> > To: [email protected]; lars hofhansl <[email protected]>
>> > Sent: Tuesday, August 21, 2012 11:31 AM
>> >  Subject: Re: Slow full-table scans
>> >
>> > How does that compare with the newScanTable on your build ?
>> >
>> > Gurjeet
>> >
>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[email protected]>
>> > wrote:
>> > > Hmm... So I tried in HBase (current trunk).
>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
>> > distributed mode - with your oldScanTable).
>> > >
>> > > -- Lars
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: lars hofhansl <[email protected]>
>> > > To: "[email protected]" <[email protected]>
>> > > Cc:
>> > > Sent: Monday, August 20, 2012 7:50 PM
>> > > Subject: Re: Slow full-table scans
>> > >
>> > > Thanks Gurjeet,
>> > >
>> > > I'll (hopefully) have a look tomorrow.
>> > >
>> > > -- Lars
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: Gurjeet Singh <[email protected]>
>> > > To: [email protected]; lars hofhansl <[email protected]>
>> > > Cc:
>> > > Sent: Monday, August 20, 2012 7:42 PM
>> > > Subject: Re: Slow full-table scans
>> > >
>> > > Hi Lars,
>> > >
>> > > Here is a testcase:
>> > >
>> > > https://gist.github.com/3410948
>> > >
>> > > Benchmarking code:
>> > >
>> > > https://gist.github.com/3410952
>> > >
>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
>> > >
>> > > Gurjeet
>> > >
>> > >
>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[email protected]>
>> > wrote:
>> > >> Sure - I can create a minimal testcase and send it along.
>> > >>
>> > >> Gurjeet
>> > >>
>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[email protected]>
>> > wrote:
>> > >>> That's interesting.
>> > >>> Could you share your old and new schema. I would like to track down
>> > the performance problems you saw.
>> > >>> (If you had a demo program that populates your rows with 200.000
>> > columns in a way where you saw the performance issues, that'd be even
>> > better, but not necessary).
>> > >>>
>> > >>>
>> > >>> -- Lars
>> > >>>
>> > >>>
>> > >>>
>> > >>> ________________________________
>> > >>>  From: Gurjeet Singh <[email protected]>
>> > >>> To: [email protected]; lars hofhansl <[email protected]>
>> > >>> Sent: Thursday, August 16, 2012 11:26 AM
>> > >>> Subject: Re: Slow full-table scans
>> > >>>
>> > >>> Sorry for the delay guys.
>> > >>>
>> > >>> Here are a few results:
>> > >>>
>> > >>> 1. Regions in the table = 11
>> > >>> 2. The region servers don't appear to be very busy with the query ~5%
>> > >>> CPU (but with parallelization, they are all busy)
>> > >>>
>> > >>> Finally, I changed the format of my data, such that each cell in
>> HBase
>> > >>> contains a chunk of a row instead of the single value it had. So,
>> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> > >>> performance boost of 1000x. It seems that the underlying issue was IO
>> > >>> overhead per byte of actual data stored.
>> > >>>
>> > >>>
>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[email protected]>
>> > wrote:
>> > >>>> Yeah... It looks OK.
>> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>> > >>>>
>> > >>>>
>> > >>>> If you can I'd like to know how busy your regionservers are during
>> > these operations. That would be an indication on whether the
>> > parallelization is good or not.
>> > >>>>
>> > >>>> -- Lars
>> > >>>>
>> > >>>>
>> > >>>> ----- Original Message -----
>> > >>>> From: Stack <[email protected]>
>> > >>>> To: [email protected]
>> > >>>> Cc:
>> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM
>> > >>>> Subject: Re: Slow full-table scans
>> > >>>>
>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[email protected]>
>> > wrote:
>> > >>>>> I am beginning to think that this is a configuration issue on my
>> > >>>>> cluster. Do the following configuration files seem sane ?
>> > >>>>>
>> > >>>>> hbase-env.sh    https://gist.github.com/3345338
>> > >>>>>
>> > >>>>
>> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>> > >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>> > >>>>
>> > >>>>
>> > >>>>> hbase-site.xml    https://gist.github.com/3345356
>> > >>>>>
>> > >>>>
>> > >>>> This is all defaults effectively.   I don't see any of the configs.
>> > >>>> recommended by the performance section of the reference guide and/or
>> > >>>> those suggested by the GBIF blog.
>> > >>>>
>> > >>>> You don't answer LarsH's query about where you see the 4%
>> difference.
>> > >>>>
>> > >>>> How many regions in your table?  Whats the HBase Master UI look like
>> > >>>> when this scan is running?
>> > >>>> St.Ack
>> > >>>>
>> >
>>

Reply via email to