Re: Slow full-table scans

Lars H Wed, 22 Aug 2012 18:38:41 -0700

Your puts are much faster because in the old case you're doing a Put per 
column, rather than per row.
That's the first thing I changed in you sample code (but since this was about 
scan performance I did not mention that).


I'm still interested in tracking this down if it is an actual performance 
problem.

-- Lars

Gurjeet Singh <[email protected]> schrieb:

>Okay, I just ran extensive tests with my minimal test case and you are
>correct, the old and the new version do the scans in about the same
>amount of time (although puts are MUCH faster in the packed scheme).
>
>I guess my test case is too minimal. I will try to make a better
>testcase since in my production code, there is still a 500x
>difference.
>
>Gurjeet
>
>On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[email protected]> wrote:
>> Try a quick TestDFSIO to see if things are okay.
>>
>> ./zahoor
>>
>> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[email protected]>wrote:
>>
>>> It's possible that there is a bad or slower disk on Gurjeet's machine. I
>>> think details of iostat and cpu would clear things up.
>>>
>>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[email protected]>
>>> wrote:
>>>
>>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
>>> > 100
>>> >
>>> >
>>> >
>>> > ________________________________
>>> >  From: Gurjeet Singh <[email protected]>
>>> > To: [email protected]; lars hofhansl <[email protected]>
>>> > Sent: Tuesday, August 21, 2012 11:31 AM
>>> >  Subject: Re: Slow full-table scans
>>> >
>>> > How does that compare with the newScanTable on your build ?
>>> >
>>> > Gurjeet
>>> >
>>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[email protected]>
>>> > wrote:
>>> > > Hmm... So I tried in HBase (current trunk).
>>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
>>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
>>> > distributed mode - with your oldScanTable).
>>> > >
>>> > > -- Lars
>>> > >
>>> > >
>>> > >
>>> > > ----- Original Message -----
>>> > > From: lars hofhansl <[email protected]>
>>> > > To: "[email protected]" <[email protected]>
>>> > > Cc:
>>> > > Sent: Monday, August 20, 2012 7:50 PM
>>> > > Subject: Re: Slow full-table scans
>>> > >
>>> > > Thanks Gurjeet,
>>> > >
>>> > > I'll (hopefully) have a look tomorrow.
>>> > >
>>> > > -- Lars
>>> > >
>>> > >
>>> > >
>>> > > ----- Original Message -----
>>> > > From: Gurjeet Singh <[email protected]>
>>> > > To: [email protected]; lars hofhansl <[email protected]>
>>> > > Cc:
>>> > > Sent: Monday, August 20, 2012 7:42 PM
>>> > > Subject: Re: Slow full-table scans
>>> > >
>>> > > Hi Lars,
>>> > >
>>> > > Here is a testcase:
>>> > >
>>> > > https://gist.github.com/3410948
>>> > >
>>> > > Benchmarking code:
>>> > >
>>> > > https://gist.github.com/3410952
>>> > >
>>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
>>> > >
>>> > > Gurjeet
>>> > >
>>> > >
>>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[email protected]>
>>> > wrote:
>>> > >> Sure - I can create a minimal testcase and send it along.
>>> > >>
>>> > >> Gurjeet
>>> > >>
>>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[email protected]>
>>> > wrote:
>>> > >>> That's interesting.
>>> > >>> Could you share your old and new schema. I would like to track down
>>> > the performance problems you saw.
>>> > >>> (If you had a demo program that populates your rows with 200.000
>>> > columns in a way where you saw the performance issues, that'd be even
>>> > better, but not necessary).
>>> > >>>
>>> > >>>
>>> > >>> -- Lars
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> ________________________________
>>> > >>>  From: Gurjeet Singh <[email protected]>
>>> > >>> To: [email protected]; lars hofhansl <[email protected]>
>>> > >>> Sent: Thursday, August 16, 2012 11:26 AM
>>> > >>> Subject: Re: Slow full-table scans
>>> > >>>
>>> > >>> Sorry for the delay guys.
>>> > >>>
>>> > >>> Here are a few results:
>>> > >>>
>>> > >>> 1. Regions in the table = 11
>>> > >>> 2. The region servers don't appear to be very busy with the query ~5%
>>> > >>> CPU (but with parallelization, they are all busy)
>>> > >>>
>>> > >>> Finally, I changed the format of my data, such that each cell in
>>> HBase
>>> > >>> contains a chunk of a row instead of the single value it had. So,
>>> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a
>>> > >>> performance boost of 1000x. It seems that the underlying issue was IO
>>> > >>> overhead per byte of actual data stored.
>>> > >>>
>>> > >>>
>>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[email protected]>
>>> > wrote:
>>> > >>>> Yeah... It looks OK.
>>> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>> > >>>>
>>> > >>>>
>>> > >>>> If you can I'd like to know how busy your regionservers are during
>>> > these operations. That would be an indication on whether the
>>> > parallelization is good or not.
>>> > >>>>
>>> > >>>> -- Lars
>>> > >>>>
>>> > >>>>
>>> > >>>> ----- Original Message -----
>>> > >>>> From: Stack <[email protected]>
>>> > >>>> To: [email protected]
>>> > >>>> Cc:
>>> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>> > >>>> Subject: Re: Slow full-table scans
>>> > >>>>
>>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[email protected]>
>>> > wrote:
>>> > >>>>> I am beginning to think that this is a configuration issue on my
>>> > >>>>> cluster. Do the following configuration files seem sane ?
>>> > >>>>>
>>> > >>>>> hbase-env.sh    https://gist.github.com/3345338
>>> > >>>>>
>>> > >>>>
>>> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>> > >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>> > >>>>
>>> > >>>>
>>> > >>>>> hbase-site.xml    https://gist.github.com/3345356
>>> > >>>>>
>>> > >>>>
>>> > >>>> This is all defaults effectively.   I don't see any of the configs.
>>> > >>>> recommended by the performance section of the reference guide and/or
>>> > >>>> those suggested by the GBIF blog.
>>> > >>>>
>>> > >>>> You don't answer LarsH's query about where you see the 4%
>>> difference.
>>> > >>>>
>>> > >>>> How many regions in your table?  Whats the HBase Master UI look like
>>> > >>>> when this scan is running?
>>> > >>>> St.Ack
>>> > >>>>
>>> >
>>>

Re: Slow full-table scans

Reply via email to