Re: Slow full-table scans

Gurjeet Singh Wed, 22 Aug 2012 19:01:36 -0700

Lars,

Can you send me the modified ingestion code ? I am trying to track
down the problem as well and will keep you posted.


Thanks for your help!
Gurjeet

On Wed, Aug 22, 2012 at 6:38 PM, Lars H <[email protected]> wrote:
> Your puts are much faster because in the old case you're doing a Put per 
> column, rather than per row.
> That's the first thing I changed in you sample code (but since this was about 
> scan performance I did not mention that).
>
> I'm still interested in tracking this down if it is an actual performance 
> problem.
>
> -- Lars
>
> Gurjeet Singh <[email protected]> schrieb:
>
>>Okay, I just ran extensive tests with my minimal test case and you are
>>correct, the old and the new version do the scans in about the same
>>amount of time (although puts are MUCH faster in the packed scheme).
>>
>>I guess my test case is too minimal. I will try to make a better
>>testcase since in my production code, there is still a 500x
>>difference.
>>
>>Gurjeet
>>
>>On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[email protected]> wrote:
>>> Try a quick TestDFSIO to see if things are okay.
>>>
>>> ./zahoor
>>>
>>> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia 
>>> <[email protected]>wrote:
>>>
>>>> It's possible that there is a bad or slower disk on Gurjeet's machine. I
>>>> think details of iostat and cpu would clear things up.
>>>>
>>>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[email protected]>
>>>> wrote:
>>>>
>>>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
>>>> > 100
>>>> >
>>>> >
>>>> >
>>>> > ________________________________
>>>> >  From: Gurjeet Singh <[email protected]>
>>>> > To: [email protected]; lars hofhansl <[email protected]>
>>>> > Sent: Tuesday, August 21, 2012 11:31 AM
>>>> >  Subject: Re: Slow full-table scans
>>>> >
>>>> > How does that compare with the newScanTable on your build ?
>>>> >
>>>> > Gurjeet
>>>> >
>>>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[email protected]>
>>>> > wrote:
>>>> > > Hmm... So I tried in HBase (current trunk).
>>>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
>>>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
>>>> > distributed mode - with your oldScanTable).
>>>> > >
>>>> > > -- Lars
>>>> > >
>>>> > >
>>>> > >
>>>> > > ----- Original Message -----
>>>> > > From: lars hofhansl <[email protected]>
>>>> > > To: "[email protected]" <[email protected]>
>>>> > > Cc:
>>>> > > Sent: Monday, August 20, 2012 7:50 PM
>>>> > > Subject: Re: Slow full-table scans
>>>> > >
>>>> > > Thanks Gurjeet,
>>>> > >
>>>> > > I'll (hopefully) have a look tomorrow.
>>>> > >
>>>> > > -- Lars
>>>> > >
>>>> > >
>>>> > >
>>>> > > ----- Original Message -----
>>>> > > From: Gurjeet Singh <[email protected]>
>>>> > > To: [email protected]; lars hofhansl <[email protected]>
>>>> > > Cc:
>>>> > > Sent: Monday, August 20, 2012 7:42 PM
>>>> > > Subject: Re: Slow full-table scans
>>>> > >
>>>> > > Hi Lars,
>>>> > >
>>>> > > Here is a testcase:
>>>> > >
>>>> > > https://gist.github.com/3410948
>>>> > >
>>>> > > Benchmarking code:
>>>> > >
>>>> > > https://gist.github.com/3410952
>>>> > >
>>>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
>>>> > >
>>>> > > Gurjeet
>>>> > >
>>>> > >
>>>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[email protected]>
>>>> > wrote:
>>>> > >> Sure - I can create a minimal testcase and send it along.
>>>> > >>
>>>> > >> Gurjeet
>>>> > >>
>>>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[email protected]>
>>>> > wrote:
>>>> > >>> That's interesting.
>>>> > >>> Could you share your old and new schema. I would like to track down
>>>> > the performance problems you saw.
>>>> > >>> (If you had a demo program that populates your rows with 200.000
>>>> > columns in a way where you saw the performance issues, that'd be even
>>>> > better, but not necessary).
>>>> > >>>
>>>> > >>>
>>>> > >>> -- Lars
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> ________________________________
>>>> > >>>  From: Gurjeet Singh <[email protected]>
>>>> > >>> To: [email protected]; lars hofhansl <[email protected]>
>>>> > >>> Sent: Thursday, August 16, 2012 11:26 AM
>>>> > >>> Subject: Re: Slow full-table scans
>>>> > >>>
>>>> > >>> Sorry for the delay guys.
>>>> > >>>
>>>> > >>> Here are a few results:
>>>> > >>>
>>>> > >>> 1. Regions in the table = 11
>>>> > >>> 2. The region servers don't appear to be very busy with the query ~5%
>>>> > >>> CPU (but with parallelization, they are all busy)
>>>> > >>>
>>>> > >>> Finally, I changed the format of my data, such that each cell in
>>>> HBase
>>>> > >>> contains a chunk of a row instead of the single value it had. So,
>>>> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a
>>>> > >>> performance boost of 1000x. It seems that the underlying issue was IO
>>>> > >>> overhead per byte of actual data stored.
>>>> > >>>
>>>> > >>>
>>>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[email protected]>
>>>> > wrote:
>>>> > >>>> Yeah... It looks OK.
>>>> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> If you can I'd like to know how busy your regionservers are during
>>>> > these operations. That would be an indication on whether the
>>>> > parallelization is good or not.
>>>> > >>>>
>>>> > >>>> -- Lars
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> ----- Original Message -----
>>>> > >>>> From: Stack <[email protected]>
>>>> > >>>> To: [email protected]
>>>> > >>>> Cc:
>>>> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>>> > >>>> Subject: Re: Slow full-table scans
>>>> > >>>>
>>>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[email protected]>
>>>> > wrote:
>>>> > >>>>> I am beginning to think that this is a configuration issue on my
>>>> > >>>>> cluster. Do the following configuration files seem sane ?
>>>> > >>>>>
>>>> > >>>>> hbase-env.sh    https://gist.github.com/3345338
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>>> > >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>> hbase-site.xml    https://gist.github.com/3345356
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>> This is all defaults effectively.   I don't see any of the configs.
>>>> > >>>> recommended by the performance section of the reference guide and/or
>>>> > >>>> those suggested by the GBIF blog.
>>>> > >>>>
>>>> > >>>> You don't answer LarsH's query about where you see the 4%
>>>> difference.
>>>> > >>>>
>>>> > >>>> How many regions in your table?  Whats the HBase Master UI look like
>>>> > >>>> when this scan is running?
>>>> > >>>> St.Ack
>>>> > >>>>
>>>> >
>>>>

Re: Slow full-table scans

Reply via email to