Lars, Can you send me the modified ingestion code ? I am trying to track down the problem as well and will keep you posted.
Thanks for your help! Gurjeet On Wed, Aug 22, 2012 at 6:38 PM, Lars H <[email protected]> wrote: > Your puts are much faster because in the old case you're doing a Put per > column, rather than per row. > That's the first thing I changed in you sample code (but since this was about > scan performance I did not mention that). > > I'm still interested in tracking this down if it is an actual performance > problem. > > -- Lars > > Gurjeet Singh <[email protected]> schrieb: > >>Okay, I just ran extensive tests with my minimal test case and you are >>correct, the old and the new version do the scans in about the same >>amount of time (although puts are MUCH faster in the packed scheme). >> >>I guess my test case is too minimal. I will try to make a better >>testcase since in my production code, there is still a 500x >>difference. >> >>Gurjeet >> >>On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[email protected]> wrote: >>> Try a quick TestDFSIO to see if things are okay. >>> >>> ./zahoor >>> >>> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia >>> <[email protected]>wrote: >>> >>>> It's possible that there is a bad or slower disk on Gurjeet's machine. I >>>> think details of iostat and cpu would clear things up. >>>> >>>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[email protected]> >>>> wrote: >>>> >>>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size >>>> > 100 >>>> > >>>> > >>>> > >>>> > ________________________________ >>>> > From: Gurjeet Singh <[email protected]> >>>> > To: [email protected]; lars hofhansl <[email protected]> >>>> > Sent: Tuesday, August 21, 2012 11:31 AM >>>> > Subject: Re: Slow full-table scans >>>> > >>>> > How does that compare with the newScanTable on your build ? >>>> > >>>> > Gurjeet >>>> > >>>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[email protected]> >>>> > wrote: >>>> > > Hmm... So I tried in HBase (current trunk). >>>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable). >>>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo >>>> > distributed mode - with your oldScanTable). >>>> > > >>>> > > -- Lars >>>> > > >>>> > > >>>> > > >>>> > > ----- Original Message ----- >>>> > > From: lars hofhansl <[email protected]> >>>> > > To: "[email protected]" <[email protected]> >>>> > > Cc: >>>> > > Sent: Monday, August 20, 2012 7:50 PM >>>> > > Subject: Re: Slow full-table scans >>>> > > >>>> > > Thanks Gurjeet, >>>> > > >>>> > > I'll (hopefully) have a look tomorrow. >>>> > > >>>> > > -- Lars >>>> > > >>>> > > >>>> > > >>>> > > ----- Original Message ----- >>>> > > From: Gurjeet Singh <[email protected]> >>>> > > To: [email protected]; lars hofhansl <[email protected]> >>>> > > Cc: >>>> > > Sent: Monday, August 20, 2012 7:42 PM >>>> > > Subject: Re: Slow full-table scans >>>> > > >>>> > > Hi Lars, >>>> > > >>>> > > Here is a testcase: >>>> > > >>>> > > https://gist.github.com/3410948 >>>> > > >>>> > > Benchmarking code: >>>> > > >>>> > > https://gist.github.com/3410952 >>>> > > >>>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 >>>> > > >>>> > > Gurjeet >>>> > > >>>> > > >>>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[email protected]> >>>> > wrote: >>>> > >> Sure - I can create a minimal testcase and send it along. >>>> > >> >>>> > >> Gurjeet >>>> > >> >>>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[email protected]> >>>> > wrote: >>>> > >>> That's interesting. >>>> > >>> Could you share your old and new schema. I would like to track down >>>> > the performance problems you saw. >>>> > >>> (If you had a demo program that populates your rows with 200.000 >>>> > columns in a way where you saw the performance issues, that'd be even >>>> > better, but not necessary). >>>> > >>> >>>> > >>> >>>> > >>> -- Lars >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> ________________________________ >>>> > >>> From: Gurjeet Singh <[email protected]> >>>> > >>> To: [email protected]; lars hofhansl <[email protected]> >>>> > >>> Sent: Thursday, August 16, 2012 11:26 AM >>>> > >>> Subject: Re: Slow full-table scans >>>> > >>> >>>> > >>> Sorry for the delay guys. >>>> > >>> >>>> > >>> Here are a few results: >>>> > >>> >>>> > >>> 1. Regions in the table = 11 >>>> > >>> 2. The region servers don't appear to be very busy with the query ~5% >>>> > >>> CPU (but with parallelization, they are all busy) >>>> > >>> >>>> > >>> Finally, I changed the format of my data, such that each cell in >>>> HBase >>>> > >>> contains a chunk of a row instead of the single value it had. So, >>>> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a >>>> > >>> performance boost of 1000x. It seems that the underlying issue was IO >>>> > >>> overhead per byte of actual data stored. >>>> > >>> >>>> > >>> >>>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[email protected]> >>>> > wrote: >>>> > >>>> Yeah... It looks OK. >>>> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>>> > >>>> >>>> > >>>> >>>> > >>>> If you can I'd like to know how busy your regionservers are during >>>> > these operations. That would be an indication on whether the >>>> > parallelization is good or not. >>>> > >>>> >>>> > >>>> -- Lars >>>> > >>>> >>>> > >>>> >>>> > >>>> ----- Original Message ----- >>>> > >>>> From: Stack <[email protected]> >>>> > >>>> To: [email protected] >>>> > >>>> Cc: >>>> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM >>>> > >>>> Subject: Re: Slow full-table scans >>>> > >>>> >>>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[email protected]> >>>> > wrote: >>>> > >>>>> I am beginning to think that this is a configuration issue on my >>>> > >>>>> cluster. Do the following configuration files seem sane ? >>>> > >>>>> >>>> > >>>>> hbase-env.sh https://gist.github.com/3345338 >>>> > >>>>> >>>> > >>>> >>>> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>>> > >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>>> > >>>> >>>> > >>>> >>>> > >>>>> hbase-site.xml https://gist.github.com/3345356 >>>> > >>>>> >>>> > >>>> >>>> > >>>> This is all defaults effectively. I don't see any of the configs. >>>> > >>>> recommended by the performance section of the reference guide and/or >>>> > >>>> those suggested by the GBIF blog. >>>> > >>>> >>>> > >>>> You don't answer LarsH's query about where you see the 4% >>>> difference. >>>> > >>>> >>>> > >>>> How many regions in your table? Whats the HBase Master UI look like >>>> > >>>> when this scan is running? >>>> > >>>> St.Ack >>>> > >>>> >>>> > >>>>
