Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Dmitriy Ryaboy Sat, 08 Jan 2011 16:05:17 -0800

Ian, I looked through the code and I don't see how this could be happening..
just to make sure this isn't an HBase issue -- can you run an equivalent
java MR program to count the rows? The shell one is sequential and doesn't
use all the mapreduce machinery.


The job you want to run is org.apache.hadoop.hbase.mapreduce.RowCounter in
the hbase jar, I believe.

On Thu, Jan 6, 2011 at 10:53 AM, Ian Stevens <[email protected]> wrote:

> The regionserver.out is empty. The regionserver.log contains only the
> following for the relevant time period:
>
> Thu Jan  6 12:19:57 EST 2011 Starting regionserver on
> istevens.syncapse.local
> ulimit -n 256
> 2011-01-06 12:19:59,588 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct
> region server because hbase.cluster.distributed is false
>
> Ian.
>
> On 2011-01-06, at 1:32 PM, Dmitriy Ryaboy wrote:
>
> > Do you happen to have the region server logs as well?
> > The .out as well as .log
> >
> > D
> >
> > On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <[email protected]>
> wrote:
> >
> >> On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:
> >>
> >>> That certainly sounds like a bug. I wonder if there is anything
> >> interesting
> >>> in the HBase logs when you run the job that gets the wrong result?
> >>
> >> Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log
> from
> >> about the time of the failed query. I restarted HBase before making the
> >> query, so there might be noise in the log associated with a restart.
> >>
> >> master.log: http://pastebin.com/VwiXZ9BB
> >> zookeeper.log: http://pastebin.com/CnFVyFT2
> >>
> >> I believe logging level is set to DEBUG for both logs.
> >>
> >> Let me know if you need further logging.
> >>
> >> thanks,
> >> Ian.
> >>
> >>
> >>> On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]>
> >> wrote:
> >>>
> >>>> Hi everyone. In considering Pig for our HBase querying needs, I've run
> >> into
> >>>> a discrepancy between the size of Pig's result set and the size of the
> >> table
> >>>> being queried. I hope this is due to a misunderstanding of HBase and
> Pig
> >> on
> >>>> my part. The test case which generates the discrepancy is fairly
> simple,
> >>>> however.
> >>>>
> >>>> The link below contains a Jython script which populates an HBase table
> >> with
> >>>> data in two column familes. A corresponding Pig query retrieves data
> for
> >> one
> >>>> column and saves it to a CSV:
> >>>>
> >>>> https://gist.github.com/766929
> >>>>
> >>>> The Jython script has the following usage:
> >>>>
> >>>>> jython hbase_test.py [table] [column count] [row count] [batch count]
> >>>>
> >>>> This will populate a table named [table] with two column families. The
> >>>> first contains static data. The second contains the given number of
> >> columns,
> >>>> populated with data.
> >>>>
> >>>> The Pig query will return an inaccurate number of results for certain
> >> table
> >>>> sizes and configurations, most notably with tables exceeding 1.8
> million
> >>>> rows in length and with more than 2 columns in the queried column
> >> family,
> >>>> eg.
> >>>>
> >>>>> jython hbase_test.py test 3 1800000 100000
> >>>>
> >>>> For instance, if I execute the above command and the corresponding Pig
> >>>> query, the results number 905914. Note that if the table is
> re-populated
> >> and
> >>>> queried a second time, a different number results. If I run the query
> >> again
> >>>> without re-populating the table, I get the same number of results. The
> >> HBase
> >>>> shell returns an accurate row count.
> >>>>
> >>>> Some notes on reproducing this issue (or not):
> >>>>
> >>>> * If the Jython script doesn't populate the meta column family, the
> >> issue
> >>>> goes away with the same query.
> >>>> * If the Jython script populates 2 columns instead of 3, the issue
> goes
> >>>> away with the same query.
> >>>> * The size of the column key or its value may influence whether the
> >> issue
> >>>> occurs.
> >>>>  For instance, if I change the script to store 'value_%d' instead of
> >>>> 'value_%d_%d', retaining the random int, the issue goes away with the
> >> same
> >>>> query.
> >>>>
> >>>> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow
> Leopard
> >>>> using the stock Java that came with the OS. Attached is a log of the
> Pig
> >>>> console output. The error logs contain nothing of import.
> >>>>
> >>>> Am I doing anything incorrectly? Is there a way I can work around this
> >>>> issue without compromising the column family being queried?
> >>>>
> >>>> This appears to be a fairly simple case of Pig/HBase usage. Can anyone
> >> else
> >>>> reproduce the issue?
> >>>>
> >>>> thanks,
> >>>> Ian.
> >>>>
> >>>>
> >>
> >>
>
>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Reply via email to