Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Dmitriy Ryaboy Thu, 06 Jan 2011 10:33:15 -0800

Do you happen to have the region server logs as well?
The .out as well as .log


D

On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <[email protected]> wrote:

> On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:
>
> > That certainly sounds like a bug. I wonder if there is anything
> interesting
> > in the HBase logs when you run the job that gets the wrong result?
>
> Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from
> about the time of the failed query. I restarted HBase before making the
> query, so there might be noise in the log associated with a restart.
>
> master.log: http://pastebin.com/VwiXZ9BB
> zookeeper.log: http://pastebin.com/CnFVyFT2
>
> I believe logging level is set to DEBUG for both logs.
>
> Let me know if you need further logging.
>
> thanks,
> Ian.
>
>
> > On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]>
> wrote:
> >
> >> Hi everyone. In considering Pig for our HBase querying needs, I've run
> into
> >> a discrepancy between the size of Pig's result set and the size of the
> table
> >> being queried. I hope this is due to a misunderstanding of HBase and Pig
> on
> >> my part. The test case which generates the discrepancy is fairly simple,
> >> however.
> >>
> >> The link below contains a Jython script which populates an HBase table
> with
> >> data in two column familes. A corresponding Pig query retrieves data for
> one
> >> column and saves it to a CSV:
> >>
> >> https://gist.github.com/766929
> >>
> >> The Jython script has the following usage:
> >>
> >>> jython hbase_test.py [table] [column count] [row count] [batch count]
> >>
> >> This will populate a table named [table] with two column families. The
> >> first contains static data. The second contains the given number of
> columns,
> >> populated with data.
> >>
> >> The Pig query will return an inaccurate number of results for certain
> table
> >> sizes and configurations, most notably with tables exceeding 1.8 million
> >> rows in length and with more than 2 columns in the queried column
> family,
> >> eg.
> >>
> >>> jython hbase_test.py test 3 1800000 100000
> >>
> >> For instance, if I execute the above command and the corresponding Pig
> >> query, the results number 905914. Note that if the table is re-populated
> and
> >> queried a second time, a different number results. If I run the query
> again
> >> without re-populating the table, I get the same number of results. The
> HBase
> >> shell returns an accurate row count.
> >>
> >> Some notes on reproducing this issue (or not):
> >>
> >> * If the Jython script doesn't populate the meta column family, the
> issue
> >> goes away with the same query.
> >> * If the Jython script populates 2 columns instead of 3, the issue goes
> >> away with the same query.
> >> * The size of the column key or its value may influence whether the
> issue
> >> occurs.
> >>   For instance, if I change the script to store 'value_%d' instead of
> >> 'value_%d_%d', retaining the random int, the issue goes away with the
> same
> >> query.
> >>
> >> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
> >> using the stock Java that came with the OS. Attached is a log of the Pig
> >> console output. The error logs contain nothing of import.
> >>
> >> Am I doing anything incorrectly? Is there a way I can work around this
> >> issue without compromising the column family being queried?
> >>
> >> This appears to be a fairly simple case of Pig/HBase usage. Can anyone
> else
> >> reproduce the issue?
> >>
> >> thanks,
> >> Ian.
> >>
> >>
>
>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Reply via email to