Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Ian Stevens Thu, 06 Jan 2011 09:49:54 -0800

On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:

> That certainly sounds like a bug. I wonder if there is anything interesting
> in the HBase logs when you run the job that gets the wrong result?


Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from 
about the time of the failed query. I restarted HBase before making the query, 
so there might be noise in the log associated with a restart.

master.log: http://pastebin.com/VwiXZ9BB
zookeeper.log: http://pastebin.com/CnFVyFT2

I believe logging level is set to DEBUG for both logs.

Let me know if you need further logging.

thanks,
Ian.


> On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]> wrote:
> 
>> Hi everyone. In considering Pig for our HBase querying needs, I've run into
>> a discrepancy between the size of Pig's result set and the size of the table
>> being queried. I hope this is due to a misunderstanding of HBase and Pig on
>> my part. The test case which generates the discrepancy is fairly simple,
>> however.
>> 
>> The link below contains a Jython script which populates an HBase table with
>> data in two column familes. A corresponding Pig query retrieves data for one
>> column and saves it to a CSV:
>> 
>> https://gist.github.com/766929
>> 
>> The Jython script has the following usage:
>> 
>>> jython hbase_test.py [table] [column count] [row count] [batch count]
>> 
>> This will populate a table named [table] with two column families. The
>> first contains static data. The second contains the given number of columns,
>> populated with data.
>> 
>> The Pig query will return an inaccurate number of results for certain table
>> sizes and configurations, most notably with tables exceeding 1.8 million
>> rows in length and with more than 2 columns in the queried column family,
>> eg.
>> 
>>> jython hbase_test.py test 3 1800000 100000
>> 
>> For instance, if I execute the above command and the corresponding Pig
>> query, the results number 905914. Note that if the table is re-populated and
>> queried a second time, a different number results. If I run the query again
>> without re-populating the table, I get the same number of results. The HBase
>> shell returns an accurate row count.
>> 
>> Some notes on reproducing this issue (or not):
>> 
>> * If the Jython script doesn't populate the meta column family, the issue
>> goes away with the same query.
>> * If the Jython script populates 2 columns instead of 3, the issue goes
>> away with the same query.
>> * The size of the column key or its value may influence whether the issue
>> occurs.
>>   For instance, if I change the script to store 'value_%d' instead of
>> 'value_%d_%d', retaining the random int, the issue goes away with the same
>> query.
>> 
>> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
>> using the stock Java that came with the OS. Attached is a log of the Pig
>> console output. The error logs contain nothing of import.
>> 
>> Am I doing anything incorrectly? Is there a way I can work around this
>> issue without compromising the column family being queried?
>> 
>> This appears to be a fairly simple case of Pig/HBase usage. Can anyone else
>> reproduce the issue?
>> 
>> thanks,
>> Ian.
>> 
>>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Reply via email to