Hi everyone. In considering Pig for our HBase querying needs, I've run into a 
discrepancy between the size of Pig's result set and the size of the table 
being queried. I hope this is due to a misunderstanding of HBase and Pig on my 
part. The test case which generates the discrepancy is fairly simple, however.

The link below contains a Jython script which populates an HBase table with 
data in two column familes. A corresponding Pig query retrieves data for one 
column and saves it to a CSV:

https://gist.github.com/766929

The Jython script has the following usage:

> jython hbase_test.py [table] [column count] [row count] [batch count]

This will populate a table named [table] with two column families. The first 
contains static data. The second contains the given number of columns, 
populated with data.

The Pig query will return an inaccurate number of results for certain table 
sizes and configurations, most notably with tables exceeding 1.8 million rows 
in length and with more than 2 columns in the queried column family, eg.

> jython hbase_test.py test 3 1800000 100000

For instance, if I execute the above command and the corresponding Pig query, 
the results number 905914. Note that if the table is re-populated and queried a 
second time, a different number results. If I run the query again without 
re-populating the table, I get the same number of results. The HBase shell 
returns an accurate row count.

Some notes on reproducing this issue (or not):

* If the Jython script doesn't populate the meta column family, the issue goes 
away with the same query.
* If the Jython script populates 2 columns instead of 3, the issue goes away 
with the same query.
* The size of the column key or its value may influence whether the issue 
occurs.
   For instance, if I change the script to store 'value_%d' instead of 
'value_%d_%d', retaining the random int, the issue goes away with the same 
query.

I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard using 
the stock Java that came with the OS. Attached is a log of the Pig console 
output. The error logs contain nothing of import.

Am I doing anything incorrectly? Is there a way I can work around this issue 
without compromising the column family being queried?

This appears to be a fairly simple case of Pig/HBase usage. Can anyone else 
reproduce the issue?

thanks,
Ian.

Reply via email to