That certainly sounds like a bug. I wonder if there is anything interesting in the HBase logs when you run the job that gets the wrong result?
On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]> wrote: > Hi everyone. In considering Pig for our HBase querying needs, I've run into > a discrepancy between the size of Pig's result set and the size of the table > being queried. I hope this is due to a misunderstanding of HBase and Pig on > my part. The test case which generates the discrepancy is fairly simple, > however. > > The link below contains a Jython script which populates an HBase table with > data in two column familes. A corresponding Pig query retrieves data for one > column and saves it to a CSV: > > https://gist.github.com/766929 > > The Jython script has the following usage: > > > jython hbase_test.py [table] [column count] [row count] [batch count] > > This will populate a table named [table] with two column families. The > first contains static data. The second contains the given number of columns, > populated with data. > > The Pig query will return an inaccurate number of results for certain table > sizes and configurations, most notably with tables exceeding 1.8 million > rows in length and with more than 2 columns in the queried column family, > eg. > > > jython hbase_test.py test 3 1800000 100000 > > For instance, if I execute the above command and the corresponding Pig > query, the results number 905914. Note that if the table is re-populated and > queried a second time, a different number results. If I run the query again > without re-populating the table, I get the same number of results. The HBase > shell returns an accurate row count. > > Some notes on reproducing this issue (or not): > > * If the Jython script doesn't populate the meta column family, the issue > goes away with the same query. > * If the Jython script populates 2 columns instead of 3, the issue goes > away with the same query. > * The size of the column key or its value may influence whether the issue > occurs. > For instance, if I change the script to store 'value_%d' instead of > 'value_%d_%d', retaining the random int, the issue goes away with the same > query. > > I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard > using the stock Java that came with the OS. Attached is a log of the Pig > console output. The error logs contain nothing of import. > > Am I doing anything incorrectly? Is there a way I can work around this > issue without compromising the column family being queried? > > This appears to be a fairly simple case of Pig/HBase usage. Can anyone else > reproduce the issue? > > thanks, > Ian. > >
