Do you happen to have the region server logs as well? The .out as well as .log
D On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <[email protected]> wrote: > On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote: > > > That certainly sounds like a bug. I wonder if there is anything > interesting > > in the HBase logs when you run the job that gets the wrong result? > > Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from > about the time of the failed query. I restarted HBase before making the > query, so there might be noise in the log associated with a restart. > > master.log: http://pastebin.com/VwiXZ9BB > zookeeper.log: http://pastebin.com/CnFVyFT2 > > I believe logging level is set to DEBUG for both logs. > > Let me know if you need further logging. > > thanks, > Ian. > > > > On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]> > wrote: > > > >> Hi everyone. In considering Pig for our HBase querying needs, I've run > into > >> a discrepancy between the size of Pig's result set and the size of the > table > >> being queried. I hope this is due to a misunderstanding of HBase and Pig > on > >> my part. The test case which generates the discrepancy is fairly simple, > >> however. > >> > >> The link below contains a Jython script which populates an HBase table > with > >> data in two column familes. A corresponding Pig query retrieves data for > one > >> column and saves it to a CSV: > >> > >> https://gist.github.com/766929 > >> > >> The Jython script has the following usage: > >> > >>> jython hbase_test.py [table] [column count] [row count] [batch count] > >> > >> This will populate a table named [table] with two column families. The > >> first contains static data. The second contains the given number of > columns, > >> populated with data. > >> > >> The Pig query will return an inaccurate number of results for certain > table > >> sizes and configurations, most notably with tables exceeding 1.8 million > >> rows in length and with more than 2 columns in the queried column > family, > >> eg. > >> > >>> jython hbase_test.py test 3 1800000 100000 > >> > >> For instance, if I execute the above command and the corresponding Pig > >> query, the results number 905914. Note that if the table is re-populated > and > >> queried a second time, a different number results. If I run the query > again > >> without re-populating the table, I get the same number of results. The > HBase > >> shell returns an accurate row count. > >> > >> Some notes on reproducing this issue (or not): > >> > >> * If the Jython script doesn't populate the meta column family, the > issue > >> goes away with the same query. > >> * If the Jython script populates 2 columns instead of 3, the issue goes > >> away with the same query. > >> * The size of the column key or its value may influence whether the > issue > >> occurs. > >> For instance, if I change the script to store 'value_%d' instead of > >> 'value_%d_%d', retaining the random int, the issue goes away with the > same > >> query. > >> > >> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard > >> using the stock Java that came with the OS. Attached is a log of the Pig > >> console output. The error logs contain nothing of import. > >> > >> Am I doing anything incorrectly? Is there a way I can work around this > >> issue without compromising the column family being queried? > >> > >> This appears to be a fairly simple case of Pig/HBase usage. Can anyone > else > >> reproduce the issue? > >> > >> thanks, > >> Ian. > >> > >> > >
