Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Ian Stevens Thu, 06 Jan 2011 10:54:26 -0800

The regionserver.out is empty. The regionserver.log contains only the following 
for the relevant time period:


Thu Jan  6 12:19:57 EST 2011 Starting regionserver on istevens.syncapse.local
ulimit -n 256
2011-01-06 12:19:59,588 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct 
region server because hbase.cluster.distributed is false

Ian.

On 2011-01-06, at 1:32 PM, Dmitriy Ryaboy wrote:

> Do you happen to have the region server logs as well?
> The .out as well as .log
> 
> D
> 
> On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens <[email protected]> wrote:
> 
>> On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:
>> 
>>> That certainly sounds like a bug. I wonder if there is anything
>> interesting
>>> in the HBase logs when you run the job that gets the wrong result?
>> 
>> Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from
>> about the time of the failed query. I restarted HBase before making the
>> query, so there might be noise in the log associated with a restart.
>> 
>> master.log: http://pastebin.com/VwiXZ9BB
>> zookeeper.log: http://pastebin.com/CnFVyFT2
>> 
>> I believe logging level is set to DEBUG for both logs.
>> 
>> Let me know if you need further logging.
>> 
>> thanks,
>> Ian.
>> 
>> 
>>> On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]>
>> wrote:
>>> 
>>>> Hi everyone. In considering Pig for our HBase querying needs, I've run
>> into
>>>> a discrepancy between the size of Pig's result set and the size of the
>> table
>>>> being queried. I hope this is due to a misunderstanding of HBase and Pig
>> on
>>>> my part. The test case which generates the discrepancy is fairly simple,
>>>> however.
>>>> 
>>>> The link below contains a Jython script which populates an HBase table
>> with
>>>> data in two column familes. A corresponding Pig query retrieves data for
>> one
>>>> column and saves it to a CSV:
>>>> 
>>>> https://gist.github.com/766929
>>>> 
>>>> The Jython script has the following usage:
>>>> 
>>>>> jython hbase_test.py [table] [column count] [row count] [batch count]
>>>> 
>>>> This will populate a table named [table] with two column families. The
>>>> first contains static data. The second contains the given number of
>> columns,
>>>> populated with data.
>>>> 
>>>> The Pig query will return an inaccurate number of results for certain
>> table
>>>> sizes and configurations, most notably with tables exceeding 1.8 million
>>>> rows in length and with more than 2 columns in the queried column
>> family,
>>>> eg.
>>>> 
>>>>> jython hbase_test.py test 3 1800000 100000
>>>> 
>>>> For instance, if I execute the above command and the corresponding Pig
>>>> query, the results number 905914. Note that if the table is re-populated
>> and
>>>> queried a second time, a different number results. If I run the query
>> again
>>>> without re-populating the table, I get the same number of results. The
>> HBase
>>>> shell returns an accurate row count.
>>>> 
>>>> Some notes on reproducing this issue (or not):
>>>> 
>>>> * If the Jython script doesn't populate the meta column family, the
>> issue
>>>> goes away with the same query.
>>>> * If the Jython script populates 2 columns instead of 3, the issue goes
>>>> away with the same query.
>>>> * The size of the column key or its value may influence whether the
>> issue
>>>> occurs.
>>>>  For instance, if I change the script to store 'value_%d' instead of
>>>> 'value_%d_%d', retaining the random int, the issue goes away with the
>> same
>>>> query.
>>>> 
>>>> I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
>>>> using the stock Java that came with the OS. Attached is a log of the Pig
>>>> console output. The error logs contain nothing of import.
>>>> 
>>>> Am I doing anything incorrectly? Is there a way I can work around this
>>>> issue without compromising the column family being queried?
>>>> 
>>>> This appears to be a fairly simple case of Pig/HBase usage. Can anyone
>> else
>>>> reproduce the issue?
>>>> 
>>>> thanks,
>>>> Ian.
>>>> 
>>>> 
>> 
>>

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

Reply via email to