Ok… 

Silly question time… so just humor me for a second.

1) What do you mean by saying your have a partitioned HBase table?  (Regions 
and partitions are not the same) 

2) There’s a question of the isolation level during the scan. What happens when 
there is a compaction running or there’s RLL taking place? 

Does your scan get locked/blocked? Does it skip the row? 
(This should be documented.) 
Do you count the number of rows scanned when building the list of rows that 
need to be processed further? 





> On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sjdur...@gmail.com> wrote:

> 
>> 
>> Are you writing any Deletes? Are you writing any duplicates?
> 
> 
> No physical deletes are occurring in my data, and there is a very real
> possibility of duplicates.
> 
> How is the partitioning done?
>> 
> 
> The key structure would be /partition_id/person_id .... I'm dealing with
> clinical data, with a data source identified by the partition, and the
> person data is associated with that particular partition at load time.
> 
> Are you doing the column filtering with a custom filter or one of the
>> prepackaged ones?
>> 
> 
> They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
> QualifierFilter, and ColumnPrefixFilter are used under various conditions,
> depending upon what is requested on the Scan object.
> 
> 
> On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bus...@cloudera.com> wrote:
> 
>> Are you writing any Deletes? Are you writing any duplicates?
>> 
>> How is the partitioning done?
>> 
>> What does the entire key structure look like?
>> 
>> Are you doing the column filtering with a custom filter or one of the
>> prepackaged ones?
>> 
>> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sjdur...@gmail.com>
>> wrote:
>> 
>>>> 
>>>> What's the TTL setting for your table ?
>>>> 
>>>> Which hbase release are you using ?
>>>> 
>>>> Was there compaction in between the scans ?
>>>> 
>>>> Thanks
>>>> 
>>> 
>>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
>>> want to say compactions aren’t a factor, but the jobs are short lived
>> (4-5
>>> minutes), and I have ran them frequently over the last couple of days
>>> trying to gather stats around what was being extracted, and trying to
>> find
>>> the difference and intersection in row keys before job runs.
>>> 
>>> These numbers have varied wildly, from being off by 2-3 between
>>> 
>>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
>>>> When you say there is a variation in the number of rows retrieved - the
>>> 40
>>>> rows that got increased - are those rows in the expected time range? Or
>>> is
>>>> the system retrieving some rows which are not in the specified time
>>> range?
>>>> 
>>>> And when the rows drop by 70, are you using any row which was needed to
>>> be
>>>> retrieved got missed out?
>>>> 
>>> 
>>> The best I can tell, if there is an increase in counts, those rows are
>> not
>>> coming from outside of the time range. In the job, I am maintaining a
>> list
>>> of rows that have a timestamp outside of my provided time range, and then
>>> writing those out to hdfs at the end of the map task. So far, nothing has
>>> been written out.
>>> 
>>> Any filters in your scan?
>>>> 
>>> 
>>>> 
>>> Regards
>>>> Ram
>>>> 
>>> 
>>> There are some column filters. There is an API abstraction on top of
>> hbase
>>> that I am using to allow users to easily extract data from columns that
>>> start with a provided column prefix. So, the column filters are in place
>> to
>>> ensure I am only getting back data from columns that start with the
>>> provided prefix.
>>> 
>>> To add a little more detail, my row keys are separated out by partition.
>> At
>>> periodic times (through oozie), data is loaded from a source into the
>>> appropriate partition. I ran some scans against a partition that hadn't
>>> been updated in almost a year (with a scan range around the times of the
>>> 2nd to last load into the table), and the row key counts were consistent
>>> across multiple scans. I chose another partition that is actively being
>>> updated once a day. I chose a scan time around the 4th most recent load,
>>> and the results were inconsistent from scan to scan (fluctuating up and
>>> down). Setting the begin time to 4 days in the past end time on the scan
>>> range to 'right now', using System.currentTimeMillis() (with the time
>> being
>>> after the daily load), the results also fluctuated up and down. So, it
>> kind
>>> of seems like there is some sort of temporal recency that is causing the
>>> counts to fluctuate.
>>> 
>>> 
>>> 
>>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
>>> ramkrishna.s.vasude...@gmail.com> wrote:
>>> 
>>> These numbers have varied wildly, from being off by 2-3 between
>>> 
>>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
>>> When you say there is a variation in the number of rows retrieved - the
>> 40
>>> rows that got increased - are those rows in the expected time range? Or
>> is
>>> the system retrieving some rows which are not in the specified time
>> range?
>>> 
>>> And when the rows drop by 70, are you using any row which was needed to
>> be
>>> retrieved got missed out?
>>> 
>>> Any filters in your scan?
>>> 
>>> Regards
>>> Ram
>>> 
>>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> 
>>> What's the TTL setting for your table ?
>>> 
>>> Which hbase release are you using ?
>>> 
>>> Was there compaction in between the scans ?
>>> 
>>> Thanks
>>> 
>>> 
>>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sjdur...@gmail.com> wrote:
>>> 
>>> I have some code that accepts a time range and looks for data written to
>>> 
>>> an HBase table during that range. If anything has been written for that
>> row
>>> during that range, the row key is saved off, and sometime later in the
>>> pipeline those row keys are used to extract the entire row. I’m testing
>>> against a fixed time range, at some point in the past. This is being done
>>> as part of a Map/Reduce job (using Apache Crunch). I have some job
>> counters
>>> setup to keep track of the number of rows extracted. Since the time range
>>> is fixed, I would expect the scan to return the same number of rows with
>>> data in the provided time range. However, I am seeing this number vary
>> from
>>> scan to scan (bouncing between increasing and decreasing).
>>> 
>>> 
>>> I’ve eliminated the possibility that data is being pulled in from
>>> 
>>> outside the time range. I did this by scanning for one column qualifier
>>> (and only using this as the qualifier for if a row had data in the time
>>> range), getting the timestamp on the cell for each returned row and
>>> compared it against the begin and end times for the scan, and I didn’t
>> find
>>> any that satisfied that criteria. I’ve observed some row keys show up in
>>> the 1st scan, then drop out in the 2nd scan, only to show back up again
>> in
>>> the 3rd scan (all with the exact same Scan object). These numbers have
>>> varied wildly, from being off by 2-3 between subsequent scans to 40 row
>>> increases, followed by a drop of 70 rows.
>>> 
>>> 
>>> I’m kind of looking for ideas to try to track down what could be causing
>>> 
>>> this to happen. The code itself is pretty simple, it creates a Scan
>> object,
>>> scans the table, and then in the map phase, extract out the row key, and
>> at
>>> the end, it dumps them to a directory in hdfs.
>>> 
>> 
>> 
>> 
>> --
>> Sean
>> 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com





Reply via email to