Ok… Silly question time… so just humor me for a second.
1) What do you mean by saying your have a partitioned HBase table? (Regions and partitions are not the same) 2) There’s a question of the isolation level during the scan. What happens when there is a compaction running or there’s RLL taking place? Does your scan get locked/blocked? Does it skip the row? (This should be documented.) Do you count the number of rows scanned when building the list of rows that need to be processed further? > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sjdur...@gmail.com> wrote: > >> >> Are you writing any Deletes? Are you writing any duplicates? > > > No physical deletes are occurring in my data, and there is a very real > possibility of duplicates. > > How is the partitioning done? >> > > The key structure would be /partition_id/person_id .... I'm dealing with > clinical data, with a data source identified by the partition, and the > person data is associated with that particular partition at load time. > > Are you doing the column filtering with a custom filter or one of the >> prepackaged ones? >> > > They appear to be all prepackaged filters: FamilyFilter, KeyOnlyFilter, > QualifierFilter, and ColumnPrefixFilter are used under various conditions, > depending upon what is requested on the Scan object. > > > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bus...@cloudera.com> wrote: > >> Are you writing any Deletes? Are you writing any duplicates? >> >> How is the partitioning done? >> >> What does the entire key structure look like? >> >> Are you doing the column filtering with a custom filter or one of the >> prepackaged ones? >> >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sjdur...@gmail.com> >> wrote: >> >>>> >>>> What's the TTL setting for your table ? >>>> >>>> Which hbase release are you using ? >>>> >>>> Was there compaction in between the scans ? >>>> >>>> Thanks >>>> >>> >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t >>> want to say compactions aren’t a factor, but the jobs are short lived >> (4-5 >>> minutes), and I have ran them frequently over the last couple of days >>> trying to gather stats around what was being extracted, and trying to >> find >>> the difference and intersection in row keys before job runs. >>> >>> These numbers have varied wildly, from being off by 2-3 between >>> >>> subsequent scans to 40 row increases, followed by a drop of 70 rows. >>>> When you say there is a variation in the number of rows retrieved - the >>> 40 >>>> rows that got increased - are those rows in the expected time range? Or >>> is >>>> the system retrieving some rows which are not in the specified time >>> range? >>>> >>>> And when the rows drop by 70, are you using any row which was needed to >>> be >>>> retrieved got missed out? >>>> >>> >>> The best I can tell, if there is an increase in counts, those rows are >> not >>> coming from outside of the time range. In the job, I am maintaining a >> list >>> of rows that have a timestamp outside of my provided time range, and then >>> writing those out to hdfs at the end of the map task. So far, nothing has >>> been written out. >>> >>> Any filters in your scan? >>>> >>> >>>> >>> Regards >>>> Ram >>>> >>> >>> There are some column filters. There is an API abstraction on top of >> hbase >>> that I am using to allow users to easily extract data from columns that >>> start with a provided column prefix. So, the column filters are in place >> to >>> ensure I am only getting back data from columns that start with the >>> provided prefix. >>> >>> To add a little more detail, my row keys are separated out by partition. >> At >>> periodic times (through oozie), data is loaded from a source into the >>> appropriate partition. I ran some scans against a partition that hadn't >>> been updated in almost a year (with a scan range around the times of the >>> 2nd to last load into the table), and the row key counts were consistent >>> across multiple scans. I chose another partition that is actively being >>> updated once a day. I chose a scan time around the 4th most recent load, >>> and the results were inconsistent from scan to scan (fluctuating up and >>> down). Setting the begin time to 4 days in the past end time on the scan >>> range to 'right now', using System.currentTimeMillis() (with the time >> being >>> after the daily load), the results also fluctuated up and down. So, it >> kind >>> of seems like there is some sort of temporal recency that is causing the >>> counts to fluctuate. >>> >>> >>> >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan < >>> ramkrishna.s.vasude...@gmail.com> wrote: >>> >>> These numbers have varied wildly, from being off by 2-3 between >>> >>> subsequent scans to 40 row increases, followed by a drop of 70 rows. >>> When you say there is a variation in the number of rows retrieved - the >> 40 >>> rows that got increased - are those rows in the expected time range? Or >> is >>> the system retrieving some rows which are not in the specified time >> range? >>> >>> And when the rows drop by 70, are you using any row which was needed to >> be >>> retrieved got missed out? >>> >>> Any filters in your scan? >>> >>> Regards >>> Ram >>> >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>> What's the TTL setting for your table ? >>> >>> Which hbase release are you using ? >>> >>> Was there compaction in between the scans ? >>> >>> Thanks >>> >>> >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sjdur...@gmail.com> wrote: >>> >>> I have some code that accepts a time range and looks for data written to >>> >>> an HBase table during that range. If anything has been written for that >> row >>> during that range, the row key is saved off, and sometime later in the >>> pipeline those row keys are used to extract the entire row. I’m testing >>> against a fixed time range, at some point in the past. This is being done >>> as part of a Map/Reduce job (using Apache Crunch). I have some job >> counters >>> setup to keep track of the number of rows extracted. Since the time range >>> is fixed, I would expect the scan to return the same number of rows with >>> data in the provided time range. However, I am seeing this number vary >> from >>> scan to scan (bouncing between increasing and decreasing). >>> >>> >>> I’ve eliminated the possibility that data is being pulled in from >>> >>> outside the time range. I did this by scanning for one column qualifier >>> (and only using this as the qualifier for if a row had data in the time >>> range), getting the timestamp on the cell for each returned row and >>> compared it against the begin and end times for the scan, and I didn’t >> find >>> any that satisfied that criteria. I’ve observed some row keys show up in >>> the 1st scan, then drop out in the 2nd scan, only to show back up again >> in >>> the 3rd scan (all with the exact same Scan object). These numbers have >>> varied wildly, from being off by 2-3 between subsequent scans to 40 row >>> increases, followed by a drop of 70 rows. >>> >>> >>> I’m kind of looking for ideas to try to track down what could be causing >>> >>> this to happen. The code itself is pretty simple, it creates a Scan >> object, >>> scans the table, and then in the map phase, extract out the row key, and >> at >>> the end, it dumps them to a directory in hdfs. >>> >> >> >> >> -- >> Sean >> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com