The maxVersions field of Scan object is 1 by default: private int maxVersions = 1;
Cheers On Thu, Feb 26, 2015 at 12:31 PM, Stephen Durfey <[email protected]> wrote: > > > > 1) What do you mean by saying your have a partitioned HBase table? > > (Regions and partitions are not the same) > > > By partitions, I just mean logical partitions, using the row key to keep > data from separate data sources apart from each other. > > I think the issue may be resolved now, but it isn't obvious to me why the > change works. The table is set to the save the max number of versions, but > the number of versions is not specified in the Scan object. Once I changed > the Scan to request the max number of versions the counts remained the same > across all subsequent job runs. Can anyone provide some insight as to why > this is the case? > > On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel <[email protected]> > wrote: > > > Ok… > > > > Silly question time… so just humor me for a second. > > > > 1) What do you mean by saying your have a partitioned HBase table? > > (Regions and partitions are not the same) > > > > 2) There’s a question of the isolation level during the scan. What > happens > > when there is a compaction running or there’s RLL taking place? > > > > Does your scan get locked/blocked? Does it skip the row? > > (This should be documented.) > > Do you count the number of rows scanned when building the list of rows > > that need to be processed further? > > > > > > > > > > > > > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <[email protected]> > wrote: > > > > > > > >> > > >> Are you writing any Deletes? Are you writing any duplicates? > > > > > > > > > No physical deletes are occurring in my data, and there is a very real > > > possibility of duplicates. > > > > > > How is the partitioning done? > > >> > > > > > > The key structure would be /partition_id/person_id .... I'm dealing > with > > > clinical data, with a data source identified by the partition, and the > > > person data is associated with that particular partition at load time. > > > > > > Are you doing the column filtering with a custom filter or one of the > > >> prepackaged ones? > > >> > > > > > > They appear to be all prepackaged filters: FamilyFilter, > KeyOnlyFilter, > > > QualifierFilter, and ColumnPrefixFilter are used under various > > conditions, > > > depending upon what is requested on the Scan object. > > > > > > > > > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <[email protected]> > > wrote: > > > > > >> Are you writing any Deletes? Are you writing any duplicates? > > >> > > >> How is the partitioning done? > > >> > > >> What does the entire key structure look like? > > >> > > >> Are you doing the column filtering with a custom filter or one of the > > >> prepackaged ones? > > >> > > >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <[email protected]> > > >> wrote: > > >> > > >>>> > > >>>> What's the TTL setting for your table ? > > >>>> > > >>>> Which hbase release are you using ? > > >>>> > > >>>> Was there compaction in between the scans ? > > >>>> > > >>>> Thanks > > >>>> > > >>> > > >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I > > don’t > > >>> want to say compactions aren’t a factor, but the jobs are short lived > > >> (4-5 > > >>> minutes), and I have ran them frequently over the last couple of days > > >>> trying to gather stats around what was being extracted, and trying to > > >> find > > >>> the difference and intersection in row keys before job runs. > > >>> > > >>> These numbers have varied wildly, from being off by 2-3 between > > >>> > > >>> subsequent scans to 40 row increases, followed by a drop of 70 rows. > > >>>> When you say there is a variation in the number of rows retrieved - > > the > > >>> 40 > > >>>> rows that got increased - are those rows in the expected time range? > > Or > > >>> is > > >>>> the system retrieving some rows which are not in the specified time > > >>> range? > > >>>> > > >>>> And when the rows drop by 70, are you using any row which was needed > > to > > >>> be > > >>>> retrieved got missed out? > > >>>> > > >>> > > >>> The best I can tell, if there is an increase in counts, those rows > are > > >> not > > >>> coming from outside of the time range. In the job, I am maintaining a > > >> list > > >>> of rows that have a timestamp outside of my provided time range, and > > then > > >>> writing those out to hdfs at the end of the map task. So far, nothing > > has > > >>> been written out. > > >>> > > >>> Any filters in your scan? > > >>>> > > >>> > > >>>> > > >>> Regards > > >>>> Ram > > >>>> > > >>> > > >>> There are some column filters. There is an API abstraction on top of > > >> hbase > > >>> that I am using to allow users to easily extract data from columns > that > > >>> start with a provided column prefix. So, the column filters are in > > place > > >> to > > >>> ensure I am only getting back data from columns that start with the > > >>> provided prefix. > > >>> > > >>> To add a little more detail, my row keys are separated out by > > partition. > > >> At > > >>> periodic times (through oozie), data is loaded from a source into the > > >>> appropriate partition. I ran some scans against a partition that > hadn't > > >>> been updated in almost a year (with a scan range around the times of > > the > > >>> 2nd to last load into the table), and the row key counts were > > consistent > > >>> across multiple scans. I chose another partition that is actively > being > > >>> updated once a day. I chose a scan time around the 4th most recent > > load, > > >>> and the results were inconsistent from scan to scan (fluctuating up > and > > >>> down). Setting the begin time to 4 days in the past end time on the > > scan > > >>> range to 'right now', using System.currentTimeMillis() (with the time > > >> being > > >>> after the daily load), the results also fluctuated up and down. So, > it > > >> kind > > >>> of seems like there is some sort of temporal recency that is causing > > the > > >>> counts to fluctuate. > > >>> > > >>> > > >>> > > >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan < > > >>> [email protected]> wrote: > > >>> > > >>> These numbers have varied wildly, from being off by 2-3 between > > >>> > > >>> subsequent scans to 40 row increases, followed by a drop of 70 rows. > > >>> When you say there is a variation in the number of rows retrieved - > the > > >> 40 > > >>> rows that got increased - are those rows in the expected time range? > Or > > >> is > > >>> the system retrieving some rows which are not in the specified time > > >> range? > > >>> > > >>> And when the rows drop by 70, are you using any row which was needed > to > > >> be > > >>> retrieved got missed out? > > >>> > > >>> Any filters in your scan? > > >>> > > >>> Regards > > >>> Ram > > >>> > > >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <[email protected]> wrote: > > >>> > > >>> What's the TTL setting for your table ? > > >>> > > >>> Which hbase release are you using ? > > >>> > > >>> Was there compaction in between the scans ? > > >>> > > >>> Thanks > > >>> > > >>> > > >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <[email protected]> > > wrote: > > >>> > > >>> I have some code that accepts a time range and looks for data written > > to > > >>> > > >>> an HBase table during that range. If anything has been written for > that > > >> row > > >>> during that range, the row key is saved off, and sometime later in > the > > >>> pipeline those row keys are used to extract the entire row. I’m > testing > > >>> against a fixed time range, at some point in the past. This is being > > done > > >>> as part of a Map/Reduce job (using Apache Crunch). I have some job > > >> counters > > >>> setup to keep track of the number of rows extracted. Since the time > > range > > >>> is fixed, I would expect the scan to return the same number of rows > > with > > >>> data in the provided time range. However, I am seeing this number > vary > > >> from > > >>> scan to scan (bouncing between increasing and decreasing). > > >>> > > >>> > > >>> I’ve eliminated the possibility that data is being pulled in from > > >>> > > >>> outside the time range. I did this by scanning for one column > qualifier > > >>> (and only using this as the qualifier for if a row had data in the > time > > >>> range), getting the timestamp on the cell for each returned row and > > >>> compared it against the begin and end times for the scan, and I > didn’t > > >> find > > >>> any that satisfied that criteria. I’ve observed some row keys show up > > in > > >>> the 1st scan, then drop out in the 2nd scan, only to show back up > again > > >> in > > >>> the 3rd scan (all with the exact same Scan object). These numbers > have > > >>> varied wildly, from being off by 2-3 between subsequent scans to 40 > row > > >>> increases, followed by a drop of 70 rows. > > >>> > > >>> > > >>> I’m kind of looking for ideas to try to track down what could be > > causing > > >>> > > >>> this to happen. The code itself is pretty simple, it creates a Scan > > >> object, > > >>> scans the table, and then in the map phase, extract out the row key, > > and > > >> at > > >>> the end, it dumps them to a directory in hdfs. > > >>> > > >> > > >> > > >> > > >> -- > > >> Sean > > >> > > > > The opinions expressed here are mine, while they may reflect a cognitive > > thought, that is purely accidental. > > Use at your own risk. > > Michael Segel > > michael_segel (AT) hotmail.com > > > > > > > > > > > > >
