> > Are you writing any Deletes? Are you writing any duplicates?
No physical deletes are occurring in my data, and there is a very real possibility of duplicates. How is the partitioning done? > The key structure would be /partition_id/person_id .... I'm dealing with clinical data, with a data source identified by the partition, and the person data is associated with that particular partition at load time. Are you doing the column filtering with a custom filter or one of the > prepackaged ones? > They appear to be all prepackaged filters: FamilyFilter, KeyOnlyFilter, QualifierFilter, and ColumnPrefixFilter are used under various conditions, depending upon what is requested on the Scan object. On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <[email protected]> wrote: > Are you writing any Deletes? Are you writing any duplicates? > > How is the partitioning done? > > What does the entire key structure look like? > > Are you doing the column filtering with a custom filter or one of the > prepackaged ones? > > On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <[email protected]> > wrote: > > > > > > > What's the TTL setting for your table ? > > > > > > Which hbase release are you using ? > > > > > > Was there compaction in between the scans ? > > > > > > Thanks > > > > > > > The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t > > want to say compactions aren’t a factor, but the jobs are short lived > (4-5 > > minutes), and I have ran them frequently over the last couple of days > > trying to gather stats around what was being extracted, and trying to > find > > the difference and intersection in row keys before job runs. > > > > These numbers have varied wildly, from being off by 2-3 between > > > > subsequent scans to 40 row increases, followed by a drop of 70 rows. > > > When you say there is a variation in the number of rows retrieved - the > > 40 > > > rows that got increased - are those rows in the expected time range? Or > > is > > > the system retrieving some rows which are not in the specified time > > range? > > > > > > And when the rows drop by 70, are you using any row which was needed to > > be > > > retrieved got missed out? > > > > > > > The best I can tell, if there is an increase in counts, those rows are > not > > coming from outside of the time range. In the job, I am maintaining a > list > > of rows that have a timestamp outside of my provided time range, and then > > writing those out to hdfs at the end of the map task. So far, nothing has > > been written out. > > > > Any filters in your scan? > > > > > > > > > > Regards > > > Ram > > > > > > > There are some column filters. There is an API abstraction on top of > hbase > > that I am using to allow users to easily extract data from columns that > > start with a provided column prefix. So, the column filters are in place > to > > ensure I am only getting back data from columns that start with the > > provided prefix. > > > > To add a little more detail, my row keys are separated out by partition. > At > > periodic times (through oozie), data is loaded from a source into the > > appropriate partition. I ran some scans against a partition that hadn't > > been updated in almost a year (with a scan range around the times of the > > 2nd to last load into the table), and the row key counts were consistent > > across multiple scans. I chose another partition that is actively being > > updated once a day. I chose a scan time around the 4th most recent load, > > and the results were inconsistent from scan to scan (fluctuating up and > > down). Setting the begin time to 4 days in the past end time on the scan > > range to 'right now', using System.currentTimeMillis() (with the time > being > > after the daily load), the results also fluctuated up and down. So, it > kind > > of seems like there is some sort of temporal recency that is causing the > > counts to fluctuate. > > > > > > > > On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan < > > [email protected]> wrote: > > > > These numbers have varied wildly, from being off by 2-3 between > > > > subsequent scans to 40 row increases, followed by a drop of 70 rows. > > When you say there is a variation in the number of rows retrieved - the > 40 > > rows that got increased - are those rows in the expected time range? Or > is > > the system retrieving some rows which are not in the specified time > range? > > > > And when the rows drop by 70, are you using any row which was needed to > be > > retrieved got missed out? > > > > Any filters in your scan? > > > > Regards > > Ram > > > > On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <[email protected]> wrote: > > > > What's the TTL setting for your table ? > > > > Which hbase release are you using ? > > > > Was there compaction in between the scans ? > > > > Thanks > > > > > > On Feb 24, 2015, at 2:32 PM, Stephen Durfey <[email protected]> wrote: > > > > I have some code that accepts a time range and looks for data written to > > > > an HBase table during that range. If anything has been written for that > row > > during that range, the row key is saved off, and sometime later in the > > pipeline those row keys are used to extract the entire row. I’m testing > > against a fixed time range, at some point in the past. This is being done > > as part of a Map/Reduce job (using Apache Crunch). I have some job > counters > > setup to keep track of the number of rows extracted. Since the time range > > is fixed, I would expect the scan to return the same number of rows with > > data in the provided time range. However, I am seeing this number vary > from > > scan to scan (bouncing between increasing and decreasing). > > > > > > I’ve eliminated the possibility that data is being pulled in from > > > > outside the time range. I did this by scanning for one column qualifier > > (and only using this as the qualifier for if a row had data in the time > > range), getting the timestamp on the cell for each returned row and > > compared it against the begin and end times for the scan, and I didn’t > find > > any that satisfied that criteria. I’ve observed some row keys show up in > > the 1st scan, then drop out in the 2nd scan, only to show back up again > in > > the 3rd scan (all with the exact same Scan object). These numbers have > > varied wildly, from being off by 2-3 between subsequent scans to 40 row > > increases, followed by a drop of 70 rows. > > > > > > I’m kind of looking for ideas to try to track down what could be causing > > > > this to happen. The code itself is pretty simple, it creates a Scan > object, > > scans the table, and then in the map phase, extract out the row key, and > at > > the end, it dumps them to a directory in hdfs. > > > > > > -- > Sean >
