Are you writing any Deletes? Are you writing any duplicates? How is the partitioning done?
What does the entire key structure look like? Are you doing the column filtering with a custom filter or one of the prepackaged ones? On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <[email protected]> wrote: > > > > What's the TTL setting for your table ? > > > > Which hbase release are you using ? > > > > Was there compaction in between the scans ? > > > > Thanks > > > > The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t > want to say compactions aren’t a factor, but the jobs are short lived (4-5 > minutes), and I have ran them frequently over the last couple of days > trying to gather stats around what was being extracted, and trying to find > the difference and intersection in row keys before job runs. > > These numbers have varied wildly, from being off by 2-3 between > > subsequent scans to 40 row increases, followed by a drop of 70 rows. > > When you say there is a variation in the number of rows retrieved - the > 40 > > rows that got increased - are those rows in the expected time range? Or > is > > the system retrieving some rows which are not in the specified time > range? > > > > And when the rows drop by 70, are you using any row which was needed to > be > > retrieved got missed out? > > > > The best I can tell, if there is an increase in counts, those rows are not > coming from outside of the time range. In the job, I am maintaining a list > of rows that have a timestamp outside of my provided time range, and then > writing those out to hdfs at the end of the map task. So far, nothing has > been written out. > > Any filters in your scan? > > > > > > Regards > > Ram > > > > There are some column filters. There is an API abstraction on top of hbase > that I am using to allow users to easily extract data from columns that > start with a provided column prefix. So, the column filters are in place to > ensure I am only getting back data from columns that start with the > provided prefix. > > To add a little more detail, my row keys are separated out by partition. At > periodic times (through oozie), data is loaded from a source into the > appropriate partition. I ran some scans against a partition that hadn't > been updated in almost a year (with a scan range around the times of the > 2nd to last load into the table), and the row key counts were consistent > across multiple scans. I chose another partition that is actively being > updated once a day. I chose a scan time around the 4th most recent load, > and the results were inconsistent from scan to scan (fluctuating up and > down). Setting the begin time to 4 days in the past end time on the scan > range to 'right now', using System.currentTimeMillis() (with the time being > after the daily load), the results also fluctuated up and down. So, it kind > of seems like there is some sort of temporal recency that is causing the > counts to fluctuate. > > > > On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan < > [email protected]> wrote: > > These numbers have varied wildly, from being off by 2-3 between > > subsequent scans to 40 row increases, followed by a drop of 70 rows. > When you say there is a variation in the number of rows retrieved - the 40 > rows that got increased - are those rows in the expected time range? Or is > the system retrieving some rows which are not in the specified time range? > > And when the rows drop by 70, are you using any row which was needed to be > retrieved got missed out? > > Any filters in your scan? > > Regards > Ram > > On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <[email protected]> wrote: > > What's the TTL setting for your table ? > > Which hbase release are you using ? > > Was there compaction in between the scans ? > > Thanks > > > On Feb 24, 2015, at 2:32 PM, Stephen Durfey <[email protected]> wrote: > > I have some code that accepts a time range and looks for data written to > > an HBase table during that range. If anything has been written for that row > during that range, the row key is saved off, and sometime later in the > pipeline those row keys are used to extract the entire row. I’m testing > against a fixed time range, at some point in the past. This is being done > as part of a Map/Reduce job (using Apache Crunch). I have some job counters > setup to keep track of the number of rows extracted. Since the time range > is fixed, I would expect the scan to return the same number of rows with > data in the provided time range. However, I am seeing this number vary from > scan to scan (bouncing between increasing and decreasing). > > > I’ve eliminated the possibility that data is being pulled in from > > outside the time range. I did this by scanning for one column qualifier > (and only using this as the qualifier for if a row had data in the time > range), getting the timestamp on the cell for each returned row and > compared it against the begin and end times for the scan, and I didn’t find > any that satisfied that criteria. I’ve observed some row keys show up in > the 1st scan, then drop out in the 2nd scan, only to show back up again in > the 3rd scan (all with the exact same Scan object). These numbers have > varied wildly, from being off by 2-3 between subsequent scans to 40 row > increases, followed by a drop of 70 rows. > > > I’m kind of looking for ideas to try to track down what could be causing > > this to happen. The code itself is pretty simple, it creates a Scan object, > scans the table, and then in the map phase, extract out the row key, and at > the end, it dumps them to a directory in hdfs. > -- Sean
