Are you writing any Deletes? Are you writing any duplicates?

How is the partitioning done?

What does the entire key structure look like?

Are you doing the column filtering with a custom filter or one of the
prepackaged ones?

On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <[email protected]> wrote:

> >
> > What's the TTL setting for your table ?
> >
> > Which hbase release are you using ?
> >
> > Was there compaction in between the scans ?
> >
> > Thanks
> >
>
> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
> want to say compactions aren’t a factor, but the jobs are short lived (4-5
> minutes), and I have ran them frequently over the last couple of days
> trying to gather stats around what was being extracted, and trying to find
> the difference and intersection in row keys before job runs.
>
> These numbers have varied wildly, from being off by 2-3 between
>
> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > When you say there is a variation in the number of rows retrieved - the
> 40
> > rows that got increased - are those rows in the expected time range? Or
> is
> > the system retrieving some rows which are not in the specified time
> range?
> >
> > And when the rows drop by 70, are you using any row which was needed to
> be
> > retrieved got missed out?
> >
>
> The best I can tell, if there is an increase in counts, those rows are not
> coming from outside of the time range. In the job, I am maintaining a list
> of rows that have a timestamp outside of my provided time range, and then
> writing those out to hdfs at the end of the map task. So far, nothing has
> been written out.
>
> Any filters in your scan?
> >
>
> >
> Regards
> > Ram
> >
>
> There are some column filters. There is an API abstraction on top of hbase
> that I am using to allow users to easily extract data from columns that
> start with a provided column prefix. So, the column filters are in place to
> ensure I am only getting back data from columns that start with the
> provided prefix.
>
> To add a little more detail, my row keys are separated out by partition. At
> periodic times (through oozie), data is loaded from a source into the
> appropriate partition. I ran some scans against a partition that hadn't
> been updated in almost a year (with a scan range around the times of the
> 2nd to last load into the table), and the row key counts were consistent
> across multiple scans. I chose another partition that is actively being
> updated once a day. I chose a scan time around the 4th most recent load,
> and the results were inconsistent from scan to scan (fluctuating up and
> down). Setting the begin time to 4 days in the past end time on the scan
> range to 'right now', using System.currentTimeMillis() (with the time being
> after the daily load), the results also fluctuated up and down. So, it kind
> of seems like there is some sort of temporal recency that is causing the
> counts to fluctuate.
>
>
>
> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> [email protected]> wrote:
>
> These numbers have varied wildly, from being off by 2-3 between
>
> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> When you say there is a variation in the number of rows retrieved - the 40
> rows that got increased - are those rows in the expected time range? Or is
> the system retrieving some rows which are not in the specified time range?
>
> And when the rows drop by 70, are you using any row which was needed to be
> retrieved got missed out?
>
> Any filters in your scan?
>
> Regards
> Ram
>
> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <[email protected]> wrote:
>
> What's the TTL setting for your table ?
>
> Which hbase release are you using ?
>
> Was there compaction in between the scans ?
>
> Thanks
>
>
> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <[email protected]> wrote:
>
> I have some code that accepts a time range and looks for data written to
>
> an HBase table during that range. If anything has been written for that row
> during that range, the row key is saved off, and sometime later in the
> pipeline those row keys are used to extract the entire row. I’m testing
> against a fixed time range, at some point in the past. This is being done
> as part of a Map/Reduce job (using Apache Crunch). I have some job counters
> setup to keep track of the number of rows extracted. Since the time range
> is fixed, I would expect the scan to return the same number of rows with
> data in the provided time range. However, I am seeing this number vary from
> scan to scan (bouncing between increasing and decreasing).
>
>
> I’ve eliminated the possibility that data is being pulled in from
>
> outside the time range. I did this by scanning for one column qualifier
> (and only using this as the qualifier for if a row had data in the time
> range), getting the timestamp on the cell for each returned row and
> compared it against the begin and end times for the scan, and I didn’t find
> any that satisfied that criteria. I’ve observed some row keys show up in
> the 1st scan, then drop out in the 2nd scan, only to show back up again in
> the 3rd scan (all with the exact same Scan object). These numbers have
> varied wildly, from being off by 2-3 between subsequent scans to 40 row
> increases, followed by a drop of 70 rows.
>
>
> I’m kind of looking for ideas to try to track down what could be causing
>
> this to happen. The code itself is pretty simple, it creates a Scan object,
> scans the table, and then in the map phase, extract out the row key, and at
> the end, it dumps them to a directory in hdfs.
>



-- 
Sean

Reply via email to