Re: RowID format tradeoffs

Russ Weeks Fri, 11 Apr 2014 09:57:38 -0700

Hi, Chris,

Thanks for your response, sorry it's taken me so long to process it.


I guess there needs to be some sort of relationship between the number of
shards and the number of tablet servers, right? Would you typically set
numShards to be the greatest # of tablet servers you'd anticipate needing,
and then maintain a mapping on the client side to say, "OK, right now my
optimal range set is
([shard0_mindate...shard15_maxdate],[shard16_mindate...shard32_maxdate]...)"
do you see what I'm getting at? Because it seems like the alternative is to
re-encode all your row IDs as the number of shards changes, and I'd
*really* like my row ids to be immutable.

Right now we're using a distributed service very similar to Snowflake[1]
for row ID generation. It's not exactly monotonically increasing, but
nearly so.

-Russ

[1]: https://github.com/twitter/snowflake/


On Sun, Apr 6, 2014 at 7:48 AM, Christopher <[email protected]> wrote:

> You could try sharding:
>
> If your RowID is ingest date (to achieve ability to scan over recently
> ingested data, as you describe), you could use RowID of
> "ShardID_IngestDate" instead, where:
>
> ShardID = hash(row) % numShards
>
> This will result in numShards number of rows for each IngestDate, and
> is chosen by you to be a value appropriate to your cluster. You can
> pre-split your cluster, for each ShardID, for better ingest and query.
>
> As for AccumuloInputFormat, it uses a regular scanner internally, but
> it supports multiple ranges, just like the BatchScanner, creating
> separate mappers for each one. All you need to do is query numShards
> number of ranges.
>
> Note: It sounds like you're currently using a 1-up increasing value
> for the current RowID. You may want to consider using IngestDate as I
> described above (to whatever degree of precision you need, as in
> YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
> counter synchronized across your ingest, and could help you scale your
> ingest by parallelizing it with fewer concurrency issues. It also
> gives you the ability to analyze your ingest performance over time. It
> also allows you to do queries like "data processed in the last day".
> However, you'll lose the ability to do "last 100 rows processed". The
> sharding approach would work with either though.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <[email protected]>
> wrote:
> > Hi,
> >
> > I'm looking for advice re. the best way to structure my row IDs.
> > Monotonically increasing IDs have the very appealing property that I can
> > quickly scan all recently-ingested unprocessed rows, particularly
> because I
> > maintain a "checkpoint" of the most-recently processed row.
> >
> > Of course, the problem with increasing IDs is that it's the lowest-order
> > bits which are changing, which (I think?) means it's less optimal for
> > distributing data across my cluster. I guess that the ways to get around
> > this are to either reverse the ID or to define partitions, and use the
> > partition ID as the high-order bits of the row id? Reversing the ID will
> > destroy the property I describe above; I guess that using partitions may
> > preserve it as long as I use a BatchScanner, but would a BatchScanner
> play
> > nicely with AccumuloInputFormat? So many questions.
> >
> > Anyways, I think there's a pretty good chance that I'm missing something
> > obvious in this analysis. For instance, if it's easy to "rebalance" the
> data
> > across my tablet servers periodically, then I'd probably just stick with
> > increasing IDs.
> >
> > Very interested to hear your advice, or the pros and cons of any of these
> > approaches.
> >
> > Thanks,
> > -Russ
>

Re: RowID format tradeoffs

Reply via email to