Hi, Chris, Thanks for your response, sorry it's taken me so long to process it.
I guess there needs to be some sort of relationship between the number of shards and the number of tablet servers, right? Would you typically set numShards to be the greatest # of tablet servers you'd anticipate needing, and then maintain a mapping on the client side to say, "OK, right now my optimal range set is ([shard0_mindate...shard15_maxdate],[shard16_mindate...shard32_maxdate]...)" do you see what I'm getting at? Because it seems like the alternative is to re-encode all your row IDs as the number of shards changes, and I'd *really* like my row ids to be immutable. Right now we're using a distributed service very similar to Snowflake[1] for row ID generation. It's not exactly monotonically increasing, but nearly so. -Russ [1]: https://github.com/twitter/snowflake/ On Sun, Apr 6, 2014 at 7:48 AM, Christopher <[email protected]> wrote: > You could try sharding: > > If your RowID is ingest date (to achieve ability to scan over recently > ingested data, as you describe), you could use RowID of > "ShardID_IngestDate" instead, where: > > ShardID = hash(row) % numShards > > This will result in numShards number of rows for each IngestDate, and > is chosen by you to be a value appropriate to your cluster. You can > pre-split your cluster, for each ShardID, for better ingest and query. > > As for AccumuloInputFormat, it uses a regular scanner internally, but > it supports multiple ranges, just like the BatchScanner, creating > separate mappers for each one. All you need to do is query numShards > number of ranges. > > Note: It sounds like you're currently using a 1-up increasing value > for the current RowID. You may want to consider using IngestDate as I > described above (to whatever degree of precision you need, as in > YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a > counter synchronized across your ingest, and could help you scale your > ingest by parallelizing it with fewer concurrency issues. It also > gives you the ability to analyze your ingest performance over time. It > also allows you to do queries like "data processed in the last day". > However, you'll lose the ability to do "last 100 rows processed". The > sharding approach would work with either though. > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <[email protected]> > wrote: > > Hi, > > > > I'm looking for advice re. the best way to structure my row IDs. > > Monotonically increasing IDs have the very appealing property that I can > > quickly scan all recently-ingested unprocessed rows, particularly > because I > > maintain a "checkpoint" of the most-recently processed row. > > > > Of course, the problem with increasing IDs is that it's the lowest-order > > bits which are changing, which (I think?) means it's less optimal for > > distributing data across my cluster. I guess that the ways to get around > > this are to either reverse the ID or to define partitions, and use the > > partition ID as the high-order bits of the row id? Reversing the ID will > > destroy the property I describe above; I guess that using partitions may > > preserve it as long as I use a BatchScanner, but would a BatchScanner > play > > nicely with AccumuloInputFormat? So many questions. > > > > Anyways, I think there's a pretty good chance that I'm missing something > > obvious in this analysis. For instance, if it's easy to "rebalance" the > data > > across my tablet servers periodically, then I'd probably just stick with > > increasing IDs. > > > > Very interested to hear your advice, or the pros and cons of any of these > > approaches. > > > > Thanks, > > -Russ >
