Be careful when changing row values, especially outside of the tablet range, as I believe it can cause the data to be dropped or rejected.
On Wed, Mar 19, 2014 at 6:40 PM, Russ Weeks <[email protected]>wrote: > Hi, Josh, > > Thanks very much for your response. I think I get what you're saying, but > it's kind of blowing my mind. > > Are you saying that if I first set up an iterator that took my key/value > pairs like, > > 000200001ccaac30 meta:size [] 1807 > 000200001ccaac30 meta:source [] data2 > 000200001cdaac30 meta:filename [] doc02985453 > 000200001cdaac30 meta:size [] 656 > 000200001cdaac30 meta:source [] data2 > 000200001cfaac30 meta:filename [] doc04484522 > 000200001cfaac30 meta:size [] 565 > 000200001cfaac30 meta:source [] data2 > 000200001dcaac30 meta:filename [] doc03342958 > > And emitted something like, > > 0 meta:size [] 1807 > 0 meta:size [] 656 > 0 meta:size [] 565 > > And then applied a SummingCombiner at a lower priority than that iterator, > then... it should work, right? > > I'll give it a try. > > Regards, > -Russ > > > On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser <[email protected]> wrote: > >> Russ, >> >> Remember about the distribution of data across multiple nodes in your >> cluster by tablet. >> >> A tablet, at the very minimum, will contain one row. Any way to say that >> same thing is that a row will never be split across multiple tablets. The >> only guarantee you get from Accumulo here is that you can use a combiner to >> do you combination across one row. >> >> However, when you combine (pun not intended) another SKVI with the >> Combiner, you can do more merging of that intermediate "combined value" >> from each row before returning back to the client. You can think of this >> approach as doing a multi-level summation. >> >> This still requires one final sum on the client side, but you should get >> quite the reduction with this approach over doing the entire sum client >> side. You sum the meta:size column in parallel across parts of the table >> (server-side) and then client-side you sum the sums from each part. >> >> I can sketch this out in more detail if it's not clear. HTH >> >> >> On 3/19/14, 6:18 PM, Russ Weeks wrote: >> >>> The accumulo manual states that combiners can be applied to values which >>> share the same rowID, column family, and column qualifier. Is there any >>> way to adjust this behaviour? I have rows that look like, >>> >>> 000200001ccaac30 meta:size [] 1807 >>> 000200001ccaac30 meta:source [] data2 >>> 000200001cdaac30 meta:filename [] doc02985453 >>> 000200001cdaac30 meta:size [] 656 >>> 000200001cdaac30 meta:source [] data2 >>> 000200001cfaac30 meta:filename [] doc04484522 >>> 000200001cfaac30 meta:size [] 565 >>> 000200001cfaac30 meta:source [] data2 >>> 000200001dcaac30 meta:filename [] doc03342958 >>> >>> and I'd like to sum up all the values of meta:size across all rows. I >>> know I can scan the sizes and sum them on the client side, but I was >>> hoping there would be a way to do this inside my cluster. Is mapreduce >>> my only option here? >>> >>> Thanks, >>> -Russ >>> >> >
