Hi, Josh, Thanks very much for your response. I think I get what you're saying, but it's kind of blowing my mind.
Are you saying that if I first set up an iterator that took my key/value pairs like, 000200001ccaac30 meta:size [] 1807 000200001ccaac30 meta:source [] data2 000200001cdaac30 meta:filename [] doc02985453 000200001cdaac30 meta:size [] 656 000200001cdaac30 meta:source [] data2 000200001cfaac30 meta:filename [] doc04484522 000200001cfaac30 meta:size [] 565 000200001cfaac30 meta:source [] data2 000200001dcaac30 meta:filename [] doc03342958 And emitted something like, 0 meta:size [] 1807 0 meta:size [] 656 0 meta:size [] 565 And then applied a SummingCombiner at a lower priority than that iterator, then... it should work, right? I'll give it a try. Regards, -Russ On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser <[email protected]> wrote: > Russ, > > Remember about the distribution of data across multiple nodes in your > cluster by tablet. > > A tablet, at the very minimum, will contain one row. Any way to say that > same thing is that a row will never be split across multiple tablets. The > only guarantee you get from Accumulo here is that you can use a combiner to > do you combination across one row. > > However, when you combine (pun not intended) another SKVI with the > Combiner, you can do more merging of that intermediate "combined value" > from each row before returning back to the client. You can think of this > approach as doing a multi-level summation. > > This still requires one final sum on the client side, but you should get > quite the reduction with this approach over doing the entire sum client > side. You sum the meta:size column in parallel across parts of the table > (server-side) and then client-side you sum the sums from each part. > > I can sketch this out in more detail if it's not clear. HTH > > > On 3/19/14, 6:18 PM, Russ Weeks wrote: > >> The accumulo manual states that combiners can be applied to values which >> share the same rowID, column family, and column qualifier. Is there any >> way to adjust this behaviour? I have rows that look like, >> >> 000200001ccaac30 meta:size [] 1807 >> 000200001ccaac30 meta:source [] data2 >> 000200001cdaac30 meta:filename [] doc02985453 >> 000200001cdaac30 meta:size [] 656 >> 000200001cdaac30 meta:source [] data2 >> 000200001cfaac30 meta:filename [] doc04484522 >> 000200001cfaac30 meta:size [] 565 >> 000200001cfaac30 meta:source [] data2 >> 000200001dcaac30 meta:filename [] doc03342958 >> >> and I'd like to sum up all the values of meta:size across all rows. I >> know I can scan the sizes and sum them on the client side, but I was >> hoping there would be a way to do this inside my cluster. Is mapreduce >> my only option here? >> >> Thanks, >> -Russ >> >
