There is a possible reason: It seems that there is an upper limit of 10,001 records per reduce input group. (or is there a setting?)
If I output one million rows with the same key, I get: Map output records: 1,000,000 Reduce input groups: 1 Reduce input records: 10,001 If I output one million rows with 20 different keys, I get: Map output records: 1,000,000 Reduce input groups: 20 Reduce input records: 200,020 If I output one million rows with unique keys, I get: Map output records: 1,000,000 Reduce input groups: 1,000,000 Reduce input records: 1,000,000 Btw., I am running on 5 nodes with total map task capacity of 10 and total reduce task capacity of 10. Thanks, Vyacheslav On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: > On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[email protected]> > wrote: > >> btw, >> >> I was thinking to try it with Utf8 objects instead of strings and I wanted >> to reuse the same Utf8 object instead of creating new from String upon each >> map() call. >> Why does not the Utf8 class have a method for setting bytes via a String >> object? > > > We could add that, but it won't help performance much in this case since the > performance improvement from reuse has more to do with the underlying byte[] > than the Utf8 object. > The expensive part of String is the conversion from an underlying char[] to a > byte[] (Utf8.getBytesFor()), so this would not help much. It would probably > be faster to use String directly rather than wrap it with Utf8 each time. > > Rather than have a static method like the below, I would propose that an > instance method be made that does the same thing, something like > > public void setValue(String val) { > // gets bytes, replaces private byte array, replaces cached string — no > system array copy. > } > > which would be much more efficient. > > >> >> I created the following code snippet: >> >> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >> byte[] strBytes = Utf8.getBytesFor(strToReuse); >> container.setByteLength(strBytes.length); >> System.arraycopy(strBytes, 0, container.getBytes(), 0, >> strBytes.length); >> return container; >> } >> >> Would that be useful if this code is encapsulated into the Utf8 class? >> >> Best, >> Vyacheslav >> >> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >> >>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[email protected]> >>> wrote: >>> >>>> Hi, Scott, >>>> >>>> thanks for your reply. >>>> >>>>> What Avro version is this happening with? What JVM version? >>>> >>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>> to look up. >>>> >>>>> >>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>> if >>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>> Java 6 too, just not as many as the recent news on Java7). >>>>> >>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>> information related to that issue would be welcome. >>>> >>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>> explanations of the issue besides it being something like AVRO-782? >>> >>> What is your key type (map output schema, first type argument of Pair)? >>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>> this point, I haven't looked into it in depth with a good reproducible >>> case. I have my suspicions with how recycling of the key works since Utf8 >>> is mutable and its backing byte[] can end up shared. >>> >>> >>> >>>> >>>> Thanks a lot, >>>> Vyacheslav >>>> >>>>> >>>>> Thanks! >>>>> >>>>> -Scott >>>>> >>>>> >>>>> >>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>>> <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>>> map >>>>>> output records and reducer input records. >>>>>> >>>>>> Does anybody encountered such a behavior? Can anybody think of possible >>>>>> explanations of this phenomenon? >>>>>> >>>>>> Any pointers/thoughts are highly appreciated! >>>>>> >>>>>> Best, >>>>>> Vyacheslav >>>>> >>>>> >>>> >>>> Best, >>>> Vyacheslav >>>> >>>> >>>> >>> >>> >> Best, Vyacheslav
