That is very interesting I don't see how Avro could affect that. Does anyone else have any ideas how Avro might cause the below?
-Scott On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[email protected]> wrote: > There is a possible reason: > It seems that there is an upper limit of 10,001 records per reduce input > group. (or is there a setting?) > > > If I output one million rows with the same key, I get: > Map output records: 1,000,000 > Reduce input groups: 1 > Reduce input records: 10,001 > > If I output one million rows with 20 different keys, I get: > Map output records: 1,000,000 > Reduce input groups: 20 > Reduce input records: 200,020 > > If I output one million rows with unique keys, I get: > Map output records: 1,000,000 > Reduce input groups: 1,000,000 > Reduce input records: 1,000,000 > > > Btw., I am running on 5 nodes with total map task capacity of 10 and total > reduce task capacity of 10. > > Thanks, > Vyacheslav > > On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: > >> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[email protected]> >> wrote: >> >>> btw, >>> >>> I was thinking to try it with Utf8 objects instead of strings and I wanted >>> to reuse the same Utf8 object instead of creating new from String upon each >>> map() call. >>> Why does not the Utf8 class have a method for setting bytes via a String >>> object? >> >> We could add that, but it won't help performance much in this case since the >> performance improvement from reuse has more to do with the underlying byte[] >> than the Utf8 object. >> The expensive part of String is the conversion from an underlying char[] to a >> byte[] (Utf8.getBytesFor()), so this would not help much. It would probably >> be faster to use String directly rather than wrap it with Utf8 each time. >> >> Rather than have a static method like the below, I would propose that an >> instance method be made that does the same thing, something like >> >> public void setValue(String val) { >> // gets bytes, replaces private byte array, replaces cached string no >> system array copy. >> } >> >> which would be much more efficient. >> >> >>> >>> I created the following code snippet: >>> >>> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >>> byte[] strBytes = Utf8.getBytesFor(strToReuse); >>> container.setByteLength(strBytes.length); >>> System.arraycopy(strBytes, 0, container.getBytes(), 0, >>> strBytes.length); >>> return container; >>> } >>> >>> Would that be useful if this code is encapsulated into the Utf8 class? >>> >>> Best, >>> Vyacheslav >>> >>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >>> >>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[email protected]> >>>> wrote: >>>> >>>>> Hi, Scott, >>>>> >>>>> thanks for your reply. >>>>> >>>>>> What Avro version is this happening with? What JVM version? >>>>> >>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>>> to look up. >>>>> >>>>>> >>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>>> if >>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>>> Java 6 too, just not as many as the recent news on Java7). >>>>>> >>>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>>> information related to that issue would be welcome. >>>>> >>>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>>> explanations of the issue besides it being something like AVRO-782? >>>> >>>> What is your key type (map output schema, first type argument of Pair)? >>>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>>> this point, I haven't looked into it in depth with a good reproducible >>>> case. I have my suspicions with how recycling of the key works since Utf8 >>>> is mutable and its backing byte[] can end up shared. >>>> >>>> >>>> >>>>> >>>>> Thanks a lot, >>>>> Vyacheslav >>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -Scott >>>>>> >>>>>> >>>>>> >>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>>>> map >>>>>>> output records and reducer input records. >>>>>>> >>>>>>> Does anybody encountered such a behavior? Can anybody think of possible >>>>>>> explanations of this phenomenon? >>>>>>> >>>>>>> Any pointers/thoughts are highly appreciated! >>>>>>> >>>>>>> Best, >>>>>>> Vyacheslav >>>>>> >>>>>> >>>>> >>>>> Best, >>>>> Vyacheslav >>>>> >>>>> >>>>> >>>> >>>> >>> > > Best, > Vyacheslav > > >
