Hi Scott, The problem is found. In the reduce job when there were too many values for some key, I stopped reading values from an iterator. So apparently the rest of the values were not counted. I thought, in case of sequence files unread values were counted in any case. That's why I didn't think about it from the very beginning.
Thanks for the support, Vyacheslav On Aug 18, 2011, at 1:47 AM, Scott Carey wrote: > That is very interesting… I don't see how Avro could affect that. > > Does anyone else have any ideas how Avro might cause the below? > > -Scott > > On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[email protected]> > wrote: > >> There is a possible reason: >> It seems that there is an upper limit of 10,001 records per reduce input >> group. (or is there a setting?) >> >> >> If I output one million rows with the same key, I get: >> Map output records: 1,000,000 >> Reduce input groups: 1 >> Reduce input records: 10,001 >> >> If I output one million rows with 20 different keys, I get: >> Map output records: 1,000,000 >> Reduce input groups: 20 >> Reduce input records: 200,020 >> >> If I output one million rows with unique keys, I get: >> Map output records: 1,000,000 >> Reduce input groups: 1,000,000 >> Reduce input records: 1,000,000 >> >> >> Btw., I am running on 5 nodes with total map task capacity of 10 and total >> reduce task capacity of 10. >> >> Thanks, >> Vyacheslav >> >> On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: >> >>> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[email protected]> >>> wrote: >>> >>>> btw, >>>> >>>> I was thinking to try it with Utf8 objects instead of strings and I wanted >>>> to reuse the same Utf8 object instead of creating new from String upon >>>> each map() call. >>>> Why does not the Utf8 class have a method for setting bytes via a String >>>> object? >>> >>> >>> We could add that, but it won't help performance much in this case since >>> the performance improvement from reuse has more to do with the underlying >>> byte[] than the Utf8 object. >>> The expensive part of String is the conversion from an underlying char[] to >>> a byte[] (Utf8.getBytesFor()), so this would not help much. It would >>> probably be faster to use String directly rather than wrap it with Utf8 >>> each time. >>> >>> Rather than have a static method like the below, I would propose that an >>> instance method be made that does the same thing, something like >>> >>> public void setValue(String val) { >>> // gets bytes, replaces private byte array, replaces cached string — no >>> system array copy. >>> } >>> >>> which would be much more efficient. >>> >>> >>>> >>>> I created the following code snippet: >>>> >>>> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >>>> byte[] strBytes = Utf8.getBytesFor(strToReuse); >>>> container.setByteLength(strBytes.length); >>>> System.arraycopy(strBytes, 0, container.getBytes(), 0, >>>> strBytes.length); >>>> return container; >>>> } >>>> >>>> Would that be useful if this code is encapsulated into the Utf8 class? >>>> >>>> Best, >>>> Vyacheslav >>>> >>>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >>>> >>>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, Scott, >>>>>> >>>>>> thanks for your reply. >>>>>> >>>>>>> What Avro version is this happening with? What JVM version? >>>>>> >>>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>>>> to look up. >>>>>> >>>>>>> >>>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>>>> if >>>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>>>> Java 6 too, just not as many as the recent news on Java7). >>>>>>> >>>>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>>>> information related to that issue would be welcome. >>>>>> >>>>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>>>> explanations of the issue besides it being something like AVRO-782? >>>>> >>>>> What is your key type (map output schema, first type argument of Pair)? >>>>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>>>> this point, I haven't looked into it in depth with a good reproducible >>>>> case. I have my suspicions with how recycling of the key works since Utf8 >>>>> is mutable and its backing byte[] can end up shared. >>>>> >>>>> >>>>> >>>>>> >>>>>> Thanks a lot, >>>>>> Vyacheslav >>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -Scott >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>>>>> <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>>>>> map >>>>>>>> output records and reducer input records. >>>>>>>> >>>>>>>> Does anybody encountered such a behavior? Can anybody think of possible >>>>>>>> explanations of this phenomenon? >>>>>>>> >>>>>>>> Any pointers/thoughts are highly appreciated! >>>>>>>> >>>>>>>> Best, >>>>>>>> Vyacheslav >>>>>>> >>>>>>> >>>>>> >>>>>> Best, >>>>>> Vyacheslav >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >> >> Best, >> Vyacheslav >> >> >>
