Re: Map output records/reducer input records mismatch

Scott Carey Wed, 17 Aug 2011 16:46:07 -0700

That is very interesting I don't see how Avro could affect that.

Does anyone else have any ideas how Avro might cause the below?


-Scott

On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[email protected]>
wrote:

> There is a possible reason:
> It seems that there is an upper limit of 10,001 records per reduce input
> group. (or is there a setting?)
> 
> 
> If I output one million rows with the same key, I get:
> Map output records: 1,000,000
> Reduce input groups: 1
> Reduce input records: 10,001
> 
> If I output one million rows with 20 different keys, I get:
> Map output records: 1,000,000
> Reduce input groups: 20
> Reduce input records: 200,020
> 
> If I output one million rows with unique keys, I get:
> Map output records: 1,000,000
> Reduce input groups: 1,000,000
> Reduce input records: 1,000,000
> 
> 
> Btw., I am running on 5 nodes with total map task capacity of 10 and total
> reduce task capacity of 10.
> 
> Thanks,
> Vyacheslav
> 
> On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:
> 
>> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[email protected]>
>> wrote:
>> 
>>> btw,
>>> 
>>> I was thinking to try it with Utf8 objects instead of strings and I wanted
>>> to reuse the same Utf8 object instead of creating new from String upon each
>>> map() call.
>>> Why does not the Utf8 class have a method for setting bytes via a String
>>> object?
>> 
>> We could add that, but it won't help performance much in this case since the
>> performance improvement from reuse has more to do with the underlying byte[]
>> than the Utf8 object.
>> The expensive part of String is the conversion from an underlying char[] to a
>> byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably
>> be faster to use String directly rather than wrap it with Utf8 each time.
>> 
>> Rather than have a static method like the below, I would propose that an
>> instance method be made that does the same thing, something like
>> 
>> public void setValue(String val) {
>>    // gets bytes, replaces private byte array, replaces cached string  no
>> system array copy.
>> } 
>> 
>> which would be much more efficient.
>> 
>> 
>>> 
>>> I created the following code snippet:
>>> 
>>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>>         container.setByteLength(strBytes.length);
>>>         System.arraycopy(strBytes, 0, container.getBytes(), 0,
>>> strBytes.length);
>>>         return container;
>>>     }
>>> 
>>> Would that be useful if this code is encapsulated into the Utf8 class?
>>> 
>>> Best,
>>> Vyacheslav
>>> 
>>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>> 
>>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi, Scott,
>>>>> 
>>>>> thanks for your reply.
>>>>> 
>>>>>> What Avro version is this happening with? What JVM version?
>>>>> 
>>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>>> to look up.
>>>>> 
>>>>>> 
>>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>>> if
>>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>> 
>>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>>> information related to that issue would be welcome.
>>>>> 
>>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>>> explanations of the issue besides it being something like AVRO-782?
>>>> 
>>>> What is your key type (map output schema, first type argument of Pair)?
>>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>>> this point, I haven't looked into it in depth with a good reproducible
>>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>>> is mutable and its backing byte[] can end up shared.
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Thanks a lot,
>>>>> Vyacheslav
>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> -Scott
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>>>> Only in one of the jobs I have a visible mismatch between a number of
>>>>>>> map
>>>>>>> output records and reducer input records.
>>>>>>> 
>>>>>>> Does anybody encountered such a behavior? Can anybody think of possible
>>>>>>> explanations of this phenomenon?
>>>>>>> 
>>>>>>> Any pointers/thoughts are highly appreciated!
>>>>>>> 
>>>>>>> Best,
>>>>>>> Vyacheslav
>>>>>> 
>>>>>> 
>>>>> 
>>>>> Best,
>>>>> Vyacheslav
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
> 
> Best,
> Vyacheslav
> 
> 
>

Re: Map output records/reducer input records mismatch

Reply via email to