Hi Scott,

The problem is found. In the reduce job when there were too many values for 
some key, I stopped reading values from an iterator. So apparently the rest of 
the values were not counted. I thought, in case of sequence files unread values 
were counted in any case. That's why I didn't think about it from the very 
beginning.

Thanks for the support,
Vyacheslav


On Aug 18, 2011, at 1:47 AM, Scott Carey wrote:

> That is very interesting… I don't see how Avro could affect that.  
> 
> Does anyone else have any ideas how Avro might cause the below?
> 
> -Scott
> 
> On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[email protected]> 
> wrote:
> 
>> There is a possible reason:
>> It seems that there is an upper limit of 10,001 records per reduce input 
>> group. (or is there a setting?)
>> 
>> 
>> If I output one million rows with the same key, I get:
>> Map output records: 1,000,000
>> Reduce input groups: 1
>> Reduce input records: 10,001
>> 
>> If I output one million rows with 20 different keys, I get:
>> Map output records: 1,000,000
>> Reduce input groups: 20
>> Reduce input records: 200,020
>> 
>> If I output one million rows with unique keys, I get:
>> Map output records: 1,000,000
>> Reduce input groups: 1,000,000
>> Reduce input records: 1,000,000
>> 
>> 
>> Btw., I am running on 5 nodes with total map task capacity of 10 and total 
>> reduce task capacity of 10.
>> 
>> Thanks,
>> Vyacheslav
>> 
>> On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:
>> 
>>> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[email protected]> 
>>> wrote:
>>> 
>>>> btw,
>>>> 
>>>> I was thinking to try it with Utf8 objects instead of strings and I wanted 
>>>> to reuse the same Utf8 object instead of creating new from String upon 
>>>> each map() call.
>>>> Why does not the Utf8 class have a method for setting bytes via a String 
>>>> object?
>>> 
>>> 
>>> We could add that, but it won't help performance much in this case since 
>>> the performance improvement from reuse has more to do with the underlying 
>>> byte[] than the Utf8 object.
>>> The expensive part of String is the conversion from an underlying char[] to 
>>> a byte[] (Utf8.getBytesFor()), so this would not help much.  It would 
>>> probably be faster to use String directly rather than wrap it with Utf8 
>>> each time.
>>> 
>>> Rather than have a static method like the below, I would propose that an 
>>> instance method be made that does the same thing, something like 
>>> 
>>> public void setValue(String val) {
>>>    // gets bytes, replaces private byte array, replaces cached string — no 
>>> system array copy.
>>> } 
>>> 
>>> which would be much more efficient.  
>>> 
>>> 
>>>> 
>>>> I created the following code snippet: 
>>>> 
>>>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>>>         container.setByteLength(strBytes.length);
>>>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, 
>>>> strBytes.length);
>>>>         return container;
>>>>     }
>>>> 
>>>> Would that be useful if this code is encapsulated into the Utf8 class?
>>>> 
>>>> Best,
>>>> Vyacheslav
>>>> 
>>>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>>> 
>>>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi, Scott,
>>>>>> 
>>>>>> thanks for your reply.
>>>>>> 
>>>>>>> What Avro version is this happening with? What JVM version?
>>>>>> 
>>>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>>>> to look up.
>>>>>> 
>>>>>>> 
>>>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>>>> if
>>>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>>> 
>>>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>>>> information related to that issue would be welcome.
>>>>>> 
>>>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>>>> explanations of the issue besides it being something like AVRO-782?
>>>>> 
>>>>> What is your key type (map output schema, first type argument of Pair)?
>>>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>>>> this point, I haven't looked into it in depth with a good reproducible
>>>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>>>> is mutable and its backing byte[] can end up shared.
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Vyacheslav
>>>>>> 
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> -Scott
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>>>>> Only in one of the jobs I have a visible mismatch between a number of
>>>>>>>> map
>>>>>>>> output records and reducer input records.
>>>>>>>> 
>>>>>>>> Does anybody encountered such a behavior? Can anybody think of possible
>>>>>>>> explanations of this phenomenon?
>>>>>>>> 
>>>>>>>> Any pointers/thoughts are highly appreciated!
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Vyacheslav
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> Vyacheslav
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> Best,
>> Vyacheslav
>> 
>> 
>> 

Reply via email to