Hi Scott, 

thanks for all the suggestions. I really appreciate your support.

Unfortunately, I could not solve the problem so far.

That's what I have tried:

1. Switched to UTF8 everywhere, including changing the interface to <Utf8, 
SomeSpecificJavaClass>
2. Always generate new instances before collecting (new Utf8("fromString") for 
the key, clone for the value)

The problem persists - records seem to get lost between mapper and reducer.

Interestingly, it's only reproducible with large datasets. So, if I run a 
relatively small set of 6 million input rows, I do not get any differences, 
however, on a 10 million input dataset the difference shows up:
Map input records: 10,000,000
Map input bytes: 11,458,340,172
Map output bytes: 30,420,106,592
Map output records: 28,196,842
Reduce input records: 28,053,314


I'm trying to simplify the job further. 

Do you have any further ideas?

Thanks,
Vyacheslav 




On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:

> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[email protected]> 
> wrote:
> 
>> btw,
>> 
>> I was thinking to try it with Utf8 objects instead of strings and I wanted 
>> to reuse the same Utf8 object instead of creating new from String upon each 
>> map() call.
>> Why does not the Utf8 class have a method for setting bytes via a String 
>> object?
> 
> 
> We could add that, but it won't help performance much in this case since the 
> performance improvement from reuse has more to do with the underlying byte[] 
> than the Utf8 object.
> The expensive part of String is the conversion from an underlying char[] to a 
> byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably 
> be faster to use String directly rather than wrap it with Utf8 each time.
> 
> Rather than have a static method like the below, I would propose that an 
> instance method be made that does the same thing, something like 
> 
> public void setValue(String val) {
>    // gets bytes, replaces private byte array, replaces cached string — no 
> system array copy.
> } 
> 
> which would be much more efficient.  
> 
> 
>> 
>> I created the following code snippet: 
>> 
>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>         container.setByteLength(strBytes.length);
>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, 
>> strBytes.length);
>>         return container;
>>     }
>> 
>> Would that be useful if this code is encapsulated into the Utf8 class?
>> 
>> Best,
>> Vyacheslav
>> 
>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>> 
>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[email protected]>
>>> wrote:
>>> 
>>>> Hi, Scott,
>>>> 
>>>> thanks for your reply.
>>>> 
>>>>> What Avro version is this happening with? What JVM version?
>>>> 
>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>> to look up.
>>>> 
>>>>> 
>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>> if
>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>> 
>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>> information related to that issue would be welcome.
>>>> 
>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>> explanations of the issue besides it being something like AVRO-782?
>>> 
>>> What is your key type (map output schema, first type argument of Pair)?
>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>> this point, I haven't looked into it in depth with a good reproducible
>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>> is mutable and its backing byte[] can end up shared.
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks a lot,
>>>> Vyacheslav
>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> -Scott
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>> <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>>> Only in one of the jobs I have a visible mismatch between a number of
>>>>>> map
>>>>>> output records and reducer input records.
>>>>>> 
>>>>>> Does anybody encountered such a behavior? Can anybody think of possible
>>>>>> explanations of this phenomenon?
>>>>>> 
>>>>>> Any pointers/thoughts are highly appreciated!
>>>>>> 
>>>>>> Best,
>>>>>> Vyacheslav
>>>>> 
>>>>> 
>>>> 
>>>> Best,
>>>> Vyacheslav
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 

Reply via email to