Re: OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")

Bertrand Dechoux Thu, 09 Aug 2012 08:59:54 -0700

I would recommend you to look at the yahoo tutorial for more information.

Here is the part we are discussing about :
http://developer.yahoo.com/hadoop/tutorial/module5.html#writable-comparator


Regards

Bertrand

On Thu, Aug 9, 2012 at 5:03 PM, Björn-Elmar Macek <[email protected]>wrote:

>  Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of some
> famous (hadoop) guy describing how to sort the input for the reducer. Is
> there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek 
> <[email protected]>wrote:
>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title "Logs
>> cannot be created", where logs could not be created (Spill failed). I got
>> the hint, that i gotta check privileges, but that was not the problem,
>> because i own the folders that were used for this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>>         at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>>         at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>>         at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>>         at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>         at java.lang.Integer.parseInt(Integer.java:468)
>>         at java.lang.Integer.parseInt(Integer.java:497)
>>         at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>>         at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>         at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>>     public int compare(byte[] text1, int start1, int length1, byte[]
>> text2,
>>         int start2, int length2) {
>>
>>     byte[] tweet1 = new byte[length1];// length1-1 (???)
>>     byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>>     System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
>>     System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>>
>>     Tweet atweet1 = new Tweet(new String(tweet1));
>>     Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>>     String key1 = atweet1.getAuthor();
>>     String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>>     if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>>         key1 = atweet1.getMention();
>>     if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>>         key2 = atweet2.getMention();
>>
>>     int realKeyCompare = key1.compareTo(key2);
>>     return realKeyCompare;
>>     }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
>> sure, that the problem somehow are the leading bytes, that Hadoop puts in
>> front of the data being compared. Since i never really understood what
>> hadoop is doing to the strings when they are sent to the KeyComparator  i
>> simply appended all strings to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
>> Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
>> NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes its
>> "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as a
>> String that i put into the key-position of my mapper output? What are the
>> concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>
>
>


-- 
Bertrand Dechoux

Re: OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")

Reply via email to