OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")

Björn-Elmar Macek Thu, 09 Aug 2012 07:48:30 -0700

Hi again,

this is an direct response to my previous posting with the title "Logscannot be created", where logs could not be created (Spill failed). Igot the hint, that i gotta check privileges, but that was not theproblem, because i own the folders that were used for this.


I finally found an important hint in a log saying:

12/08/09 15:30:49 WARN mapred.JobClient: Error reading taskoutputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout12/08/09 15:30:49 WARN mapred.JobClient: Error reading taskoutputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr12/08/09 15:34:34 INFO mapred.JobClient: Task Id :attempt_201208091516_0001_m_000055_0, Status : FAILED

java.io.IOException: Spill failed

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)atorg.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)

        at uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
        at uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)

atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)

        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NumberFormatException: For input string: ""

atjava.lang.NumberFormatException.forInputString(NumberFormatException.java:48)

        at java.lang.Integer.parseInt(Integer.java:468)
        at java.lang.Integer.parseInt(Integer.java:497)
        at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)

atuni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)

        at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
        at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)

corresponding to the following lines of code within the classTwitterValueGroupingComparator:


public class TwitterValueGroupingComparator implements RawComparator<Text> {
...
    public int compare(byte[] text1, int start1, int length1, byte[] text2,
        int start2, int length2) {

    byte[] tweet1 = new byte[length1];// length1-1 (???)
    byte[] tweet2 = new byte[length2];// length1-1 (???)

    System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
    System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)

    Tweet atweet1 = new Tweet(new String(tweet1));
    Tweet atweet2 = new Tweet(new String(tweet2));


    String key1 = atweet1.getAuthor();
    String key2 = atweet2.getAuthor();
////////////////////////////////////////////////////////////////
//THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
/////////////////////////////////////////////////////////////////
    if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
        key1 = atweet1.getMention();
    if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
        key2 = atweet2.getMention();

    int realKeyCompare = key1.compareTo(key2);
    return realKeyCompare;
    }

}

As i am taking the incoming bytes and interpret them as Tweets byrecreating the appropriate CSV-Strings and Tokenizing it, i was kind ofsure, that the problem somehow are the leading bytes, that Hadoop putsin front of the data being compared. Since i never really understoodwhat hadoop is doing to the strings when they are sent to theKeyComparator i simply appended all strings to a file in order to seemyself.


You can see the results here:

??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it'smostly bullshit Alex Sink June Cleaver or Joan Crawford, null??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it'smostly bullshit Alex Sink June Cleaver or Joan Crawford, null

I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null

??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it'smostly bullshit Alex Sink June Cleaver or Joan Crawford, null^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mindWatching food network, nullb2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANTNOTHING LIKE THE HOOD, null

As you can see there are different leading characters: sometimes its"??", other times its "b" or "^", etc.


My question is now:

How many bits do i have to cut off, so i get the original Text as aString that i put into the key-position of my mapper output? What arethe concepts behind this?


Thanks for your help in advance!

Best regards,
Elmar Macek

OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")

Reply via email to