Re: OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")

Björn-Elmar Macek Thu, 09 Aug 2012 08:14:33 -0700

Ah ok, i got the idea: i can use the abstract class instead of the lowlevel interface, though i am not sure, how to use it. It would just benice, if complexer mechanics like the sorting would have an up-to-datetutorial with some example code. If i find the time, i will make one,since i want to make a presentation for Hadoop anyways.


Thanks for your help! I will try to use the abstract class.



Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:

Hi Bertrand,

i am using RawComperator because this one was used in the tutorial ofsome famous (hadoop) guy describing how to sort the input for thereducer. Is there an easier alternative?



Am 09.08.2012 16:57, schrieb Bertrand Dechoux:

I am just curious but are you using Writable? If so there is aWritableComparator...If you are going to interpret every bytes (you create a String, soyou do), there no clear reason for choosing such a low level API.


Regards

Bertrand

On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek<[email protected] <mailto:[email protected]>> wrote:


    Hi again,

    this is an direct response to my previous posting with the title
    "Logs cannot be created", where logs could not be created (Spill
    failed). I got the hint, that i gotta check privileges, but that
    was not the problem, because i own the folders that were used for
    this.

    I finally found an important hint in a log saying:
    12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
    
outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
    
<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
    12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
    
outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
    
<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
    12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
    attempt_201208091516_0001_m_000055_0, Status : FAILED
    java.io.IOException: Spill failed
            at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
            at
    
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
            at
    uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
            at
    uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
            at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
            at
    org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
            at org.apache.hadoop.mapred.Child.main(Child.java:249)
    Caused by: java.lang.NumberFormatException: For input string: ""
            at
    
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
            at java.lang.Integer.parseInt(Integer.java:468)
            at java.lang.Integer.parseInt(Integer.java:497)
            at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
            at
    
uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
            at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
            at
    org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
            at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
            at
    
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
            at
    
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
            at
    
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)



    corresponding to the following lines of code within the class
    TwitterValueGroupingComparator:

    public class TwitterValueGroupingComparator implements
    RawComparator<Text> {
    ...
        public int compare(byte[] text1, int start1, int length1,
    byte[] text2,
            int start2, int length2) {

        byte[] tweet1 = new byte[length1];// length1-1 (???)
        byte[] tweet2 = new byte[length2];// length1-1 (???)

        System.arraycopy(text1, start1, tweet1, 0, length1);//
    start1+1 (???)
        System.arraycopy(text2, start2, tweet2, 0, length2);//
    start2+1 (???)

        Tweet atweet1 = new Tweet(new String(tweet1));
        Tweet atweet2 = new Tweet(new String(tweet2));


        String key1 = atweet1.getAuthor();
        String key2 = atweet2.getAuthor();
    ////////////////////////////////////////////////////////////////
    //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
    /////////////////////////////////////////////////////////////////
        if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
            key1 = atweet1.getMention();
        if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
            key2 = atweet2.getMention();

        int realKeyCompare = key1.compareTo(key2);
        return realKeyCompare;
        }

    }

    As i am taking the incoming bytes and interpret them as Tweets by
    recreating the appropriate CSV-Strings and Tokenizing it, i was
    kind of sure, that the problem somehow are the leading bytes,
    that Hadoop puts in front of the data being compared. Since i
    never really understood what hadoop is doing to the strings when
    they are sent to the KeyComparator  i simply appended all strings
    to a file in order to see myself.

    You can see the results here:

    ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
    http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
    it's mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
    ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
    http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
    it's mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
    I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
    ??????????????????, null
    ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
    http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
    it's mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
    ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
    mind Watching food network, null
    b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
    WORDUP ANT NOTHING LIKE THE HOOD, null


    As you can see there are different leading characters: sometimes
    its "??", other times its "b" or "^", etc.

    My question is now:
    How many bits do i have to cut off, so i get the original Text as
    a String that i put into the key-position of my mapper output?
    What are the concepts behind this?

    Thanks for your help in advance!

    Best regards,
    Elmar Macek






--
Bertrand Dechoux

Re: OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")

Reply via email to