I would recommend you to look at the yahoo tutorial for more information. Here is the part we are discussing about : http://developer.yahoo.com/hadoop/tutorial/module5.html#writable-comparator
Regards Bertrand On Thu, Aug 9, 2012 at 5:03 PM, Björn-Elmar Macek <[email protected]>wrote: > Hi Bertrand, > > i am using RawComperator because this one was used in the tutorial of some > famous (hadoop) guy describing how to sort the input for the reducer. Is > there an easier alternative? > > > Am 09.08.2012 16:57, schrieb Bertrand Dechoux: > > I am just curious but are you using Writable? If so there is a > WritableComparator... > If you are going to interpret every bytes (you create a String, so you > do), there no clear reason for choosing such a low level API. > > Regards > > Bertrand > > On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek > <[email protected]>wrote: > >> Hi again, >> >> this is an direct response to my previous posting with the title "Logs >> cannot be created", where logs could not be created (Spill failed). I got >> the hint, that i gotta check privileges, but that was not the problem, >> because i own the folders that were used for this. >> >> I finally found an important hint in a log saying: >> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp:// >> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout >> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp:// >> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr >> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id : >> attempt_201208091516_0001_m_000055_0, Status : FAILED >> java.io.IOException: Spill failed >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029) >> at >> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592) >> at >> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26) >> at >> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) >> at org.apache.hadoop.mapred.Child.main(Child.java:249) >> Caused by: java.lang.NumberFormatException: For input string: "" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Integer.parseInt(Integer.java:468) >> at java.lang.Integer.parseInt(Integer.java:497) >> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126) >> at >> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111) >> at >> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95) >> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344) >> >> >> >> corresponding to the following lines of code within the class >> TwitterValueGroupingComparator: >> >> public class TwitterValueGroupingComparator implements >> RawComparator<Text> { >> ... >> public int compare(byte[] text1, int start1, int length1, byte[] >> text2, >> int start2, int length2) { >> >> byte[] tweet1 = new byte[length1];// length1-1 (???) >> byte[] tweet2 = new byte[length2];// length1-1 (???) >> >> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???) >> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???) >> >> Tweet atweet1 = new Tweet(new String(tweet1)); >> Tweet atweet2 = new Tweet(new String(tweet2)); >> >> >> String key1 = atweet1.getAuthor(); >> String key2 = atweet2.getAuthor(); >> //////////////////////////////////////////////////////////////// >> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47) >> ///////////////////////////////////////////////////////////////// >> if (atweet1.getRT() > 0 && !atweet1.getMention().equals("")) >> key1 = atweet1.getMention(); >> if (atweet2.getRT() > 0 && !atweet2.getMention().equals("")) >> key2 = atweet2.getMention(); >> >> int realKeyCompare = key1.compareTo(key2); >> return realKeyCompare; >> } >> >> } >> >> As i am taking the incoming bytes and interpret them as Tweets by >> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of >> sure, that the problem somehow are the leading bytes, that Hadoop puts in >> front of the data being compared. Since i never really understood what >> hadoop is doing to the strings when they are sent to the KeyComparator i >> simply appended all strings to a file in order to see myself. >> >> You can see the results here: >> >> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , , >> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's >> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null >> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , , >> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's >> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null >> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null >> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , , >> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's >> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null >> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind >> Watching food network, null >> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT >> NOTHING LIKE THE HOOD, null >> >> >> As you can see there are different leading characters: sometimes its >> "??", other times its "b" or "^", etc. >> >> My question is now: >> How many bits do i have to cut off, so i get the original Text as a >> String that i put into the key-position of my mapper output? What are the >> concepts behind this? >> >> Thanks for your help in advance! >> >> Best regards, >> Elmar Macek >> >> >> >> > > > -- > Bertrand Dechoux > > > > -- Bertrand Dechoux
