I am just curious but are you using Writable? If so there is a WritableComparator... If you are going to interpret every bytes (you create a String, so you do), there no clear reason for choosing such a low level API.
Regards Bertrand On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <[email protected]>wrote: > Hi again, > > this is an direct response to my previous posting with the title "Logs > cannot be created", where logs could not be created (Spill failed). I got > the hint, that i gotta check privileges, but that was not the problem, > because i own the folders that were used for this. > > I finally found an important hint in a log saying: > 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp:// > its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=** > attempt_201208091516_0001_m_**000048_0&filter=stdout<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout> > 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp:// > its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=** > attempt_201208091516_0001_m_**000048_0&filter=stderr<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr> > 12/08/09 15:34:34 INFO mapred.JobClient: Task Id : > attempt_201208091516_0001_m_**000055_0, Status : FAILED > java.io.IOException: Spill failed > at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.** > collect(MapTask.java:1029) > at org.apache.hadoop.mapred.**MapTask$OldOutputCollector.** > collect(MapTask.java:592) > at uni.kassel.macek.rtprep.**RetweetMapper.map(** > RetweetMapper.java:26) > at uni.kassel.macek.rtprep.**RetweetMapper.map(** > RetweetMapper.java:12) > at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50) > at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.** > java:436) > at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:372) > at org.apache.hadoop.mapred.**Child$4.run(Child.java:255) > at java.security.**AccessController.doPrivileged(**Native Method) > at javax.security.auth.Subject.**doAs(Subject.java:396) > at org.apache.hadoop.security.**UserGroupInformation.doAs(** > UserGroupInformation.java:**1093) > at org.apache.hadoop.mapred.**Child.main(Child.java:249) > Caused by: java.lang.**NumberFormatException: For input string: "" > at java.lang.**NumberFormatException.**forInputString(** > NumberFormatException.java:48) > at java.lang.Integer.parseInt(**Integer.java:468) > at java.lang.Integer.parseInt(**Integer.java:497) > at uni.kassel.macek.rtprep.Tweet.**getRT(Tweet.java:126) > at uni.kassel.macek.rtprep.**TwitterValueGroupingComparator** > .compare(**TwitterValueGroupingComparator**.java:47) > at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.** > compare(MapTask.java:1111) > at org.apache.hadoop.util.**QuickSort.sortInternal(** > QuickSort.java:95) > at org.apache.hadoop.util.**QuickSort.sort(QuickSort.java:**59) > at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.** > sortAndSpill(MapTask.java:**1399) > at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.** > access$1800(MapTask.java:853) > at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$** > SpillThread.run(MapTask.java:**1344) > > > > corresponding to the following lines of code within the class > TwitterValueGroupingComparator**: > > public class TwitterValueGroupingComparator implements RawComparator<Text> > { > ... > public int compare(byte[] text1, int start1, int length1, byte[] text2, > int start2, int length2) { > > byte[] tweet1 = new byte[length1];// length1-1 (???) > byte[] tweet2 = new byte[length2];// length1-1 (???) > > System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???) > System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???) > > Tweet atweet1 = new Tweet(new String(tweet1)); > Tweet atweet2 = new Tweet(new String(tweet2)); > > > String key1 = atweet1.getAuthor(); > String key2 = atweet2.getAuthor(); > //////////////////////////////**//////////////////////////////**//// > //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47) > //////////////////////////////**//////////////////////////////**///// > if (atweet1.getRT() > 0 && !atweet1.getMention().equals("**")) > key1 = atweet1.getMention(); > if (atweet2.getRT() > 0 && !atweet2.getMention().equals("**")) > key2 = atweet2.getMention(); > > int realKeyCompare = key1.compareTo(key2); > return realKeyCompare; > } > > } > > As i am taking the incoming bytes and interpret them as Tweets by > recreating the appropriate CSV-Strings and Tokenizing it, i was kind of > sure, that the problem somehow are the leading bytes, that Hadoop puts in > front of the data being compared. Since i never really understood what > hadoop is doing to the strings when they are sent to the KeyComparator i > simply appended all strings to a file in order to see myself. > > You can see the results here: > > ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , , > http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's > mostly bullshit Alex Sink June Cleaver or Joan Crawford, null > ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , , > http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's > mostly bullshit Alex Sink June Cleaver or Joan Crawford, null > I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null > ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , , > http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's > mostly bullshit Alex Sink June Cleaver or Joan Crawford, null > ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind > Watching food network, null > b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT > NOTHING LIKE THE HOOD, null > > > As you can see there are different leading characters: sometimes its "??", > other times its "b" or "^", etc. > > My question is now: > How many bits do i have to cut off, so i get the original Text as a String > that i put into the key-position of my mapper output? What are the concepts > behind this? > > Thanks for your help in advance! > > Best regards, > Elmar Macek > > > > -- Bertrand Dechoux
