Hi,

the log bellow shows an issue that started to occur just "recently" (I haven't ran tests with this somewhat larger dataset (320K documents) for some time and when I did today, I got this "all of a sudden"). Am using mahout 0.9-cdh5.2.0-SNAPSHOT (yes its cloudera but as far as I can tell, that's vanilla mahout in the community edition I use).

As far as I can tell, it's happening in the middle of seq2sparse and all three - the input, the output and the mr-job are being generated by mahout and there's no my code involved.

It would be cool if  anyone could point me to the source of this error.

thanks and kind regards
reinis.

SETTINGS OF SEQ2SPARSE
----------------------------------------------

{"--analyzerName", "com.myproj.quantify.ticket.text.TicketTextAnalyzer",
              "--chunkSize", "200",
              "--output", finalDir,
              "--input", ticketTextsOutput.toString,
              "--minSupport", "2",
              "--minDF", "2",
              "--maxDFPercent", "85",
              "--weight", "tfidf",
              "--minLLR", "50",
              "--maxNGramSize", "3",
              "--norm", "2",
              "--namedVector", "--sequentialAccessVector", "--overwrite"}


LOG
-----------------------------------------------------

14/07/12 16:46:16 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors 14/07/12 16:46:16 INFO vectorizer.DictionaryVectorizer: Creating dictionary from /quantify/ticket/text/final/tokenized-documents and saving at /quantify/ticket/text/final/wordcount 14/07/12 16:46:16 INFO client.RMProxy: Connecting to ResourceManager at hadoop1 14/07/12 16:46:17 INFO input.FileInputFormat: Total input paths to process : 1
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: number of splits:2
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1404888747437_0074 14/07/12 16:46:17 INFO impl.YarnClientImpl: Submitted application application_1404888747437_0074 14/07/12 16:46:17 INFO mapreduce.Job: The url to track the job: http://hadoop1:8088/proxy/application_1404888747437_0074/
14/07/12 16:46:17 INFO mapreduce.Job: Running job: job_1404888747437_0074
14/07/12 16:46:30 INFO mapreduce.Job: Job job_1404888747437_0074 running in uber mode : false
14/07/12 16:46:30 INFO mapreduce.Job:  map 0% reduce 0%
14/07/12 16:46:41 INFO mapreduce.Job:  map 6% reduce 0%
14/07/12 16:46:44 INFO mapreduce.Job:  map 10% reduce 0%
14/07/12 16:46:47 INFO mapreduce.Job:  map 11% reduce 0%
14/07/12 16:46:48 INFO mapreduce.Job:  map 14% reduce 0%
14/07/12 16:46:50 INFO mapreduce.Job:  map 15% reduce 0%
14/07/12 16:46:51 INFO mapreduce.Job:  map 19% reduce 0%
14/07/12 16:46:53 INFO mapreduce.Job:  map 20% reduce 0%
14/07/12 16:46:54 INFO mapreduce.Job:  map 23% reduce 0%
14/07/12 16:46:57 INFO mapreduce.Job:  map 26% reduce 0%
14/07/12 16:47:00 INFO mapreduce.Job:  map 29% reduce 0%
14/07/12 16:47:01 INFO mapreduce.Job: Task Id : attempt_1404888747437_0074_m_000000_0, Status : FAILED
Error: java.lang.IllegalStateException: java.io.IOException: Spill failed
at org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:140) at org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:115) at org.apache.mahout.math.map.OpenObjectIntHashMap.forEachPair(OpenObjectIntHashMap.java:185) at org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:115) at org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1535) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$300(MapTask.java:853) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.mahout.vectorizer.collocations.llr.GramKey.write(GramKey.java:91) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1126) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:131)
        ... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1836016430
        at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:144)
        at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159) at org.apache.mahout.vectorizer.collocations.llr.GramKey.readFields(GramKey.java:78) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:132) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1245) at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:105)
        at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1575) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:853) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1505)

Reply via email to