Ah; looks like that config can be set in Hadoop's core-site.xml but if you're running Mahout in local mode that shouldn't help.
Can you try this with local mode off, in other words on a running Hadoop/Spark cluster? Looking for empty lines could be run via a command like `grep -r "^$" input-file-directory`; blank lines will show up before your next prompt if so. On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <tannaa...@gmail.com> wrote: > Thank you Andrew for the quick response . I have around 300 input files. > It would take a while for me to go though each file. I will try to look > into that, but then I had successfully generated the sequence file use mahout > seqdirectory for the same dataset. How can I find which mahout release I am > on? also can you let me know how can I increase io.sort.mb = 100 when I > have Mahout running in local mode. > > In the earlier attach file you can see it says 16/02/03 22:59:04 INFO > mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes > > How can I increase in-memory buffer for Mahout local mode. > > I hope this has nothing to do with this error. > > Thanks, > Alok Tanna > > On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman < > andrew.mussel...@gmail.com> wrote: > >> Is it possible you have any empty lines or extra whitespace at the end or >> in the middle of any of your input files? I don't know for sure but >> that's >> where I'd start looking. >> >> Are you on the most recent release? >> >> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <tannaa...@gmail.com> wrote: >> >> > Mahout in local mode >> > >> > I am able to successfully run the below command on smaller data set, but >> > then when I am running this command on large data set I am getting below >> > error. Its looks like I need to increase size of some parameter but >> then I >> > am not sure which one. It is failing with this error >> java.io.EOFException >> > which creating the dictionary-0 file >> > >> > Please fine the attached file for more details. >> > >> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o >> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf >> > >> > Main error : >> > >> > >> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce >> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce >> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner: job_local1308764206_0003 >> > java.io.EOFException >> > at java.io.DataInputStream.readByte(DataInputStream.java:267) >> > at >> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299) >> > at >> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320) >> > at org.apache.hadoop.io.Text.readFields(Text.java:263) >> > at >> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142) >> > at >> > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >> > at >> > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >> > at >> > >> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117) >> > at >> > org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >> > at >> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) >> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >> > at >> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) >> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete: >> > job_local1308764206_0003 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20 >> > 16/02/03 23:02:18 INFO mapred.JobClient: File Output Format Counters >> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Written=14923244 >> > 16/02/03 23:02:18 INFO mapred.JobClient: FileSystemCounters >> > 16/02/03 23:02:18 INFO mapred.JobClient: >> FILE_BYTES_READ=1412144036729 >> > 16/02/03 23:02:18 INFO mapred.JobClient: >> > FILE_BYTES_WRITTEN=323876626568 >> > 16/02/03 23:02:18 INFO mapred.JobClient: File Input Format Counters >> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Read=11885543289 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Map-Reduce Framework >> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input groups=223 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output materialized >> > bytes=2214020551 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine output records=0 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Map input records=223 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce shuffle bytes=0 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Physical memory (bytes) >> > snapshot=0 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce output records=222 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Spilled Records=638 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output bytes=2214019100 >> > 16/02/03 23:02:18 INFO mapred.JobClient: CPU time spent (ms)=0 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Total committed heap usaAT >> > (bytes)=735978192896 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Virtual memory (bytes) >> > snapshot=0 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine input records=0 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output records=223 >> > 16/02/03 23:02:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=9100 >> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input records=222 >> > Exception in thread "main" java.lang.IllegalStateException: Job failed! >> > at >> > >> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) >> > at >> > >> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) >> > at >> > >> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274) >> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> > at >> > >> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > at >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > at java.lang.reflect.Method.invoke(Method.java:606) >> > at >> > >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> > at >> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> > at >> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> > . >> > . >> > >> > >> > >> > -- >> > Thanks & Regards, >> > >> > Alok R. Tanna >> > >> > >> > > > > -- > Thanks & Regards, > > Alok R. Tanna > >