Re: Mahout error : seq2sparse

Andrew Musselman Wed, 03 Feb 2016 20:38:36 -0800

Ah; looks like that config can be set in Hadoop's core-site.xml but if
you're running Mahout in local mode that shouldn't help.


Can you try this with local mode off, in other words on a running
Hadoop/Spark cluster?

Looking for empty lines could be run via a command like `grep -r "^$"
input-file-directory`; blank lines will show up before your next prompt if
so.

On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <tannaa...@gmail.com> wrote:

> Thank you Andrew for the quick response . I have around 300 input files.
> It would take a while for me to go though each file. I will try to look
> into that, but then I had successfully generated the sequence file use mahout
> seqdirectory for the same dataset. How can I find which mahout release I am
> on? also can you let me know how can I increase io.sort.mb = 100 when I
> have Mahout running in local mode.
>
> In the earlier attach file you can see it says 16/02/03 22:59:04 INFO
> mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes
>
> How can I increase in-memory buffer for Mahout local mode.
>
> I hope this has nothing to do with this error.
>
> Thanks,
> Alok Tanna
>
> On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
>> Is it possible you have any empty lines or extra whitespace at the end or
>> in the middle of any of your input files?  I don't know for sure but
>> that's
>> where I'd start looking.
>>
>> Are you on the most recent release?
>>
>> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <tannaa...@gmail.com> wrote:
>>
>> > Mahout in local mode
>> >
>> > I am able to successfully run the below command on smaller data set, but
>> > then when I am running this command on large data set I am getting below
>> > error.  Its looks like I need to increase size of some parameter but
>> then I
>> > am not sure which one.  It is failing with this error
>> java.io.EOFException
>> >   which creating the dictionary-0 file
>> >
>> > Please fine the attached file for more details.
>> >
>> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o
>> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf
>> >
>> > Main error :
>> >
>> >
>> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce
>> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce
>> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner: job_local1308764206_0003
>> > java.io.EOFException
>> >         at java.io.DataInputStream.readByte(DataInputStream.java:267)
>> >         at
>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299)
>> >         at
>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320)
>> >         at org.apache.hadoop.io.Text.readFields(Text.java:263)
>> >         at
>> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142)
>> >         at
>> >
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>> >         at
>> >
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>> >         at
>> >
>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
>> >         at
>> > org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>> >         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>> >         at
>> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>> >         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>> >         at
>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete:
>> > job_local1308764206_0003
>> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20
>> > 16/02/03 23:02:18 INFO mapred.JobClient:   File Output Format Counters
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Bytes Written=14923244
>> > 16/02/03 23:02:18 INFO mapred.JobClient:   FileSystemCounters
>> > 16/02/03 23:02:18 INFO mapred.JobClient:
>>  FILE_BYTES_READ=1412144036729
>> > 16/02/03 23:02:18 INFO mapred.JobClient:
>> > FILE_BYTES_WRITTEN=323876626568
>> > 16/02/03 23:02:18 INFO mapred.JobClient:   File Input Format Counters
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Bytes Read=11885543289
>> > 16/02/03 23:02:18 INFO mapred.JobClient:   Map-Reduce Framework
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce input groups=223
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output materialized
>> > bytes=2214020551
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Combine output records=0
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map input records=223
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce shuffle bytes=0
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Physical memory (bytes)
>> > snapshot=0
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce output records=222
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Spilled Records=638
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output bytes=2214019100
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     CPU time spent (ms)=0
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Total committed heap usaAT
>> > (bytes)=735978192896
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Virtual memory (bytes)
>> > snapshot=0
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Combine input records=0
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Map output records=223
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=9100
>> > 16/02/03 23:02:18 INFO mapred.JobClient:     Reduce input records=222
>> > Exception in thread "main" java.lang.IllegalStateException: Job failed!
>> >         at
>> >
>> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
>> >         at
>> >
>> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
>> >         at
>> >
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274)
>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> >         at
>> >
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >         at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >         at java.lang.reflect.Method.invoke(Method.java:606)
>> >         at
>> >
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> >         at
>> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> >         at
>> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> > .
>> > .
>> >
>> >
>> >
>> > --
>> > Thanks & Regards,
>> >
>> > Alok R. Tanna
>> >
>> >
>>
>
>
>
> --
> Thanks & Regards,
>
> Alok R. Tanna
>
>

Re: Mahout error : seq2sparse

Reply via email to