$ for i in `ls input-directory`; do sed -i '/^$/d' input-directory/$i; done
On Wed, Feb 3, 2016 at 9:08 PM, Alok Tanna <tannaa...@gmail.com> wrote: > This command works thank you , yes I am seeing lot of empty lines in my > input files. any magic command to remove this lines that would save lot of > time. > I would re run this once I have removed empty lines. > > It would be great if I can get this working in local mode or else I will > have to send few days to get it working on hadoop\spark cluster. > > Thanks, > Alok Tanna > > On Wed, Feb 3, 2016 at 11:38 PM, Andrew Musselman < > andrew.mussel...@gmail.com> wrote: > >> Ah; looks like that config can be set in Hadoop's core-site.xml but if >> you're running Mahout in local mode that shouldn't help. >> >> Can you try this with local mode off, in other words on a running >> Hadoop/Spark cluster? >> >> Looking for empty lines could be run via a command like `grep -r "^$" >> input-file-directory`; blank lines will show up before your next prompt if >> so. >> >> On Wed, Feb 3, 2016 at 8:30 PM, Alok Tanna <tannaa...@gmail.com> wrote: >> >>> Thank you Andrew for the quick response . I have around 300 input files. >>> It would take a while for me to go though each file. I will try to look >>> into that, but then I had successfully generated the sequence file use >>> mahout >>> seqdirectory for the same dataset. How can I find which mahout release I am >>> on? also can you let me know how can I increase io.sort.mb = 100 when I >>> have Mahout running in local mode. >>> >>> In the earlier attach file you can see it says 16/02/03 22:59:04 INFO >>> mapred.MapTask: Record too large for in-memory buffer: 99614722 bytes >>> >>> How can I increase in-memory buffer for Mahout local mode. >>> >>> I hope this has nothing to do with this error. >>> >>> Thanks, >>> Alok Tanna >>> >>> On Wed, Feb 3, 2016 at 10:50 PM, Andrew Musselman < >>> andrew.mussel...@gmail.com> wrote: >>> >>>> Is it possible you have any empty lines or extra whitespace at the end >>>> or >>>> in the middle of any of your input files? I don't know for sure but >>>> that's >>>> where I'd start looking. >>>> >>>> Are you on the most recent release? >>>> >>>> On Wed, Feb 3, 2016 at 7:33 PM, Alok Tanna <tannaa...@gmail.com> wrote: >>>> >>>> > Mahout in local mode >>>> > >>>> > I am able to successfully run the below command on smaller data set, >>>> but >>>> > then when I am running this command on large data set I am getting >>>> below >>>> > error. Its looks like I need to increase size of some parameter but >>>> then I >>>> > am not sure which one. It is failing with this error >>>> java.io.EOFException >>>> > which creating the dictionary-0 file >>>> > >>>> > Please fine the attached file for more details. >>>> > >>>> > command: mahout seq2sparse -i /home/ubuntu/AT/AT-Seq/ -o >>>> > /home/ubuntu/AT/AT-vectors/ -lnorm -nv -wt tfidf >>>> > >>>> > Main error : >>>> > >>>> > >>>> > 16/02/03 23:02:06 INFO mapred.LocalJobRunner: reduce > reduce >>>> > 16/02/03 23:02:17 INFO mapred.LocalJobRunner: reduce > reduce >>>> > 16/02/03 23:02:18 WARN mapred.LocalJobRunner: job_local1308764206_0003 >>>> > java.io.EOFException >>>> > at java.io.DataInputStream.readByte(DataInputStream.java:267) >>>> > at >>>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299) >>>> > at >>>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320) >>>> > at org.apache.hadoop.io.Text.readFields(Text.java:263) >>>> > at >>>> > org.apache.mahout.common.StringTuple.readFields(StringTuple.java:142) >>>> > at >>>> > >>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >>>> > at >>>> > >>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >>>> > at >>>> > >>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117) >>>> > at >>>> > >>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >>>> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >>>> > at >>>> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) >>>> > at >>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >>>> > at >>>> > >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Job complete: >>>> > job_local1308764206_0003 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Counters: 20 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: File Output Format Counters >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Written=14923244 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: FileSystemCounters >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: >>>> FILE_BYTES_READ=1412144036729 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: >>>> > FILE_BYTES_WRITTEN=323876626568 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: File Input Format Counters >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Bytes Read=11885543289 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map-Reduce Framework >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input groups=223 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output materialized >>>> > bytes=2214020551 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine output records=0 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map input records=223 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce shuffle bytes=0 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Physical memory (bytes) >>>> > snapshot=0 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce output records=222 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Spilled Records=638 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output >>>> bytes=2214019100 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: CPU time spent (ms)=0 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Total committed heap >>>> usaAT >>>> > (bytes)=735978192896 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Virtual memory (bytes) >>>> > snapshot=0 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Combine input records=0 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Map output records=223 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=9100 >>>> > 16/02/03 23:02:18 INFO mapred.JobClient: Reduce input records=222 >>>> > Exception in thread "main" java.lang.IllegalStateException: Job >>>> failed! >>>> > at >>>> > >>>> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) >>>> > at >>>> > >>>> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) >>>> > at >>>> > >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:274) >>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>> > at >>>> > >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) >>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> > at >>>> > >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>> > at >>>> > >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> > at java.lang.reflect.Method.invoke(Method.java:606) >>>> > at >>>> > >>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>> > at >>>> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>> > at >>>> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>>> > . >>>> > . >>>> > >>>> > >>>> > >>>> > -- >>>> > Thanks & Regards, >>>> > >>>> > Alok R. Tanna >>>> > >>>> > >>>> >>> >>> >>> >>> -- >>> Thanks & Regards, >>> >>> Alok R. Tanna >>> >>> >> >> > > > -- > Thanks & Regards, > > Alok R. Tanna > >