Here are the MR job imp status for the partial-vector-0 creation job:
13/05/14 12:37:08 INFO mapred.JobClient: Map-Reduce Framework 13/05/14 12:37:08 INFO mapred.JobClient: Map output materialized bytes=5907 13/05/14 12:37:08 INFO mapred.JobClient: Map input records=83 13/05/14 12:37:08 INFO mapred.JobClient: Reduce shuffle bytes=0 13/05/14 12:37:08 INFO mapred.JobClient: Spilled Records=166 13/05/14 12:37:08 INFO mapred.JobClient: Map output bytes=5729 13/05/14 12:37:08 INFO mapred.JobClient: CPU time spent (ms)=1340 13/05/14 12:37:08 INFO mapred.JobClient: Total committed heap usage (bytes)=206700544 13/05/14 12:37:08 INFO mapred.JobClient: Combine input records=0 13/05/14 12:37:08 INFO mapred.JobClient: SPLIT_RAW_BYTES=156 13/05/14 12:37:08 INFO mapred.JobClient: Reduce input records=83 13/05/14 12:37:08 INFO mapred.JobClient: Reduce input groups=2 13/05/14 12:37:08 INFO mapred.JobClient: Combine output records=0 13/05/14 12:37:08 INFO mapred.JobClient: Physical memory (bytes) snapshot=266747904 13/05/14 12:37:08 INFO mapred.JobClient: Reduce output records=2 13/05/14 12:37:08 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4225028096 13/05/14 12:37:08 INFO mapred.JobClient: Map output records=83 Here its clear that Reduce output generates on 2 records. Thanks Stuti Awasthi From: Stuti Awasthi Sent: Tuesday, May 14, 2013 12:19 PM To: [email protected] Subject: Incorrect output vectors generation using seq2sparse Hi All, I am trying Mahout Naïve Bayes Algorithm for Classification. I have created a custom sequence file with \t separated with Key as Label and Value as Text strings. Now to convert it to vector , I used seq2sparse utility and found out that vectors are not generating correctly. I debug each step and below are my findings : 1. My train data contain 83 records in seq file format 2. The output of wordcount, tokenized-document and dictionary.file-0 steps are getting generated correctly 3. Then comes the step of partial-vector-0 generator. In this step the MR job outputs only 2 records 4. Since this step is incorrect ,so the output of tf-vectors,frequency.file-0 ,df-count and tfidf-vector output are incorrect. Final vector file tfidf-vectors contain only 2 vectorized document and that too not correct. Output of tfidf-vectors: Key: /Irrelevant/: Value: {50:1.0,83:1.0} Key: /Relevant/: Value: {62:1.0,128:1.0,329:1.0,289:1.0} The command used for seq2sparse is : bin/mahout seq2sparse -i /data-seq1 -o /data-vectors Please help me to figure out how can I fix this. To my understanding, the vectorized document should also have 83 records. Thanks Stuti Awasthi ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. ----------------------------------------------------------------------------------------------------------------------------------------------------
