RE: Incorrect output vectors generation using seq2sparse

Stuti Awasthi Tue, 14 May 2013 00:15:29 -0700

Here are the MR job imp status for the partial-vector-0 creation job:



13/05/14 12:37:08 INFO mapred.JobClient:   Map-Reduce Framework

13/05/14 12:37:08 INFO mapred.JobClient:     Map output materialized bytes=5907

13/05/14 12:37:08 INFO mapred.JobClient:     Map input records=83

13/05/14 12:37:08 INFO mapred.JobClient:     Reduce shuffle bytes=0

13/05/14 12:37:08 INFO mapred.JobClient:     Spilled Records=166

13/05/14 12:37:08 INFO mapred.JobClient:     Map output bytes=5729

13/05/14 12:37:08 INFO mapred.JobClient:     CPU time spent (ms)=1340

13/05/14 12:37:08 INFO mapred.JobClient:     Total committed heap usage 
(bytes)=206700544

13/05/14 12:37:08 INFO mapred.JobClient:     Combine input records=0

13/05/14 12:37:08 INFO mapred.JobClient:     SPLIT_RAW_BYTES=156

13/05/14 12:37:08 INFO mapred.JobClient:     Reduce input records=83

13/05/14 12:37:08 INFO mapred.JobClient:     Reduce input groups=2

13/05/14 12:37:08 INFO mapred.JobClient:     Combine output records=0

13/05/14 12:37:08 INFO mapred.JobClient:     Physical memory (bytes) 
snapshot=266747904

13/05/14 12:37:08 INFO mapred.JobClient:     Reduce output records=2

13/05/14 12:37:08 INFO mapred.JobClient:     Virtual memory (bytes) 
snapshot=4225028096

13/05/14 12:37:08 INFO mapred.JobClient:     Map output records=83



Here its clear that Reduce output generates on 2 records.



Thanks

Stuti Awasthi


From: Stuti Awasthi
Sent: Tuesday, May 14, 2013 12:19 PM
To: [email protected]
Subject: Incorrect output vectors generation using seq2sparse

Hi All,

I am trying Mahout Naïve Bayes Algorithm for Classification. I have created a 
custom sequence file with \t separated with Key as Label and Value as Text 
strings.
Now to convert it to vector , I used seq2sparse utility and found out that 
vectors are not generating correctly.

I debug each step and below are my findings :


1.       My train data contain 83 records in seq file format

2.       The output of  wordcount, tokenized-document and dictionary.file-0 
steps are getting generated correctly

3.       Then comes the step of partial-vector-0 generator. In this step the MR 
job outputs only 2 records

4.       Since this step is incorrect ,so the output of 
tf-vectors,frequency.file-0 ,df-count and tfidf-vector output are incorrect.
Final vector file tfidf-vectors contain only 2 vectorized document and that too 
not correct.

Output of tfidf-vectors:
Key: /Irrelevant/: Value: {50:1.0,83:1.0}
Key: /Relevant/: Value: {62:1.0,128:1.0,329:1.0,289:1.0}

The command used for seq2sparse is :
bin/mahout seq2sparse -i  /data-seq1 -o /data-vectors

Please help me to figure out how can I fix this. To my understanding, the 
vectorized document should also have 83 records.

Thanks
Stuti Awasthi



::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

RE: Incorrect output vectors generation using seq2sparse

Reply via email to