Hi All,

I have a usecase where I would like to group the documents based on the Topics. 
I thought to try Mahout CVB for the same so that I can get the topics as well 
the Documents associated with Topic.
The algo run fine and output is generated but facing some difficulty in 
interpreting it correctly for the further use.

I have created 10 Topics.

Topic|Term Dump:

Command:      /mahout-distribution-0.7/bin/mahout vectordump -i 
/stuti/ClusteringOutput/cvb/CVBoutput -dt sequencefile -d 
/stuti/ClusteringOutput/data-vectors/dictionary.file-0 -o ~/vectordump 
--printKey TRUE -u true

Output

9       
{1:0.01817006075375475,10:3.136941380958122E-5,100:9.273658949434378E-6,2:0.002578156629756592,3:0.010230706274774652,4500:7.18113775256184E-4,6000:7.823039625689614E-
 .....}
8       
{1:0.0050419126069912,10:0.00419466385663254,100:3.2355607898053236E-5,2:0.0035546709954573028,3:4.932871944602668E-....}

Document|Topic Dump
Command :  ~/mahout-distribution-0.7/bin/mahout vectordump -i 
/stuti/ClusteringOutput/cvb/doc_topic_output/part-m-00000 -dt sequencefile -d 
/stuti/ClusteringOutput/data-vectors/dictionary.file-0 -o ~/vectordump 
--printKey TRUE -u true

Output
docID_0 
{1:0.24003566642926688,10:0.07902409378706234,100:0.27642347654563704,2:0.09328563693950895,3:0.07601011469606367,4500:0.030149937033033494,6000:0.016651437809998382,6500:0.08626329023442939,802.1q:0.031797124464879,aaa:0.0703592220601207}
docID_1 
{1:0.09930948851237775,10:0.036922991425322335,100:0.013837888829815237,2:0.2853167624741968,3:0.00586058239550104,4500:0.28977064603716723,6000:0.0840069314177161,6500:0.12960407482548578,802.1q:0.017209604276159373,aaa:0.03816102980625812}


According to my understanding, the output is fine. Now my queries are :
1. How to get top terms for every Topic from Topic|Term output
2. How to inference which document belongs to which Topic from document|Topic 
output.
3. Is there any inbuild utility like Clusterdump which can show the document of 
same topics together under a cluster

As my goal is to Group the documents related with same topics together.

Please help me understanding this output.

Thanks
Stuti Awasthi




::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

Reply via email to