Suneel, I'm going to do the similarity part of the tour over - my laptop was "sleeped" in the middle of the run of the rowsimilarity job. Maybe the job is sensitive to that …. :( Normally - a server would not go to sleep nor would it run in local mode.
Sorry that I didn't think of that sooner. Will let you know my outcome. Am planning on redoing by deleting the contents and the folder titled "reuters-similarity" Please let me know if that is not good enough. Thanks again. SCott On 12/19/13 11:53 AM, "Suneel Marthi" <[email protected]> wrote: >What you are seeing is the output matrix of the RowSimilarity job. You >are right there should be 21578 documents only in the reuters corpus. > >a) How many documents do you have in your docIndex? DocIndex is one of >the artifacts of the RowIDJob and should have been executed prior to the >RowSimilarity Job. You can run seqdumper on docIndex to see the output. > >b) Also what was the message at the end of the RowId job. It should read >something like 'Wrote out matrix with 21578 rows and 19515 columns to >reuters-matrix/matrix'. > > > > >On Thursday, December 19, 2013 12:14 PM, Scott C. Cote ><[email protected]> wrote: > >All, > >I am a newbie Mahout user and am trying to use the "Quick tour of text >analysis using the Mahout command line" . Thank you to whomever >contributed >to that page. > >> >>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana >>lysis >> +using+the+Mahout+command+line > >Went all the way from beginning to end of the page with "seemingly" no >hiccups. >At the very end of the "tour", I became confused because the command: > >> mahout seqdumper -i reuters-matrix/matrix | more > >Allowed me to see output (snippet) > >> Key: 1: Value: >> >>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440 >>3:0.2 >> >>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108: >>0.126 >> >>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750: >>0.188 >> >>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0. >>36601 >> >>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0. >>10869 >> >>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.1 >>23091 >> >>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06 >>16936 >> >>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1 >>23271 >> >>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08 >>01873 >> >>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19 >>87470 >> >>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.147 >>88025 >> >>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973 >>79357 >> >>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358 >>19767 >> >>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081 >>98203 >> >>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952 >>82500 >> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} > >Reading through that snippet of data made me think that there exists a >document with rowed 41154 with cosine value of ~0.0658 (the last element >in >the snippet). > >The problem is that the folder > >> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted > >Only has 21578 files in it. Indeed, my dictionary file (output command >used shown below) > >> mahout seqdumper -i reuters-matrix/docIndex | tail > >Has a max key of > >> Key: 21576: Value: /reut2-021.sgm-98.txt >> Key: 21577: Value: /reut2-021.sgm-99.txt >> Count: 21578 > >So I cannot find the document with key value 41154 . What does the >41154 >related to???? > >Obviously I have misunderstood something that I did or need to do in >the >tour. Can someone please shine a light on where I strayed? I have >scripted >every step that I took and can share them here if desired (I noticed that >some of the output file names changed since the page was written so I >made >adjustments). > >Regards, > >SCott > >PS Thanks TD for helping me earlier
