Hello,
I'm trying to generate TFIDF values from a document collection stored as a
SequenceFile<Text,Text> using the seq2sparse job, however several of the terms
are being somehow filtered or pruned from the TFIDF vectors. I have pasted a
sample output below from the TFIDF value for a given key and the TF value for a
given key. As you can see, the TF vector has many more non-zero values than the
TFIDF vector. What happened to all of the other term values in the TFIDF
vector? Is there a parameter I am missing to make sure all terms are
represented in the TFIDF vectors? I have tried setting maxDFPercent to 100 but
saw the same result. I am using mahout-distribution-0.6. I appreciate any help
you can provide!
Here is the script I am using:
./mahout seq2sparse -i input/docs/ -o seq2sparse/ -wt tfidf --minDF 2
--minSupport 2 -seq -a org.apache.lucene.analysis.SimpleAnalyzer
The VectorWritable value from the tfidf-vectors directory: Key: 10175389:
Value:
{13716:8.130898475646973,14116:6.991464614868164,24713:7.265901565551758,25344:5.853631496429443,39719:5.699481010437012,41716:5.656463623046875,58073:5.74443244934082}
The VectorWritable value from the tf-vectors directory: Key: 10175389: Value:
{0:3.0,587:1.0,673:1.0,1080:3.0,1085:1.0,1186:1.0,1666:1.0,1886:1.0,2372:1.0,2459:7.0,2827:1.0,3023:1.0,3222:1.0,3322:1.0,3392:2.0,3498:1.0,5522:3.0,5550:1.0,5595:1.0,6046:1.0,6937:2.0,7313:1.0,7731:2.0,8446:1.0,9329:1.0,9746:1.0,9749:1.0,10251:1.0,11118:1.0,11128:1.0,11369:1.0,11370:1.0,11855:1.0,12088:4.0,12213:1.0,12214:1.0,12282:1.0,13259:1.0,13716:1.0,14116:1.0,14566:1.0,14896:1.0,15338:1.0,15550:1.0,15559:1.0,16305:1.0,17424:1.0,17843:1.0,17977:2.0,18536:1.0,18952:1.0,19013:2.0,19037:1.0,19129:1.0,19477:2.0,19595:1.0,19626:1.0,19686:2.0,22889:1.0,23353:1.0,23666:1.0,24176:1.0,24622:9.0,24713:1.0,24823:2.0,25077:1.0,25344:1.0,25613:2.0,26049:4.0,26290:2.0,26650:1.0,26700:1.0,27148:1.0,28789:1.0,29900:1.0,29975:2.0,30724:1.0,32705:1.0,32892:2.0,33067:1.0,33681:1.0,34342:3.0,36135:1.0,36263:1.0,36427:1.0,36905:1.0,36986:2.0,37582:17.0,37870:1.0,38299:1.0,38339:1.0,38570:1.0,38703:1.0,39208:1.0,39406:1.0,
39609:4.0,39719:1.0,40389:1.0,40637:2.0,40667:1.0,41326:1.0,41494:1.0,41538:2.0,41716:1.0,41905:1.0,42003:1.0,42020:2.0,42652:2.0,43373:1.0,44200:1.0,44562:1.0,45843:1.0,45980:4.0,47376:1.0,47398:1.0,47511:1.0,47753:2.0,48636:2.0,48851:1.0,49803:1.0,49968:1.0,49970:2.0,50170:1.0,50586:1.0,50850:1.0,51041:1.0,52359:1.0,53183:1.0,53197:18.0,53222:1.0,53352:2.0,53717:3.0,54486:1.0,55368:1.0,55607:1.0,57483:1.0,57586:1.0,57714:2.0,57795:1.0,57835:1.0,57950:1.0,58073:1.0,58089:1.0,58194:1.0,58212:1.0,58327:1.0,58780:4.0}