Any time you pass in that you want term frequency vs tfidf used as weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent (--maxDFSigma 3) will cause the term vectors not to be created (as shown in the code below)
For example, the following cmd line will reproduce this situation: bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq Thanks, John On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <[email protected]>wrote: > What were the command/options you were passing in? > > > On Jan 18, 2012, at 4:26 PM, John Conwell wrote: > > > I got latest from Trunk and built it, and when > > running SparseVectorsFromSequenceFiles I noticed what I think is a bug. > > The SparseVectorsFromSequenceFiles throws an exception when you want term > > frequency vectors output, with the maxDFSigma filtering option. > > > > Basically the if / else if section shown below, will skip > > calling DictionaryVectorizer.createTermFrequencyVectors when have that > > combination. The condition will create vectors when you want tf vectors > > without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, > > but if you want tf vectors with maxDFSigma filtering, it totally skips > over > > the call to createTermFrequencyVectors, and later on throws an exception > > because the vector input path doesn't exist. > > > > Is this a known issue? I'm assuming thats not the way its suposed to > work, > > correct? If so, I think some sort of validation should break the user > out > > before they start processing anything > > > > //at line ~267 in trunk > > > > if (!processIdf && !shouldPrune) { > > > > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > > outputDir, tfDirName, conf, minSupport, maxNGramSize, > > > > minLLRValue, norm, logNormalize, reduceTasks, chunkSize, > > sequentialAccessOutput, namedVectors); > > > > } else if (processIdf) { > > > > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > > outputDir, tfDirName, conf, minSupport, maxNGramSize, > > > > minLLRValue, -1.0f, false, reduceTasks, chunkSize, > > sequentialAccessOutput, namedVectors); > > > > } > > > > -- > > > > Thanks, > > John C > > > > > > > > > > -- > > > > -- John C > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > > > -- Thanks, John C
