Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

John Conwell Mon, 23 Jan 2012 07:49:53 -0800

Any time you pass in that you want term frequency vs tfidf used as
weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent
(--maxDFSigma 3) will cause the term vectors not to be created (as shown in
the code below)


For example, the following cmd line will reproduce this situation:

bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
/Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
--minDF 2 --maxDFSigma 3 -seq

Thanks,
John

On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <[email protected]>wrote:

> What were the command/options you were passing in?
>
>
> On Jan 18, 2012, at 4:26 PM, John Conwell wrote:
>
> > I got latest from Trunk and built it, and when
> > running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
> > The SparseVectorsFromSequenceFiles throws an exception when you want term
> > frequency vectors output, with the maxDFSigma filtering option.
> >
> > Basically the if / else if section shown below, will skip
> > calling DictionaryVectorizer.createTermFrequencyVectors when have that
> > combination.  The condition will create vectors when you want tf vectors
> > without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
> > but if you want tf vectors with maxDFSigma filtering, it totally skips
> over
> > the call to createTermFrequencyVectors, and later on throws an exception
> > because the vector input path doesn't exist.
> >
> > Is this a known issue?  I'm assuming thats not the way its suposed to
> work,
> > correct?  If so, I think some sort of validation should break the user
> out
> > before they start processing anything
> >
> > //at line ~267 in trunk
> >
> > if (!processIdf && !shouldPrune) {
> >
> >        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> > outputDir, tfDirName, conf, minSupport, maxNGramSize,
> >
> >          minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
> > sequentialAccessOutput, namedVectors);
> >
> > } else if (processIdf) {
> >
> >        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> > outputDir, tfDirName, conf, minSupport, maxNGramSize,
> >
> >          minLLRValue, -1.0f, false, reduceTasks, chunkSize,
> > sequentialAccessOutput, namedVectors);
> >
> > }
> >
> > --
> >
> > Thanks,
> > John C
> >
> >
> >
> >
> > --
> >
> > -- John C
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>


-- 

Thanks,
John C

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Reply via email to