On Sep 20, 2012, at 1:55 PM, Dave Byrne wrote: > In TFIDFPartialVectorReducer.java: > > If docFreq > maxDocFreq then the vector at that index is not set (ignored) > If docFreq < minDocFreq then the vector at that index is set to the TfIdf > calculation using minDocFreq instead of the actual document frequency. > > Should minDocFreq not be treated the same as maxDocFreq by skipping setting > the vector at that index?
I think the idea is that it is being rounded up to provide some minimum level of input. It's always a bit of a hedge w/ these rare terms. Sometimes they are just garbage, other times, they are valuable. My leaning would be towards keeping it as is. > > In both cases, the vector length remains the same and these settings have no > effect on pruning the vector length / term reduction? > > > NOTICE: This message and any attachments are intended only for the use of the > addressee and may contain confidential, proprietary and/or privileged > information. If you are not the intended recipient, any review, use, > distribution, dissemination or copying of this email is prohibited. If you > have received this email in error, please notify the sender by replying to > this message and delete this email immediately. Securities trading, account > management, and investment banking services are offered by MDB Capital Group > LLC, a registered broker-dealer and member of FINRA and SIPC. Unless clearly > stated, nothing herein shall be construed to be an offer to sell, nor a > solicitation of an offer to buy, any financial product. -------------------------------------------- Grant Ingersoll http://www.lucidworks.com
