Re: tf/idf similarity with modified document similarity

2014-03-07 Thread Jack Krupansky
Do you expect to have relatively large or relatively small result sets? For the former, are you willing to accept slow performance? I mean, your logic will have to scan all of the documents and fetch and check their term frequencies to count up df for each desired term. Maybe at least some of t

Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-07 Thread Koji Sekiguchi
Hi Tommaso, Thank you for your reply and tweet! > Some useful points / suggestions come out of it, let's see if we can follow > up :) Let's see simple one first. :-) Why don't we consider adding Analyzer parameter to assignClass()? koji (14/03/07 17:18), Tommaso Teofili wrote: cool Koji, tha

Re: Square of Idf

2014-03-07 Thread Yonik Seeley
On Thu, Mar 6, 2014 at 6:28 PM, Furkan KAMACI wrote: > Hi; > > Tf-Idf is explanation says that: > > *idf(t)* appears for *t* in both the query and the document, hence it is > squared in the equation. > > DefaultSimilarity does not square it. What it the explanation of it? I think you explained it

Re: codec mismatch

2014-03-07 Thread Michael McCandless
Thanks for bringing closure Jason! Mike McCandless http://blog.mikemccandless.com On Fri, Mar 7, 2014 at 12:30 AM, Jason Wee wrote: > Hello Mike, > > Thank you and you were right in your first comment, the expected field, > Lucene46FieldInfos is within the file _0.cfs. We have taken a closer l

Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-07 Thread Tommaso Teofili
cool Koji, thanks a lot for sharing. Some useful points / suggestions come out of it, let's see if we can follow up :) Regards, Tommaso 2014-03-07 3:30 GMT+01:00 Koji Sekiguchi : > Hello, > > I just posted an article on Comparing Document Classification Functions > of Lucene and Mahout. > > > h