I found the solution:
https://wiki.apache.org/solr/TermVectorComponent. I did not know that
before, but that's exactly what I need.

Regards,
Péter

2016-01-29 16:09 GMT+01:00 Péter Király <kirun...@gmail.com>:
> Dear all,
>
> I am working on a research project in which I create an OS tool which
> tries to detect "bad" and "good" records in a metadata collection
> (such as a library catalog, museum database etc. -- you can find more
> info here http://pkiraly.github.io/). This is not the first project of
> that kind, there are some scientific articles in the topic, and there
> are some established metrics as well. One of the metrics is
> "Conformance to expectation" which is more or less a variation of the
> tf-idf calculation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
>
> The process in my case is to index the dabase, than iterate over the
> records and caculate tf-idf of the important fields. Since I haven't
> find a method with which I simply retrieve this from the Solr index, I
> followed the method:
>
> take a field value
> use /analysis/field handler to extract the terms from the original value
> use /terms with terms.limit=1, terms.sort=index, and terms.fl,
> terms.prefix parameters to retrieve the document frequencies of each
> terms
> do the calculations based on those input variables
>
> My question is: is there any more direct way to extract this
> information from the Solr index either in Solr, or with the Lucene
> API?
>
> Thank you very much in advance!
> Péter
>
> --
> Péter Király
> software developer
> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
> http://linkedin.com/in/peterkiraly



-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Reply via email to