I found the solution: https://wiki.apache.org/solr/TermVectorComponent. I did not know that before, but that's exactly what I need.
Regards, Péter 2016-01-29 16:09 GMT+01:00 Péter Király <kirun...@gmail.com>: > Dear all, > > I am working on a research project in which I create an OS tool which > tries to detect "bad" and "good" records in a metadata collection > (such as a library catalog, museum database etc. -- you can find more > info here http://pkiraly.github.io/). This is not the first project of > that kind, there are some scientific articles in the topic, and there > are some established metrics as well. One of the metrics is > "Conformance to expectation" which is more or less a variation of the > tf-idf calculation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf). > > The process in my case is to index the dabase, than iterate over the > records and caculate tf-idf of the important fields. Since I haven't > find a method with which I simply retrieve this from the Solr index, I > followed the method: > > take a field value > use /analysis/field handler to extract the terms from the original value > use /terms with terms.limit=1, terms.sort=index, and terms.fl, > terms.prefix parameters to retrieve the document frequencies of each > terms > do the calculations based on those input variables > > My question is: is there any more direct way to extract this > information from the Solr index either in Solr, or with the Lucene > API? > > Thank you very much in advance! > Péter > > -- > Péter Király > software developer > GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal > http://linkedin.com/in/peterkiraly -- Péter Király software developer GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal http://linkedin.com/in/peterkiraly