Re: codec: accessing term dictionary

2017-03-10 Thread Jürgen Jakobitsch
david, thanks for your input.. initially i was hoping to be able to use FST somehow in this process, but my knowledge in this area is fairly manageable.. i will give it a second thought anyway... ;-) krj *Jürgen Jakobitsch* Innovation Director Semantic Web Company GmbH EU: +43-1-4021235-0 Mobile

Re: codec: accessing term dictionary

2017-03-10 Thread Jürgen Jakobitsch
michael, thanks for your input.. i already extended the defaultCodec to return the BlockTreeOrdsPostingFormat for testing and this works nicely and i can access terms via ordinal. speed is not really the issue ( some things simply take a while... ;-) ) . i also don't want to index shingles, becau

Re: codec: accessing term dictionary

2017-03-10 Thread Dawid Weiss
Or you could encode those term/ ngram frequencies one FST and then reuse it. This would be memory-saving and fairly fast (~comparable to a hash table). Dawid On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless wrote: > Yes, this is a reasonable way to use Lucene (to see terms statistics across

Re: codec: accessing term dictionary

2017-03-10 Thread Michael McCandless
Yes, this is a reasonable way to use Lucene (to see terms statistics across the corpus) but it may not be performant enough for your needs. E.g. wasting memory and making a giant hash table for one time or periodic corpus analysis may be faster. If you are looking for word N gram stats, you could

codec: accessing term dictionary

2017-03-09 Thread Jürgen Jakobitsch
hi, i'd like to ask users for their experiences with the fastest way to access the term dictionary. what i want to do is to implement some algorithms to find phrases (e.g. mutual rank ratio [1]) (and other statistics on term distribution, generally: corpus related stuff) the idea would be to do