david, thanks for your input..
initially i was hoping to be able to use FST somehow in this process, but
my knowledge in this area is fairly manageable..
i will give it a second thought anyway... ;-)
krj
*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile
michael, thanks for your input..
i already extended the defaultCodec to return the
BlockTreeOrdsPostingFormat for testing and this works nicely and i can
access terms via ordinal.
speed is not really the issue ( some things simply take a while... ;-) ) .
i also don't want to index shingles, becau
Or you could encode those term/ ngram frequencies one FST and then
reuse it. This would be memory-saving and fairly fast (~comparable to
a hash table).
Dawid
On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
wrote:
> Yes, this is a reasonable way to use Lucene (to see terms statistics across
Yes, this is a reasonable way to use Lucene (to see terms statistics across
the corpus) but it may not be performant enough for your needs.
E.g. wasting memory and making a giant hash table for one time or periodic
corpus analysis may be faster.
If you are looking for word N gram stats, you could
hi,
i'd like to ask users for their experiences with the fastest way to access
the term dictionary.
what i want to do is to implement some algorithms to find phrases (e.g.
mutual rank ratio [1])
(and other statistics on term distribution, generally: corpus related stuff)
the idea would be to do