When I run your code, as is except for using RAMDirectory and setting
up an IndexWriter using StandardAnalyzer
RAMDirectory dir = new RAMDirectory();
Analyzer anl = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, a
There's no way to set such a limit within lucene that I know of. If
you really need this you could implement something outside lucene to
monitor the index directory and do something (what???) when the limit
was exceeded.
Don't forget that disk usage will vary over time as segments are
merged, rea
Thanks very much for your reply, Ian.
I am using SlowCompositeReaderWrapper because I am also retrieving the
term frequency statistics for the corpus (at the end of the day, I am
doing some machine learning/document clustering). Despite its name and
warning documentation not to use it, SlowComposi
In our case (very similar to the "Netflix movie titles" use case) the
AnalyzingSuggester's FST grows by a factor of ~5 when we generate the
token graph.
Looking up and joining individual "postings lists" for the individual
tokens would certainly work, but is certainly more work than injecting a
to
Which statistics in particular (which methods)?
On Thu, Jan 17, 2013 at 5:10 AM, Jon Stewart
wrote:
> Thanks very much for your reply, Ian.
>
> I am using SlowCompositeReaderWrapper because I am also retrieving the
> term frequency statistics for the corpus (at the end of the day, I am
> doing so
I don't recall that the RawTermFilter was required. The following code
should also work in 4.x:
Filter parentsFilter = new CachingWrapperFilter(new
QueryWrapperFilter(new TermQuery(new Term("type", "T1";
Martijn
On 16 January 2013 16:51, kiwi clive wrote:
> Hi Guys,
>
> Apologies if this has
Right, RawTermFilter is no longer needed (because we changed how
deleted docs are handled in 4.0).
Mike McCandless
http://blog.mikemccandless.com
On Thu, Jan 17, 2013 at 9:39 AM, Martijn v Groningen
wrote:
> I don't recall that the RawTermFilter was required. The following code
> should also wo
On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote:
> Which statistics in particular (which methods)?
I'd like to know the frequency of each term in each document. Those
term counts for the most frequent terms in the corpus will make it
into the document vectors for clustering.
Looking at Terms
typo time. You need doc2.add(...) not 2 doc.add(...) statements.
--
Ian.
On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart
wrote:
> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote:
>> Which statistics in particular (which methods)?
>
> I'd like to know the frequency of each term in each docume
D'oh Thanks!
Does TermsEnum.totalTermFreq() return the per-doc frequencies? It
looks like it empirically, but the documentation refers to corpus
usage, not document.field usage.
Jon
On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea wrote:
> typo time. You need doc2.add(...) not 2 doc.add(...) stat
Hi,
I am looking to get a combination of multiple subqueries.
What I want to do is to have two queries which have to be near one to another.
As an example:
Query1: (A AND (B OR C))
Query2: D
Then I want to use something like a SpanNearQuery to combine both (slop 5):
Both would then have to matc
Hi,
On a 256 Gb RAM machine, we have half of our IT system running.
Part of it, are 2 lucene applications, managing each a an approximate 100 Gb
index.
These applications are used to index logging events, and every night there is a
purge, followed by a forceMergeDeletes to reclaim disk space (and
You need to express the "boolean" query solely in terms of SpanOrQuery and
SpanNearQuery. If you can't, ... then it probably can't be done, but you
should be able to.
How about starting with a plan English description of the problem you are
trying to solve?
-- Jack Krupansky
-Original M
The problem I would like to solve is to have two queries that I will
get from the query parser (this could include wildcardqueries and
phrasequeries).
Both of these queries would have to match the document and as an
additional restriction I would like to add that a matching term from
the first quer
I've had to do something exactly like this. My approach was to turn AND queries
into a SpanNearQuery with a slop of Integer.MAX_VALUE and inOrder false, and to
turn OR queries into a SpanOrQuery. It's a bit hacky, but is much simpler than
creating your own Query class to implement this.
-Michae
Currently there isn't. SpanNearQuery can take only other SpanQuery objects,
which includes other spans, span terms, and span wrapped multi-term queries
(e.g., wildcard, fuzzy query), but not Boolean queries.
But it does sound like a good feature request.
There is SpanNotQuery, so you can exclu
The answer is simple: It load the whole index into RAM because it has the RAM
available for free use. If there would be apps running really using that RAM,
this would not happen.
There is no difference if you use another Lucene directory implementation,
merging is a heavy IO operation, so it wi
Hi,
Maybe https://github.com/sematext/ActionGenerator could be of help?
We use it to produce query load for Solr and ElasticSearch and the whole thing
is extensible, so you could easily add support for talking directly to Lucene.
Oh, and there is the benchmark in Lucene:
http://lucene.apache.or
18 matches
Mail list logo