Re: Document term vectors in Lucene 4

2013-01-17 Thread Ian Lea
When I run your code, as is except for using RAMDirectory and setting up an IndexWriter using StandardAnalyzer RAMDirectory dir = new RAMDirectory(); Analyzer anl = new StandardAnalyzer(Version.LUCENE_40); IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, a

Re: How to control the lucence index storage size?

2013-01-17 Thread Ian Lea
There's no way to set such a limit within lucene that I know of. If you really need this you could implement something outside lucene to monitor the index directory and do something (what???) when the limit was exceeded. Don't forget that disk usage will vary over time as segments are merged, rea

Re: Document term vectors in Lucene 4

2013-01-17 Thread Jon Stewart
Thanks very much for your reply, Ian. I am using SlowCompositeReaderWrapper because I am also retrieving the term frequency statistics for the corpus (at the end of the day, I am doing some machine learning/document clustering). Despite its name and warning documentation not to use it, SlowComposi

RE: Suggesters: circumfix suggestions

2013-01-17 Thread Oliver Christ
In our case (very similar to the "Netflix movie titles" use case) the AnalyzingSuggester's FST grows by a factor of ~5 when we generate the token graph. Looking up and joining individual "postings lists" for the individual tokens would certainly work, but is certainly more work than injecting a to

Re: Document term vectors in Lucene 4

2013-01-17 Thread Robert Muir
Which statistics in particular (which methods)? On Thu, Jan 17, 2013 at 5:10 AM, Jon Stewart wrote: > Thanks very much for your reply, Ian. > > I am using SlowCompositeReaderWrapper because I am also retrieving the > term frequency statistics for the corpus (at the end of the day, I am > doing so

Re: BlockJoin and RawTermFilter (lucene 4.0.0)

2013-01-17 Thread Martijn v Groningen
I don't recall that the RawTermFilter was required. The following code should also work in 4.x: Filter parentsFilter = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("type", "T1"; Martijn On 16 January 2013 16:51, kiwi clive wrote: > Hi Guys, > > Apologies if this has

Re: BlockJoin and RawTermFilter (lucene 4.0.0)

2013-01-17 Thread Michael McCandless
Right, RawTermFilter is no longer needed (because we changed how deleted docs are handled in 4.0). Mike McCandless http://blog.mikemccandless.com On Thu, Jan 17, 2013 at 9:39 AM, Martijn v Groningen wrote: > I don't recall that the RawTermFilter was required. The following code > should also wo

Re: Document term vectors in Lucene 4

2013-01-17 Thread Jon Stewart
On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote: > Which statistics in particular (which methods)? I'd like to know the frequency of each term in each document. Those term counts for the most frequent terms in the corpus will make it into the document vectors for clustering. Looking at Terms

Re: Document term vectors in Lucene 4

2013-01-17 Thread Ian Lea
typo time. You need doc2.add(...) not 2 doc.add(...) statements. -- Ian. On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart wrote: > On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote: >> Which statistics in particular (which methods)? > > I'd like to know the frequency of each term in each docume

Re: Document term vectors in Lucene 4

2013-01-17 Thread Jon Stewart
D'oh Thanks! Does TermsEnum.totalTermFreq() return the per-doc frequencies? It looks like it empirically, but the documentation refers to corpus usage, not document.field usage. Jon On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea wrote: > typo time. You need doc2.add(...) not 2 doc.add(...) stat

Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Michel Conrad
Hi, I am looking to get a combination of multiple subqueries. What I want to do is to have two queries which have to be near one to another. As an example: Query1: (A AND (B OR C)) Query2: D Then I want to use something like a SpanNearQuery to combine both (slop 5): Both would then have to matc

mmap loads the entire index into memory during forceMergeDeletes/forceMerge(int)

2013-01-17 Thread v . sevel
Hi, On a 256 Gb RAM machine, we have half of our IT system running. Part of it, are 2 lucene applications, managing each a an approximate 100 Gb index. These applications are used to index logging events, and every night there is a purge, followed by a forceMergeDeletes to reclaim disk space (and

Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Jack Krupansky
You need to express the "boolean" query solely in terms of SpanOrQuery and SpanNearQuery. If you can't, ... then it probably can't be done, but you should be able to. How about starting with a plan English description of the problem you are trying to solve? -- Jack Krupansky -Original M

Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Michel Conrad
The problem I would like to solve is to have two queries that I will get from the query parser (this could include wildcardqueries and phrasequeries). Both of these queries would have to match the document and as an additional restriction I would like to add that a matching term from the first quer

RE: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Michael Ryan
I've had to do something exactly like this. My approach was to turn AND queries into a SpanNearQuery with a slop of Integer.MAX_VALUE and inOrder false, and to turn OR queries into a SpanOrQuery. It's a bit hacky, but is much simpler than creating your own Query class to implement this. -Michae

Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Jack Krupansky
Currently there isn't. SpanNearQuery can take only other SpanQuery objects, which includes other spans, span terms, and span wrapped multi-term queries (e.g., wildcard, fuzzy query), but not Boolean queries. But it does sound like a good feature request. There is SpanNotQuery, so you can exclu

RE: mmap loads the entire index into memory during forceMergeDeletes/forceMerge(int)

2013-01-17 Thread Uwe Schindler
The answer is simple: It load the whole index into RAM because it has the RAM available for free use. If there would be apps running really using that RAM, this would not happen. There is no difference if you use another Lucene directory implementation, merging is a heavy IO operation, so it wi

Re: Any benchmark corps to evaluate performance of specified query?

2013-01-17 Thread Otis Gospodnetic
Hi, Maybe https://github.com/sematext/ActionGenerator could be of help? We use it to produce query load for Solr and ElasticSearch and the whole thing is extensible, so you could easily add support for talking directly to Lucene. Oh, and there is the benchmark in Lucene:  http://lucene.apache.or