OR harry)* +olfaithfull:stillhere))
typo? (type:1 81)
> I would really think to do this all in one Query. Is this even possible?
How would you want to combine the results?
Regards,
Paul Elschot
-
To unsubscribe, e-m
in the current version that some scoring is done ahead
for each clause into an unordered buffer.
This helps for top level OR queries, but loses for OR queries that are
subqueries of AND.
The svn version does not score ahead. It reli
quot; and I rewrote the query as:
> c AND (a OR b)
> Would the query run faster?
Exchanging the operands of AND would not make a noticeable difference
in speed. Queries are evaluated by iterating the inverted term index entries
for all query terms in p
On Saturday 19 February 2005 11:02, Erik Hatcher wrote:
>
> On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
> >>> By lowercasing the querytext and searching in title_lc ?
> >>
> >> Well sure, but how about this query:
> >>
> >&
Erik,
On Saturday 19 February 2005 01:33, Erik Hatcher wrote:
>
> On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
>
> > On Friday 18 February 2005 21:55, Erik Hatcher wrote:
> >>
> >> On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
> >>
> >
On Friday 18 February 2005 21:55, Erik Hatcher wrote:
>
> On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
>
> > Erik,
> >
> > Just curious: it would seem easier to use multiple fields for the
> > original case and lowercase searching. Is there any parti
Erik,
Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple fields?
Regards,
Paul Elschot
AND like queries will match in the indexed field.
A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.
Regards,
Paul Elschot
>
e in IndexSearcher.search().
A profiler might tell you whether that is a bottleneck for your queries.
If it is, there is some code in development that might help
.
In case it turns out that the memory occupied by the BitSet of the filter
is a bottleneck, please check the (very) recen
32 required/prohibited clauses in query".
In the development version this restriction has gone.
The limitation of the maximum clause count (default 1024,
configurable) is still there.
Regards,
Paul Elschot
-
To unsubscribe
;(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
> >>method.
> >> Please report this error in detail to
> >>http://java.sun.com/cgi-bin/bugreport.cgi
Iirc java 1.1 had a switch to turn of JIT compilation. It did slow things
d
On Friday 04 February 2005 17:29, Bill Tschumy wrote:
>
> On Feb 4, 2005, at 10:19 AM, Bill Tschumy wrote:
>
> >
> > On Feb 3, 2005, at 2:04 PM, Paul Elschot wrote:
> >
> >> On Thursday 03 February 2005 20:18, Bill Tschumy wrote:
> >>> Is
ining the field
names of all (other) indexed fields in the document.
Assuming there is always a primary key field the query is then:
+fieldnames:primarykeyfield -fieldnames:specificfield
Regards,
Paul Elschot
-
To unsubscribe, e-mai
it also be
> broken?
No.
Currently, the "old" constructor for BooleanClause does not carry the
old state forward.
The "new" constructor does carry the new state backward.
I'll post a fix in bugzilla later.
Thanks,
Paul Elschot.
--
ion 151042.
So much for the few minutes instead of hours,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
empty beforehand):
cvs -d :pserver:[EMAIL PROTECTED]:/home/cvspublic checkout -r lucene_1_4_3 -d
lucene-1.4.3 jakarta_lucene
In there you can correct the build.xml file and do:
ant compile
to compile the source code.
Regards,
Paul Elschot
On Wednesday 02 February 2005 20:55, Helen Butler wrote:
>
ilt
has a 1.5 version number because of an incorrect version number
in the 1.4.3 build.xml.
You need to correct the version property in the build.xml file:
Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Friday 28 January 2005 22:30, Andy Goodell wrote:
> You should be fine.
For search performance, yes. But the extra field data does slow down
optimization of a modified index because all the field (and index) data
is read and written for that. When the extra data gets bulky, it's normally
better
C1 synC2 ...)
the development version of BooleanQuery might be a bit faster
than the current one.
For an interesting twist in the use of idf please search
for "fuzzy scoring changes" on lucene-dev at the end of 2004.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
gt; Setting bit on
> Setting bit on
> Setting bit on
> Setting bit on
> Leaving AccountFilter...
> Leaving AccountFilter...
> Leaving AccountFilter...
I don't see any recursion in your code, but this output
suggests nesting three deep
r to the way the MySQL index cache works...
It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
ne-dev?
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Sorry for the duplicate on lucene-dev, it should have gone to lucene-user
directly:
A bit more:
On Thursday 06 January 2005 10:22, Paul Elschot wrote:
> On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> > Hi all,
> >
> > I'm currently doing a quer
ndexing all word combinations that you're interested in.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
nt or use batches of several documents,
> but you cannot escape the need to serialize the writes.
And while this updating is going on, you can keep another reader open for
searching, it will not be affected by the updates.
After all updates are done, close that reader and reopen
another one to s
ore (assuming no hits in
the changed field text.)
Finally, a change in document score only influences the document
ordering in the search results when another document has a score
that is within the range of the change.
Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
d one).
Then inherit from BooleanQuery.BooleanWeight to return the above
Scorer.
Then inherit from BooleanQuery to use the above Weight in createWeight().
Then inherit from QueryParser to use the above Query in getBooleanQuery().
Finally us
s and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> Question :
>
> When on Search Process , How to Display that this relevan Document Id
> Originated from Which MRG???
>
> [ Some thing like this : - Search word 'ISBN12345' is avalible from
> "MRGx" ]
I thi
ot an error
back from the file system. After that it has put the name of that segment in
the deletable file, so it can try later to delete that segment.
This is known behaviour on FAT file systems. These randomly take some time
for themselves to finish closing a file after it has been corr
a score that sorts by the
number of matching clauses. Higher powers as above can come
a long way, though.
Regards,
Paul Elschot
> Thanks,
> Gururaja
>
> Mike Snare <[EMAIL PROTECTED]> wrote:
> I'm still new to Lucene, but wouldn't that be the coord()? My
> understan
gt; approach where they could end up laboriously marking
> the entire index as True?
The filter is checked only for search results on the query
over the whole index.
The bit filters generally work well, except when you need
a lot of very sparse filters and memory is a concern.
Regards,
Pau
p Filter could be the first time a user from
a group queries an index after it is opened.
Filters can be cached, see the recent discussion on CachingWrappingFilter
and friends.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EM
of the primary
key field can serve as the constant value.
Regards,
Paul Elschot
> -Original Message-
> From: Aviran [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 09, 2004 2:08 PM
> To: 'Lucene Users List'
> Subject: RE: Retrieving all docs in the index
&
a filter does reduce the search space.
A filter might also be used to reduce the I/O for searching, but Lucene
doesn't do that now, probably because there was little to gain.
Regards,
Paul Elschot.
P.S. The code doing the filte
Paul,
On Friday 03 December 2004 23:31, you wrote:
> Hi,
> how yould you restrict the search results for a certain user? I'm
One way to restrict results is by using a Filter.
> indexing all the existing data in my application but there are certain
> access levels so some users should see more r
On Friday 03 December 2004 08:43, Paul Elschot wrote:
> On Friday 03 December 2004 07:50, Chris Hostetter wrote:
...
> > So, If I'm understanding you (and the javadocs) correctly, the real key
> > here is maxMergeDocs. It seems like addDocument will never merge a
> > s
ergeFactor at 10, the
1000'th added document will create a segment of size 1000.
With maxMergeDocs at a lower value than 1000, the last merge (of the 10
segments with 100 docs each) will not be done.
optimize() uses minMergeDocs for its final merges, but it ignores
maxMergeDocs.
Regards,
Pau
ocs/api/org/apache/lucene/search/Similarity.html
See also the DefaultSimilarity.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
and/or Refactoring
on how to get rid of the parallel class hierarchy. That could also
involve some sort of accrual scorer and Lucene's Similarity.
Regards,
Paul Elschot
> -Ken
>
> On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot <[EMAIL PROTECTED]>
wrote:
> > On Frida
ut that will probably not make a difference.
Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help t
Chris,
On Tuesday 23 November 2004 03:25, Hoss wrote:
> (NOTE: numbers in [] indicate Footnotes)
>
> I'm rather new to Lucene (and this list), so if I'm grossly
> misunderstanding things, forgive me.
>
> One of my main needs as I investigate Search technologies is to restrict
> results based on
On Monday 22 November 2004 05:02, Kauler, Leto S wrote:
> Hi Lucene list,
>
> We have the need for analysed and 'not analysed/not tokenised' clauses
> within one query. Imagine an unparsed query like:
>
> +title:"Hello World" +path:Resources\Live\1
>
> In the above example we would want the fir
On Thursday 18 November 2004 16:57, Rupinder Singh Mazara wrote:
> hi all
>
> I needed some help in solving the following problem
> a user executes query1 and query2
>
> both the queries( not result sets ) get stored, over time the user
> wants to find
> which documents from query1 are commo
; and with first 2 columns docuemnts will be displayed in a 2D-space.
> Does anyone work on a project like this?
I don't know. Is there a good SVD package for Java?
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Wednesday 17 November 2004 07:10, Karthik N S wrote:
> Hi guy's
>
>
> Apologies.
>
>
> So A Mergeed Index is again a Single [ addition of subIndexes... ),
>
> If that case , If One of the Field Types is of type 'Field.Keyword'
> whic is Unique across the subIndexes [Before Mergin
On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
> Hello,
>
> I have been using DateFilter to limit my search results to a certain date
> range. I am now asked to replace this filter with one where my search
results
> have document IDs greater than a given document ID. This document ID is
>
g dates into day and time components.
Once you approach 1000 days, you'll get the same problem again,
so you might want to use a filter for the dates.
See DateFilter and the archives on MMDD.
Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
ncations?
To have only longer matches one can also use queries with
multiple ? characters, each matching exactly one character.
I think it would be better encourage the users to use longer
and maybe also more prefixes. This gives m
On Friday 12 November 2004 22:56, Chuck Williams wrote:
> I had a similar need and wrote MaxDisjunctionQuery and
> MaxDisjunctionScorer. Unfortunately these are not available as a patch
> but I've included the original message below that has the code (modulo
> line breaks added by simple text emai
to return that weight.
- override QueryParser.getBooleanQuery() to return that query
in the cases you want, that is when all clauses are optional.
"replace" usually means "inherit from" in new code.
When you need more info on this, try lucene-dev.
Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
e actual tradeoff depends on the user
requirements and the time and memory available on the server,
so the users get what they pay for.
Imposing a minimum prefix length can be done by overriding the method
in QueryParser that provides a prefix query.
Regards,
Paul Elschot
-
word 2 or "word3 word4"~4)"~2
SpanQueries can also enforce an order on the matching subqueries,
but that is difficult to express in the current query syntax.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Tuesday 09 November 2004 23:14, Luke Francl wrote:
> On Tue, 2004-11-09 at 16:00, Paul Elschot wrote:
>
> > Lucene has no provision for matching by being prohibited only. This can
> > be achieved by indexing something for each document that can be
> > used in queries t
replace the value efficiently.
The only updates available are on the field norms.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
g not having + or - prefix is optional and only influences the score.
In case there is nothing required by a + prefix, at least one of the things
without prefix is required.
Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
analyzer.
It's an unusual solution, though.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
rms in the phrases.
Another way is to avoid using the term positions by querying for words
instead of phrases.
In case you have hardware/resources there are more options
like using faster disks and/or using RAM for critical parts of the index.
Lucene can use extra RAM in
. Just retrieve what you
need from the Lucene index in the order of the docId's.
Try and store as little data per document as possible.
> about updating the database when the documentID is created?
To know the docId use an indexed primary key in lucene and search
for it using IndexReader.termD
ad, 10-15% iirc. More threads
were of no use for me in that case.
Regards,
Paul Elschot
> Otis
>
> --- Chris Fraschetti <[EMAIL PROTECTED]> wrote:
> > if i have four threads all trying to call my index function, will
> > lucene do what is necessary for each threa
or each term,
and one scorer to combine the other two to provide the search results,
usually a BooleanScorer or a ConjunctionScorer.
For proximity queries, other scorers are used.
Regards,
Paul Elschot
-
To unsubscribe, e-mail:
On Tuesday 12 October 2004 19:27, Paul Elschot wrote:
>
> IndexReader.open(indexName).termDocs(new Term(term,
> field)).skipTo(documentNr)
>
> returns the boolean indicating that.
Well, almost. When it returns true one still needs to check the TermDocs
for being at the documentNr
erm is
in a field of a document:
IndexReader.open(indexName).termDocs(new Term(term, field)).skipTo(documentNr)
returns the boolean indicating that.
What do you need the {0,1} values for?
Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
ay to find out number of indexed terms for each
>
> document?
By default, the stored norm is the inverse square root of
the number of indexed terms of an indexed document field.
The encoding/decoding is somewhat rough, though.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
order of the search results much.
Taking the square roots of the query term weights would have
the query weights directly apllied to the the query term density in the document field,
whereas now the weights seem to be applied to the square root of the density.
The density value is an approximation
ed the 1.000.000 results?
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
dcard. As each clause ends up using
some buffer memory internally, a maximum was introduced to
avoid running out of memory.
You can change the maximum nr of added clauses using
BooleanQuery.setMaxClauseCount() but then it is advisable
to monitor memory usage, and evt. i
an IndexReader for this
> particular search, where all other searches use the pool. Suggestions?
You could use a map from the IndexSearcher back to the IndexReader that was
used to create it. (It's a bit of a waste because the IndexSearcher has a reader
attribute inte
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote:
> Hey Paul,
>
> Thanks for the quick reply. Excuse my ignorance, but what do I do with the
> generated BitSet?
You can return it in in the bits() method of the object implementing your
org.apache.lucene.search.Filter (http://jakarta.apache
oots?
This would allow a more straightforward comprehension
of the of the term weights as directly weighing the term densities.
Section 5 of the reference above has the full weighted
p-Norm formula's. The OR p-Norm there is very close
to the Lucene formula without coord().
Regards,
Paul Elschot
on the latest version to see the code).
and iteratate over you doc ids instead of over dates.
This will give you a filter for the doc ids you want to query.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Kevin,
On Sunday 05 September 2004 23:13, Kevin A. Burton wrote:
> Paul Elschot wrote:
> >Kevin,
> >
> >On Sunday 05 September 2004 10:16, Kevin A. Burton wrote:
> >>I want to sort a result set but perform a group by as well... IE remove
> >>duplicate
case you can define another field that defines what is a duplicate
by having the same value for duplicates, you can use it as one of the
SortField's for sorting.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTE
.4.1 is out, but it's not available there yet.
In case you want that version please ask on lucene-dev.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
l
> be indexing about 400.000 messages per month.
To easily keep the primary keys in sync between the SQL db and Lucene,
I'd start by keeping the images and the full text only in the SQL db.
Lucene optimisations (needed after adding/deleting docs) copy all
#x27;s and then use the 2nd index to qualify
> the full text search over the document table. The reason I want to do
> this is to reduce the numbers of documents that the full text query will
> run.
Regards,
Paul Elschot
-
e outgoing URL's. Crawlers also keep track
of multiple host names resolving to the same IP address.
In case you need to crawl and index an intranet or more, have a look
at Nutch.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
e
web site.
You can then see the total disk size of for example the stored fields.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Kevin,
On Thursday 05 August 2004 23:32, Kevin A. Burton wrote:
> I'm trying to compute a filter to match documents in our index by a set
> of terms.
>
> For example some documents have a given field 'category' so I need to
> compute a filter with mulitple categories.
>
> The problem is that our
On Wednesday 04 August 2004 18:22, John Z wrote:
> Hi
>
> I had a question related to number of fields in a document. Is there any
> limit to the number of fields you can have in an index.
>
> We have around 25-30 fields per document at present, about 6 are keywords,
> Around 6 stored, but not ind
On Monday 26 July 2004 21:41, John Patterson wrote:
> Is there any way to cache TermDocs? Is this a good idea?
Lucene does this internally by buffering
up to 32 document numbers in advance for a query Term.
You can view the details here in case you're interested:
http://cvs.apache.org/viewcvs.cg
ussions on creating new
> query parsers (one size doesn't fit all, I don't think) and what syntax
> should be used.
>
> Paul Elschot created a "surround" query parser that he posted about to
> the list in April.
>
> Erik
Here is a bit about the syn
On Thursday 11 March 2004 06:15, Tomcat Programmer wrote:
> I have a situation where I need to be able to find
> incomplete word matches, for example a search for the
> string 'ape' would return matches for 'grapes'
> 'naples' 'staples' etc. I have been searching the
> archives of this user list a
83 matches
Mail list logo