Hi all,
I've got an indexing issue I think other folks might be interested in
hearing about and I wanted to get feedback before I went ahead and
implemented a new method.
Currently, the way we update indices is by sending individual delete/add
document requests to all our search boxes individuall
Hi all,
I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying to
integrate the new DodIdSet changes since o.a.l.search.Filter#bits() method
is now depreciated. For our app we actually heavily rely on bits from the
Filter to do post-query filtering (I explain why below).
For example,
c 9, 2008 at 1:47 AM, Michael McCandless <
[EMAIL PROTECTED]> wrote:
>
> This use case sounds alot like faceted navigation, which Solr provides.
>
> Mike
>
>
> Michael Stoppelman wrote:
>
> Hi all,
>>
>> I'm working on upgrading to Lucene 2.4.0 fr
On Sat, Nov 29, 2008 at 11:11 AM, Yonik Seeley wrote:
> On Sat, Nov 29, 2008 at 12:45 PM, Michael Stoppelman
> wrote:
> > Hi all,
> >
> > I've got an indexing issue I think other folks might be interested in
> > hearing about and I wanted to get feedback befo
I've got a question from Doug's original email about replication (
http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg12709.html):
"1. On the index master, periodically checkpoint the index. Every minute or
so the IndexWriter is closed and a 'cp -lr index index.DATE' command is
executed
Hi Yonik,
Thanks for the response.
reply inline.
On Tue, Dec 16, 2008 at 6:44 AM, Yonik Seeley wrote:
> On Tue, Dec 16, 2008 at 1:04 AM, Michael Stoppelman
> wrote:
> > I've got a question from Doug's original email about replication (
> > http://w
Hi all,
My search backends are only able to eek out 13-15 qps even with the entire
index in memory (this makes it very expensive to scale). According to my
YourKit profiler 80% of the program's time ends up in highlighting. With
highlighting disabled my backend gets about 45-50 qps (cheaper scalin
On Tue, Feb 3, 2009 at 7:26 AM, John Byrne wrote:
> Hi,
>
> I've got a weird problem with a lucene index, using 2.3.1. The index
> contains 6660 files. I don't know how this happened.Maybe somone can tell me
> something about the files themselves? (examples below)
>
> On one day, between 10 and 4
a little more detail; I'm not exactly sure what you
mean.
> Cheers
> Mark
>
>
>
> - Original Message
> From: Michael Stoppelman
> To: java-user@lucene.apache.org
> Sent: Tuesday, 3 February, 2009 7:24:06
> Subject: Poor QPS with highlighting
>
>
Thanks Mark for the explanation. I think your solution would definitely
change the tf-idf scoring for documents since your field is now split up
over multiple docs. One option to get around the changing scoring would be
to to run a completely separate index for highlighting (with the overlapping
d
On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen wrote:
> Google uses dedicated highlighting servers. Maybe this architecture would
> work for you.
>
What's your reference? I used to work at Google.
>
> On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman >wrote:
&
On Thu, Feb 5, 2009 at 12:47 PM, Michael Stoppelman wrote:
>
>
> On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Google uses dedicated highlighting servers. Maybe this architecture would
>> work for you.
>>
>
Fuzzy search tends to be super heavy on CPU because of the Levenstein
distance algo. We use it for a small index 60MB for spell correcting and our
QPS suffers as a result.
There was recently a discussion of a new fuzzy algorithm:
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian
Hi Ken,
I found this post on the Lucene documentation page:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
In practice you sometimes need to have a cut-off or boost factor post tf-idf
scoring. The way I've been going about it is by picking values and se
If another thread is executing a query with the handle to one of readers[i]
you're going to kill it since the IndexReader is now closed.
Just don't call the IndexReader#close() method. If nothing is pointing at
the readers they should be garbage collected. Also, you might
want to warm up your new I
ime.
M
On Wed, Feb 25, 2009 at 10:48 PM, Michael Stoppelman wrote:
> Hi Ken,
>
> I found this post on the Lucene documentation page:
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
>
> In practice you sometimes need to have a cut-off
I guess I don't really understand this comment in the similarity java doc
then:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
*queryNorm(q) * is a normalizing factor used to make scores between queries
comparable.
:/.
M
On Fri, Feb 27, 2009
On Sat, Jul 7, 2007 at 8:19 PM, Chun Wei Ho wrote:
> We are currently running a search service with a single Lucene index
> of about 10 GB. We would like to find out:
>
> (a) What is the usual index size of everyone else? How large have
> Lucene index gone in prodution environments, and is there
Another potential idea would be to break up the index into N indices
such that each index is small enough to fit two in memory and then you
can swap them.
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/MultiReader.html
This is just an idea, I haven't tri
Hi all,
I was wondering why the InstantiatedIndex gets very slow as the number of
documents increases in the index. I've been looking at the source and have
only found comments saying "it's slow" when the index is big but not why. Do
folks just run out of memory or something deeper?
Thanks for th
Is this jar going to be in the next release of lucene? Also, are these the
same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch
-M
On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:
>
>
> > I have not looked at any highlight
ault Lucene Query syntax.
>
> Whether it is included soon or not, the code works well and I will
> continue to support it.
>
> - Mark
>
> Michael Stoppelman wrote:
> > Is this jar going to be in the next release of lucene? Also, are these
> the
> > same as th
Kalvir,
Have you tried giving the name field a boost? E.g. name:(John Smith)^10
alias:(John Smith)
-M
On 8/31/07, Kalvir Sandhu <[EMAIL PROTECTED]> wrote:
>
> Hi all.
>
> I am working on building a lucene index to search names of people. I want
> to
> be able to score things differently. Here i
Most of the time the highlighting uses is in getting the next token from the
analyzer (tokenStream.next()). I'm wondering how I can access the tokens
that
are stored in lucene (or store another copy of the TokenStream seperately)
and send a pre-tokenized TokenStream to the highlighter so next() is
Hi all,
Would this approach be recommended for stemmed words as well. For example
let say the original word is
'mower', I want matches on 'mow', 'mowing' and 'mowers' but the most
relevance would obviously be matches
for 'mower'. Should I index my documents unstemmed and then stem at the
query wor
I'm surprised they aren't keeping *any* logs or so they claim. Seems foolish
to me from a data-mining prospective.
"A Wikia employee told me today that people were already asking what the
most popular search terms were. He said there was no way of finding out as
no logs are kept." [1]
[1]
http://r
Hi all,
I've been tracking down a problem happening in our production environment.
When we switch an index after doing deletes & adds, running some searches,
and finally changing the pointer
from old index to new all the threads start stacking up all waiting on
isDeleted(). The threads seem to fin
BTW, I'm using Lucene 2.2.0.
-M
p.s. Congrats on the 2.3.0 release!
On Jan 24, 2008 7:42 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I've been tracking down a problem happening in our production environment.
> When we switch an index after doing
the threads at the start are building the same cache multiple
times?
-M
On Jan 25, 2008 2:01 AM, Michael Stoppelman <[EMAIL PROTECTED]> wrote:
> BTW, I'm using Lucene 2.2.0.
>
> -M
>
> p.s. Congrats on the 2.3.0 release!
>
>
> On Jan 24, 2008 7:42 PM, Michael S
u kill -QUIT right after you fire those 20-30
> concurrent queries? This could tell you/us where those threads are
> blocking, if they are blocking, or what they are all doing.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Or
I've created a mapping of query terms to clusters with corresponding
strength values that I want to integrate into lucene
scoring so I can boost documents that match the clusters. I would like to
give a boost based on the normalized score.
In my setup, each document has a field with the clusters th
/FuzzyLikeThisQuery.java
-M
On Feb 3, 2008 8:21 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote:
> I've created a mapping of query terms to clusters with corresponding
> strength values that I want to integrate into lucene
> scoring so I can boost documents that match the clusters. I would li
.
> Or you could add a clause with the unstemmed version boosted. Or
> something like that Note that whether you add the $ to the stemmed
> or unstemmed version is up to you...
>
> Watch what analyzer you use to be sure it doesn't strip out the special
> symbol
>
Hi all,
I've got an index with tokens that are stemmed. Sometimes I really need to
boost the unstemmed
version of a query word to get the most relevant documents.
Example:
Query: [olives].
I don't want to match documents with the words: oliver, oliver's, etc...
Since I'm stemming when creating t
Did your index size increase drastically?
As a first step I would recommend optimizing your index if you haven't
already.
-M
On Feb 12, 2008 7:42 PM, Cesar Ronchese <[EMAIL PROTECTED]> wrote:
>
> I was doing normal queries happily, seeing the results statistics come in
> about 0.02 seconds.
>
>
To add to what Mark is saying, it's very important that watch out for the
first N results effect. If you showed a user a random set of documents with
crap
relevance I'll bet you that a good number will click on the first result
(call it user laziness or the Google "I'm feeling lucky" effect :)). Yo
On Tue, Feb 26, 2008 at 10:18 AM, Jamie <[EMAIL PROTECTED]> wrote:
> Hi
>
> I am looking for a way to improve the search performance of my
> application. I've followed every suggestion in the Lucene Wiki but the
> search is still too slow with large indexes. I was wondering whether
Did you optim
ted, based on date, search only those
> indexes that fall between specified dates. I've run my code through the
> YourKit profiler. The time appears to be consumed by Lucene itself and
> not by my code.
>
> Any other ideas?
>
>
> Michael Stoppelman wrote:
> > On Tu
Sumit,
The class you'll end up subclassing from would be:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.htmlor
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/DefaultSimilarity.html
On an IndexSearcher
Hi all,
I've been doing some performance testing and found that using
QueryWrapperFilter for a location field
restriction I have to do allows my search results to approach 5-10ms. This
was surprising.
Before the performance was between 50ms-100ms.
The queries from before the optimization look like
n Wed, Apr 16, 2008 at 6:43 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:
> Michael Stoppelman skrev:
>
> Hi all,
> > I've been doing some performance testing and found that using
> > QueryWrapperFilter for a location field
> > restriction I have to do allows my
Stephane,
Could you describe how you setup the spatial area? Having BooleanQuery with
200 terms in it definitely slows things down (I'm not sure exactly why yet
-- it seems like it shouldn't be "that" slow). If you can describe your
spatial area in fewer terms you can get much better performance.
Hi all,
I've got a document that contains a bunch of separate posts about one topic
(a message board thread), all the posts become concatenated together in the
indexed lucene document.
I would like to create highlights and know where the highlight came from,
meaning if the text fragment came from
Hi all,
My index is being zeroed out by the new lucene core jar.
Here's the deal:
I've got an old index from lucene-core-2.0.0 jar. I start up my service with
the new lucene 2.2.0 jar and everything
is fine. When I add a document to the index the everything is still fine.
Yet when I shut down my
Seems like the lucene 2.0.0 created a file /segments. In
2.2.0the new segments file has the following convention
/segments_. Our codebase had some logic that depended on
this file being named consistently.
It seems like the bug was on my end, my apologies.
-M
On 6/21/07, Michael Stoppelman
Hi all,
I was tracking down slowness in the contrib highlighter code and it seems
the seemingly simple tokenStream.next() is the culprit.
I've seen multiple posts about this being a possible cause. Has anyone
looked into how to speed up StandardTokenizer? For my
documents it's taking about 70ms p
ing up
query term offset information in the index. For larger documents this
can be much faster than using the standard contrib Highlighter, even if
your using TokenSources. LUCENE-644 has a much flatter curve than the
contrib Highlighter as document size goes up.
- Mark
Michael Stoppelman wrote:
&g
d
to be the same as the tokenizer for indexing so I can make the highlighting
tokenizer
much simpler. Everything will be fast and happy soon.
-M
- Mark
Michael Stoppelman wrote:
> Might be nice to add a line of documentation to the highlighter on the
> possible
> perform
A couple of thoughts here...
You could hash (e.g.md5) all the documents in your index and eliminate
duplicates that way. Just pick one of the docs in the hash bucket as
the non-dup document and the delete the other dups. This could be run as a
batch job to eliminate the duplicates in an off-line p
49 matches
Mail list logo