indexing issue

2008-11-29 Thread Michael Stoppelman
Hi all, I've got an indexing issue I think other folks might be interested in hearing about and I wanted to get feedback before I went ahead and implemented a new method. Currently, the way we update indices is by sending individual delete/add document requests to all our search boxes individuall

Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-08 Thread Michael Stoppelman
Hi all, I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying to integrate the new DodIdSet changes since o.a.l.search.Filter#bits() method is now depreciated. For our app we actually heavily rely on bits from the Filter to do post-query filtering (I explain why below). For example,

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-09 Thread Michael Stoppelman
c 9, 2008 at 1:47 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > This use case sounds alot like faceted navigation, which Solr provides. > > Mike > > > Michael Stoppelman wrote: > > Hi all, >> >> I'm working on upgrading to Lucene 2.4.0 fr

Re: indexing issue

2008-12-14 Thread Michael Stoppelman
On Sat, Nov 29, 2008 at 11:11 AM, Yonik Seeley wrote: > On Sat, Nov 29, 2008 at 12:45 PM, Michael Stoppelman > wrote: > > Hi all, > > > > I've got an indexing issue I think other folks might be interested in > > hearing about and I wanted to get feedback befo

replication question

2008-12-15 Thread Michael Stoppelman
I've got a question from Doug's original email about replication ( http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg12709.html): "1. On the index master, periodically checkpoint the index. Every minute or so the IndexWriter is closed and a 'cp -lr index index.DATE' command is executed

Re: replication question

2008-12-16 Thread Michael Stoppelman
Hi Yonik, Thanks for the response. reply inline. On Tue, Dec 16, 2008 at 6:44 AM, Yonik Seeley wrote: > On Tue, Dec 16, 2008 at 1:04 AM, Michael Stoppelman > wrote: > > I've got a question from Doug's original email about replication ( > > http://w

Poor QPS with highlighting

2009-02-02 Thread Michael Stoppelman
Hi all, My search backends are only able to eek out 13-15 qps even with the entire index in memory (this makes it very expensive to scale). According to my YourKit profiler 80% of the program's time ends up in highlighting. With highlighting disabled my backend gets about 45-50 qps (cheaper scalin

Re: waaaay too many files in the index!

2009-02-03 Thread Michael Stoppelman
On Tue, Feb 3, 2009 at 7:26 AM, John Byrne wrote: > Hi, > > I've got a weird problem with a lucene index, using 2.3.1. The index > contains 6660 files. I don't know how this happened.Maybe somone can tell me > something about the files themselves? (examples below) > > On one day, between 10 and 4

Re: Poor QPS with highlighting

2009-02-03 Thread Michael Stoppelman
a little more detail; I'm not exactly sure what you mean. > Cheers > Mark > > > > - Original Message > From: Michael Stoppelman > To: java-user@lucene.apache.org > Sent: Tuesday, 3 February, 2009 7:24:06 > Subject: Poor QPS with highlighting > >

Re: Poor QPS with highlighting

2009-02-04 Thread Michael Stoppelman
Thanks Mark for the explanation. I think your solution would definitely change the tf-idf scoring for documents since your field is now split up over multiple docs. One option to get around the changing scoring would be to to run a completely separate index for highlighting (with the overlapping d

Re: Poor QPS with highlighting

2009-02-05 Thread Michael Stoppelman
On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen wrote: > Google uses dedicated highlighting servers. Maybe this architecture would > work for you. > What's your reference? I used to work at Google. > > On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman >wrote: &

Re: Poor QPS with highlighting

2009-02-05 Thread Michael Stoppelman
On Thu, Feb 5, 2009 at 12:47 PM, Michael Stoppelman wrote: > > > On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> Google uses dedicated highlighting servers. Maybe this architecture would >> work for you. >> >

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Michael Stoppelman
Fuzzy search tends to be super heavy on CPU because of the Levenstein distance algo. We use it for a small index 60MB for spell correcting and our QPS suffers as a result. There was recently a discussion of a new fuzzy algorithm: https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian

Re: Confidence scores at search time

2009-02-25 Thread Michael Stoppelman
Hi Ken, I found this post on the Lucene documentation page: http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03 In practice you sometimes need to have a cut-off or boost factor post tf-idf scoring. The way I've been going about it is by picking values and se

Re: Faceted Search using Lucene

2009-02-25 Thread Michael Stoppelman
If another thread is executing a query with the handle to one of readers[i] you're going to kill it since the IndexReader is now closed. Just don't call the IndexReader#close() method. If nothing is pointing at the readers they should be garbage collected. Also, you might want to warm up your new I

Re: Confidence scores at search time

2009-02-28 Thread Michael Stoppelman
ime. M On Wed, Feb 25, 2009 at 10:48 PM, Michael Stoppelman wrote: > Hi Ken, > > I found this post on the Lucene documentation page: > http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03 > > In practice you sometimes need to have a cut-off

Re: queryNorm affect on score

2009-02-28 Thread Michael Stoppelman
I guess I don't really understand this comment in the similarity java doc then: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm *queryNorm(q) * is a normalizing factor used to make scores between queries comparable. :/. M On Fri, Feb 27, 2009

Re: Lucene index sizes and performance

2009-04-16 Thread Michael Stoppelman
On Sat, Jul 7, 2007 at 8:19 PM, Chun Wei Ho wrote: > We are currently running a search service with a single Lucene index > of about 10 GB. We would like to find out: > > (a) What is the usual index size of everyone else? How large have > Lucene index gone in prodution environments, and is there

Re: Reloading RAM Directory from updated FS Directory

2009-06-10 Thread Michael Stoppelman
Another potential idea would be to break up the index into N indices such that each index is small enough to fit two in memory and then you can swap them. http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/MultiReader.html This is just an idea, I haven't tri

InstantiatedIndex performance

2010-03-31 Thread Michael Stoppelman
Hi all, I was wondering why the InstantiatedIndex gets very slow as the number of documents increases in the index. I've been looking at the source and have only found comments saying "it's slow" when the index is big but not why. Do folks just run out of memory or something deeper? Thanks for th

Re: Highlighter that works with phrase and span queries

2007-08-27 Thread Michael Stoppelman
Is this jar going to be in the next release of lucene? Also, are these the same as the changes in the following patch: https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch -M On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote: > > > > I have not looked at any highlight

Re: Highlighter that works with phrase and span queries

2007-08-27 Thread Michael Stoppelman
ault Lucene Query syntax. > > Whether it is included soon or not, the code works well and I will > continue to support it. > > - Mark > > Michael Stoppelman wrote: > > Is this jar going to be in the next release of lucene? Also, are these > the > > same as th

Re: Weighting Issue

2007-08-31 Thread Michael Stoppelman
Kalvir, Have you tried giving the name field a boost? E.g. name:(John Smith)^10 alias:(John Smith) -M On 8/31/07, Kalvir Sandhu <[EMAIL PROTECTED]> wrote: > > Hi all. > > I am working on building a lucene index to search names of people. I want > to > be able to score things differently. Here i

Speeding up highlighting by storing a cached TokenStream

2007-10-25 Thread Michael Stoppelman
Most of the time the highlighting uses is in getting the next token from the analyzer (tokenStream.next()). I'm wondering how I can access the tokens that are stored in lucene (or store another copy of the TokenStream seperately) and send a pre-tokenized TokenStream to the highlighter so next() is

Re: Synonyms and Ranking

2008-01-03 Thread Michael Stoppelman
Hi all, Would this approach be recommended for stemmed words as well. For example let say the original word is 'mower', I want matches on 'mow', 'mowing' and 'mowers' but the most relevance would obviously be matches for 'mower'. Should I index my documents unstemmed and then stem at the query wor

Re: Wikia search goes live today

2008-01-08 Thread Michael Stoppelman
I'm surprised they aren't keeping *any* logs or so they claim. Seems foolish to me from a data-mining prospective. "A Wikia employee told me today that people were already asking what the most popular search terms were. He said there was no way of finding out as no logs are kept." [1] [1] http://r

Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-24 Thread Michael Stoppelman
Hi all, I've been tracking down a problem happening in our production environment. When we switch an index after doing deletes & adds, running some searches, and finally changing the pointer from old index to new all the threads start stacking up all waiting on isDeleted(). The threads seem to fin

Re: Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-25 Thread Michael Stoppelman
BTW, I'm using Lucene 2.2.0. -M p.s. Congrats on the 2.3.0 release! On Jan 24, 2008 7:42 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote: > Hi all, > > I've been tracking down a problem happening in our production environment. > When we switch an index after doing

Re: Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-25 Thread Michael Stoppelman
the threads at the start are building the same cache multiple times? -M On Jan 25, 2008 2:01 AM, Michael Stoppelman <[EMAIL PROTECTED]> wrote: > BTW, I'm using Lucene 2.2.0. > > -M > > p.s. Congrats on the 2.3.0 release! > > > On Jan 24, 2008 7:42 PM, Michael S

Re: Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-27 Thread Michael Stoppelman
u kill -QUIT right after you fire those 20-30 > concurrent queries? This could tell you/us where those threads are > blocking, if they are blocking, or what they are all doing. > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Or

Boosting using an external data source

2008-02-03 Thread Michael Stoppelman
I've created a mapping of query terms to clusters with corresponding strength values that I want to integrate into lucene scoring so I can boost documents that match the clusters. I would like to give a boost based on the normalized score. In my setup, each document has a field with the clusters th

Re: Boosting using an external data source

2008-02-04 Thread Michael Stoppelman
/FuzzyLikeThisQuery.java -M On Feb 3, 2008 8:21 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote: > I've created a mapping of query terms to clusters with corresponding > strength values that I want to integrate into lucene > scoring so I can boost documents that match the clusters. I would li

Re: How to promote an unstemmed match over a stemmed match in an index that's stemmed...

2008-02-11 Thread Michael Stoppelman
. > Or you could add a clause with the unstemmed version boosted. Or > something like that Note that whether you add the $ to the stemmed > or unstemmed version is up to you... > > Watch what analyzer you use to be sure it doesn't strip out the special > symbol >

How to promote an unstemmed match over a stemmed match in an index that's stemmed...

2008-02-11 Thread Michael Stoppelman
Hi all, I've got an index with tokens that are stemmed. Sometimes I really need to boost the unstemmed version of a query word to get the most relevant documents. Example: Query: [olives]. I don't want to match documents with the words: oliver, oliver's, etc... Since I'm stemming when creating t

Re: Lucene multiple field search performance

2008-02-12 Thread Michael Stoppelman
Did your index size increase drastically? As a first step I would recommend optimizing your index if you haven't already. -M On Feb 12, 2008 7:42 PM, Cesar Ronchese <[EMAIL PROTECTED]> wrote: > > I was doing normal queries happily, seeing the results statistics come in > about 0.02 seconds. > >

Re: Which file in the lucene package is used to manipulate results..

2008-02-20 Thread Michael Stoppelman
To add to what Mark is saying, it's very important that watch out for the first N results effect. If you showed a user a random set of documents with crap relevance I'll bet you that a good number will click on the first result (call it user laziness or the Google "I'm feeling lucky" effect :)). Yo

Re: Lucene Search Performance

2008-02-26 Thread Michael Stoppelman
On Tue, Feb 26, 2008 at 10:18 AM, Jamie <[EMAIL PROTECTED]> wrote: > Hi > > I am looking for a way to improve the search performance of my > application. I've followed every suggestion in the Lucene Wiki but the > search is still too slow with large indexes. I was wondering whether Did you optim

Re: Lucene Search Performance

2008-02-26 Thread Michael Stoppelman
ted, based on date, search only those > indexes that fall between specified dates. I've run my code through the > YourKit profiler. The time appears to be consumed by Lucene itself and > not by my code. > > Any other ideas? > > > Michael Stoppelman wrote: > > On Tu

Re: changing scoring formula

2008-03-05 Thread Michael Stoppelman
Sumit, The class you'll end up subclassing from would be: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.htmlor http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/DefaultSimilarity.html On an IndexSearcher

QueryWrapperFilter question...

2008-04-16 Thread Michael Stoppelman
Hi all, I've been doing some performance testing and found that using QueryWrapperFilter for a location field restriction I have to do allows my search results to approach 5-10ms. This was surprising. Before the performance was between 50ms-100ms. The queries from before the optimization look like

Re: QueryWrapperFilter question...

2008-04-16 Thread Michael Stoppelman
n Wed, Apr 16, 2008 at 6:43 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > Michael Stoppelman skrev: > > Hi all, > > I've been doing some performance testing and found that using > > QueryWrapperFilter for a location field > > restriction I have to do allows my

Re: hybrid query (lucene + db)

2008-05-01 Thread Michael Stoppelman
Stephane, Could you describe how you setup the spatial area? Having BooleanQuery with 200 terms in it definitely slows things down (I'm not sure exactly why yet -- it seems like it shouldn't be "that" slow). If you can describe your spatial area in fewer terms you can get much better performance.

Highlighting for a message board thread

2007-06-21 Thread Michael Stoppelman
Hi all, I've got a document that contains a bunch of separate posts about one topic (a message board thread), all the posts become concatenated together in the indexed lucene document. I would like to create highlights and know where the highlight came from, meaning if the text fragment came from

Lucene 2.0.0 index being zeroed out by Lucene 2.2.0.

2007-06-21 Thread Michael Stoppelman
Hi all, My index is being zeroed out by the new lucene core jar. Here's the deal: I've got an old index from lucene-core-2.0.0 jar. I start up my service with the new lucene 2.2.0 jar and everything is fine. When I add a document to the index the everything is still fine. Yet when I shut down my

Re: Lucene 2.0.0 index being zeroed out by Lucene 2.2.0.

2007-06-21 Thread Michael Stoppelman
Seems like the lucene 2.0.0 created a file /segments. In 2.2.0the new segments file has the following convention /segments_. Our codebase had some logic that depended on this file being named consistently. It seems like the bug was on my end, my apologies. -M On 6/21/07, Michael Stoppelman

StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Michael Stoppelman
Hi all, I was tracking down slowness in the contrib highlighter code and it seems the seemingly simple tokenStream.next() is the culprit. I've seen multiple posts about this being a possible cause. Has anyone looked into how to speed up StandardTokenizer? For my documents it's taking about 70ms p

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Michael Stoppelman
ing up query term offset information in the index. For larger documents this can be much faster than using the standard contrib Highlighter, even if your using TokenSources. LUCENE-644 has a much flatter curve than the contrib Highlighter as document size goes up. - Mark Michael Stoppelman wrote: &g

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-19 Thread Michael Stoppelman
d to be the same as the tokenizer for indexing so I can make the highlighting tokenizer much simpler. Everything will be fast and happy soon. -M - Mark Michael Stoppelman wrote: > Might be nice to add a line of documentation to the highlighter on the > possible > perform

Re: Detection of index dublicates in Lucene

2007-07-30 Thread Michael Stoppelman
A couple of thoughts here... You could hash (e.g.md5) all the documents in your index and eliminate duplicates that way. Just pick one of the docs in the hash bucket as the non-dup document and the delete the other dups. This could be run as a batch job to eliminate the duplicates in an off-line p