Re: Snowball and accents filter...?
El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió: > : In order to do this, we tried subclassing the SnowballAnalyzer... it > : doesn't work yet, though. Here is the code of our custom class: > > At first glance, what youv'e got seems fine, can you elaborate on what you > mean by "it doesn't work" ? > > Perhaps the issue is that the SnowballStemmer can't handle the accented > characters, and you should strip them first, then stem? > > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream result = new StandardTokenizer(reader); > result = new StandardFilter(result); > result = new LowerCaseFilter(result); > if (stopSet != null) > result = new StopFilter(result, stopSet); > result = new ISOLatin1AccentFilter(result); > result = new SnowballFilter(result, name); > return result; > } > Thanks for your answer, Chris. It doesn't work for the opposite reason: it requires words to be spelled correctly, including accents, in order to stem them. So, for example, "civilización" and its plural, "civilizaciones" are stemmed correctly, but the accentless version, "civilizacion", doesn't get stemmed at all. So if someone misspells the word, omitting the accent, in the search query--a likely scenario--the only hits they get are identical misspellings in the documents, if such things exist. But we need stemming of both accented and unaccented versions of the word. Stemming misspellings may sound inherently evil, I suppose, but it seems to be our best bet. We're currently trying to modify the SpanishStemmer to do this, but haven't gotten it quite yet. Another option that I'm imagining might work, though less well, would be to simultaneously maintain two indexes, one of correctly stemmed words generated without the accents filter, and another of unstemmed words with the accents stripped, and query both indexes when searching. Yet another possibility would be, I think, to silently use a dictionary to correct spellings in queries before searching. A few Google queries show that they do things sort of the way we're trying to, though perhaps not quite... Thanks again, Andrew - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search for docs containing only a certain word in a specified field?
28 apr 2007 kl. 07.52 skrev Kun Hong: karl wettin wrote: 27 apr 2007 kl. 14.11 skrev Erik Hatcher: On Apr 27, 2007, at 6:39 AM, karl wettin wrote: 27 apr 2007 kl. 12.36 skrev Erik Hatcher: Unless someone has some other tricks I'm not aware of, that is. I guess it would be possible to add start/stop-tokens such as ^ and $ to the indexed text: "^ the $" and place a phrase query with 0 slop. True true. That'd work too. Thanks for the replies and discussion. I think I didn't express my problems correctly. The problem is I want to find documents containing only the "the" token in the title field, but not necessarily with only one appearance. For example, if the query is "the", I want to find documents whose title is "the", "the the" or "the the the". I'm not sure if you mean that it should treat all repetative tokens as only one token? Then you are better of using a filter when analyzing text you insert to the index: rather than creating one token for each the in "the the the the the the" you only create one. You might also want to use this filter when parsing user queries. (It will be hard to find the band 'the the'.) If not and what you write above is all you want to match, nothing more, nothing less, then you could do something like this: (dry coded and untested.) int n = 3; // the; the the; the the the String field = "title"; String token = "the"; BooleanQuery bq = new BooleanQuery(); for (int i=0;i
Re: Index sync up
I don't understand why you think HitCollector caches lots of data. All it does is provide a place for you to decide whether you want a doc or not. There's no fetching of the doc, or anything else except the score and the doc ID. There's nothing else you have to do with HitCollector.collect method. TopDocs, ends up with an array of docIDs and scores, which is probably what you want, just skip to the Nth document and read off the next X documents. In neither case is there very much storage involved. You've got to score all the documents anyway if you want the most relevant, and the TopDocs has a long and a float for each scoring document. Still not a huge amount of data. Anyway, best of luck however it works out Erick On 4/27/07, Tony Qian <[EMAIL PROTECTED]> wrote: Erick, Thanks for your explaination. I thought using HitCollector. The search interface we are facing now actually is pretty simple. One of the search requires maximum of search results of 500 and page size is 500 (basically return first 500). Second one requires max of 250 and page size is 25. At this time, we are ok even we have to hit query several times. I see one problem with HitCollector, which is HitCollector caches a huge data if the document is very large. One best of implementation (I think) is client passes in a page number and page size in search method, Lucene returns documents on that page instead of always returns first 100 documents. I haven't looked at Lucene code yet and don't know how hard to implement that. Tony >From: "Erick Erickson" <[EMAIL PROTECTED]> >Reply-To: java-user@lucene.apache.org >To: java-user@lucene.apache.org >Subject: Re: Index sync up >Date: Fri, 27 Apr 2007 13:12:16 -0400 > ><4> is also easy > >From the javadoc: >"*Caution:* Iterate only over the hits needed. Iterating over all hits is >generally not desirable and may be the source of performance issues." > >So an iterator should be fine for all documents, even those > 100. But do >be >aware that the entire query gets re-executed each 100 docs or so, so yes, >there is a performance issue. You'll pay a price how big depends on a lot >of >variables. But let's say the query takes 2 seconds to run. You'll spend two >seconds searching before returning document 0, two more seconds between >documents 100 and 101, two seconds between 200 and 201, etc. *even if you >just throw them away* if you use an iterator. > >So, getting hits 10,000 through 10,100 will spend a LOT if time processing >queries. You're better off using a HitCollector, perhaps a TopDocs etc. > >On the other hand, if your query takes 10 ms and you never really expect to >fetch more than, say, 500 documents, who cares? Do it as simply as >possible. > >But now that I'm thinking about it, it's unclear to me what happens if you >just ask for Hits.doc(401) as your first call to get any document from the >Hits object. I took a quick look at the Hits code and it *looks* like, for >fetching an arbitrary 100 documents, the maximum number of searches you'll >make is two. Again, it's a quick look, but it seems like the following > >Hits hits = search(); >Document doc = hits.doc(401); > >will execute the search twice. First to get the first 100 docs, then to get >documents 400-800. At least I think that's what's happening. That said, I >think you'd still be ahead by implementing your own HitCollector if you >expect to fetch thousands of documents The "fetch twice as many >documents as the one we're asked for" algorithm seems tailored for >relatively small data sets, which shouldn't be any surprise.. > >Erick > >On 4/27/07, Tony Qian <[EMAIL PROTECTED]> wrote: >> >>All, >> >>After playing around with Lucene, we decided to replace old full-text >>search >>engine with Lucene. I got "Lucene in Action" a week ago and finished >>reading >>most of the book. I got several questions. >> >>1) Since the book was written two years ago and Lucene has made a lot of >>changes, is there any plan for 2nd edition? (I guess this question is for >>Otis and Erik, btw, it is a great book.) >> >>2) I have two processes for indexing. one runs every 5 minutes to add new >>contents into an existing index. Another one runs daily to rebuild entire >>index which also handles removing old contents. After rebuild process >>finishes indexing, we'd like to replace the index built by first process >>(every 5 minutes) with index built by second process. How do i do it >>safely >>and also avoid duplicating or missing documents (It is possible that first >>process is still adding documents to the index when we try to replace it >>with second one). >>NOTE: both processes retrieve data from same database. >> >>3) we are doing indexing on a master server and push index data to slave >>servers. In order to make new data visible to client, we have to close >>IndexSearcher and open it after new data is coped over. We use web based >>application (servlet) as search interface, creating a IndexSearcher as an >>instance variable
Re: Snowball and accents filter...?
You actually wouldn't have to maintain two versions. You could, instead, inject the accentless (stemmed) terms in your single index as synonyms (See Lucene In Action). This is easier to search and maintain But it also bloats your index by some factor since you're storing two words for every accented word in your corpus. And gives you headaches if there is more than one accent in the word (do you then store all 4 possibilities for two accents? 8 for 3? etc?). I think your notion of running the search terms through a dictionary is a very good one. That way, your searcher doesn't have to care about all this nonsense, and assume correctly-accented characters. Erick On 4/28/07, Andrew Green <[EMAIL PROTECTED]> wrote: El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió: > : In order to do this, we tried subclassing the SnowballAnalyzer... it > : doesn't work yet, though. Here is the code of our custom class: > > At first glance, what youv'e got seems fine, can you elaborate on what you > mean by "it doesn't work" ? > > Perhaps the issue is that the SnowballStemmer can't handle the accented > characters, and you should strip them first, then stem? > > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream result = new StandardTokenizer(reader); > result = new StandardFilter(result); > result = new LowerCaseFilter(result); > if (stopSet != null) > result = new StopFilter(result, stopSet); > result = new ISOLatin1AccentFilter(result); > result = new SnowballFilter(result, name); > return result; > } > Thanks for your answer, Chris. It doesn't work for the opposite reason: it requires words to be spelled correctly, including accents, in order to stem them. So, for example, "civilización" and its plural, "civilizaciones" are stemmed correctly, but the accentless version, "civilizacion", doesn't get stemmed at all. So if someone misspells the word, omitting the accent, in the search query--a likely scenario--the only hits they get are identical misspellings in the documents, if such things exist. But we need stemming of both accented and unaccented versions of the word. Stemming misspellings may sound inherently evil, I suppose, but it seems to be our best bet. We're currently trying to modify the SpanishStemmer to do this, but haven't gotten it quite yet. Another option that I'm imagining might work, though less well, would be to simultaneously maintain two indexes, one of correctly stemmed words generated without the accents filter, and another of unstemmed words with the accents stripped, and query both indexes when searching. Yet another possibility would be, I think, to silently use a dictionary to correct spellings in queries before searching. A few Google queries show that they do things sort of the way we're trying to, though perhaps not quite... Thanks again, Andrew - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sort in Lucene 1.4.3
Hi, I encountered one problem in lucene 1.4.3: I called Searcher.search(, new Sort("myfiled"); In "myfiled", most values looks like number "123456" or sth similiar, but one field contains a value "Just a TRY", then I got error: java.lang.ClassCastException at org.pache.lucene.search.FieldDocSortedHitQueue.lssThan (FieldDocSortedHitQueue.java:129) It seems that lucene judged this field as number so it cannot cast one particular value as String? To me if client did not specify sorting field type, we should treat it just as String? Thanks very much for helps and best regards, Lisheng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index sync up
Hi Tony, - Original Message All, After playing around with Lucene, we decided to replace old full-text search engine with Lucene. I got "Lucene in Action" a week ago and finished reading most of the book. I got several questions. 1) Since the book was written two years ago and Lucene has made a lot of changes, is there any plan for 2nd edition? (I guess this question is for Otis and Erik, btw, it is a great book.) OG: Thanks. Yes, there are plans for LIA2. At this point in time they are still just plans. We started preparing for the second edition some months ago, but then Lucene got some fresh blood and started developing an changing rapidly, that we decided to wait a little longer. Plus, both Erik and I are quite busy these days (see my signature). 2) I have two processes for indexing. one runs every 5 minutes to add new contents into an existing index. Another one runs daily to rebuild entire index which also handles removing old contents. After rebuild process finishes indexing, we'd like to replace the index built by first process (every 5 minutes) with index built by second process. How do i do it safely and also avoid duplicating or missing documents (It is possible that first process is still adding documents to the index when we try to replace it with second one). NOTE: both processes retrieve data from same database. OG: You'll need to make those two processed communicate somehow. If they run on the same servers, the easiest way might be using files - if file X exists, stop updating the index. Or, if file Y exists, that means the first process is still updating, so wait with the index swap. If this is running under UNIX, you might be able to just do: rm -rf index// the files won't *really* be removed at this point, so searching against this index will still work. mv newIndex index reopen the IndexSearcher You could also play with sym-links: normally you'd have: index -> index-built-on-20070428 when you build a new index the following night you call it index-built-on-20070429 and point index to it: index -> index-built-on-20070429 reopen the IndexSearcher 3) we are doing indexing on a master server and push index data to slave servers. In order to make new data visible to client, we have to close IndexSearcher and open it after new data is coped over. We use web based application (servlet) as search interface, creating a IndexSearcher as an instance variable for all clients. My question is what will happen to clients if I close IndexSearcher while clients are still doing search. How to safely update index when client are searching? OG: The clients using the IndexSearcher when you close it will get an exception - IOException most likely. But you don't *have* to close the old IndexSearcher. You could just open a new one and let the old one get GCed. OR, if you really want to close the old one, you could always come up with a simple mechanism that implements the "oh, this IndexSearcher needs to be closed soon - ok, let's give all clients who are using it 60 seconds to finish up and then we are closing this IS". Or you could keep count of clients using this. I believe Solr does this. You'll also want to warm up the new IndexSearcher with a query before exposing it to real clients, esp. if your index is big. 4) Lucene caches first 100 hits in memmory. We decided to use requery to return search results back to clients. For first 100 documents, i can iterator through "Hits". Do i have to use doc(n) to retrive documents for any documents > 100? Any performance issues? OG: For hits > 100 you still use the same API as for hits < 100. However, if your application or its users need to go deep in the results, you might want to look at the IndexSearcher search() method that returns TopDocs. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucene Consulting - http://lucene-consulting.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort in Lucene 1.4.3
Lisheng, Have a look at the javadoc for the Sort object: Valid Types of Values There are three possible kinds of term values which may be put into sorting fields: Integers, Floats, or Strings. Unless SortField objects are specified, the type of value in the field is determined by parsing the first term in the field. Thus, if you know what type of a value yoru field has, use SortField and explicitly set it. Also, instantiate Sort and SortField only once instead of in each call to search(). Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucene Consulting - http://lucene-consulting.com/ - Original Message From: "Zhang, Lisheng" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, April 28, 2007 9:04:23 PM Subject: Sort in Lucene 1.4.3 Hi, I encountered one problem in lucene 1.4.3: I called Searcher.search(, new Sort("myfiled"); In "myfiled", most values looks like number "123456" or sth similiar, but one field contains a value "Just a TRY", then I got error: java.lang.ClassCastException at org.pache.lucene.search.FieldDocSortedHitQueue.lssThan (FieldDocSortedHitQueue.java:129) It seems that lucene judged this field as number so it cannot cast one particular value as String? To me if client did not specify sorting field type, we should treat it just as String? Thanks very much for helps and best regards, Lisheng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]