RE: Null or no analyzer
Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. Morus -Original Message- From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 19, 2004 11:23 AM To: Lucene Users List Subject: Null or no analyzer Hi All I have a question regarding selection of Analyzer's during query parsing i have three field in my index db_id, full_text, subject all three are indexed, however while indexing I specified to lucene to index db_id and subject but not tokenize them I want to give a single search box in my application to enable searching for documents some query can look lile motor cross rally this will get fed to QueryParser to do the relevent parsing however if the user enters Jhon Kerry subject:Elections 2004 I want to make sure that No analyzer is used fro the subject field ? how can that be done. this is because I expect the users to know the subject from a List of controlled vocabularies and also I am searching for documents that have the exact subject I tried using the PerFieldAnalyzerWrapper, but how do I get hold a Analyzer that does nothing but pass the text trough to the Searcher ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Range Query
Hi Jonathan When searching I also pad the query term ??? When Exactly are u handling this [ using During Indexing Process Also or while Search on Process Only ] Can u be Please be specific. [ if time permits and possible please can u send me the sample Code for the same ] . :) Thx in advance -Original Message- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 3:31 AM To: Lucene Users List Subject: Re: Range Query That is exactly right. It is searching the ASCII. To solve it I pad my price using a method like this: /** * Pads the Price so that all prices are the same number of characters and * can be compared lexigraphically. * @param price * @return */ public static String formatPriceAsString(Double price) { if (price == null) { return null; } return PRICE_FORMATTER.format(price.doubleValue()); } where PRICE_FORMATTER contains enough digits for your largest number. private static final DecimalFormat PRICE_FORMATTER = new DecimalFormat(000.00); When searching I also pad the query term. I looked into hooking into QueryParser, but since the lower/upper prices for my application are different inputs, I choose to handle them without hooking into the QueryParser. Jonathan On Tue, 19 Oct 2004 12:35:06 +0530, Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies. I have a Field Type Text 'ItemPrice' , Using it to Store Price Factor in numeric such as 10, 25.25 , 50.00 If I am suppose to Find the Range factor between 2 prices ex - Contents:shoes +ItemPrice:[10.00 TO 50.60] I get results other then the Range that has been executed [This may be due to query parsing the Ascii values instead of numeric values ] Am I am missing something in the Querry syntax or Is this the wrong way to construct the Query. Please Somebody Advise me ASAP. :( Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Downloading Full Copies of Web Pages
Hi Try nutch [ http://www.nutch.org/docs/en/about.html ] underneath it uses Lucene :) -Original Message- From: Luciano Barbosa [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 3:06 AM To: [EMAIL PROTECTED] Subject: Downloading Full Copies of Web Pages Hi folks, I want to download full copies of web pages and storage them locally as well the hyperlink structures as local directories. I tried to use Lucene, but I've realized that it doesn't have a crawler. Does anyone know a software that make this? Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Downloading Full Copies of Web Pages
wget does this. Little point in reinventing the wheel. Luciano Barbosa wrote: Hi folks, I want to download full copies of web pages and storage them locally as well the hyperlink structures as local directories. I tried to use Lucene, but I've realized that it doesn't have a crawler. Does anyone know a software that make this? Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RTÉ may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ** - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TestRangeQuery.java
Hi Does anybody have Trouble in Compiling TestRangeQuery.java in Eclipse 3.0 IDE, [ http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/test/org/apache/lucene/ search ] Seem's there is an Error doc.add(new Field(id, id + docCount, Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field(content, content, Field.Store.NO, Field.Index.TOKENIZED)); Compiler Error is with Lucene1.4.1, Win O/s Field.Store.yes is not Found Thx in Advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Null or no analyzer
AFIK if the term Election 2004 will be between quotation marks this should work fine. Aviran http://aviran.mordos.com -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 2:25 AM To: Lucene Users List Subject: RE: Null or no analyzer Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Range Query
Karthik, It is all spelled out in a Lucene HowTo here: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields Have fun with it, Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 12:15 AM To: Lucene Users List; Jonathan Hager Subject: RE: Range Query Hi Jonathan When searching I also pad the query term ??? When Exactly are u handling this [ using During Indexing Process Also or while Search on Process Only ] Can u be Please be specific. [ if time permits and possible please can u send me the sample Code for the same ] . :) Thx in advance -Original Message- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 3:31 AM To: Lucene Users List Subject: Re: Range Query That is exactly right. It is searching the ASCII. To solve it I pad my price using a method like this: /** * Pads the Price so that all prices are the same number of characters and * can be compared lexigraphically. * @param price * @return */ public static String formatPriceAsString(Double price) { if (price == null) { return null; } return PRICE_FORMATTER.format(price.doubleValue()); } where PRICE_FORMATTER contains enough digits for your largest number. private static final DecimalFormat PRICE_FORMATTER = new DecimalFormat(000.00); When searching I also pad the query term. I looked into hooking into QueryParser, but since the lower/upper prices for my application are different inputs, I choose to handle them without hooking into the QueryParser. Jonathan On Tue, 19 Oct 2004 12:35:06 +0530, Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies. I have a Field Type Text 'ItemPrice' , Using it to Store Price Factor in numeric such as 10, 25.25 , 50.00 If I am suppose to Find the Range factor between 2 prices ex - Contents:shoes +ItemPrice:[10.00 TO 50.60] I get results other then the Range that has been executed [This may be due to query parsing the Ascii values instead of numeric values ] Am I am missing something in the Querry syntax or Is this the wrong way to construct the Query. Please Somebody Advise me ASAP. :( Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spell checker
Where can I download it? Thanks, Lynn -Original Message- From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 1:26 PM To: Lucene Users List Subject: Spell checker hy lucene users i developed a Spell checker for lucene inspired by the David Spencer code see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker Nicolas Maisonneuve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Null or no analyzer
On Oct 20, 2004, at 9:55 AM, Aviran wrote: AFIK if the term Election 2004 will be between quotation marks this should work fine. No, it won't. The Analyzer will analyze it, and the WhitespaceAnalyzer would split it into two tokens [Election] and [2004]. This is a tricky situation with no clear *best* way to do this sort of thing. However, given what I've seen of this thread so far I'd recommend using the PerFieldAnalyzerWrapper and associate the fields indexed as Field.Keyword with a KeywordAnalyzer. There have been some variants of this posted on the list - it is not included in the API, however perhaps it should be. Or perhaps there are other options to solve this recurring dilemma folks have with Field.Keyword indexed fields and QueryParser? Erik Aviran http://aviran.mordos.com -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 2:25 AM To: Lucene Users List Subject: RE: Null or no analyzer Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Null or no analyzer
Hi Erik I think the best solutuion is to have a NullAnalayzer class that allows a simple pass through The query parser then can be passed with a PerFieldAnalayzer that knows when to select NullAnalayzer or some other based on the Field:data... Field2:pp format this is something that the query parser is already geared up to do regards Rupinder -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 20 October 2004 16:29 To: Lucene Users List Subject: Re: Null or no analyzer On Oct 20, 2004, at 9:55 AM, Aviran wrote: AFIK if the term Election 2004 will be between quotation marks this should work fine. No, it won't. The Analyzer will analyze it, and the WhitespaceAnalyzer would split it into two tokens [Election] and [2004]. This is a tricky situation with no clear *best* way to do this sort of thing. However, given what I've seen of this thread so far I'd recommend using the PerFieldAnalyzerWrapper and associate the fields indexed as Field.Keyword with a KeywordAnalyzer. There have been some variants of this posted on the list - it is not included in the API, however perhaps it should be. Or perhaps there are other options to solve this recurring dilemma folks have with Field.Keyword indexed fields and QueryParser? Erik Aviran http://aviran.mordos.com -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 2:25 AM To: Lucene Users List Subject: RE: Null or no analyzer Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Null or no analyzer
Erik Hatcher wrote: On Oct 20, 2004, at 9:55 AM, Aviran wrote: AFIK if the term Election 2004 will be between quotation marks this should work fine. No, it won't. The Analyzer will analyze it, and the WhitespaceAnalyzer would split it into two tokens [Election] and [2004]. This is a tricky situation with no clear *best* way to do this sort of thing. However, given what I've seen of this thread so far I'd recommend using the PerFieldAnalyzerWrapper and associate the fields indexed as Field.Keyword with a KeywordAnalyzer. There have been some variants of this posted on the list - it is not included in the API, however perhaps it should be. Or perhaps there are other options to solve this recurring dilemma folks have with Field.Keyword indexed fields and QueryParser? Erik I still don't understand what is wrong with the Idea of indexing the title in a separate field and searching with a Phrase query +title:Elections 2004 ? I think that the real problem is that the title is not tokenized and the title contains more then Elections 2004 I think it is worthing to give a try to this solution. Or maybe I don't understand the problem correctly ... All the best, Sergiu Aviran http://aviran.mordos.com -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 2:25 AM To: Lucene Users List Subject: RE: Null or no analyzer Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Null or no analyzer
hi the basic problem here is that there are data source which contain a) id, b) text c) title d) authors AND d) subject heading text, title and authors need to be tokenized the subject heading can be one or more words, anyone searching such datasource is expected to know the subject headings , if the user is trying to find all articles that have the phrases Jhon Kerry and Goerge Bush as well as that are classified as Election 2004 it is possible that there are other documents that are classified as Nation Service Records or Tax Returns etc... so the object is to find documents that have the above mentioned phrases as well as one one of the subject classifiers, so as to pull out the most meaning full documents the subject classifiers pretain to domain knowledge, and it is possible that 2 or more subject classification headings are composed of the same set of words, but the sequence in which they appear can drastically alter the meaning hence tokenizing the subject field is not exactly a healthy solution. also such search tools are meant for people who know / understand this classification system Taxonomy of animals can be taken as one such example, hope this helps define the problem I still don't understand what is wrong with the Idea of indexing the title in a separate field and searching with a Phrase query +title:Elections 2004 ? I think that the real problem is that the title is not tokenized and the title contains more then Elections 2004 I think it is worthing to give a try to this solution. Or maybe I don't understand the problem correctly ... All the best, Sergiu Aviran http://aviran.mordos.com -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 2:25 AM To: Lucene Users List Subject: RE: Null or no analyzer Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to find coherent terms
Hello, i've to realize one function in my project and i hope i can find someone who can help me. the idee is about search of coherent terms! my imagination: 1. search for a specific term_a 2. result: hits from lucene resultlist: term_a term_b term_c term_d term_b term_a term_e term_e term_a term_b term_f 3. now i can see that the term_a is in a speciall relation to term_b - but how can i check this with lucene? is this supported by any function of lucene or does exist any other api? thx miro ___ Gesendet von Yahoo! Mail - Jetzt mit 100MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spell checker
Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009 Aviran http://aviran.mordos.com -Original Message- From: Lynn Li [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 10:52 AM To: 'Lucene Users List' Subject: RE: Spell checker Where can I download it? Thanks, Lynn -Original Message- From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 1:26 PM To: Lucene Users List Subject: Spell checker hy lucene users i developed a Spell checker for lucene inspired by the David Spencer code see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker Nicolas Maisonneuve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Null or no analyzer
Rupinder Singh Mazara wrote: hi the basic problem here is that there are data source which contain a) id, b) text c) title d) authors AND d) subject heading text, title and authors need to be tokenized the subject heading can be one or more words, the subject must be also tokennized, otherwise you cannot get any results that doesn't match the Term exaclty so ... for example, let's asume you have the folowing titles: George Trash Elections George Trash if you search for George Trash and your title is not tokenized you will get just the second document (I hope I'm not making any mistake when I say that, anyway it can be easily tested). anyone searching such datasource is expected to know the subject headings , if the user is trying to find all articles that have the phrases Jhon Kerry and Goerge Bush as well as that are classified as Election 2004 it is possible that there are other documents that are classified as Nation Service Records or Tax Returns etc... how is there represented in the GUI as a select box? or input field? if it is select box, if you have the concept of unique domain concept .. you can use a a not tokenized string, or even a numerical representation, but I think it is not your case. In the case of input fields .. again I suggest you to tokenize the string so the object is to find documents that have the above mentioned phrases as well as one one of the subject classifiers, so as to pull out the most meaning full documents no problem ... once again .. use +subject:my searched subject the subject classifiers pretain to domain knowledge, and it is possible that 2 or more subject classification headings are composed of the same set of words, but the sequence in which they appear can drastically alter the meaning hence tokenizing the subject field is not exactly a healthy solution. the tokenization doesn't change the word order, in the case you use a PhraseQuery you will get the correct results +title:George Bush doesn't return documents with the title Bush George also such search tools are meant for people who know / understand this classification system :)) This is a general truth the the result are better when the people know what they are searching for :) Taxonomy of animals can be taken as one such example, hope this helps define the problem I cannot see anything special in your problem. Before strating to implement a complex solution probably will be better to give it a chance to the simple one ... I ensure you that you won't loose anything, and even if you decide to implement complex solutions you will have a lot of reusable code. so ... Have fun, Sergiu PS: if you can provide an example with a false positive please ... provide us the case I still don't understand what is wrong with the Idea of indexing the title in a separate field and searching with a Phrase query +title:Elections 2004 ? I think that the real problem is that the title is not tokenized and the title contains more then Elections 2004 I think it is worthing to give a try to this solution. Or maybe I don't understand the problem correctly ... All the best, Sergiu Aviran http://aviran.mordos.com -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 2:25 AM To: Lucene Users List Subject: RE: Null or no analyzer Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Spell checker
I investigated how the algorithm implemented in this spell checker compares with my simple implementation of a spell checker. First here is what my implementation looks like: //Each word becomes a single Lucene Document //To find suggestions: FuzzyQuery fquery = new FuzzyQuery(new Term(word, word)); Hits dicthits = dictionarySearcher.search(fquery); For a simple test I misspelled brown, as follows: * bronw * bruwn * brownz To validate my testcases I checked if Microsoft Word and Google had any idea what I was trying to spell. Google suggested brown, brown, browns, respectively. Words suggestions were: bronw==brown, brow bruwn==brown, brawn, bruin brownz==browns, brown The suggestions using David Spencer/Nicolas Maisonneuve's algorithm against my index were: bronw==jaron, brooks, citron, brookline bruwn==brush brownz==bronze, brooks, brooke, brookline The suggestions using my real simple algorithm against my index were: bronw==brown, brwn, brush bruwn==brown, brwn, brush brownz==brown, bronze It appears that David Spencer/Nicolas Maisonneuve's Spell Checking Algorithm returns a broader result set than most commercial algorithms or a real simple algorithm. I will be the first to say, that this is just anecdotal evidence and not a rigourous test of either algorithm. But until extensive testing has been done I'm going to stick with my real simple dictionary lookup. Jonathan On Wed, 20 Oct 2004 12:56:39 -0400, Aviran [EMAIL PROTECTED] wrote: Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009 Aviran http://aviran.mordos.com -Original Message- From: Lynn Li [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 10:52 AM To: 'Lucene Users List' Subject: RE: Spell checker Where can I download it? Thanks, Lynn -Original Message- From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 1:26 PM To: Lucene Users List Subject: Spell checker hy lucene users i developed a Spell checker for lucene inspired by the David Spencer code see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker Nicolas Maisonneuve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spell checker
If you look at the FuzzyQuery code, it is based on computing Levenshtein distance between the original term and every term in the index and keeping the terms that are within the specified relative distance of the original term. This would explain why FuzzyQuery may work well for small indexes but for large indexes (I have ~5 million terms in mine) it is impossibly slow. What n-gram based (or any other secondary index based) spell checkers are trying to do is to select a limited number of candidate terms in a very quick manner and then apply the distance algorithm to them. If you use the same cutoff rules as the FuzzyQuery, you will get a very similar result set. Secondary index-based spell checkers also give you a lot more control on how many similar terms to bring back and in what order. Regards, Alexey -Original Message- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 6:48 PM To: Lucene Users List Subject: Re: Spell checker I investigated how the algorithm implemented in this spell checker compares with my simple implementation of a spell checker. First here is what my implementation looks like: //Each word becomes a single Lucene Document //To find suggestions: FuzzyQuery fquery = new FuzzyQuery(new Term(word, word)); Hits dicthits = dictionarySearcher.search(fquery); For a simple test I misspelled brown, as follows: * bronw * bruwn * brownz To validate my testcases I checked if Microsoft Word and Google had any idea what I was trying to spell. Google suggested brown, brown, browns, respectively. Words suggestions were: bronw==brown, brow bruwn==brown, brawn, bruin brownz==browns, brown The suggestions using David Spencer/Nicolas Maisonneuve's algorithm against my index were: bronw==jaron, brooks, citron, brookline bruwn==brush brownz==bronze, brooks, brooke, brookline The suggestions using my real simple algorithm against my index were: bronw==brown, brwn, brush bruwn==brown, brwn, brush brownz==brown, bronze It appears that David Spencer/Nicolas Maisonneuve's Spell Checking Algorithm returns a broader result set than most commercial algorithms or a real simple algorithm. I will be the first to say, that this is just anecdotal evidence and not a rigourous test of either algorithm. But until extensive testing has been done I'm going to stick with my real simple dictionary lookup. Jonathan On Wed, 20 Oct 2004 12:56:39 -0400, Aviran [EMAIL PROTECTED] wrote: Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009 Aviran http://aviran.mordos.com -Original Message- From: Lynn Li [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 10:52 AM To: 'Lucene Users List' Subject: RE: Spell checker Where can I download it? Thanks, Lynn -Original Message- From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 1:26 PM To: Lucene Users List Subject: Spell checker hy lucene users i developed a Spell checker for lucene inspired by the David Spencer code see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker Nicolas Maisonneuve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TestRangeQuery.java
Hi, If tests work without eclipse it is necessary to adjust correctly their performance in eclipse:-) Good luke, Vladimir. On Wed, 20 Oct 2004 19:10:45 +0530 Karthik N S [EMAIL PROTECTED] wrote: Hi Does anybody have Trouble in Compiling TestRangeQuery.java in Eclipse 3.0 IDE, [ http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/test/org/apache/lucene/ search ] Seem's there is an Error doc.add(new Field(id, id + docCount, Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field(content, content, Field.Store.NO, Field.Index.TOKENIZED)); Compiler Error is with Lucene1.4.1, Win O/s Field.Store.yes is not Found Thx in Advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]