RE: Fatal error on Windows
If my understanding is correct, unless you are using JNI, you should never be able to crash the JVM using only java code. We've had a lot of crash problems with Sun's JVM, especially in server mode (on Linux, not Windows). We don't have any JNI code (only the JVM itself and the database driver). Finally we have switched to BEA Jrockit and haven't had a crash since. Hope this helps, Alexey -Original Message- From: Steve Rajavuori [mailto:[EMAIL PROTECTED] Sent: Monday, January 03, 2005 11:59 AM To: 'Lucene Users List' Subject: RE: Fatal error on Windows No, I didn't change the source code at all. Has anyone ever seen this error with Lucene 1.4.3? I am unsure how to troubleshoot further, since the error occurs within the call to search(). Steve -Original Message- From: Vikas Gupta [mailto:[EMAIL PROTECTED] Sent: Thursday, December 30, 2004 4:47 PM To: Lucene Users List Subject: Re: Fatal error on Windows A similar message appeared for me on linux. I would recommend doing a ant clean (or something similar.) ant jar ant war if you changing the source code. Did you change your source code? -Vikas - Original Message - From: Steve Rajavuori [EMAIL PROTECTED] To: 'Lucene Users List' lucene-user@jakarta.apache.org Sent: Thursday, December 30, 2004 4:47 PM Subject: Fatal error on Windows I am getting a fatal exception on Windows 2000 Server when performing a search. Upon call to IndexSearcher.search( ) with a large query I see this error from the JVM: Unexpected Signal : EXCEPTION_FLT_STACK_CHECK (0xc092) occurred at PC=0xA2D4 16 Function=[Unknown.] Library=(N/A) NOTE: We are unable to locate the function name symbol for the error just occurred. Please refer to release documentation for possible reason and solutions. I am using Lucene 1.4.3 and JRE 1.4.2_06. Has anyone had an experience like this? Any suggestions to work around or troubleshoot? Steve Rajavuori - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Unexpected TermEnum behavior
My application needs to enumerate all terms for a specific field. To do that I get the TermEnum using the following code: TermEnum terms = reader.terms(new Term(fieldName, )); I noticed that initially TermEnum is positioned at the first term. In other words, I don't have to call terms.next() before calling terms.term(). This is different from the behavior of Iterator, Enumeration and ResultSet whose initial position is before the first result. I wonder whether it is this way by design. If it is by design, what is the defined TermEnum behavior if there are no terms for the field name in question? Will the call to terms.term() return null? Or get positioned at the first term with the field name that comes after the provided field name? What if there are no field names after it? In any case, some javadoc describing the behavior would be extremely useful. Being used to Iterators and ResultSets, I automatically wrote the code the same way, calling next() first. Fortunately, I had a field with only 2 terms, that's why I noticed that I am missing the first element. Thanks, Alexey
RE: Spell checker
If you look at the FuzzyQuery code, it is based on computing Levenshtein distance between the original term and every term in the index and keeping the terms that are within the specified relative distance of the original term. This would explain why FuzzyQuery may work well for small indexes but for large indexes (I have ~5 million terms in mine) it is impossibly slow. What n-gram based (or any other secondary index based) spell checkers are trying to do is to select a limited number of candidate terms in a very quick manner and then apply the distance algorithm to them. If you use the same cutoff rules as the FuzzyQuery, you will get a very similar result set. Secondary index-based spell checkers also give you a lot more control on how many similar terms to bring back and in what order. Regards, Alexey -Original Message- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 6:48 PM To: Lucene Users List Subject: Re: Spell checker I investigated how the algorithm implemented in this spell checker compares with my simple implementation of a spell checker. First here is what my implementation looks like: //Each word becomes a single Lucene Document //To find suggestions: FuzzyQuery fquery = new FuzzyQuery(new Term(word, word)); Hits dicthits = dictionarySearcher.search(fquery); For a simple test I misspelled brown, as follows: * bronw * bruwn * brownz To validate my testcases I checked if Microsoft Word and Google had any idea what I was trying to spell. Google suggested brown, brown, browns, respectively. Words suggestions were: bronw==brown, brow bruwn==brown, brawn, bruin brownz==browns, brown The suggestions using David Spencer/Nicolas Maisonneuve's algorithm against my index were: bronw==jaron, brooks, citron, brookline bruwn==brush brownz==bronze, brooks, brooke, brookline The suggestions using my real simple algorithm against my index were: bronw==brown, brwn, brush bruwn==brown, brwn, brush brownz==brown, bronze It appears that David Spencer/Nicolas Maisonneuve's Spell Checking Algorithm returns a broader result set than most commercial algorithms or a real simple algorithm. I will be the first to say, that this is just anecdotal evidence and not a rigourous test of either algorithm. But until extensive testing has been done I'm going to stick with my real simple dictionary lookup. Jonathan On Wed, 20 Oct 2004 12:56:39 -0400, Aviran [EMAIL PROTECTED] wrote: Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009 Aviran http://aviran.mordos.com -Original Message- From: Lynn Li [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 10:52 AM To: 'Lucene Users List' Subject: RE: Spell checker Where can I download it? Thanks, Lynn -Original Message- From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 1:26 PM To: Lucene Users List Subject: Spell checker hy lucene users i developed a Spell checker for lucene inspired by the David Spencer code see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker Nicolas Maisonneuve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: n-gram indexing for generating spell suggestions
You can also store a phonetic key for the term to find sounds-like matches. I use double metaphone algorithm which appears to be English specific. Not sure if there is something out there for Dutch. For the length, I use relative distance cutoff (distance/length) in addition to the absolute length cutoff that doesn't work very well for short words (as you mentioned). Alexey -Original Message- From: Aad Nales [mailto:[EMAIL PROTECTED] Sent: Monday, October 18, 2004 11:59 AM To: [EMAIL PROTECTED] Subject: n-gram indexing for generating spell suggestions ... 2. often used misspelings in Dutch words between 4 and 5 characters were missed. E.g. 'fiets' was suggested as a possible spell suggestion for 'feits' since no matching 3gram exist between the two. The same held true for misspellings based on 'ch' and 'g' both being the same sound in Dutch but written differently. 3. words that could never be part of a suggestion were added based on a single matchting n-gram. (e.g. if I ask for suggestions on 'per' then tupperware is also suggested. But solely based on length it is clear that it has a minimal distance of 7. ... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Field Tokenization
You can do it using PerFieldAnalyzerWrapper. See http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFiel dAnalyzerWrapper.html for details. Alexey -Original Message- From: Brandon Lee [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 17, 2004 3:51 PM To: Lucene Users List Subject: Field Tokenization Hi. I would like to tokenize different fields in a document w/ different analyzers but it doesn't seem possible because analyzers are associated w/ documents but not Fields. Is there a reason for this? For example, I'd like: Document : Field=Text - porter w/ stop words analyzer Field=Author - whitespace lower-cased analyzer If I add Field=Author as a Keyword Field (non-tokenized), the document states that it will add it as a single word (I want separate words but not through the porter stemmer). I know that query would be more complicated but I'm willing to code around that. Thanks for any enlightenment. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]