RE: Fatal error on Windows

2005-01-08 Thread Alexey Lef
If my understanding is correct, unless you are using JNI, you should never
be able to crash the JVM using only java code. We've had a lot of crash
problems with Sun's JVM, especially in server mode (on Linux, not Windows).
We don't have any JNI code (only the JVM itself and the database driver).
Finally we have switched to BEA Jrockit and haven't had a crash since.

Hope this helps,

Alexey 

-Original Message-
From: Steve Rajavuori [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 03, 2005 11:59 AM
To: 'Lucene Users List'
Subject: RE: Fatal error on Windows


No, I didn't change the source code at all. Has anyone ever seen this error
with Lucene 1.4.3? I am unsure how to troubleshoot further, since the error
occurs within the call to search().

Steve

-Original Message-
From: Vikas Gupta [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 30, 2004 4:47 PM
To: Lucene Users List
Subject: Re: Fatal error on Windows


A similar message appeared for me on linux. I would recommend doing a

ant clean (or something similar.)
ant jar
ant war

if you changing the source code. Did you change your source code?

-Vikas
- Original Message - 
From: Steve Rajavuori [EMAIL PROTECTED]
To: 'Lucene Users List' lucene-user@jakarta.apache.org
Sent: Thursday, December 30, 2004 4:47 PM
Subject: Fatal error on Windows


 I am getting a fatal exception on Windows 2000 Server when performing a
 search. Upon call to IndexSearcher.search( ) with a large query I see this
 error from the JVM:

 Unexpected Signal : EXCEPTION_FLT_STACK_CHECK (0xc092) occurred at
 PC=0xA2D4
 16
 Function=[Unknown.]
 Library=(N/A)

 NOTE: We are unable to locate the function name symbol for the error
   just occurred. Please refer to release documentation for possible
   reason and solutions.

 I am using Lucene 1.4.3 and JRE 1.4.2_06. Has anyone had an experience
like
 this? Any suggestions to work around or troubleshoot?

 Steve Rajavuori

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Unexpected TermEnum behavior

2004-12-08 Thread Alexey Lef
My application needs to enumerate all terms for a specific field. To do that
I get the TermEnum using the following code:

TermEnum terms = reader.terms(new Term(fieldName, ));

I noticed that initially TermEnum is positioned at the first term. In other
words, I don't have to call terms.next() before calling terms.term(). This
is different from the behavior of Iterator,  Enumeration and ResultSet whose
initial position is before the first result. I wonder whether it is this way
by design.

If it is by design, what is the defined TermEnum behavior if there are no
terms for the field name in question? Will the call to terms.term() return
null? Or get positioned at the first term with the field name that comes
after the provided field name? What if there are no field names after it?

In any case, some javadoc describing the behavior would be extremely useful.
Being used to Iterators and ResultSets, I automatically wrote the code the
same way, calling next() first. Fortunately, I had a field with only 2
terms, that's why I noticed that I am missing the first element.

Thanks,

Alexey


RE: Spell checker

2004-10-20 Thread Alexey Lef
If you look at the FuzzyQuery code, it is based on computing Levenshtein
distance between the original term and every term in the index and keeping
the terms that are within the specified relative distance of the original
term. This would explain why FuzzyQuery may work well for small indexes but
for large indexes (I have ~5 million terms in mine) it is impossibly slow.

What n-gram based (or any other secondary index based) spell checkers are
trying to do is to select a limited number of candidate terms in a very
quick manner and then apply the distance algorithm to them. If you use the
same cutoff rules as the FuzzyQuery, you will get a very similar result set.
Secondary index-based spell checkers also give you a lot more control on how
many similar terms to bring back and in what order.

Regards,

Alexey


-Original Message-
From: Jonathan Hager [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 20, 2004 6:48 PM
To: Lucene Users List
Subject: Re: Spell checker


I investigated how the algorithm implemented in this spell checker
compares with my simple implementation of a spell checker.

First here is what my implementation looks like:

//Each word becomes a single Lucene Document

//To find suggestions:
   FuzzyQuery fquery = new FuzzyQuery(new Term(word, word));
   Hits dicthits = dictionarySearcher.search(fquery);

For a simple test I misspelled brown, as follows:
 * bronw
 * bruwn
 * brownz

To validate my testcases I checked if Microsoft Word and Google had
any idea what I was trying to spell.  Google suggested brown, brown,
browns, respectively.

Words suggestions were:

bronw==brown, brow
bruwn==brown, brawn, bruin
brownz==browns, brown

The suggestions using  David Spencer/Nicolas Maisonneuve's algorithm
against my index were:

bronw==jaron, brooks, citron, brookline
bruwn==brush
brownz==bronze, brooks, brooke, brookline


The suggestions using my real simple algorithm against my index were:

bronw==brown, brwn, brush
bruwn==brown, brwn, brush
brownz==brown, bronze

It appears that  David Spencer/Nicolas Maisonneuve's Spell Checking
Algorithm returns a broader result set than most commercial algorithms
or a real simple algorithm.  I will be the first to say, that this is
just anecdotal evidence and not a rigourous test of either algorithm. 
But until extensive testing has been done I'm going to stick with my
real simple dictionary lookup.

Jonathan

On Wed, 20 Oct 2004 12:56:39 -0400, Aviran [EMAIL PROTECTED] wrote:
 Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009
 
 Aviran
 http://aviran.mordos.com
 
 
 
 -Original Message-
 From: Lynn Li [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 20, 2004 10:52 AM
 To: 'Lucene Users List'
 Subject: RE: Spell checker
 
 Where can I download it?
 
 Thanks,
 Lynn
 
 -Original Message-
 From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 11, 2004 1:26 PM
 To: Lucene Users List
 Subject: Spell checker
 
 hy lucene users
 i developed a Spell checker for lucene inspired by the David Spencer code
 
 see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker
 
 Nicolas Maisonneuve
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: n-gram indexing for generating spell suggestions

2004-10-18 Thread Alexey Lef
You can also store a phonetic key for the term to find sounds-like matches.
I use double metaphone algorithm which appears to be English specific. Not
sure if there is something out there for Dutch.

For the length, I use relative distance cutoff (distance/length) in addition
to the absolute length cutoff that doesn't work very well for short words
(as you mentioned).

Alexey

-Original Message-
From: Aad Nales [mailto:[EMAIL PROTECTED] 
Sent: Monday, October 18, 2004 11:59 AM
To: [EMAIL PROTECTED]
Subject: n-gram indexing for generating spell suggestions

...

2. often used misspelings in Dutch words between 4 and 5 characters were
missed. E.g. 'fiets' was suggested as a possible spell suggestion for
'feits' since no matching 3gram exist between the two. The same held
true for misspellings based on 'ch' and 'g' both being the same sound in
Dutch but written differently.

3. words that could never be part of a suggestion were added based on a
single matchting n-gram. (e.g. if I ask for suggestions on 'per' then
tupperware is also suggested. But solely based on length it is clear
that it has a minimal distance of 7. 

...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Field Tokenization

2004-03-17 Thread Alexey Lef
You can do it using PerFieldAnalyzerWrapper.
See
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFiel
dAnalyzerWrapper.html for details.

Alexey

-Original Message-
From: Brandon Lee [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 17, 2004 3:51 PM
To: Lucene Users List
Subject: Field Tokenization


Hi.  I would like to tokenize different fields in a document w/
different analyzers but it doesn't seem possible because analyzers are
associated w/ documents but not Fields.  Is there a reason for this?

For example, I'd like:

  Document : Field=Text   - porter w/ stop words analyzer
 Field=Author - whitespace lower-cased analyzer

If I add Field=Author as a Keyword Field (non-tokenized), the document
states that it will add it as a single word (I want separate words but
not through the porter stemmer).

I know that query would be more complicated but I'm willing to code
around that.

Thanks for any enlightenment.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]