Re: ThaiAnalyzer for Lucene

2006-02-21 Thread Otis Gospodnetic
Hi Samphan, Please create an "issue" in JIRA, and attach your code to it. We can put the analyzers in the contrib section of Lucene. I hope DictionaryBasedBreakIterator is not a compile-time dependency, because we probably can't distribute ICU4J due to the license. Otis - Original Message

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Terry Steichen
1) Having a simple way to match singular and plural forms of a term with a single wildcard expression is quite useful. 2) The trailing '?' behavior has been present since that wildcard was first introduced. Why not provide a flag to allow the original behavior to optionally be preserved? 3) The

ThaiAnalyzer for Lucene

2006-02-21 Thread Samphan Raruenrom
Hi, I've wrote an alpha version of ThaiAnalyzer to enable Thai in Lucene full text search. Thai has no space between words (same for Lao and Khmer), so you need a dictionary-based word breaker to break words. I use ICU4j DictionaryBasedBreakIterator for this job. I want to contribute the code us

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Chris Hostetter
: In either case, what I'm arguing is that the current behavior makes more : sense in the real world of query expressions (that is, makes the most : common query expressions simpler), so why not continue it? I disagree with that statment. People familiar with shell globing are going to be confus

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Terry Steichen
Hoss, Whether the previous behavior (which I believe has been present in Lucene from the outset) was a "bug" or a "feature" is kind of academic. My point is that this behavior has value that's not countered by any argument that any significant value is added by eliminating it. As to your ri

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Terry Steichen
Marvin, While a stemming analyzer can work well for general purpose queries, if you're seeking a decent level of precision/recall, stemming often severely limits you. Moreover, unless the user is very familiar with the behavior of the stemmer used, some of the returned results can be quite s

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Chris Hostetter
: of query). Under the previous versions of QueryParser, I could simply : specify 'riot???' and capture all of those variants. I don't have a strong opinion on this issue, but it seems clear to me that this was a bug in 1.4.3 not a change in the orriginally intended behavior. queryparsersyntax.h

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Marvin Humphrey
Terry, Is there a reason you wouldn't use a stemming analyzer of some kind, which would match cat and cats but not cater, catches, etc? http://snowball.tartarus.org/demo.php Marvin Humphrey Rectangular Research http://www.rectangular.com/ On Feb 21, 2006, at 3:13 PM, Terry Steichen wrote:

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Yonik Seeley
Terry, I think most of the examples you provide are normally handled via stemming. Using wildcarding for stemming will normally be less accurate. The current behavior is also consistent with the way file globbing works. -Yonik On 2/21/06, Terry Steichen <[EMAIL PROTECTED]> wrote: > Yonik, > >

XML based Query Parser

2006-02-21 Thread markharw00d
Further to our discussions some time ago I've had some time to put together an XML-based query parser with support for many "advanced" query types not supported in the current Query parser. More details and code here: http://www.inperspective.com/lucene/LXQuery2.htm Cheers Mark

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Terry Steichen
Yonik, No, I don't think that the riot* option would work for many queries. Let's take a simple case where you want a singular or plural form, like either cat or cats (which would be very common). With 1.4.x, you can use cat? to retrieve such matches. With the new change, you need to use (

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Yonik Seeley
On 2/21/06, Terry Steichen <[EMAIL PROTECTED]> wrote: > For example, let's say that I'm interested in docs with terms 'riot', > 'riots', 'rioting' and 'rioters' (which, I think, is a reasonable kind > of query). Under the previous versions of QueryParser, I could simply > specify 'riot???' and cap

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Terry Steichen
In reviewing the latest changes incorporated into release 1.9 RC1, I noticed a change responding to JIRA item LUCENE-306. According to the writeup, the new change forces the wildcard pattern 'cat??' to exactly match the length of the term (in this case, a five-letter term starting with 'cat').

[jira] Commented: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases

2006-02-21 Thread Dan Armbrust (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-301?page=comments#action_12367239 ] Dan Armbrust commented on LUCENE-301: - I'm perfectly happy with either new constructor approach - as long as there is a better constructor than what is currently available

Re: Lucene 1.9 RC1 release available

2006-02-21 Thread Doug Cutting
Doug Cutting wrote: Release 1.9 RC1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ I will send this announcement to user list tomorrow if no major issues are identified. If things still look good next week, I will promote this release to 1.9-final. Once

[jira] Commented: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases

2006-02-21 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-301?page=comments#action_12367227 ] Doug Cutting commented on LUCENE-301: - I agree with Hoss that we need to be very careful about compatibility. Why not add a new constructor, IndexWriter(Directory, Analyze

[jira] Resolved: (LUCENE-435) [PATCH] BufferedIndexOutput - optimized writeBytes() method

2006-02-21 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-435?page=all ] Doug Cutting resolved LUCENE-435: - Resolution: Fixed I just committed this. > [PATCH] BufferedIndexOutput - optimized writeBytes() method > ---

Re: 1.9 RC1

2006-02-21 Thread Doug Cutting
Maxim Patramanskij wrote: Doug, what about including optimization of BuffereIndexOutput.writeBytes() method: [ http://issues.apache.org/jira/browse/LUCENE-435?page=all ] made by Lukas Zapletal, into 1.9? I just committed this to trunk. If no issues arise with it there then perhaps we can

Re: 1.9 RC1

2006-02-21 Thread Doug Cutting
Chris Hostetter wrote: I think moving forward the query parser and fileformat docs should be moved into docfile directories within the java source, so they are reved/tagged with the individual releases. That way when people have questions about the file format of their index built with 1.9 they

Lucene 1.9 RC1 release available

2006-02-21 Thread Doug Cutting
Release 1.9 RC1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release candidate has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see: http://svn.apache.org/viewcvs.cgi/*checkou

Re: TermVector usage

2006-02-21 Thread Marvin Humphrey
On Feb 20, 2006, at 9:47 PM, Otis Gospodnetic wrote: As far as I can tell, most people use TermVectors for "more like this" queries (see MoreLikeThis class in contrib/ somewhere) On Feb 21, 2006, at 5:39 AM, Erik Hatcher wrote: I use term vectors for "more like this" queries, such as the lin

Re: TermVector usage

2006-02-21 Thread Erik Hatcher
I use term vectors for "more like this" queries, such as the links you'll see here: I am using the MoreLikeThis class. Erik On Feb 21, 2006, at 12:47 AM, Otis Gospodnetic wrot

Re: Implementing new scoring algorithms in lucene

2006-02-21 Thread Paul Elschot
On Tuesday 21 February 2006 05:34, Shailesh Kochhar wrote: ... > > I have a question about the sumOfSquaredWeigths method. As I > understand it, it computes the square of the idf for a given term that > is used to normalize the weight of individual terms in the query. > > In implementing a differ