Benchmarking on GOV2

2006-05-29 Thread Sebastiano Vigna
Dear Lucene developers, I'd be interested in doing some benchmarking on (at least) Lucene, Egothor and MG4J. There is no actual data around on publicly available collections, and it would be nice to have some more objective data on efficiency for a significantly large collection. We have GOV2 (25M

Re: Benchmarking on GOV2

2006-05-29 Thread Dave Kor
Hi, On 5/29/06, Sebastiano Vigna <[EMAIL PROTECTED]> wrote: Dear Lucene developers, I'd be interested in doing some benchmarking on (at least) Lucene, Egothor and MG4J. There is no actual data around on publicly available collections, and it would be nice to have some more objective data on effi

Re: Benchmarking on GOV2

2006-05-29 Thread Murat . Yakici
Hi, We have been doing such a benchmark over all TREC collections and TREC queries. Our participation to TREC in last years gives us the opportunity to work on the collections. Lucene is one of the systems that we look at. The measurements are based on two functionalities; indexing and querying. W

[jira] Commented: (LUCENE-503) Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene

2006-05-29 Thread Arthit Suriyawongkul (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-503?page=comments#action_12413695 ] Arthit Suriyawongkul commented on LUCENE-503: - related projects/implementations: SansarnLook based on Lucene, with additional ThaiAnalyzer ref: http://sansarn.c

Re: How To find which field has the search term in Hit?

2006-05-29 Thread N
Thanks for the reply but I couldnt get your point..Could you elaborate it further? Fopr instance we have FirstName (= Martin ), LastName (= Spaniol), Company (= Mark Co.) and we search for the "Mar*" which will be found in FirstName and Company ..so how can I retrieve this info that it is foun

Re: Benchmarking on GOV2

2006-05-29 Thread eks dev
That would be great to see! There is a million of enhancements and ideas that could come up as a result of this comparison. For example, I would not be surprised to see mg4j "perfect skipping" to become interesting optimization for Lucene, Trie based Lexicon could make some regex queries signi

Re: Benchmarking on GOV2

2006-05-29 Thread Sebastiano Vigna
On Mon, 2006-05-29 at 17:33 +0800, Dave Kor wrote: > I was wondering if you have seen the TREC 2004 paper by Giuseppe > Attardi, Andrea Esuli and Chirag Pate from the University of Pisa, > Italy, titled "Using Clustering and Blade Clusters in the TeraByte > task"? http://trec.nist.gov/pubs/trec13/

Re: New lucene contrib project - GData Server

2006-05-29 Thread Yonik Seeley
On 5/29/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Regarding the part where you describe how indexing/fields could be configured via XML descriptors, you may want to have a look at lucene/java/trunk/contrib/xml-query-parser . And Solr's schema.xml :-) http://svn.apache.org/viewvc/incuba

Re: Changing Lucene scoring?

2006-05-29 Thread Grant Ingersoll
Otis wrote: Short answer: no. Damn are those scoring classes hard to follow... I have looked at these classes several times, including stepping through them and I still find them confusing. Perhaps someone (Doug? pretty please?) could illuminate us? I think the part I always found tr

Re: New lucene contrib project - GData Server

2006-05-29 Thread Simon Willnauer
This is great guys, I will definately have a look at it. :) Thanks!! I'm thinking of configuring the indexer via xml as well. As the atom format allows foreign namespaces the indexing component has to be very flexible. I might configure all the elements in the atom namespace globally and will off

Re: Lucene and Java 1.5

2006-05-29 Thread Bill Janssen
Boy, I'd sure like to see at least one bug-fix release for 2.0 maintain java 1.4 compatibility. Would that be 2.1? Bill > This sounds reasonable to me. I feel bad about Andi and PyLucene, but it > sounds like GCJ(X) will soon be up-to-date (the link Andi sent was from early > February). Disc

Re: Lucene and Java 1.5

2006-05-29 Thread Simon Willnauer
I guess this discussion isn't over... I would like to know if anybody would feel uncomfortable with a 1.5 dependend contrib project like the GData Server? I'm not sure whether it is worth to think about a 2.0 / 2.1 (tiger) branch. That would be a lot more work but far less fight ;) simon On 5/2

Re: Lucene and Java 1.5

2006-05-29 Thread Otis Gospodnetic
Could be 2.0.*. I think that is what Hoss was saying, too. Otis - Original Message From: Bill Janssen <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org; Otis Gospodnetic <[EMAIL PROTECTED]> Sent: Monday, May 29, 2006 11:17:43 AM Subject: Re: Lucene and Java 1.5 Boy, I'd sure like to

Re: Benchmarking on GOV2

2006-05-29 Thread Andrzej Bialecki
Dave Kor wrote: Hi, On 5/29/06, Sebastiano Vigna <[EMAIL PROTECTED]> wrote: Dear Lucene developers, I'd be interested in doing some benchmarking on (at least) Lucene, Egothor and MG4J. There is no actual data around on publicly available collections, and it would be nice to have some more objec

Re: Benchmarking on GOV2

2006-05-29 Thread Otis Gospodnetic
Hi, - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> Dave Kor wrote: > Hi, > > On 5/29/06, Sebastiano Vigna <[EMAIL PROTECTED]> wrote: >> Dear Lucene developers, >> I'd be interested in doing some benchmarking on (at least) Lucene, >> Egothor and MG4J. There is no actual dat

Re: How To find which field has the search term in Hit?

2006-05-29 Thread jian chen
Hi, Noon, Sorry I did not initially understand the detailed problem you have. This sounds like a prefix match problem. You can create index for each field and then do a prefix mach for these fields. By the way, I think you question could be better served by posting to the lucene user group. Ch

Re: Benchmarking on GOV2

2006-05-29 Thread Andrzej Bialecki
Otis Gospodnetic wrote: OG: But Andrzej, you already wrote that indexing benchmark tool (which we never put anywhere in SVN, I'm afraid) that works on some freely available Reuters corpus, I believe. Why couldn't that be adapted for testing Lucene, Egothor, and MG4J? Hmm, yes, indeed I h

Re: Benchmarking on GOV2

2006-05-29 Thread Marvin Humphrey
On May 29, 2006, at 10:34 AM, Andrzej Bialecki wrote: It could use the Reuters corpus Has anyone used existing categorization data associated with the Reuters corpus to build a benchmarker that measured IR precision and/ or recall? Marvin Humphrey Rectangular Research http://www.rectang

Re: Lucene and Java 1.5

2006-05-29 Thread Erik Hatcher
To weigh in with my take, all environments I develop and deploy to are at JDK/JRE 1.5. Solr is exclusively for 1.5+ and it has top billing my architecture. GData server at 1.5 is perfectly fine by me. I'd use it, and very interested in Solr collaboration as well. Erik On May 29

Re: Benchmarking on GOV2

2006-05-29 Thread Andrzej Bialecki
Marvin Humphrey wrote: On May 29, 2006, at 10:34 AM, Andrzej Bialecki wrote: It could use the Reuters corpus Has anyone used existing categorization data associated with the Reuters corpus to build a benchmarker that measured IR precision and/or recall? That would be RCV1 or RCV2, right

Re: Benchmarking on GOV2

2006-05-29 Thread Marvin Humphrey
On May 29, 2006, at 10:58 AM, Andrzej Bialecki wrote: Has anyone used existing categorization data associated with the Reuters corpus to build a benchmarker that measured IR precision and/or recall? That would be RCV1 or RCV2, right? AFAIK the Reuters-21578 has no such information ... Th

Re: Lucene and Java 1.5

2006-05-29 Thread Simon Willnauer
So I guess that would be totally alright to build the gdata server 1.5 dependent. does anyone feel comfortable with that? simon On 5/29/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: To weigh in with my take, all environments I develop and deploy to are at JDK/JRE 1.5. Solr is exclusively for 1.5

Gdata Server - Feed / Entry representation

2006-05-29 Thread Simon Willnauer
Hello everyone, today I reconsidered the internal representation of the feed / entries. I had a closer look at the Google Data Client Api which is supposed to be the other end to the server. This API is dist. under the Apache Licence e.g. open source. It already provides the Object representatio

Re: Lucene and Java 1.5

2006-05-29 Thread Chris Hostetter
: Boy, I'd sure like to see at least one bug-fix release for 2.0 : maintain java 1.4 compatibility. Would that be 2.1? : Could be 2.0.*. I think that is what Hoss was saying, too. Yes, that was my point ... as far as i can tell, Lucene bug fix releases have historically been at the "third leve

Re: Benchmarking on GOV2

2006-05-29 Thread Chuck Williams
Sebastiano Vigna wrote on 05/28/2006 10:39 PM: > but we will certainly need > some help to configure Lucene so that it works at its best. > > We would like to measure indexing time and query answer time > I'm not sure what form you would like that help to take, but here are a couple high-level

[jira] Updated: (LUCENE-503) Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene

2006-05-29 Thread Samphan Raruenrom (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-503?page=all ] Samphan Raruenrom updated LUCENE-503: - Attachment: TestThaiAnalyzer.java Add TestThaiAnalyzer junit test, modified from TestFrenchAnalyzer. The Thai words are picked so that changing the d

[jira] Commented: (LUCENE-503) Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene

2006-05-29 Thread Samphan Raruenrom (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-503?page=comments#action_12413756 ] Samphan Raruenrom commented on LUCENE-503: -- All the code have been tested with Lucene 2.0.0. Thanks Art for the info/URL. I've never known about Pichai's work before I

Re: Lucene and Java 1.5

2006-05-29 Thread Bill Janssen
My concern is really with the use of GCJ with Lucene. I'd hate to see Lucene core releases that couldn't be used with the latest "stable" release of GCJ. Unfortunately, it's very hard to know what that means. What's the latest version of GCJ? What Java language features are supported in it? It

Re: Benchmarking on GOV2

2006-05-29 Thread Sebastiano Vigna
On Mon, 2006-05-29 at 14:35 -1000, Chuck Williams wrote: > I'm not sure what form you would like that help to take, but here are a > couple high-level points imho: Help in configuring Lucene so that it uses all resources available, and so that the results returned are identical to all other engin