Lucene Taglib

2004-03-08 Thread Iskandar Salim
Hi,

I've worked on a bit on the taglib and added an index and field tag for
basic indexing capability, though I don't think it's really useful, apart
from, in my case quick prototyping of web applications. What do you guys
think? I'm new to Lucene and taglibs so I may have missed out lots of
things.

For the curious, you see the 'in progress' examples and docs at
http://www.javaxp.net/lucene-examples/ and http://www.javaxp.net/lucene-doc/
resp.
or download the distribution
http://www.javaxp.net/lucene-taglib/lucene-taglib.zip or
http://www.javaxp.net/lucene-taglib/lucene-taglib.tar.gz

Erik, is there any requirements for the java package names? e.g. ... to be
named as org.apache.lucene.taglib etc.
BTW, I've included the ASL 2.0 license in the source files.

Regards,
Iskandar

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Sunday, March 07, 2004 11:18 AM
Subject: Re: Lucene Search Taglib


 I, too, gave up on the sandbox taglib.  I apologize for even committing
 it without giving it more of a workout.  I gave a good effort to fix it
 up a couple of months ago, but there was more work to do than I was
 willing to put in.

 I have not heard from the original contributor, and I specifically
 asked on the list for assistance with getting it cleaned up.  I would
 gladly throw away what is in the sandbox for your code.

 If your code is designated as ASL 2.0 on all the files per the Apache
 licensing guidelines and you wish to donate it to the sandbox, just say
 the word.

 Erik


 On Mar 6, 2004, at 9:32 PM, Iskandar Salim wrote:

  Hi,
 
  I've written a taglib for querying lucene indices and have uploaded the
  taglib at http://blog.javaxp.net/files/lucene-taglib.zip for anyone
  wanting
  to check it out. It's a hefty 903kb as it includes the Lucene
  libraries and a sample index :P . There's a demo at
  http://www.javaxp.net/lucene-taglib/
 
  Anyway, I could not get the current lucene taglib from the cvs to work
  as
  expected and gave up trying to modify it and getting it to work, so I
  wrote
  a new one, my very first taglib :P, with ideas and code
  borrowed/copied from
  the JSTL taglib.
 
  I've tested the taglib on Tomcat 4.1.18 and Tomcat 5.1.19 on JRE 1.4.2
 
  I'll be making a few enhancements/cleanup/docs these few days and would
  greatly appreciate any feedback/ideas on features that the taglib
  should
  have
  and whether the taglib is done right at all.
 
  Thanks  Regards,
  Iskandar Salim
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread hui




Hi,

Here is the indexing performance testing result for the two index formats.


1000 megahertz Intel Pentium III (2 installed)
32 kilobyte primary memory cache
256 kilobyte secondary memory cache

SCSI Hard drive 145.45 GB  
RAm 3G

Windows 2000 Advanced Server, Service Pack 2

JDK 140
JVM memory 512m

Indexed files: local 66100 local text files around 400m

Index time: 
compound format is 89 seconds slower.

compound format:
1389507 total milliseconds
non-compound format:
1300534 total milliseconds

The index size is 85m with 4 fields only. The files are stored in the index.
The compound format has only 3 files and the other has 13 files. 

Search Time (with only top 10 retrieved, no indexing at the same time, only
one thread search, indices are optimized and opened)
Do not see too much constant difference for the simple situation.

compound format:
Query: iraq
4275 total within(ms) 110
Query: war
5728 total within(ms) 0
Query: iraq AND war
3182 total within(ms) 16

non-compound format:
Query: war
5728 total within(ms) 125
Query: iraq war
6821 total within(ms) 31
Query: iraq AND war
3182 total within(ms) 0



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 04, 2004 11:54 AM
To: Lucene Users List
Subject: Re: Sys properties Was: java.io.tmpdir as lock dir  once again

hui wrote:
 Not yet. For the compound file format, when the files get bigger, if I add
 few new files frequently, the bigger files has to be updated. Will that
 affect lot on the search and produce heavier disk I/O compared with the
 traditional index format? It seems OS cache makes quite difference when
the
 files not changed differently.

The compound format slows indexing performance slightly, but should not 
affect search performance much.  It radically reduces the number of file 
handles used when searching, by a factor of eight or more, depending on 
how many indexed fields you have.

Perhaps the compound format should be the default format in 1.4.  Can 
folks provide any benchmarks for how it affects performance?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread Andrzej Bialecki
hui wrote:



Hi,

Here is the indexing performance testing result for the two index formats.
A shameless plug: you can use Luke (http://www.getopt.org/luke) to 
convert the same index between compound/non-compound formats. Which 
could be useful to rule out any possible differences in the 
indexing/inserting process between the runs. Luke provides you also with 
a simple time measurement for query execution. Just FYI.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread hui
Thank you, the converting option from Luke is really helpful for migrate
existing user index.
Regards,
Hui

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 08, 2004 10:57 AM
To: Lucene Users List
Subject: Re: Sys properties Was: java.io.tmpdir as lock dir  once again

hui wrote:

 
 
 
 Hi,
 
 Here is the indexing performance testing result for the two index formats.

A shameless plug: you can use Luke (http://www.getopt.org/luke) to 
convert the same index between compound/non-compound formats. Which 
could be useful to rule out any possible differences in the 
indexing/inserting process between the runs. Luke provides you also with 
a simple time measurement for query execution. Just FYI.

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-08 Thread Doug Cutting
Erik Hatcher wrote:
  private static final DecimalFormat formatter =
  new DecimalFormat(0); // make this as wide as you need
For ints, ten digits is probably safest.  Since Lucene uses prefix 
compression on the term dictionary, you don't pay a penalty at search 
time for long shared prefixes.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread Doug Cutting
hui wrote:
Index time: 
compound format is 89 seconds slower.

compound format:
1389507 total milliseconds
non-compound format:
1300534 total milliseconds
The index size is 85m with 4 fields only. The files are stored in the index.
The compound format has only 3 files and the other has 13 files. 
Thanks for performing this benchmark!

It looks like the compound format is around 7% slower when indexing.  To 
my thinking that's acceptable, given the dramatic reduction in file 
handles.  If folks really need maximal indexing performance, then they 
can explicitly disable the compound format.

Would anyone object to making compound format the default for Lucene 
1.4?  This is an incompatible change, but I don't think it should break 
applications.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Caching and paging search results

2004-03-08 Thread Clandes Tino
Hi all, 
could someone describe his expirience in
implementation of caching, sorting and paging search
results.
Is Stateful Session bean appropriate for this?
My wish is to obtain all search hits only in first
call, and after that, to iterate through Hit
Collection and display cached results.
I have checked SearchBean in contribution section, but
it does not provide real caching and paging.
 
Regards and thanx in advance!
Milan






___
Yahoo! Messenger - Communicate instantly...Ping 
your friends today! Download Messenger Now 
http://uk.messenger.yahoo.com/download/index.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching and paging search results

2004-03-08 Thread Erik Hatcher
In the RealWorld... many applications actually just re-run a search and 
jump to the appropriate page within the hits searching is generally 
plenty fast enough to alleviate concerns of caching.

However, if you need to cache Hits, you need to be sure to keep around 
the originating IndexSearcher as well.

A stateful session bean could be used, but I'd opt for a much simpler 
solution as a first pass, such as the first point of just re-running a 
search from scratch.

	Erik

On Mar 8, 2004, at 2:14 PM, Clandes Tino wrote:

Hi all,
could someone describe his expirience in
implementation of caching, sorting and paging search
results.
Is Stateful Session bean appropriate for this?
My wish is to obtain all search hits only in first
call, and after that, to iterate through Hit
Collection and display cached results.
I have checked SearchBean in contribution section, but
it does not provide real caching and paging.
Regards and thanx in advance!
Milan





___
Yahoo! Messenger - Communicate instantly...Ping
your friends today! Download Messenger Now
http://uk.messenger.yahoo.com/download/index.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread Terry Steichen
I tend to agree (but with the same uncertainty as to why I feel that way).

Regards,

Terry
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 08, 2004 2:34 PM
Subject: Re: Sys properties Was: java.io.tmpdir as lock dir  once again


 I can't explain why, but I feel like the old index format should stay
 by default.  I feel like I'd rather a (slightly) faster index, and
 switch to the compound one when/IF I encounter problems, than have a
 safer, but slower index, and never realize that there is a faster
 option available.
 
 Weak argument, I know, but some instinct in me thinks that the current
 mode should remain.
 
 Otis
 
 
 --- Doug Cutting [EMAIL PROTECTED] wrote:
  hui wrote:
   Index time: 
   compound format is 89 seconds slower.
   
   compound format:
   1389507 total milliseconds
   non-compound format:
   1300534 total milliseconds
   
   The index size is 85m with 4 fields only. The files are stored in
  the index.
   The compound format has only 3 files and the other has 13 files. 
  
  Thanks for performing this benchmark!
  
  It looks like the compound format is around 7% slower when indexing. 
  To 
  my thinking that's acceptable, given the dramatic reduction in file 
  handles.  If folks really need maximal indexing performance, then
  they 
  can explicitly disable the compound format.
  
  Would anyone object to making compound format the default for Lucene 
  1.4?  This is an incompatible change, but I don't think it should
  break 
  applications.
  
  Doug
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Filtering out duplicate documents...

2004-03-08 Thread Michael Giles
I'm looking for a way to filter out duplicate documents from an index 
(either while indexing, or after the fact).  It seems like there should be 
an approach of comparing the terms for two documents, but I'm wondering if 
any other folks (i.e. nutch) have come up with a solution to this problem.

Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.

-Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Filtering out duplicate documents...

2004-03-08 Thread Chong, Herb
that kind of fuzzy equality is an area of open research. you need to define what is an 
acceptable error rate for Type 1 and Type 2 errors before you can think about 
implementations that scale better. approaches range from identifying document 
vocabulary and statistics to raw hashing of the input text.

Herb...

-Original Message-
From: Michael Giles [mailto:[EMAIL PROTECTED]
Sent: Monday, March 08, 2004 4:38 PM
To: Lucene Users List
Subject: Filtering out duplicate documents...


Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filtering out duplicate documents...

2004-03-08 Thread Erik Hatcher
My impression is the new term vector support should at least make this 
type of comparison feasible in some manner.  I'd be interested to see 
what you come up with if you give this a try.  You will need the latest 
CVS codebase.

	Erik

On Mar 8, 2004, at 4:37 PM, Michael Giles wrote:

I'm looking for a way to filter out duplicate documents from an index 
(either while indexing, or after the fact).  It seems like there 
should be an approach of comparing the terms for two documents, but 
I'm wondering if any other folks (i.e. nutch) have come up with a 
solution to this problem.

Obviously you can compute the Levenstein distance on the text, but 
that is way too computationally intensive to scale.  So the goal is to 
find something that would be workable in a production system.  For 
example, a given NYT article, and its printer friendly version should 
be deemed to be the same.

-Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


which query matched in a Boolean query

2004-03-08 Thread Supun Edirisinghe
I have a BooleanQuery that takes 3 TermQueries

for example (title:colombo OR txt:colombo OR city:colombo)

I would like to mark hits that match in the field title in red on
display, txt in blue, and city in green. and maybe those that match in 2
fields in another color

is this possible?

thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching and paging search results

2004-03-08 Thread Tatu Saloranta
On Monday 08 March 2004 12:34, Erik Hatcher wrote:
 In the RealWorld... many applications actually just re-run a search and
 jump to the appropriate page within the hits searching is generally
 plenty fast enough to alleviate concerns of caching.

 However, if you need to cache Hits, you need to be sure to keep around
 the originating IndexSearcher as well.

Further, oftentimes search index only contains key to actual content indexed 
(which itself is stored as a file, in database, or so)... so it's enough to 
cache just set of such ids, not actual search result objects.
And assuming ids are simple (int id, short String), such information can be 
stored in, say, user session.
In system I'm working on, we store up to 500 hits, only storing document id 
(int) and hit quality (byte), stored in session.

-+ Tatu +-


 A stateful session bean could be used, but I'd opt for a much simpler
 solution as a first pass, such as the first point of just re-running a
 search from scratch.

   Erik

 On Mar 8, 2004, at 2:14 PM, Clandes Tino wrote:
  Hi all,
  could someone describe his expirience in
  implementation of caching, sorting and paging search
  results.
  Is Stateful Session bean appropriate for this?
  My wish is to obtain all search hits only in first
  call, and after that, to iterate through Hit
  Collection and display cached results.
  I have checked SearchBean in contribution section, but
  it does not provide real caching and paging.
 
  Regards and thanx in advance!
  Milan
 
 
 
 
 
 
  ___
  Yahoo! Messenger - Communicate instantly...Ping
  your friends today! Download Messenger Now
  http://uk.messenger.yahoo.com/download/index.html
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-08 Thread Kevin A. Burton
I'm looking at StopFilter.java right now...

I did a kill -3 java and a number of my threads were blocked here:

ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for 
monitor entry [b9bff000..b9bff8d0]
   at java.util.Hashtable.get(Hashtable.java:332)
   - waiting to lock 0x61569720 (a java.util.Hashtable)
   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
   at 
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170)
   at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
   at 
ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136)
   at 
ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331)

Is there ANY reason to keep this as a Hashtable?  It's just preventing 
inversion across multiple threads.  They all have to lock on this hashtable.

Note that this guy is initialized ONCE and no more puts take place so I 
don't see why not.  It's readonly after the StopFilter is created.

I think this might really end up speeding up indexing a bit.  No hard 
benchmarks yet though.  Right now though it's just an inefficiency that 
should be removed.

I've attached a quick implementation. 

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

package org.apache.lucene.analysis;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/).
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names Apache and Apache Software Foundation and
 *Apache Lucene must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called Apache,
 *Apache Lucene, nor may Apache appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * 
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * http://www.apache.org/.
 */

import java.io.IOException;
import java.util.*;

/** Removes stop words from a token stream. */

public final class StopFilter extends TokenFilter {

  //Note: this could migrate to using a HashSet
  private HashMap table;

  /** Constructs a filter which removes words from the input
TokenStream that are named in the array of words. */
  public StopFilter(TokenStream in, String[] stopWords) {
super(in);
table = makeStopTable(stopWords);
  }

  /** Constructs a filter which removes words from the input

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-08 Thread Erik Hatcher
I don't see any reason for this to be a Hashtable.

It seems an acceptable alternative to not share analyzer/filter  
instances across threads - they don't really take up much space, so is  
there a reason to share them?  Or I'm guessing you're sharing it  
implicitly through an IndexWriter, huh?

I'll away further feedback before committing this change, but seems  
reasonable to me.

	Erik

On Mar 8, 2004, at 8:50 PM, Kevin A. Burton wrote:
I'm looking at StopFilter.java right now...

I did a kill -3 java and a number of my threads were blocked here:

ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for  
monitor entry [b9bff000..b9bff8d0]
   at java.util.Hashtable.get(Hashtable.java:332)
   - waiting to lock 0x61569720 (a java.util.Hashtable)
   at  
org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
   at  
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.ja 
va:170)
   at  
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java: 
111)
   at  
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
   at  
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
   at  
ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java: 
136)
   at  
ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java: 
331)

Is there ANY reason to keep this as a Hashtable?  It's just preventing  
inversion across multiple threads.  They all have to lock on this  
hashtable.

Note that this guy is initialized ONCE and no more puts take place so  
I don't see why not.  It's readonly after the StopFilter is created.

I think this might really end up speeding up indexing a bit.  No hard  
benchmarks yet though.  Right now though it's just an inefficiency  
that should be removed.

I've attached a quick implementation.
Kevin
--

Please reply using PGP:

   http://peerfear.org/pubkey.asc
   NewsMonster - http://www.newsmonster.org/
   Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
package org.apache.lucene.analysis;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/).
 *Alternately, this acknowledgment may appear in the software  
itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names Apache and Apache Software Foundation and
 *Apache Lucene must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called Apache,
 *Apache Lucene, nor may Apache appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * 
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * http://www.apache.org/.
 */

import java.io.IOException;
import java.util.*;
/** Removes 

Re: Lucene Taglib

2004-03-08 Thread Iskandar Salim
Thanks for the tips and comments.

Regards,
Iskandar

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 08, 2004 7:48 PM
Subject: Re: Lucene Taglib


 On Mar 8, 2004, at 3:46 AM, Iskandar Salim wrote:
  I've worked on a bit on the taglib and added an index and field 
  tag for
  basic indexing capability, though I don't think it's really useful, 
  apart
  from, in my case quick prototyping of web applications. What do you 
  guys
  think? I'm new to Lucene and taglibs so I may have missed out lots of
  things.
 
 I don't think a taglib is a useful place to put indexing code.  Your 
 mileage may vary, but there are so many flags to control (field type, 
 analyzer, boost, etc) that it is more cleanly done directly with the 
 Lucene API.
 
  For the curious, you see the 'in progress' examples and docs at
  http://www.javaxp.net/lucene-examples/ and 
  http://www.javaxp.net/lucene-doc/
 
 
 Nice work fleshing out documentation!
 
  Erik, is there any requirements for the java package names? e.g. ... 
  to be
  named as org.apache.lucene.taglib etc.
 
 Yes, that package name is the best one probably.
 
  BTW, I've included the ASL 2.0 license in the source files.
 
 Thanks!
 
 A few comments/suggestions:
 
 - What if I wanted an index to live in a RAMDirectory and have it live 
 in application scope?  My suggestion here is instead of using a path 
 for the index, use a Directory.  This allows greater freedom for the 
 developer, and it should be pretty easy to craft a JSTL expression to 
 wrap a string path into an FSDirectory (I don't know JSTL, but if it 
 cannot do this then I'm disappointed - I'm in the Tapestry/OGNL world 
 myself, where it would be trivial).
 
 - Or, perhaps you may want a long-lived IndexSearcher so that a 
 Directory is only needed to construct the IndexSearcher?
 
 - I haven't looked at your code, but is 'keywords' passed directly to 
 QueryParser?  If so, perhaps that should be renamed 'query' instead 
 since keywords is more domain-specific and has sort of a special 
 meaning in Lucene as a Field.Keyword
 
 - What about allowing specification of an Analyzer?  Look at how this 
 is done in the sandbox contributions/ant area in IndexTask.  I allowed 
 the user to specify high level strings like 'whitespace', 'stop', 
 'standard', etc. as well as a fully-qualified classname.  I can only 
 assume you have it hardcoded to use a particular analyzer, which is not 
 going to be generally useful.
 
 - It would also be nice if you allowed for an optional filter to be 
 specified - in this case I think it would probably suffice to just 
 allow a Filter instance to be passed in rather than the taglib itself 
 constructing one.  This allows capabilities like search-within-search 
 and more.
 
 - What is the 'content' attribute for the search tag?  Is that the 
 default field?  If so, again, I think it would be best to named the 
 similarly to the Lucene terminology - just call it 'field', or 
 'defaultField'.
 
 - SortedMap?  What are you sorting on?  Is count necessary since you 
 can just ask the map what its size is?
 
 In general it looks fine though, although I cringe seeing the amount of 
 code your examples have in it with all the scriptlet junk.  It seems 
 quite yucky to me given that I'm now in the elegant Tapestry world 
 where I could hide the *entire* tag in an HTML template with something 
 like this:
 
 table jwcid=results/
 
 and no, I'm not kidding, and yes, there would be more behind the scenes 
 but separate from the view.  And the example includes all the paging 
 controls.
 
 Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Taglib

2004-03-08 Thread Erik Hatcher
On Mar 8, 2004, at 10:21 PM, Iskandar Salim wrote:
Thanks for the tips and comments.
Also, there was a big smiley implicit in my JSP taglib rantings below.  
Certainly no offense intended.  I've paid my Struts/taglib dues and am 
now deep into a completely different web development paradigm that I 
find quite enjoyable and refreshing.

Your taglib is a nicely done.

	Erik



Regards,
Iskandar
- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 08, 2004 7:48 PM
Subject: Re: Lucene Taglib

On Mar 8, 2004, at 3:46 AM, Iskandar Salim wrote:
I've worked on a bit on the taglib and added an index and field
tag for
basic indexing capability, though I don't think it's really useful,
apart
from, in my case quick prototyping of web applications. What do you
guys
think? I'm new to Lucene and taglibs so I may have missed out lots of
things.
I don't think a taglib is a useful place to put indexing code.  Your
mileage may vary, but there are so many flags to control (field type,
analyzer, boost, etc) that it is more cleanly done directly with the
Lucene API.
For the curious, you see the 'in progress' examples and docs at
http://www.javaxp.net/lucene-examples/ and
http://www.javaxp.net/lucene-doc/
Nice work fleshing out documentation!

Erik, is there any requirements for the java package names? e.g. ...
to be
named as org.apache.lucene.taglib etc.
Yes, that package name is the best one probably.

BTW, I've included the ASL 2.0 license in the source files.
Thanks!

A few comments/suggestions:

- What if I wanted an index to live in a RAMDirectory and have it live
in application scope?  My suggestion here is instead of using a path
for the index, use a Directory.  This allows greater freedom for the
developer, and it should be pretty easy to craft a JSTL expression to
wrap a string path into an FSDirectory (I don't know JSTL, but if it
cannot do this then I'm disappointed - I'm in the Tapestry/OGNL world
myself, where it would be trivial).
- Or, perhaps you may want a long-lived IndexSearcher so that a
Directory is only needed to construct the IndexSearcher?
- I haven't looked at your code, but is 'keywords' passed directly to
QueryParser?  If so, perhaps that should be renamed 'query' instead
since keywords is more domain-specific and has sort of a special
meaning in Lucene as a Field.Keyword
- What about allowing specification of an Analyzer?  Look at how this
is done in the sandbox contributions/ant area in IndexTask.  I allowed
the user to specify high level strings like 'whitespace', 'stop',
'standard', etc. as well as a fully-qualified classname.  I can only
assume you have it hardcoded to use a particular analyzer, which is 
not
going to be generally useful.

- It would also be nice if you allowed for an optional filter to be
specified - in this case I think it would probably suffice to just
allow a Filter instance to be passed in rather than the taglib itself
constructing one.  This allows capabilities like search-within-search
and more.
- What is the 'content' attribute for the search tag?  Is that the
default field?  If so, again, I think it would be best to named the
similarly to the Lucene terminology - just call it 'field', or
'defaultField'.
- SortedMap?  What are you sorting on?  Is count necessary since you
can just ask the map what its size is?
In general it looks fine though, although I cringe seeing the amount 
of
code your examples have in it with all the scriptlet junk.  It seems
quite yucky to me given that I'm now in the elegant Tapestry world
where I could hide the *entire* tag in an HTML template with something
like this:

table jwcid=results/

and no, I'm not kidding, and yes, there would be more behind the 
scenes
but separate from the view.  And the example includes all the paging
controls.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Taglib

2004-03-08 Thread Iskandar Salim
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, March 09, 2004 11:51 AM
Subject: Re: Lucene Taglib


 Also, there was a big smiley implicit in my JSP taglib rantings below.  
 Certainly no offense intended.

None taken. :)

  I've paid my Struts/taglib dues and am 
 now deep into a completely different web development paradigm that I 
 find quite enjoyable and refreshing.

Heard too many good things about Tapestry. Have to learn it some day ;)

Regards,
Iskandar


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]