Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
 Thats the beauty it is up to you to load the doc iff you want it.

As I want all of them I don't see why this should be faster at all...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread Erik Hatcher
On Apr 11, 2004, at 5:25 AM, [EMAIL PROTECTED] wrote:
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
Thats the beauty it is up to you to load the doc iff you want it.
As I want all of them I don't see why this should be faster at all...
Then have a look at the Hits class.  It is doing more work for caching 
and keeping a most recently used collection of documents around.  By 
using a HitCollector you are bypassing those mechanisms.  Whether it is 
measurably faster would depend on several other factors.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Sunday 11 April 2004 13:40, Erik Hatcher wrote:
 using a HitCollector you are bypassing those mechanisms.  Whether it is
 measurably faster would depend on several other factors.

Well, it is hardly faster, so this is no real solution :-\

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
 Thats the beauty it is up to you to load the doc iff you want it.

Well, there's another problem with HitCollector: the list I build is not 
sorted by score :-(

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread Erik Hatcher
On Apr 11, 2004, at 9:32 AM, [EMAIL PROTECTED] wrote:
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
Thats the beauty it is up to you to load the doc iff you want it.
Well, there's another problem with HitCollector: the list I build is 
not
sorted by score :-(
HitCollector was just an option - and apparently not the right one for 
your use.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ValueListHandler pattern with Lucene

2004-04-11 Thread Erik Hatcher
On Apr 11, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote:
On Sunday 11 April 2004 15:56, Erik Hatcher wrote:
HitCollector was just an option - and apparently not the right one for
your use.
So, any other option? :-)
Well, yes the one we already discussed.  Let your presentation tier 
talk directly to Hits, so you are as efficient as possible with access 
to documents, and only fetch what you need.

Again, don't let patterns get in your way.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Sunday 11 April 2004 17:16, Erik Hatcher wrote:
 Well, yes the one we already discussed.  Let your presentation tier
 talk directly to Hits, so you are as efficient as possible with access
 to documents, and only fetch what you need.

 Again, don't let patterns get in your way.

Well, the sense of tiers and (BTW: language-independant) patterns is to 
modularize software and make things exchangable. This way
neither the presentation tier nor the search engine is exchangable.

The problem actually is that VLH is designed to have a static list of VOs. VLH 
needs to evolve to support something like a data provider that dynamically 
may add data. The problems here so far is that an Iterator must throw an 
ConcurrentModificationException if the backing data is modified but as data 
in a VLH is actually never removed but only added this should be something 
possible to implement.

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread Erik Hatcher
On Apr 11, 2004, at 11:28 AM, [EMAIL PROTECTED] wrote:
On Sunday 11 April 2004 17:16, Erik Hatcher wrote:
Well, yes the one we already discussed.  Let your presentation 
tier
talk directly to Hits, so you are as efficient as possible with access
to documents, and only fetch what you need.

Again, don't let patterns get in your way.
Well, the sense of tiers and (BTW: language-independant) patterns is to
modularize software and make things exchangable. This way
neither the presentation tier nor the search engine is exchangable.
The problem actually is that VLH is designed to have a static list of 
VOs. VLH
needs to evolve to support something like a data provider that 
dynamically
may add data. The problems here so far is that an Iterator must throw 
an
ConcurrentModificationException if the backing data is modified but as 
data
in a VLH is actually never removed but only added this should be 
something
possible to implement.
In other words, you need to invent your own pattern here?!  :)

The benefit of agility is to know that any decision you make now is not 
something that prohibits you from change later.  Do you really think 
you're going to plug-and-play with search engines?  Or will you be 
sticking with Lucene for the foreseeable future?  Are you trying to 
plan for a future without Lucene when there is no use-case for doing 
so?  If you code with coupling to Lucene, do you see that as making 
life harder in the future, or are you smart enough and flexible enough 
to change your software as times change?

Throw your patterns away when they don't solve the problem.  Be 
pragmatic _and_ agile.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Stemming options

2004-04-11 Thread Boris Goldowsky
Has anyone on the list implemented a dictionary-based English stemmer
with Lucene?  Perhaps based on the freely-available ispell dictionaries
or something like that?  The Porter and Snowball stemmers have not
worked that well for our application, but it is a bit daunting to start
from scratch in developing an alternate stemmer.

Alternatively, is there an algorithmic stemmer that anyone has used
which is a little less aggressive than the Porter algorithm?  We've been
having problems with searches for conversion returning converse and
conversational; and animal returning animate.  Yes, these are
morphologically related, but in our particular application it would be
better to stick with removing simple inflections.

Thanks for any pointers --

Boris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Carrot 2 (was: Re: clustering results)

2004-04-11 Thread Otis Gospodnetic
Carrot (2):

  http://www.cs.put.poznan.pl/dweiss/carrot/xml/index.xml?lang=en

Otis

--- [EMAIL PROTECTED] wrote:
 I got all excited reading the subject line clustering results but
 this isn't 
 really clustering is it?  This is more sorting.  Does anyone know of
 any work 
 within Lucene (or another indexer) to do actual subject clustering
 (i.e. like 
 Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? 
 It would 
 be pretty awesome if Lucene had such ability, I know there aren't a
 whole lot 
 of clustering options, and the commercial products are very
 expensive.  
 Anyhow, just curious.
 
 A brief definition of clustering: automatically organizing search or
 database 
 query results into meaningful hierarchical folders ... transforming
 long lists 
 of search results into categorized information without any clumsy
 pre-
 processing of the source documents.
 
 I'm not sure how it would be done...?  Based off of top Term
 Frequencies for a 
 document?
 
 -K
 
 Quoting Michael A. Schoen [EMAIL PROTECTED]:
 
  So as Venu pointed out, sorting doesn't seem to help the problem.
 If we have
  to walk the result set, access docs and dedupe using brute force,
 we're
  better off w/ the standard order by relevance.
  
  If you've got an example of this type of clustering done in a more
 efficient
  way, that'd be great.
  
  Any other ideas?
  
  
  - Original Message - 
  From: Erik Hatcher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Saturday, April 10, 2004 12:35 AM
  Subject: Re: clustering results
  
  
   On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
I have an index of urls, and need to display the top 10 results
 for a
given query, but want to display only 1 result per domain. It
 seems
that using either Hits or a HitCollector, I'll need to access
 the doc,
grab the domain field (I'll have it parse ahead of time) and
 only
take/display documents that are unique.
   
A significant percentage of the time I expect I may have to
 access
thousands of results before I find 10 in unique domains. Is
 there a
faster approach that won't require accessing thousands of
 documents?
  
   I have examples of this that I can post when I have more time,
 but a
   quick pointer... check out the overloaded IndexSearcher.search()
   methods which accept a Sort.  You can do really really
 interesting
   slicing and dicing, I think, using it.  Try this one on for size:
  
example.displayHits(allBooks,
new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
}));
  
   Be clever indexing the piece you want to group on - I think you
 may
   find this the solution you're looking for.
  
   Erik
  
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: clustering results

2004-04-11 Thread Karsten Konrad

Hi (danger: shameless advertising below),

our partner, Brox It Solutions, is using our - XtraMind Technologies GmbH - clustering 
for implementing meta-search clustering of search results ala Vivisimo. Check out:

http://www.anyfinder.de/

The clustering is done on the snipplets coming from search engines, but the original 
version that we still use in our own products is based on modified Lucene indexes as 
these can efficiently handle lots of information on texts and terms. Our clustering 
engine does not only cluster search results, but also performs trend recognition for 
competitive intelligence and similar tasks, but not too many people require such 
specialized features.

Brox' price models for this engine may be interesting for those who find other 
products too expensive; it also works with all existing search engines, not only 
Lucene. 

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com







-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Sonntag, 11. April 2004 19:03
An: Lucene Users List
Betreff: Re: clustering results


I got all excited reading the subject line clustering results but this isn't 
really clustering is it?  This is more sorting.  Does anyone know of any work 
within Lucene (or another indexer) to do actual subject clustering (i.e. like 
Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)?  It would 
be pretty awesome if Lucene had such ability, I know there aren't a whole lot 
of clustering options, and the commercial products are very expensive.  
Anyhow, just curious.

A brief definition of clustering: automatically organizing search or database 
query results into meaningful hierarchical folders ... transforming long lists 
of search results into categorized information without any clumsy pre- processing of 
the source documents.

I'm not sure how it would be done...?  Based off of top Term Frequencies for a 
document?

-K

Quoting Michael A. Schoen [EMAIL PROTECTED]:

 So as Venu pointed out, sorting doesn't seem to help the problem. If 
 we have to walk the result set, access docs and dedupe using brute 
 force, we're better off w/ the standard order by relevance.
 
 If you've got an example of this type of clustering done in a more 
 efficient way, that'd be great.
 
 Any other ideas?
 
 
 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, April 10, 2004 12:35 AM
 Subject: Re: clustering results
 
 
  On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote:
   I have an index of urls, and need to display the top 10 results 
   for a given query, but want to display only 1 result per domain. 
   It seems that using either Hits or a HitCollector, I'll need to 
   access the doc, grab the domain field (I'll have it parse ahead of 
   time) and only take/display documents that are unique.
  
   A significant percentage of the time I expect I may have to access 
   thousands of results before I find 10 in unique domains. Is there 
   a faster approach that won't require accessing thousands of 
   documents?
 
  I have examples of this that I can post when I have more time, but a 
  quick pointer... check out the overloaded IndexSearcher.search() 
  methods which accept a Sort.  You can do really really interesting 
  slicing and dicing, I think, using it.  Try this one on for size:
 
   example.displayHits(allBooks,
   new Sort(new SortField[]{
 new SortField(category),
 SortField.FIELD_SCORE,
 new SortField(pubmonth, SortField.INT, true)
   }));
 
  Be clever indexing the piece you want to group on - I think you may 
  find this the solution you're looking for.
 
  Erik
 
 
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]