Re: knowing which field contributed the search result

2005-02-22 Thread David Spencer
John Wang wrote:
Hi David:
Can you further explain which calls specically would solve my problem?
Not in depth but anyway:
Examine the output of Explanation.toHtml() and/or 
Explanation.toString(). Does it contain the info you want..if so call 
the other Explanation methods and/or dig into the src if necessary. 
getValue() is the score, so all that's missing is the name of the field 
and I'm not sure if that's directly returned or not.


Thanks
-John
On Mon, 21 Feb 2005 12:20:15 -0800, David Spencer
[EMAIL PROTECTED] wrote:
John Wang wrote:

Anyone has any thoughts on this?
Does this help?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int)
Thanks
-John
On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote:

Hi:
 Is there way to find out given a hit from a search, find out which
fields contributed to the hit?
e.g.
If my search for:
contents1=brown fox OR contents2=black bear
can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.
Thanks
-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Handling Synonyms

2005-02-21 Thread David Spencer
Luke Shannon wrote:
Hello;
Does anyone see a problem with the following approach?
No, no problem with it and it's in fact what my Wordnet Query 
Expansion sandbox module does.

The nice thing about Lucene is you at least have the option of doing 
things the other way - you can write a custom Analyzer that puts all 
synonyms at the same token offset so they appear to be in the same place 
in the token stream. Thinking about it...this approach, with the 
Analyzer, lets user search for phrases which would match a synonym, so, 
using your example below, the text bright red engine would be matched 
by either phrase bright red or bright colour. Doing the query 
expansion is trickier if you allow phrases.

For synonyms, rather than putting them in the index, I put the original term
and all the synonyms in the query.
Every time I create a query, I check if the term has any synonyms. If it
does, I create Boolean Query OR'ing one Query object for each synonym.
So if I have a synoym list:
red = colour, primary, stop
And someone wants to search the desc field for the red, I would end up with
something like:
( (desc:*red*) (desc:*colout*) (desc:*stop*) ).
I don't like that bit about substring terms, but if it's right for you 
ok - if you insist on loosening things I'd consider fuzzy terms 
(desc:red~ ...etc).


Now the synonyms would'nt be in the index, the Query would account for all
the possible synonym terms.
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: knowing which field contributed the search result

2005-02-21 Thread David Spencer
John Wang wrote:
Anyone has any thoughts on this?
Does this help?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int)
Thanks
-John
On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote:
Hi:
  Is there way to find out given a hit from a search, find out which
fields contributed to the hit?
e.g.
If my search for:
contents1=brown fox OR contents2=black bear
can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.
Thanks
-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Performance

2005-02-18 Thread David Spencer
Noone has mentioned JVM options yet.
[a] -server
[b] -XX:CompileThreshold=1000
[c] Raise the -Xms value if you haven't done so (-Xms...)
I think by default the VM runs with -client but -server makes more 
sense for web containers (Tomcat etc).
[b] tells the hotspot compiler to compile methods sooner - you can lower 
the 1000 to, say, '2' makes it compile methods after they've executed 2 
times - I had trouble once lowering this to 1 for some reason


Also, even though you're not supposed to need to do this, I've found it 
helpful to force gc() periodically e.g. every minute via this idiom:

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Michael Celona wrote:
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  

 

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Performance

2005-02-18 Thread David Spencer
Are you using the highlighter or doing anything non-trivial in 
displaying the results?

Are the pages being compressed (mod_gzip or some servlet equivalent)? 
This definitely helps, though to see the effect you may have to make 
sure your simulated users are remote.

Also consider caching search results if it's reasonable to assume users 
may search for the same things.

I made some measurements on caching on my site:
http://www.searchmorph.com/weblog/index.php?id=41
http://www.searchmorph.com/weblog/index.php?id=40
And I use OSCache:
http://www.searchmorph.com/weblog/index.php?id=38
http://www.opensymphony.com/oscache/


Michael Celona wrote:
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  

 

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Performance

2005-02-18 Thread David Spencer
Michael Celona wrote:
Just tried that... works like a charm... thanks...
Could you clarify what the problem was - just the overhead of opening 
IndexSearchers?
Michael
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 4:42 PM
To: Lucene Users List; Chris Lamprecht
Subject: Re: Search Performance

Or you could just open a new IndexSearcher, forget the old one, and
have GC collect it when everyone is done with it.
Otis
--- Chris Lamprecht [EMAIL PROTECTED] wrote:

I should have mentioned, the reason for not doing this the obvious,
simple way (just close the Searcher and reopen it if a new version is
available) is because some threads could be in the middle of
iterating
through the search Hits.  If you close the Searcher they get a Bad
file descriptor IOException.  As I found out the hard way :)
On Fri, 18 Feb 2005 15:03:29 -0600, Chris Lamprecht
[EMAIL PROTECTED] wrote:
I recently dealt with the issue of re-using a Searcher with an
index
that changes often.  I wrote a class that allows my searching
classes
to check out a lucene Searcher, perform a search, and then return
the Searcher.  It's similar to a database connection pool, except
that
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document comparison

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote:
Matt,
Erik and I have some code for this in Lucene in Action, but David
Spencer did this since the book was published:
  http://www.lucenebook.com/blog/announcements/more_like_this.html

If you want an informal way of doing it you're right, just feed the 
words of the source doc to a query. The doc for the code it is at this 
easy to  remember URL:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/SimilarityQueries.html#formSimilarQuery(java.lang.String,%20org.apache.lucene.analysis.Analyzer,%20java.lang.String,%20java.util.Set)

Follow Otis's link above to my weblog for the code.
The MoreLikeThis stuff is similar but more sophisticated.
Also if you want the IR way I think you'd do a cosine measure. I know 
carrot2 has the code - this might be it:

http://www.searchmorph.com/pub/carrot2/jd/com/chilang/carrot/filter/cluster/rough/measure/CosineCoefficient.html
Otis
--- Matt Chaput [EMAIL PROTECTED] wrote:

Is there a simple, efficient way to compute similarity of documents 
indexed with Lucene?

My first, naive idea is to use the entire contents of one document as
a 
query to the second document, and use the score as a similarity 
measurement. But I think I'm probably way off base with that.

Can any IR pros set me straight? Thanks very much.
Matt
--
Matt Chaput
Word Monkey
Side Effects Software Inc.
A goddamned ray of sunshine all the goddamned time
-- Sparkle Hayter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, 
but I always hear people complaining about the speed. 
Yeah, but in theory, in the ideal world :), it should't be any slower - 
there's no magic Lucene has that DB's don't.  And the big advantage of 
it being embedded in the DB is the index can always be up to date, just 
as if you had Lucene updating the index based on a trigger. You don't 
need any separate cron job to periodically update the index.

But this brings up - has anyone run Lucene off a database trigger or are 
 triggers known to be slow and bad for this use?

A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
Otis

--- Steven J. Owens [EMAIL PROTECTED] wrote:

Hi,
I was rambling to some friends about an idea to build a
cache-aware JDBC driver wrapper, to make it easier to keep a lucene
index of a database up to date.
They asked me a question that I have to take seriously, which is
that most RDBMSes provide some built-in fulltext searching -
postgres,
mysql, even oracle - why not use that instead of adding another layer
of caching?
I have to take this question seriously, especially since it
reminds me a lot of what Doug has often said to folks contemplating
doing similar things (caching query results, etc) with Lucene.
Has anybody done some serious investigation into this, and could
summarize the pros and cons?
--
Steven J. Owens
[EMAIL PROTECTED]
I'm going to make broad, sweeping generalizations and strong,
declarative statements, because otherwise I'll be here all night and
this document will be four times longer and much less fun to read.
Take it all with a grain of salt. - http://darksleep.com/notablog
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
markharw00d wrote:
 But this brings up - has anyone run Lucene off a database trigger or 
are  triggers known to be slow and bad for this use?

I suspect the tricky bit would be knowing when to balancing the calls to 
Reader/Writer closes, opens and optimizes.
Record updates are the usual fun and games involving a reader.delete and 
a document.write.
I agree this is the usual tricky/fun thing.
In similar situations I have:
- batched the updates in, well, sort of a queue
- flushed the queue after t seconds or n documents (e.g. t=60sec, 
n=1000 docs)

Part of the trick is a document that changes multiple times during one 
of these periods - if you have a add queue and a delete queue then 
you'll probably have the wrong index with the doc either zero times or 
more than one time - not impossible to cover, just something to keep in mind

- Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2005-02-08 Thread David Spencer
Owen Densmore wrote:
I would like to be able to analyze my document collection (~1200 
documents) and discover good buckets of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene should 
be able to help with this.  Has anyone had success with this recently?

Last year it was suggested Carrot2 could help, and it would even produce 
good labels for the clusters.  Has this proven to be true?  Our goal is 
to use clustering to build a nifty graphic interface, probably using Flash.
Carrot2 seems to work nicely.
Demo here...
Search for something like artificial intelligence in my Wikipedia 
Search engine:

http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence
The click on see clustered results.. link to go here:
http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence
And voilla, what seems like decent clusters.
I'm not sure what the complexity of the algorithm is, but for me ~100 
docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep 
increasing the # of docs you give it. You might have to wait a while w/ 
all 1,200 docs...

- Dave



Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Many times I've written ad-hoc code that pulls in data from an RDBMS and 
builds a Lucene index. The use case is a typical database-driven dynamic 
website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be able 
to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. Type of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: select url, name, body from foo
(now the column to field mapping)
col 1 = url
col 2 = title
col 3 = contents
(now the field types for each named field)
url = Field( ...store=true, index=false)
  title = Field( ...store=true, index=true)
   contents = Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
 I know in the more general form, query needs to be replaced by 
queries above, and the updated query may need some time stamp variable 
expansion, and possibly the queries need paging to deal w/ lamo DBs 
like mysql that don't have cursors for large result sets...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Nice, very similar to what I was thinking of, where the most significant 
difference is probably just that I was thinking of a batch indexer, not 
one embedded in a web container. Probably a worthwhile contribution to 
the sandbox.


Aad Nales wrote:
Yep,
This is how we do it.
We have a search.xml that maps database fields to search fields and a 
parameter part that describes the 'click for detailed result url' and 
the parameter names (based on the search fields). In this xml we also 
describe how the different fields should be stored we have for instance 
a number of large text fields for we use the unstored option.

The framework that we have build around has an element that we call 
detailer. This detailer creates a lucene Document with the fields as 
specified in the search.xml

To illustrate here is the code that specifies the detailer for a forum.
-- XML 
-
documenttype id=FORUM index=general defaultfield=body
fields
field property=messageid searchfield=messageid type=unindexed 
key=true/
field property=instanceid searchfield=instanceid type=unindexed /
field property=subject searchfield=title type=split maxwords=8 /
field property=body searchfield=default type=split maxwords=20 /
field property=aka_username searchfield=username type=keyword /
field property=modifiedDateAsDate searchfield=modifieddate 
type=keyword /
/fields
action uri=/forum/viewMessage.do 
image=/htmlarea/images/cops_insert_threadlink.gif
parameter property=messageid name=messageid/
parameter property=instanceid name=instanceid/
/action
analyzer 
classname=org.apache.lucene.analysis.standard.StandardAnalyzer/
/documenttype
 END XML ---

Please note:
Messageid is the keyfield here when we search the index we use a 
combined TYPE + KEY id to filter out double hits on the same document 
(not unusual in for instance a long forum thread).

Per type of document we also specifify what picture to show in the 
result (image), and we specify in what index the result should be 
written and what the general search field is (if the user submits a 
query without search all and without a field specified).

We have added the 'split' key word which makes it possbile to search a 
long text but only store a bit in the resulting hit.

The reindex is pretty straightforward we build a series of detailers for 
all possible document types and we run through the database and call the 
right detailer from a HashMap.

We have not included the JDBC stuff since the application is always 
running in Tomcat-Struts and since we cache most of the database reads. 
(a completely differnt story).

Queries on new and changed records seem to only make sense if asked in a 
context of time. (Right?). We have not needed it yet. The mapping can be 
query from a singleton java class. (SearchConfiguration).

We are currently adding functionality to store 'user structured data' 
best imagined as user defined input forms that are described in XML and 
are then stored as XML in the database. We query these documents using 
Lucene. These documents end up in the same index but this is quite 
manageable by using specialized detailers. For these document the type 
is more important then for the 'normally' stored documents. For this 
latter situation the search logic assumes that the query is 
appropriately configured by the application.

I am not sure if this is the kind of solution that you are looking for, 
but everything we produce is 100% open source.

Cheers,
Aad
David Spencer wrote:
Many times I've written ad-hoc code that pulls in data from an RDBMS 
and builds a Lucene index. The use case is a typical database-driven 
dynamic website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be 
able to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. Type of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: select url, name, body from foo
(now the column to field mapping)
col 1 = url
col 2 = title
col 3 = contents
(now the field types for each named field)
url = Field( ...store=true, index=false)
title = Field( ...store=true, index=true)
contents = Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
I know in the more general form, query needs to be replaced by 
queries above, and the updated query may need some time stamp 
variable expansion, and possibly the queries need paging to deal w/ 
lamo DBs like mysql that don't have cursors for large result

competition - Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-02-01 Thread David Spencer
I wasn't sure where in this thread to reply so I'm replying to myself :)
What search appliances exist now?
I only found 3:
[1] Google
[2] Thunderstone
http://www.thunderstone.com/texis/site/pages/Appliance.html
[3] IndexEngines (not out yet)
http://www.indexengines.com/

--
Also, out of curiosity, do people have appliance h/w vendors they like?
These guys seem like they have nice options for pretty colors:
http://www.mbx.com/oem/index.cfm
http://www.mbx.com/oem/options/

David Spencer wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the LF of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread David Spencer
Otis Gospodnetic wrote:
Adam,
Dawid posted some code that lets you use Carrot2 locally with Lucene,
see embedded zip url here for carrot2/lucene code - it may also be in 
the carrot2 cvs tree too - this is what I used in the wikipedia/cluster 
stuff as the basis

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
without the componentized pipe line system described on Carrot2 site.


Otis
--- Adam Saltiel [EMAIL PROTECTED] wrote:

David, Hi,
Would you be able to comment on coincidentally recent thread  RE: -
Grouping Search Results by Clustering Snippets:?
Also, when I looked at Carrot2 the pipe line is implemented as over
http. I
wonder how efficient that is, or can it be changed, for instance for
an all
local implementation?
Has Carrot2 been integrated in with Lucene, has it been used as the
bases
for a recommender system (could it be?)?
TIA.
Adam

-Original Message-
From: Dawid Weiss [mailto:[EMAIL PROTECTED]
Sent: Monday, January 31, 2005 4:12 PM
To: Lucene Users List
Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
Hi.
Coming up with answers... a little belated, but hope you're still
on:
we have been experimenting with carrot2 and are very pleased so
far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
Yes, there is not official release. We just don't feel the need
to tag
the sources with an official label because Carrot is not a
stand-alone
product (rather a library... or a framework). It does not imply
that the
project is in alpha stage... quite the contrary, in fact -- it has
been
out there for a while and it seems to do a good job for most
people.
is there any intentions to have any releases in the near future?
I could tag a release even today if it makes you happy ;) But I
hope I
made the status of the project clear above.
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: query term frequency

2005-01-27 Thread David Spencer
Jonathan Lasko wrote:
What do I call to get the term frequencies for terms in the Query?  I 
can't seem to find it in the Javadoc...
Do you mean the # of docs that have a term?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#docFreq(org.apache.lucene.index.Term)
Thanks.
Jonathan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the LF of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Jason Polites wrote:
I think everyone agrees that this would be a very neat application of 
opensource technology like Lucene... however (opens drawer, pulls out 
devil's advocate hat, places on head)... there are several complexities 
here not addressed by Lucene (et. al).  Not because Lucene isn't damn 
fantastic, just because it's not its job.

One of the big ones is security.  Enterprise search is no good if it 
doesn't match up with the authentication and authorization paradigms 
existing in the organisation.  How useful is it to return a whole bunch 
of search results for documents to which you don't have access? Not to 
mention the issues around whether you are even authorized to know it 
exists.
I was gonna mention this - you beat me to the punch.  I suspect that 
LDAP/JNDI itegration is a start, but you need hooks for an arbitrary 
auth plugin. And once we address this it might be the case that a user 
has to *log in* to the search server.  We have Verity where I work and 
this is all the case, along w/ the fact that a sale seems to involve 
mandatory consulting work (not that that's bad, but if you're trying to 
ship a shrink wrapped search engine in a box then this is an issue).

The other prickly one is file types.  It's all well and good to index 
HTML, XML and text but when you start looking at PDF, MS Office (OLE 
docs, PSTs, Outlook MSG files, MS Project files etc), Lotus Notes 
databases etc etc, things begin to look less simple and far less elegant 
than a nice clean lucene rackmount.  Sure there are great projects like 
Apache POI but they are still have a bit of a way to go before they 
mature to a point of really solving these problems.  After which time 
Microsoft will probably be rolling out Longhorn and everyone may need to 
start from scratch.
Also need http://jcifs.samba.org/ so you can spider windows file shares.
This is not to say that it's not a great idea, but as with most great 
ideas the challenge is not the formation of the idea, but its 
implementation.
Indeed.
I think a great first step would be to start developing good, reliable, 
opensource extensions to Lucene which strive to solve some of these issues.

end rant.
- Original Message - From: Otis Gospodnetic 
[EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Friday, January 28, 2005 12:40 PM
Subject: Re: rackmount lucene/nutch - Re: google mini? who needs it when 
Lucene is there


I discuss this with myself a lot inside my head... :)
Seriously, I agree with Erik.  I think this is a business opportunity.
How many people are hating me now and going shh?  Raise your
hands!
Otis
--- David Spencer [EMAIL PROTECTED] wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure
the thing and to customize the LF of the search results.

jian chen wrote:
 Hi,

 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another
free
 service for small companies. Then I realized that it costs quite
some
 money ($4,995) for the hardware and software. (I guess the
proprietary
 software costs a whole lot more than actual hardware.)

 The nice feature is that, you can only index up to 50,000
documents
 with this price. If you need to index more, sorry, send in the
 check...

 It seems to me that any small biz will be ripped off if they
install
 this google mini thing, compared to using Lucene to implement a
easy
 to use search software, which could search up to whatever number of
 documents you could image.

 I hope the lucene project could get exposed more to the enterprise
so
 that people know that they have not only cheaper but more
importantly,
 BETTER alternatives.

 Jian


-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Xiaohong Yang (Sharon) wrote:
Hi,
 
I agree that Google mini is quite expensive.  It might be similar to the desktop version in quality.  Anyone knows google's ratio of index to text?   Is it true that Lucene's index is about 500 times the original text size (not including image size)?  I don't have one installed, so I cannot measure.
500:1 for Lucene?  I don't think so.
In my wikipedia search engine the data in the MySQL DB I index from is 
approx 1.0 GB (sum of lengths of title and body), while the Lucene index 
of just these 2 fields is 250MB, thus in this case the Lucene index is 
25% of the corpus size.


 
Best,
 
Sharon

jian chen [EMAIL PROTECTED] wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


lucenebook.com -- Re: Search on heterogenous index

2005-01-26 Thread David Spencer
Erik Hatcher wrote:
On Jan 26, 2005, at 5:44 AM, Simeon Koptelov wrote:
Heterogenous Documents/indices are OK - check out the second hit:
  http://www.lucenebook.com/search?query=heterogenous+different

Thanks, I'll consider buying Lucene in Action.

Our master plan is working!  :)   Just kidding I have on my TODO 
list to aggregate more Lucene related content (like the javadocs, 
Would be nice if we could have up to date sandbox javadoc somewhere.
I've linked to some local copies of it from my page here:
http://www.searchmorph.com/pub/
Also useful would be to use the -linksource tag to javadoc so the 
htmlized source code is avail too, that way you have a source code search.

Maybe I should release my javadoc Analyzer which I use on 
searchmorph.com - it tries to do intelligent tokenization of java so 
that a word like HashMap becomes 3 tokens at the same offset 
('hashmap', 'hash', 'map') and which might be useful for you.

I do like the way you provide snippets from the book - nicely done.
Lucene's own documentation, perhaps a crawl of the wiki and the Lucene 
resources) into our search engine so that it becomes a richer resource 
and seems less than a marketing ploy.  Though the highlighted snippets 
do have enough information to be useful in some cases, which is nice.  I 
will start dedicating a few minutes a day to blog some useful content.

By all means, if you have other suggestions for our site, let us know at 
[EMAIL PROTECTED]

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-24 Thread David Spencer
Pierrick Brihaye wrote:
Hi,
David Spencer a écrit :
One example of expansion with the synonym boost set to 0.9 is the 
query big dog expands to:

Interesting.
Do you plan to add expansion on other Wordnet relationships ? Hypernyms 
and hyponyms would be a good start point for thesaurus-like search, 
wouldn't it ?
Good point, I hadn't considered this - but how would it work -just 
consider these 2 relationships synonyms (thus easier to use) or make 
it separate (too academic?)
However, I'm afraid that this kind of feature would require refactoring, 
probably based on WordNet-dedicated libraries. JWNL 
(http://jwordnet.sourceforge.net/) may be a good candidate for this.
Good point, should leverage existing code.

Thank you for your work.
thx,
 Dave
Cheers,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread David Spencer
Dawid Weiss wrote:
Hi David,
I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.
Oh no problem, and yes carrot2 is useful and fun.  It's a rich package 
so it takes a while to understand all that it can do.

That is awesome and very inspirational!

Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.
Thanks but I just took code that I think you wrote(!) and made minor 
mods to it - here's one link:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

I'd like to do more w/ Carrot2- that's where things get harder.

Carrot2 looks very interesting. Wondering if anybody has a list of 
all the

Technically I don't think carrot2 uses lucene per-se- it's just that 
you can integrate the two, and ditto for Nutch - it has code that uses 
Carrot2.

Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
  from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

And here's my question. I reread the Carrot2-Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above 
is just for the sake of benchmarking clusterHits() w/o the effect of 
1-time initialization - and that there's no benefit of repeatedly 
calling clusterHits (where a benefit might be that it can find nested 
clusters or whatever) - is that right (that there's no benefit)?

No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid
thx,
 Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread David Spencer
Mariella Di Giacomo wrote:
Hi ALL,
We are trying to index scientic articles written in english, but whose 
authors can be spelled in any language (depending on the author's 
nazionality)

E.g.
Schäffer
In the XML document that we provide to Lucene the author name is written 
in the following way (using HTML ENTITIES)

Schamp;auml;ffer
So in practice that is the name that would be given to a Lucene 
analyzer/filter

Is there any already written analyzer that would take that name 
(Schamp;auml;ffer or any other name that has entities) so that
Lucene index could searched (once the field has been indexed) for the 
real version of the name, which is

Schäffer
and the english spelled version of the name which is
Schaffer
Thanks a lot in advance for your help,
If I understand the question then I think there are 2 ways of doing it.
[1] Write a custom analyzer that uses Token.setPositionIncrement(0) to 
put alternate spellings at the same place in the token stream. This way 
phrase matches work right (so the query Jonathan Schaffer and 
Jonathan Schäffer will match the same phrase in the doc).

[2] Do not use a special analyzer - instead do query expansion, so if 
they search for Schaffer then the generated query is (Schaffer Schäffer).

I've used both techniques before - I use #1 w/ a JavadocAnalyzer on 
searchmorph.com so that if you search for hash you'll see matches for 
HashMap, as HashMap is tokenized into 3 tokens at the same location 
( 'hash', 'map, 'hashmap').  Writing this kind of an analyzer can be a 
bit of a hassle and the position increment of 0 might affect 
highlighting code or other (say, summarizing) code that uses the Analyzer.

For an example of #2 see my Wordnet/Synonym query expansion example in 
the lucene sandbox. You prebuild an index of synonyms (or in your case 
maybe just rules are fine). Then you need query expansion code that 
takes Schaffer and expands it to something like Schaffer 
Schäffer^0.9 (if you want to assume the user probably spells the name 
right). Simple enough to code, only hassle then is if you want to use 
the standard QueryParser...

thx,
 Dave



Mariella

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


MoreLikeThis and other similarity query generators checked in + online demo

2005-01-17 Thread David Spencer
Based on mail from Doug I wrote a more like this query generator, 
named, well, MoreLikeThis. Bruce Ritchie and Mark Harwood made changes 
to it (esp term vector support) and bug fixes. Thanks to everyone.

I've checked in the code to the sandbox under contributions/similarity.
The package it ends up at is org.apache.lucene.search.similar -- hope 
that makes sense.

I also created a class, SimilarityQueries, to hold other methods of 
similarity query generation. The 2 methods in there are dumber 
variations that use the entire source of the target doc to from a large 
query.

Javadoc is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/package-summary.html
Online demo here - this page below compares the 3 variations on 
detecting similar docs. The timing info (3 numbers w/ (ms)) may be 
suspect. Also note if you scroll to the bottom you can see the queries 
that were generated.

Here's a page showing docs similar to the entry for Iraq:
http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Iraq
And here's one for docs similar to the one on Garry Kasparov (he knows 
how to play chess :) ):

http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Garry_Kasparov
To get to it you start here:
http://www.searchmorph.com/kat/wikipedia.jsp
And search for something - on the search results page follow a cmp link
http://www.searchmorph.com/kat/wikipedia.jsp?s=iraq
Make sense? Useful? Has anyone done any other variations (e.g. cosine 
measure)?

- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


stop words and index size

2005-01-13 Thread David Spencer
Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
Thus, the index grows by 45% in this case, which I found suprising, as I 
expected it to not grow as much. I haven't dug into the details of the 
Lucene file formats but thought compression (field/term vector/sparse 
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36
-- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: full text as input ?

2005-01-13 Thread David Spencer
Hunter Peress wrote:
is it efficient and feasible to use lucene to do full text
comparisions. eg :  take an entire text thats reasonably large ( eg
more than 10 words) and find the result set within the lucene search
index that  is statistically similar with all the text.
I do this kind of stuff all the time, no problem.
I think this came up a month ago - probably appears monthly.
For another variation search for MoreLikeThis in the list - it's code 
I mailed in that I haven't, yet, checked in.

Anyway, if you want to search for docs that are similar to a source 
document, you can all this method to generate a similarity query.

'srch' is the source doc
'a' is your analyzer
'field' is the field that stores the body e.g. contents
'stop' is an opt Set of stop words to ignore as an optimization - it's 
not needed if the Analyzer ignores stop words, but if you keep stop 
words you might still want to ignore them in this kind of query as they 
probably won't help

  public static Query formSimilarQuery( String srch, 
Analyzer a,	String field,	Set stop)
		throws org.apache.lucene.queryParser.ParseException, IOException
	{	
		TokenStream ts = a.tokenStream( field, new StringReader( srch));
		org.apache.lucene.analysis.Token t;
		BooleanQuery tmp = new BooleanQuery();
		Set already = new HashSet();
		while ( (t = ts.next()) != null)
		{
			String word = t.termText();
			if ( stop != null 
 stop.contains( word)) continue;
			if ( ! already.add( word)) continue;
			TermQuery tq = new TermQuery( new Term( field, word));
			tmp.add( tq, false, false);
		}

// tbd, from lucene in action book
// 
https://secure.manning.com/catalog/view.php?book=hatcher2item=source
// exclude myself
//likeThisQuery.add(new TermQuery(
//new Term(isbn, doc.get(isbn))), false, true);
return tmp;
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do you handle dynamic html pages?

2005-01-10 Thread David Spencer
Kevin L. Cobb wrote:
I don't like to periodically re-index everything because 1) you can't be
confident that your searches are as up to date as they could be, and 2)
you are wasting cycles either checking for documents that may or may not
need to be updated, or re-indexing documents that don't need updated. 
And what I've noticed is that typically systems with dynamic content 
that comes from, say, a database, do not implment the HTTP HEAD verb 
nor if-modified-since, which a smart spider might try to use to be 
more efficient. Thus an incremental spider run can be just as 
expensive as the first one.

Ideally, I think that you want an event driven system where the content
management system or the like indicates to your searcher engine when a
page/document gets updated. That way, you know that documents are as up
to date as possible in terms of searches, and you know that you aren't
doing unnecessary work. 

 

-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 10, 2005 11:09 AM
To: Lucene Users List
Subject: Re: How do you handle dynamic html pages?

On Mon, 2005-01-10 at 10:03, Jim Lynch wrote:
How is anyone managing reindexing of pages that change?  Just 
periodically reindex everything or do you try to determine frequency
of 

each changes to each page and/or site? 

If you are using a CMS, your best bet is to integrate Lucene with the
CMS's content update mechanism. That way, your index will always be
up-to-date.
Otherwise, I would say reindexing everything is easiest, provided it
doesn't take too long. If it's ~15 minutes or less, you could schedule a
processes to do it at a low activity period (2 AM or whenever) every day
and that would probably handle your needs.
Regards,
Luke Francl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: SYNONYM + GOOGLE

2005-01-10 Thread David Spencer
Erik Hatcher wrote:
Karthik,
Thanks for that info.  I knew I was behind the times with WordNet using  
the sandbox code, but it was good enough for my purposes at the time.   
I will definitely try out the latest WordNet offerings in the future  
Hi...I wrote the WordNet sandbox code - but I'm not sure if I undertand 
this thread. Are we saying that it does not work w/ the new WordNet 
data, or that code in Eric's book is better/more up to date etc?

If needed I can update the sandbox code..
thx,
 Dave

though.
Erik
On Jan 10, 2005, at 7:37 AM, Karthik N S wrote:
Hi Erik
Apologies...
I may be a little offline from this form,but I may help u for the next
version of Luncene In Action.
 I Was working on Java WordNet Library , On fiddling with the API's,  
found
something Interesting ,

 the code attached to this  get's more Synonyms then the Wordnet's  
Indexed
format avaliable from the LuceneinAction Zip File


1) It needs Wordnet2.0's Dictonery  Installed
2) jwnl.jar from SourceForge
[
http://sourceforge.net/project/showfiles.php? 
group_id=33824package_id=33975
release_id=196864 ]

After sucess compilation
Type for watch
ORIGINAL  : watch OR analog_watch OR digital_watch OR hunter OR
hunting_watch OR pendulum_watch OR
pocket_watch OR stem-winder OR wristwatch OR  
wrist_watch

FORMATTED : watch OR analog watch OR digital watch OR hunter OR
hunting watch OR pendulum watch OR pocket watch
Check this Out,may be u will come up with Briliant Idea's

with regards
Karthik
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, January 10, 2005 5:19 PM
To: Lucene Users List
Subject: Re: SYNONYM + GOOGLE

On Jan 10, 2005, at 5:33 AM, Karthik N S wrote:
If u search Google  using  '~shoes',  It returns  hits  based on the
Synonym's
[ I know there is a Synonym Wordnet  based Lucene Package in the
sandbox
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/
contributions/WordN
et/   ]
Can this be achieved in Lucene ,If so How ???

Yes, it can be achieved.  Not quite synonyms, but various forms of the
same word can be found in this example, like this search for similar
(see the highlighted variations):
http://www.lucenebook.com/search?query=similar
This is accomplished using the Snowball stemmer filter found in the
sandbox.   For synonyms, you have lots of options.  In Lucene in Action
I demonstrate custom analyzers that inject synonyms using the WordNet
database (from the sandbox).  From the source code distribution of LIA:
% ant SynonymAnalyzerViewer
Buildfile: build.xml
SynonymAnalyzerViewer:
  [echo]
  [echo]   Using a custom SynonymAnalyzer, two fixed strings  are
  [echo]   analyzed with the results displayed.  Synonyms, from
the
  [echo]   WordNet database, are injected into the same  
positions
  [echo]   as the original words.
  [echo]
  [echo]   See the Analysis chapter for more on synonym
injection and
  [echo]   position increments.  The Tools and extensions
chapter covers
  [echo]   the WordNet feature found in the Lucene sandbox.
  [echo]
 [input] Press return to continue...

  [echo] Running lia.analysis.synonym.SynonymAnalyzerViewer...
  [java] 1: [quick] [warm] [straightaway] [spry] [speedy] [ready]
[quickly] [promptly] [prompt] [nimble] [immediate] [flying] [fast]
[agile]
  [java] 2: [brown] [brownness] [brownish]
  [java] 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob]  [dodger]
[discombobulate] [confuse] [confound] [befuddle] [bedevil]
  [java] 4: [jumps]
  [java] 5: [over] [o] [across]
  [java] 6: [lazy] [faineant] [indolent] [otiose] [slothful]
  [java] 7: [dogs]
...
The phrase analyzed was The quick brown fox jumps over the lazy dogs.
  Why no synonyms for jumps and dogs?  WordNet has synonyms for
jump and dog, but not the plural forms.  Stemming would be a
necessary step in achieving full synonym look-up, though this would
need to be done carefully as the stem of a word is not necessarily a
real word itself - so you'd probably want to stem the synonym database
also to ensure accurate lookup.
Also notice the semantically incorrect synonyms that appear for the
animal fox (confuse, for example).  Be careful!  :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Quick question about highlighting.

2005-01-07 Thread David Spencer
Jim Lynch wrote:
I've read as much as I could find on the highlighting that is now in the 
sandbox.  I didn't find the javadocs.
I have a copy here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html
  I found a link to them, but it
redirected my to a cvs tree.
Do I assume that you have to store the content of the document for the 
highlighting to work?  
Not per se, but you do need access to the contents to pass to 
Highlighter.getBestFragments(). You can store the contents in the index, 
or you can have in a cache, DB, or you can refetch the doc...

You need to know what Analyzer you used too to get the tokenStream via:
TokenStream tokenStream = analyzer.tokenStream( field, new 
StringReader(body));


Otherwise I don't see how it could work.
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


google suggest / incremental search - Re: Lucene appreciation

2004-12-17 Thread David Spencer
Rony Kahan wrote:
Thanks for feedback.
PA - Since rss readers usually visit at least once per day, we only show
jobs from past few days. This allows us to use a smaller, faster index for
traffic intensive rss searching.
Ben  Praveen - Thanks for the UI suggestions. Hope to have that %3A  %22
cleared up shortly. I saw in the Lucene sandbox a Javascript Query
Constructor. Does anyone have any experience using this to make a google
I've done 2 versions, this one closer to google:
http://www.searchmorph.com/kat/isearch2.jsp
And this v1 w/ frames, thus not the same but a step:
http://www.searchmorph.com/kat/isearch.html
SearchMorph is mainly a lucene index of javadoc-generated pages from OSS 
projects..


like advanced search page? Regarding open source technology - we also use
Carrot: http://sourceforge.net/projects/carrot2/
Kind Regards,
Rony

-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 4:29 PM
To: Lucene Users List
Subject: Re: Lucene appreciation

On Dec 16, 2004, at 17:26, Rony Kahan wrote:

If you are interested in Lucene work you can set up an rss feed or
email alert from here: 
http://www.indeed.com/search?q=lucenesort=date
Looks great :)
One thing though, the web search returns 14 hits for the above query.
Using the RSS feed only returns 4 of them. What gives?
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-15 Thread David Spencer
Christoph Kiefer wrote:
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.
What do you think?
I don't have any deep thoughts, just a few questions/ideas...
[1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are 
your classes right, which are not in the mail, but presumably the source 
isn't needed.

[2] Does the problem boil down to this line and the memory usage?
double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
Thus using a sparse matrix would be a win, and so would using floats 
instead of doubles?

[3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop 
words, as you do so later in getSimilarity().

[4] You can also consider using Colt possibly even JUNG:
http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html
http://jung.sourceforge.net/doc/api/index.html
[5]
Related to #2, can you precalc the matrix and store it on disk, or is 
your index too dynamic?

[6] Also, in similar kinds of calculations I've seen code that filters 
out low frequency terms e.g. ignore all terms that don't occur in at 
least 5 docs.

-- Dave

Best,
Christoph


/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import Jama.Matrix;
/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {

private File indexDir = null;
private File dataDir = null;
private String target = ;
private String query = ;
private int targetDocumentNumber = -1;
private final String ME = this.getClass().getName();
private int fileCounter = 0;

public TFIDF_Lucene( String indexDir, String dataDir, String target, 
String query ) {
this.indexDir = new File(indexDir);
this.dataDir = new File(dataDir);
this.target = target;
this.query = query;
}

public String getName() {
return TFIDF_Lucene_Similarity_Measure;
}

private void makeIndex() {
try {
IndexWriter writer = new IndexWriter(indexDir, new 
SnowballAnalyzer( English, StopAnalyzer.ENGLISH_STOP_WORDS ), false);
indexDirectory(writer, dataDir);
writer.optimize();
writer.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}

private void indexDirectory(IndexWriter writer, File dir) {
File[] files = dir.listFiles();
for (int i=0; i  files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
} else if (f.getName().endsWith(.txt)) {
indexFile(writer, f);
}
}
}

private void indexFile(IndexWriter writer, File f) {
try {
System.out.println( Indexing  + f.getName() + ,  + 
(fileCounter++) );
String name = f.getCanonicalPath();
//System.out.println(name);
Document doc = new Document();
doc.add( Field.Text( contents, new FileReader(f), 
true ) );
writer.addDocument( doc );

 

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
Christoph,
I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox 
Ot oh, sorry, I'll try to get this checked in soonish. For me it's 
always one thing to do a prelim version of a piece of code, but another 
matter to get it correctly packasged.

so I've attached it here. Just repackage and test.

An alternate approach to find similar docs is to use all (possibly 
unique) tokens in the  source doc to form a large query. This is code I use:

'srch' is the entire untokenized text of the source doc
'a' is the analyzer you want to use
'field' is the field you want to search on e.g. contents or body
'stop' is an opt set of stop words to ignore
It returns a query, which you then use to search for similar docs, and 
then in the return result you need to make sure you ignore the source 
doc, which will prob come back 1st. You can use stemming, synonyms, or 
fuzzy expansion for each term too.

public static Query formSimilarQuery( 
String srch,			Analyzer a,	String field,	Set stop)
throws org.apache.lucene.queryParser.ParseException, IOException
{	
	TokenStream ts = a.tokenStream( foo, new StringReader( srch));
	org.apache.lucene.analysis.Token t;
	BooleanQuery tmp = new BooleanQuery();
	Set already = new HashSet();
	while ( (t = ts.next()) != null)
	{
		String word = t.termText();
		if ( stop != null 
			 stop.contains( word)) continue;
		if ( ! already.add( word)) continue;
		TermQuery tq = new TermQuery( new Term( field, word));
		tmp.add( tq, false, false);
	}
	return tmp;

}

Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [RFE] IndexWriter.updateDocument()

2004-12-14 Thread David Spencer
petite_abeille wrote:
Well, the subject says it all...
If there is one thing which is overly cumbersome in Lucene, it's 
updating documents, therefore this Request For Enhancement:

Please consider enhancing the IndexWriter API to include an 
updateDocument(...) method to take care of all the gory details involved 
in such operation.
I agree, this is always a hassle to do right due to having to use 
IndexWriter and IndexReader and properly opening/closing them.

I have a prelim version of a batched index writer that I use. The code 
is kinda messy, but for discussion here's what it does:

Briefly the methods are:
// [1]
// the ctr has parameters:
//'batch size # docs' e.g. it will flush pending updates every 100 docs
//'batch freq' e.g. auto flush every 60 sec
// [2]
// queue a document to be added to the index
// 'key' is the primary key name e.g. url
// 'val' is the primary key val e.g. http://www.tropo.com/;
// 'doc' is the doc to be added
update( String key, String val, Document doc)
// [3]
//  queue a document for removal
// 'key' and 'val' are the params, as from [2]
remove( String key, String val)
// [4]
// periodic flush, called automatically or on demand, 2 stages:
// 1. call IndexReader.delete() on all pending (key,val) pairs
// 2. close IndexReader
// 3. call IndexWriter.add() on all pending documents
// 4. optionally call optimze()
// 5. close IndexWriter
flush()

//
So in normal usage you just keep calling update() and it peridically 
flushes the pending updates to the index. By its nature this uses memory 
however it's tunable as to how many documents it'll queue in memory.

Does the algorithm above, esp flush(), sound correct? It seems to work 
right for me and I can post this if people want to see it...

- Dave

Thanks in advance.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Otis Gospodnetic wrote:
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2item=source
Well done, uses a term vector, instead of reparsing the orig doc, to 
form the similarity query. Also I like the way you exclude the source 
doc in the query, I didn't think of doing that in my code.

I don't trust calling vector.size() and vector.getTerms() within the 
loop but I haven't looked at the code to see if it calculates  the 
results each time or caches them...
Otis
--- Bruce Ritchie [EMAIL PROTECTED] wrote:

Christoph,
I'm not entirely certain if this is what you want, but a while back
David Spencer did code up a 'More Like This' class which can be used
for generating similarities between documents. I can't seem to find
this class in the sandbox so I've attached it here. Just repackage
and test.
Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed
index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
From the code I looked at, those calls don't recalculate on 
every call. 

I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
dex/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j  vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term(subject, vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough.
Oh oh oh, sorry, 10-4, no prob.

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
 

You can also see 'Books like this' example from here 

https://secure.manning.com/catalog/view.php?book=hatcher2item=source
Well done, uses a term vector, instead of reparsing the orig 
doc, to form the similarity query. Also I like the way you 
exclude the source doc in the query, I didn't think of doing 
that in my code.

I agree, it's a good way to exclude the source doc.
 

I don't trust calling vector.size() and vector.getTerms() 
within the loop but I haven't looked at the code to see if it 
calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 
I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j  vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term(subject, vector.getTerms()[j]));

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LIMO problems

2004-12-13 Thread David Spencer
Daniel Cortes wrote:
Hi, I want to know what library do you use for search in PPT files?
I use this (native code):
http://chicago.sourceforge.net/xlhtml
POI support this?
thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-12 Thread David Spencer
Chris Lamprecht wrote:
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds).  So
Thx again for tip, I updated my experiment 
(http://www.searchmorph.com/kat/isearch.html, as per below) to use a 
150ms delay to avoid some needless searches...TBD is more intelligent 
guidance or suggestions for the user.

if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.
On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
[EMAIL PROTECTED] wrote:
Google just came out with a page that gives you feedback as to how many
pages will match your query and variations on it:
http://www.google.com/webhp?complete=1hl=en
I had an unexposed experiment I had done with Lucene a few months ago
that this has inspired me to expose - it's not the same, but it's
similar in that as you type in a query you're given *immediate* feedback
as to how many pages match.
Try it here: http://www.searchmorph.com/kat/isearch.html
This is my SearchMorph site which has an index of ~90k pages of open
source javadoc packages.
As you type in a query, on every keystroke it does at least one Lucene
search to show results in the bottom part of the page.
It also gives spelling corrections (using my NGramSpeller
contribution) and also suggests popular tokens that start the same way
as your search query.
For one way to see corrections in action, type in rollback character
by character (don't do a cut and paste).
Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it
What's nice is when you get used to immediate results, going back to the
batch way of searching seems backward, slow, and old fashioned.
There are too many idle CPUs in the world - this is one way to keep them
busier :)
-- Dave
PS Weblog entry updated too:
http://www.searchmorph.com/weblog/index.php?id=26
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-11 Thread David Spencer
Chris Lamprecht wrote:
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds). 
Ohh, good point - I was wondering how to cancel a URL - this is the 
right way. I'll try to code that in.

I also realized they're prob not doing searches at all - instead they're 
going off a DB of query popularity - I wanted to code up something 
generic, based just on term frequency but I don't think it'll be useful 
e.g. let's say
in my index (index of javadoc-generated documentation) the user types in
hash - well a human might guess that they intend hash map or 
hashmap or hash tree but I'm sure other terms are more frequent in 
my index than map and tree...I'm sure hash java occurs more 
frequently than hash map - or any other freq, non-stop word, and it's 
dubious that hash java is a useful suggestion...

So
if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.
On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
[EMAIL PROTECTED] wrote:
Google just came out with a page that gives you feedback as to how many
pages will match your query and variations on it:
http://www.google.com/webhp?complete=1hl=en
I had an unexposed experiment I had done with Lucene a few months ago
that this has inspired me to expose - it's not the same, but it's
similar in that as you type in a query you're given *immediate* feedback
as to how many pages match.
Try it here: http://www.searchmorph.com/kat/isearch.html
This is my SearchMorph site which has an index of ~90k pages of open
source javadoc packages.
As you type in a query, on every keystroke it does at least one Lucene
search to show results in the bottom part of the page.
It also gives spelling corrections (using my NGramSpeller
contribution) and also suggests popular tokens that start the same way
as your search query.
For one way to see corrections in action, type in rollback character
by character (don't do a cut and paste).
Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it
What's nice is when you get used to immediate results, going back to the
batch way of searching seems backward, slow, and old fashioned.
There are too many idle CPUs in the world - this is one way to keep them
busier :)
-- Dave
PS Weblog entry updated too:
http://www.searchmorph.com/weblog/index.php?id=26
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread David Spencer
Google just came out with a page that gives you feedback as to how many 
pages will match your query and variations on it:

http://www.google.com/webhp?complete=1hl=en
I had an unexposed experiment I had done with Lucene a few months ago 
that this has inspired me to expose - it's not the same, but it's 
similar in that as you type in a query you're given *immediate* feedback 
as to how many pages match.

Try it here: http://www.searchmorph.com/kat/isearch.html
This is my SearchMorph site which has an index of ~90k pages of open 
source javadoc packages.

As you type in a query, on every keystroke it does at least one Lucene 
search to show results in the bottom part of the page.

It also gives spelling corrections (using my NGramSpeller 
contribution) and also suggests popular tokens that start the same way 
as your search query.

For one way to see corrections in action, type in rollback character 
by character (don't do a cut and paste).

Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole 
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and 
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it

What's nice is when you get used to immediate results, going back to the 
batch way of searching seems backward, slow, and old fashioned.

There are too many idle CPUs in the world - this is one way to keep them 
busier :)

-- Dave
PS Weblog entry updated too: 
http://www.searchmorph.com/weblog/index.php?id=26



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Single Digit Indexing

2004-12-06 Thread David Spencer
Otis Gospodnetic wrote:
Hm, if you can index 11, you should be able to index 8 as well.  In any
case, you most likely want to make sure that your Analyzer is not just
In theory you could have a  length filter tossing out tokens that are 
too short or too long, and maybe you're getting rid of all tokens less 
than 2 chars...


throwing your numbers out.  This may stillbe up to date:
http://www.jguru.com/faq/view.jsp?EID=538308
See also: http://wiki.apache.org/jakarta-lucene/HowTo
Otis
--- Bill von Ofenheim (LaRC) [EMAIL PROTECTED] wrote:

How can I get Lucene to index single digits (e.g. 8 as in Gemini
8)?
I am able to index numbers with two or more digits (e.g. 11 as in
Apollo 11).
Thanks,
Bill von Ofenheim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Looking for consulting help on project

2004-10-27 Thread David Spencer
Suggestions
[a]
Try invoking the VM w/ an option like -XX:CompileThreshold=100 or even 
a smaller number. This encourages the hotspot VM to compile methods 
sooner, thus the app will take less time to warm up.

http://java.sun.com/docs/hotspot/VMOptions.html#additional
You might want to search the web for refs to this, esp how things like 
Eclipse is brought up, as I think their invocation script sets other 
obscure options to guide GC too.

[b]
Any time I've worked w/ a hard core java server I've always found it 
helpful to have a loop explicitly trying to force gc - this is the idiom 
I use (i.e. you may have to do more than just System.gc()), and my 
suggestion is to try calling this every 15-60 secs so that memory use 
never jumps. I know that in theory you should never need to, but it may 
help.

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Gordon Riggs wrote:
Hi,
 
I am working on a web development project using PHP and mySQL. The team has
implemented full text search with mySQL, but is now researching Lucene to
help with performance/scalability issues. The team is looking for a
developer who has experience working with Lucene and can assist with
integrating into our environment. What follows is a brief overview of the
problems that we're working to address. If you have the experience with
using Lucene with large amounts of data (we have roughly 16 million records)
where search time is critical (needs to be under .2 seconds), then please
respond.
 
Thanks,
Gordon Riggs
[EMAIL PROTECTED]
 
1. Loading index into memory using Lucene's RAMDirectory
Why is the Java heap 2.9GB for a 1.4GB index?
Why can we not load an index over 1.4GB in size?  We receive
'java.lang.OutOfMemoryError' even with the -mx flag set to as high as '10g'.
We're using a dedicated test machine which has dual AMD Opteron processors
and 12GB of memory.  The OS is SuSE Linux Enterprise Server 9 (x86_64).  The
java version is: Java(TM) 2 Runtime Environment, Standard Edition (build
Blackdown-1.4.2) Java HotSpot(TM) 64-Bit Server VM (build
Blackdown-1.4.2-fcs, mixed mode)
We also get similar results with: Java(TM) 2 Runtime Environment, Standard
Edition (build 1.4.2_03-b02) Java HotSpot(TM) Client VM (build 1.4.2_03-b02,
mixed mode)

2. How to keep Lucene and Java in memory, to improve performance
The idea is to have a Lucene daemon that loads the index into memory once
on startup. It then listens for connections and performs search requests for
clients using that single index instance.
Do you foresee any problems (other than the ones stated above) with this
approach?
Garbage collection and/or memory leaks?  Performance issues?  
Concurrency issues with multiple searches coming in at once?
What's involved in writing the daemon?
Assuming that we need the daemon, we need to find out how big a job it is to
develop, what requirements need to be specified, etc.

3. How to interface our PHP web application with Java
Our web application is written in PHP so we need a communication interface
for performing search queries that is both PHP and Java friendly.
What do you think would be a good solution?  XML-RPC?
What's involved in developing the solution?
4. How to tune Lucene
Are there ways to tune Lucene in order to improve performance? We already
plan on moving the index into memory.
What else can be done to improve the search times? Can the way the index is
built affect performance?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Thesaurus ...

2004-10-19 Thread David Spencer
Erik Hatcher wrote:
Have a look at the WordNet contribution in the Lucene sandbox 
repository.  It could be leveraged for part of a solution.
It's something I contributed.
Relevant links are:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/
http://www.tropo.com/techno/java/lucene/wordnet.html
Basically it uses the Lucene index as a kind of associated array to map 
words to their synonyms using the thesaurus from Wordnet, so a key like, 
say, fast will have mappings to quick and rapid. This can then be 
used for query expansion.

An example of this expansion in use is here:
http://www.hostmon.com/rfc/advanced.jsp

Erik
On Oct 19, 2004, at 12:40 PM, Patricio Galeas wrote:
Hello,
I'm a new user of Lucene, and a would like to use it to create a 
Thesaurus.
Do you have any idea to do this?  Thanks!

kind regards
P.Galeas


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Efficient search on lucene mailing archives

2004-10-14 Thread David Spencer
sam s wrote:
Hi Folks,
Is there any place where I can do a better search on lucene mailing 
archives?
I tried JGuru and looks like their search is paid.
Apache maintained archives lags efficient searching.
Of course one of the ironies is, shouldn't we be able to use Lucene to 
search the mailing list archives and even apache.org?
Thanks in advance,
s s
_
Don't just search. Find. Check out the new MSN Search! 
http://search.msn.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Highlighting PDF file after the search

2004-09-20 Thread David Spencer
[EMAIL PROTECTED] wrote:

Hello,
I can successfully index and search the PDF documents, however i am not
able to highlight the searched text in my original PDF file (ie: like
dtSearch
highlights on original file)
I took a look at the highlighter in sandbox, compiled it and have it
ready.  I am wondering if this highlighter is for highlighting indexed
documents or
can it be used for PDF Files as is !  Please enlighten !
I did this a few weeks ago.
There are two ways, and they both revolve round the same thing, you need 
the tokenized PDF text available.

[a] Store the tokenized PDF text in the index, or in some other file on 
disk i.e. a cache ( but cache is a misleading term, as you can't have 
a cache miss unless you can do [b]).

[b] Tokenize it on the fly when you call getBestFragments() - the 1st 
arg, the TokenStream, should be one that takes a PDF file as input and 
tokenizes it.

http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String)
Thanks,
Vijay Balasubramanian
DPRA Inc.,
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexReader.close() semantics and optimize -- Re: problem with locks when updating the data of a previous stored document

2004-09-16 Thread David Spencer
Crump, Michael wrote:
You have to close the IndexReader after doing the delete, before opening the 
IndexWriter for the addition.  See information at this link:
http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex
Recently I thought I observed that if I use this batch update idiom (1st 
delete the changed docs, then add them), it seems that 
IndexReader.close() does not flush/commit the deletions - rather 
IndexWriter.optimize() does.

I may have been confused and should retest this, but regardless, the 
javadoc seems unclear. close() says it *saves* deletions to disk. What 
does it mean to save a deletion? Save a pending one, or commit it 
(commit - really delete it)?

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#close()
Also optimize doesn't mention deletions.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#optimize()
Suggestion: could the word save in the close() jdoc be elaborated on, 
and possibly could optimize() get another comment wrt its effect on 
deletions?

thx,
 Dave

Regards,
Michael
-Original Message-
From:   Paul Williams [mailto:[EMAIL PROTECTED]
Sent:   Thu 9/16/2004 5:39 AM
To: 'Lucene Users List'
Cc: 
Subject:problem with locks when updating the data of a previous stored document
Hi,
Using lucene-1.4.1.jar on WinXP  

I am having trouble with locking and updating an existing Lucene document. I
delete the old document from the index and then add the new document to the
index writer. I am using the minMerge docs set to 100 (much quicker!!) and
close the writer once the batch is done, so the documents are flushed to the
filesystem
The problem i am having is I can't delete the old version of the document
(after the first document has been added) using reader.delete because there
is a lock on the index due to the IndexWriter being open.
Am I doing this wrong or is there a simple way round this?
Regards,
Paul
Code snippets of the update code (I have just cut and pasted the relevant
line from my app to get an idea)
reader = IndexReader.open(location);
// Delete old doc/term if present
if (reader.docFreq(docNumberTerm)  0) {
reader.delete(docNumberTerm);
.
.
.
IndexWriter writer = null;
// get the writer from the hash table so last few are cached and don't
have to be restarted
synchronized(IndexWriterCache) {
   String dbstring =  + ldb;
   writer = (IndexWriter)IndexWriterCache.get(dbstring);
   if (writer == null) {
   //Not in cache so create one and add to cache for next time
   writer = new IndexWriter(location, new StandardAnalyzer(),
new_index);
   writer.setUseCompoundFile(true);
   // Set the maximum number of entries per field. Default is
10,000
   writer.maxFieldLength = MaxFieldCount;
   // Set how many docs will be stored in memory before being
saved to disk
   writer.minMergeDocs = (int) DocsInMemory;
   IndexWriterCache.remove(dbstring);
   IndexWriterCache.put(dbstring, writer);
}
.
.
.
  
// Add the docuents to the Lucene index
writer.addDocument(doc);


.
. Some time later after a batch of docs been added
 
	   writer.close();




DISCLAIMER:
The information in this message is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this message
by anyone else is unauthorised. If you are not the intended recipient, any
disclosure, copying, or distribution of the message, or any action or
omission taken by you in reliance on it, is prohibited and may be unlawful.
Please immediately contact the sender if you have received this message in
error.
Thank you.
Valid Information Systems Limited. Address: Morline House, 160 London Road,
Barking, Essex, IG11 8BB. 

http://www.valinf.com Tel: +44 (0) 20 8215 1414 Fax: +44 (0) 20 8215 2040
Please note that as part of our drive to continually improve the service to
our clients, we have established a dedicated support line for customers to 
call if they are in need of help with their installation of R/KYV or have a
query regarding the operation of the software. The number is - 0870 0161414
This will ensure any call is carefully noted, any action required is 
scheduled for completion and any problem experienced handled by a carefully
chosen team of developers. Please make a note of this number and pass it 
on to any other relevant person within your organisation.
 
*

--
Visit Valid who will sharing a stand with partners, Goss Interactive at the
SOCITM Event, 10- 12 October 2004, Edinburgh International Conference 

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread David Spencer
Morus Walter wrote:
Hi David,
Based on this mail I wrote a ngram speller for Lucene. It runs in 2 
phases. First you build a fast lookup index as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

great :-)
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
could you put the current version of your code on that website as a java
Weblog entry updated:
http://searchmorph.com/weblog/index.php?id=23
To link to source code:
http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java
source also? At least until it's in the lucene sandbox.
I created an ngram index on one of my indexes and think I found an issue
in the indexing code:
There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
		final TermEnum te = r.terms();
i'd suggest
		final TermEnum te = r.terms(new Term(field, ));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
			Term t = te.term();
			if ( !t.field().equals(field) ) {
			break;
			}

otherwise you loop over all terms in all fields.
Great suggestion and thanks for that idiom - I should know such things 
by now. To clarify the issue, it's just a performance one, not other 
functionality...anyway I put in the code - and to be scientific I 
benchmarked it two times before the change and two times after - and the 
results were suprising the same both times (1:45 to 1:50 with an index 
that takes up  200MB). Probably there are cases where this will run 
faster, and the code seems more correct now so it's in.



An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.
I also thinking of reviewing the list to see if anyone had done a Jaro 
Winkler fuzzy query yet and doing that



Thanks,
 Dave
Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

Good question.
If I use the form at the bottom of the page and ask for more results, 
the suggestion of const does eventually show up - 99th however(!).

http://www.searchmorph.com/kat/spell.jsp?s=contsmin=3max=4maxd=5maxr=1000bstart=2.0bend=1.0
Even boosting the prefix match from 2.0 to 10.0 only changes the ranking 
a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=contsmin=3max=4maxd=5maxr=1000bstart=10.0bend=1.0

To restate the question for a second.
The misspelled word is: conts.
The sugggestion expected is const, which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts - con ont nts
const - con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.

I guess one way is to add all simple transpositions to the lookup table 
(the ngram index) so that these could easily be found, with the 
heuristic that a frequent way of misspelling words is to transpose two 
adjacent letters.

Based on other mails I'll make some additions to the code and will 
report back if anything of interest changes here.



-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Aad Nales wrote:

David,
Perhaps I misunderstand somehting so please correct me if I do. I used

http://www.searchmorph.com/kat/spell.jsp to look for conts without 
changing any of the default values. What I got as results did not 
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
const is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes 
this behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote:
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
const is not a Java keyword.
I added a line of output to the right column under the 'details' box. 
const appears 216 times in the index (out of 96k docs), thus it is 
indeed kinda rare.

http://www.searchmorph.com/kat/spell.jsp?s=const

should have a pretty low levenshtein distance. Any idea what causes this
behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

I enhanced the code to store simple transpositions also and I 
regenerated my site w/ ngrams from 2 to 5 chars. If you set the 
transposition boost up to 10 then const is returned 2nd...

http://www.searchmorph.com/kat/spell.jsp?s=contsmin=2max=5maxd=5maxr=10bstart=2.0bend=1.0btranspose=10.0popular=1
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Aad Nales wrote:

David,
Perhaps I misunderstand somehting so please correct me if I do. I used

http://www.searchmorph.com/kat/spell.jsp to look for conts without 
changing any of the default values. What I got as results did not 
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
const is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes 
this behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
To restate the question for a second.
The misspelled word is: conts.
The sugggestion expected is const, which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words 
like this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts - con ont nts
const - con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This 
is an interesting issue, how to tune the algorithm so that it might 
return words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):
Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then const 
is returned 2nd.

http://www.searchmorph.com/kat/spell.jsp?s=contsmin=2max=5maxd=5maxr=10bstart=2.0bend=1.0btranspose=10.0popular=1
conts - _c co on nt ts s_
const - _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would 
be:

conts:
_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const:
_c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.

I guess one way is to add all simple transpositions to the lookup 
table (the ngram index) so that these could easily be found, with 
the heuristic that a frequent way of misspelling words is to 
transpose two adjacent letters.

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters a recursize purser, then: a, which is in, say, 
 50% of the documents, is probably spelled correctly and recursize, 
which is in zero documents, is probably mispelled.  But what about 
purser?  If we run the spell check algorithm on purser and generate 
parser, should we show it to the user?  If purser occurs in 1% of 
documents and parser occurs in 5%, then we probably should, since 
parser is a more common word than purser.  But if parser only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting parser.

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does purser or parser occur 
more frequently near descent.  But that gets expensive.
I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like remove, no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove
But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=removemin=2max=5maxd=5maxr=10bstart=2.0bend=1.0btranspose=1.0popular=0
TBD I need to update the javadoc  repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PorterStemfilter

2004-09-14 Thread David Spencer
Honey George wrote:
Hi,
 This might be more of a questing related to the
PorterStemmer algorithm rather than with lucene, but
if anyone has the knowledge please share.
You might want to also try the Snowball stemmer:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
And KStem:
http://ciir.cs.umass.edu/downloads/
I am using the PorterStemFilter that some with lucene
and it turns out that searching for the word 'printer'
does not return a document containing the text
'print'. To narrow down the problem, I have tested the
PorterStemFilter in a standalone programs and it turns
out that the stem of printer is 'printer' and not
'print'. That is 'printer' is not equal to 'print' +
'er', the whole of the word is stem. Can somebody
explain the behavior.
Thanks  Regards,
   George



___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st n (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop  0, to get similar existing terms. This should be fast, and 
you could provide a did you mean function too...

Based on this mail I wrote a ngram speller for Lucene. It runs in 2 
phases. First you build a fast lookup index as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like recursixe 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in plain English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
 Dave

package org.apache.lucene.spell;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001-2003 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/).
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names Apache and Apache Software Foundation and
 *Apache Lucene must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called Apache,
 *Apache Lucene, nor may Apache appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Tate Avery wrote:
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp
How embarassing!
Sorry!
Fixed!


T
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:

David Spencer wrote:

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st n (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop  0, to get similar existing terms. This should be fast, and 
you could provide a did you mean function too...


Based on this mail I wrote a ngram speller for Lucene. It runs in 2 
phases. First you build a fast lookup index as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like recursixe 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in plain English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop  0, to get similar existing terms.
This should be fast, and you could provide a did you mean
function too...
Based on this mail I wrote a ngram speller for Lucene. It runs in 2
phases. First you build a fast lookup index as mentioned above.
Then to correct a word you do a query in this index based on the
ngrams in the misspelled word.
The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)
Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.


[1] Source is attached and I'd like to contribute it to the sandbox,
esp if someone can validate that what it's doing is reasonable and
useful.

There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more fuzziness you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

[2] Here's a demo page. I built an ngram index for ngrams of length 3
 and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like
recursixe or whatnot to see what suggestions it returns. Note this
is not a normal search index query -- rather this is a test page for
spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

Very nice demo! 
Thanks, kinda designed for ngram-nerds if you know what I mean :)
I bet it's running way faster than the linear search 
Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

Based on a subsequent mail in this thread I set boosts for the words
in the ngram index. The background is each word (er..term for a given
 field) in the orig index is a separate Document in the ngram index.
This Doc contains all ngrams (in my test case, like #2 above, of
length 3 and 4) of the word. I also set a boost of
log(word_freq)/log(num_docs) so that more frequently words will tend
to be suggested more often.

You may want to experiment with 2 = n = 5. Some n-gram based 
Yep, will do prob tomorrow.
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

I think in plain English then the way a word is suggested as a 
spelling correction is: - frequently occuring words score higher -
words that share more ngrams with the orig word score higher - words
that share rare ngrams with the orig word score higher

I think this is a reasonable heuristics. Reading the code I would 
present it this way:
ok, thx, will update
- words that share more ngrams with the orig word score higher, and
  words that share rare ngrams with the orig word score higher
  (as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
  per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
  to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about 
 the input n-grams you may be getting some really weird hits.
Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
Good point, I'll try to add that in as an optional parameter.
because you operate on a small result set

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters a recursize purser, then: a, which is in, say, 
 50% of the documents, is probably spelled correctly and recursize, 
which is in zero documents, is probably mispelled.  But what about 
purser?  If we run the spell check algorithm on purser and generate 
parser, should we show it to the user?  If purser occurs in 1% of 
documents and parser occurs in 5%, then we probably should, since 
parser is a more common word than purser.  But if parser only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting parser.
OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does purser or parser occur 
more frequently near descent.  But that gets expensive.
Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Ji Kuhn wrote:
Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();
Let change the code once again:
...
public static void main(String[] args) throws IOException, InterruptedException
{
Directory directory = create_index();
for (int i = 1; i  100; i++) {
System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
System.gc();
Thread.sleep(1000);// whatever value you want
}
}
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Ji Kuhn wrote:
This doesn't work either!
You're right.
I'm running under JDK1.5 and trying larger values for -Xmx and it still 
fails.

Running under (Borlands) OptimzeIt shows the number of Terms and 
Terminfos (both in org.apache.lucene.index) increase every time thru the 
loop, by several hundred instances each.

I can trace thru some Term instances on the reference graph of 
OptimizeIt but it's unclear to me what's right. One *guess* is that 
maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the 
problem.



Lets concentrate on the first version of my code. I believe that the code should 
run endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, InterruptedException
   {
   Directory directory = create_index();
   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it

Ji Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the code should run 
endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, InterruptedException
   {
   Directory directory = create_index();
   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it
Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile, 
and run w/ the new jar then I see Comparators.size() go up after every 
iteration thru ReopenTest's loop and the size() never goes down...

 static Object store (IndexReader reader, String field, int type, 
Object factory, Object value) {
FieldCacheImpl.Entry entry = (factory != null)
  ? new FieldCacheImpl.Entry (field, factory)
  : new FieldCacheImpl.Entry (field, type);
synchronized (Comparators) {
  HashMap readerCache = (HashMap)Comparators.get(reader);
  if (readerCache == null) {
readerCache = new HashMap();
Comparators.put(reader,readerCache);
		System.out.println( *\t* NOW: + Comparators.size());
  }
  return readerCache.put (entry, value);
}
  }

Ji Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the 
code should run endlesly (I have said it before: in version 1.4 final 
it does).

Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a 
stand-alone code with main().

What about slow garbage collector? This looks for me as wrong 
suggestion.


I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() 
call is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, 
InterruptedException
   {
   Directory directory = create_index();

   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


SegmentReader - Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Another clue, the SegmentReaders are piling up too, which may be why the 
 Comparator map is increasing in size, because SegmentReaders are the 
keys to Comparator...though again, I don't know enough about the Lucene 
internals to know what refs to SegmentReaders are valid which which ones 
may be causing this leak.

David Spencer wrote:
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it

Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile, 
and run w/ the new jar then I see Comparators.size() go up after every 
iteration thru ReopenTest's loop and the size() never goes down...

 static Object store (IndexReader reader, String field, int type, Object 
factory, Object value) {
FieldCacheImpl.Entry entry = (factory != null)
  ? new FieldCacheImpl.Entry (field, factory)
  : new FieldCacheImpl.Entry (field, type);
synchronized (Comparators) {
  HashMap readerCache = (HashMap)Comparators.get(reader);
  if (readerCache == null) {
readerCache = new HashMap();
Comparators.put(reader,readerCache);
System.out.println( *\t* NOW: + Comparators.size());
  }
  return readerCache.put (entry, value);
}
  }


Ji Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the 
code should run endlesly (I have said it before: in version 1.4 final 
it does).

Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a 
stand-alone code with main().

What about slow garbage collector? This looks for me as wrong 
suggestion.


I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() 
call is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, 
InterruptedException
   {
   Directory directory = create_index();

   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread David Spencer
Daniel Naber wrote:
On Monday 13 September 2004 15:06, Ji Kuhn wrote:

   I think I can reproduce memory leaking problem while reopening
an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My
JVM is:

Could you try with the latest Lucene version from CVS? I cannot reproduce 
your problem with that version (Sun's Java 1.4.2_03, Linux).
I verified it w/ the latest lucene code from CVS under win xp.
Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
eks dev wrote:
Hi Doug,

Perhaps.  Are folks really better at spelling the
beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
Good ole Citeseer finds 2 docs that seem relevant:
http://citeseer.ist.psu.edu/cs?cs=1q=Winkler+Jarosubmit=Documentsco=Citationscm=50cf=Anyao=Citationsam=20af=Any
I have some of the ngram spelling suggestion stuff, based on earlier 
msgs in this thread, working in my dev tree. I'll try to get a test site 
up later today for people to fool around with.


this topic, if you are interested.
On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required = prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.
Hope this was helpful,
Eks

  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: has anybody
tried this before? 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only corrections 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  

And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
I keep thinking over this one and I don't understand it. If a user 
misspells a word and the did you mean spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I know in other contexts of IR frequent terms are penalized but in this 
context it seems that frequent terms should be fine...

-- Dave

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the did you mean spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely 
that the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote:
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: has anybody
tried this before? 
I did a WordNet/synonym query expander. Search for WordNet on this 
page. Of interest is it stores the Wordnet info in a separate Lucene 
index as at its essence all an index is is a database.

http://jakarta.apache.org/lucene/docs/lucene-sandbox/
Also, another variation, is to instead spell based on what terms are in 
the index, not what an external dictionary says. I've done this on my 
experimental site searchmorph.com in a dumb/inefficient way. Here's an 
example:

http://www.searchmorph.com/kat/search.jsp?s=recursivz
After you click above it takes ~10sec as it produces terms close to 
recursivz. Opps - looking at the output, it looks like the same word 
is suggest multiple times - ouch - I must be considering all fields, not 
just the contents field. TBD is fixing this. (or no wonder it's so slow :))

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st n (prob 3) chars.



Cheers,
Aad
--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st n (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop  0, to get similar existing terms. This should be fast, and 
you could provide a did you mean function too...
Sounds interesting/fun but I'm not sure if I'm following exactly.
Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a 
mapping of trigram - term?
Thus if recursive is in the (orig) index then we'd create entries like:

rec - recursive
ecu - ...
cur - ...
urs - ...
rsi - ...
siv - ...
ive - ...
And so on for all terms in the orig index.
OK fine.
But now the user types in a query like recursivz.
What's the algorithm - obviously I guess take all trigrams in the bad 
term and go thru the trigram-index, but there will be lots of 
suggestions. Now what - use string distance to score them? I guess that 
makes sense - plz confirm if I understand And so I guess the point 
here is we precalculate the trigram-term mappings to avoid an expensive 
traversal of all terms in an index, but we still use string distance as 
a 2nd pass (and prob should force the matches to always match on the 1st 
  n (3) chars using the heuristic that people can usually start the 
spelling a word  corrrectly).




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Existing Parsers

2004-09-09 Thread David Spencer
Honey George wrote:
Hi,
  I know some of them.
1. PDF
 + http://www.pdfbox.org/
 + http://www.foolabs.com/xpdf/download.html
   - I am using this and found good. It even supports 
My dated experience from 2 years ago was that (the evil, native code) 
foolabs pdf parser was the best, but obviously things could have changed.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html
 various languages.
2. word
  + http://sourceforge.net/projects/wvware
3. excel
  + http://www.jguru.com/faq/view.jsp?EID=1074230
-George
 --- [EMAIL PROTECTED] wrote: 

Anyone know of any reliable parsers out there for
pdf word 
excel or powerpoint?
For powerpoint it's not easy. I've been using this and it has worked 
fine util recently and seems to sometimes go into an infinite loop now 
on some recent PPTs. Native code and a package that seems to be dormant 
but to some extent it does the job. The file ppthtml does the work.

http://chicago.sourceforge.net/xlhtml

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: has anybody
tried this before? 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only corrections 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:

- ignore(penalize?) terms that are rare
- ignore(penalize?) terms that are common
- terms that are closer (string distance) to the term entered are better
- terms that start w/ the same 'n' chars as the users term are better


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread David Spencer
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close()
What is the intent of IndexSearcher.close()?
I want to know how, in a web app, one can stop a search that's in 
progress - use case is a user is limited to one search at at time, and 
when one (expensive) search is running they decide it's taking too long 
so they elaborate on the query and resubmit it. Goal is for the server 
to stop the search that's in progress and to start a new one. I know how 
to deal w/ session vars and so on in a web container - but can one stop 
a search that's in progress and is that the intent of close()?

I haven't done the obvious experiment but regardless, the javadoc is 
kinda terse so I wanted to hear from the all knowing people on the list.

thx,
  Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: about search sorting

2004-09-03 Thread David Spencer
Wermus Fernando wrote:
 
Luceners,
My app is creating, updating and deleting from the index and searching
too. I need some information about sorting by a field. Does any one
could send me a link related to sorting?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Sort.html
 
Thanks in advance.
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-07-02 Thread David Spencer
This in theory should not help, but anyway, just in case, the idea is to 
call gc() periodically to force gc - this is the code I use which 
tries to force it...

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Mark Florence wrote:
Thanks, Jim. I'm pretty sure I'm throwing OOM for real,
and not because I've run out of file handles. I can easily
recreate the latter condition, and it is always reported
accurately. I've also monitored the OOM as it occurs using
top and I can see memory usage climbing until it is
exhausted -- if you will excuse the pun!
I'm not familiar with the new compound file format. Where
can I look to find more information?
-- Mark
-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Friday, July 02, 2004 01:29 pm
To: Lucene Users List
Subject: Re: Running OutOfMemory while optimizing and searching
Ah yes, I don't think I made that clear enough.  From
Mark's original post, I believe he mentioned that he
used seperate readers for each simultaneous query.
His other issue was that he was getting an OOM during
an optimize, even when he set the JVM heap to 2GB.  He
said his index was about 10.5GB spread over ~7000
files on Linux.  

My guess is that OOM might actually be a too many
open files error.  I have seen that type of error
being reported by the JVM as an OutOfMemory error on
Linux before.  I had the same problem but once I
switched to the new Lucene compound file format, I
haven't had that problem since.  

Mark, have you tried switching to the compound file
format?  

Jim

--- Doug Cutting [EMAIL PROTECTED] wrote:
 What do your queries look like?  The memory
required
 for a query can be computed by the following
equation:

 1 Byte * Number of fields in your query * Number
of
 docs in your index

 So if your query searches on all 50 fields of
your 3.5
 Million document index then each search would
take
 about 175MB.  If your 3-4 searches run
concurrently
 then that's about 525MB to 700MB chewed up at
once.
That's not quite right.  If you use the same
IndexSearcher (or 
IndexReader) for all of the searches, then only
175MB are used.  The 
arrays in question (the norms) are read-only and can
be shared by all 
searches.

In general, the amount of memory required is:
1 byte * Number of searchable fields in your index *
Number of docs in 
your index

plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
The latter are for i/o buffers.  There are a few
other things, but these 
are the major ones.

Doug

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Result

2004-07-02 Thread David Spencer
Hetan Shah wrote:
My search results are only displaying the top portion of the indexed 
documents. It does match the query in the later part of the document. 
Where should I look to change the code in demo3 of default 1.3 final 
distribution. In general if I want to show the block of document that 
matches with the query string which classes should I use?
Sounds like this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
Thanks guys.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
Inspired by these guys who put results from Google into a treemap...
http://google.hivegroup.com/
I did up my own version running against my index of OSS/javadoc trees.
This query for thread pool shows it off nicely:
http://www.searchmorph.com/kat/tsearch.jsp?s=thread%20poolside=300goal=500
This is the empty search form:
http://www.searchmorph.com/kat/tsearch.jsp
And the weblog entry has a few more links, esp useful if you don't know 
what a treemap is:

http://searchmorph.com/weblog/index.php?id=18
Oh: As a start, a treemap is a visualization technique, not 
java.util.Treemap. Bigger boxes show a higher score, and x,y location 
has no significance.

Enjoy,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: search multiple indexes

2004-07-01 Thread David Spencer
Stefan Groschupf wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
MultiSearcher.html

100% Right.
I personal found code samples more interesting then just java doc.
Good point.
That why my hint, here the code snippet from nutch:
But - warning - in normal use of Lucene you don't need the Similarity 
stuff..
/** Construct given a number of indexed segments. */
  public IndexSearcher(File[] segmentDirs) throws IOException {
NutchSimilarity sim = new NutchSimilarity();
Searchable[] searchables = new Searchable[segmentDirs.length];
segmentNames = new String[segmentDirs.length];
for (int i = 0; i  segmentDirs.length; i++) {
  org.apache.lucene.search.Searcher searcher =
new org.apache.lucene.search.IndexSearcher
(new File(segmentDirs[i], index).toString());
  searcher.setSimilarity(sim);
  searchables[i] = searcher;
  segmentNames[i] = segmentDirs[i].getName();
}
this.luceneSearcher = new MultiSearcher(searchables);
this.luceneSearcher.setSimilarity(sim);
  }
Kent Beck said: Monkey see, Monkey do. ;-)
Cheers,
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
Stefan Groschupf wrote:
Dave,
cool stuff, think aboout to contribute that to nutch.. ;-)!
Well the code is very generic - basically 1 method that takes a 
Searcher, a Query, the # of cells to show, and the size of the diagram. 
Technically I think it would be a Lucene sandbox contribution - but - 
for my site I do want to convert the custom spider/cache to use Nutch...

Do you know:
http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ?
Interesting - is there any code avail to draw the maps?
thx,
 Dave
Cheers,
Stefan
Am 01.07.2004 um 23:28 schrieb David Spencer:
Inspired by these guys who put results from Google into a treemap...
http://google.hivegroup.com/
I did up my own version running against my index of OSS/javadoc trees.
This query for thread pool shows it off nicely:
http://www.searchmorph.com/kat/tsearch.jsp? 
s=thread%20poolside=300goal=500

This is the empty search form:
http://www.searchmorph.com/kat/tsearch.jsp
And the weblog entry has a few more links, esp useful if you don't  
know what a treemap is:

http://searchmorph.com/weblog/index.php?id=18
Oh: As a start, a treemap is a visualization technique, not  
java.util.Treemap. Bigger boxes show a higher score, and x,y location  
has no significance.

Enjoy,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


ANN: Experimental site for searching javadoc of OSS projects

2004-06-25 Thread David Spencer
I've put together a kind of experimental site which indexes the javadoc 
of OSS java projects (well, plus the JDK).

http://www.searchmorph.com/
This is meant to solve the problem where a java developer knows 
something has been done before, but where, in what project - source 
forge? jakarta? eclipse? jboss?.

There are at least 2 somewhat unique things here. I use a custom 
analyzer (JavadocAnalyzer) which I recently mentioned on this list in 
another context. With it searches for something like thread pool will 
match tokens like SyncThreadPool or Sync_ThreadPool.

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1731360
There's also an AIM (AOL) IM bot running. You send it a query and it 
sends you back 5 URLs of matches - web search w/o a browser.

Also inside - it does query expansion so that query terms are checked 
against multiple fields (may be similar to what nutch does).

And I also use the MoreLikeThis query expansion code I wrote - from a 
results page you can find similar URLs to the hits you see. [BTW: this 
doesn't seem to have made it into the sandbox...]

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1353138
The about page is here:
http://www.searchmorph.com/weblog/index.php?id=7
And the technology inside page elaborates a bit more:
http://www.searchmorph.com/weblog/index.php?id=3
I'm interested in feedback. Does it find matches you expect, and what 
other packages should I index?

thx,
Dave
PS
Surely this has been done before - what's the competition - any other 
similar specialized search engines?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)
May be getting off topic - but maybe not..I can't find an example of how 
to use Carrot2. It builds easy enough, but there's no obvious example 
what it takes as input (documents?) and what it returns as output 
(some list of clustered docs?). I want to use the local interface to 
it and hook it into Lucene.

thx,
 Dave

Otis
--- William W [EMAIL PROTECTED] wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the 
Lucene API ?
Thanks,
William.

_
Watch the online reality show Mixed Messages with a friend and enter
to win 
a trip to NY 

http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
William W wrote:
Hi,
Carrot seems to be very interesting but I didn't find a simple example :(
I will try to use it ! :)
I can't find an example either, but after going through their source I 
think the heart of it is

com.dawidweiss.carrot.filter.stc.algorithm.STCEngine
and  com.dawidweiss.carrot.filter.stc.Processor is a class that drives this.
Lucene hook - hey - I'm trying to integrate the two. I think this is how 
it would be done, get search results from Lucene then set up STCEngine a 
la how Processor does.


Thx,
william.

From: David Spencer [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Subject: carrot2 - Re: Categorization
Date: Wed, 23 Jun 2004 11:50:22 -0700
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)

May be getting off topic - but maybe not..I can't find an example of 
how to use Carrot2. It builds easy enough, but there's no obvious 
example what it takes as input (documents?) and what it returns as 
output (some list of clustered docs?). I want to use the local 
interface to it and hook it into Lucene.

thx,
 Dave

Otis
--- William W [EMAIL PROTECTED] wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the Lucene API ?
Thanks,
William.
_
Watch the online reality show Mixed Messages with a friend and enter
to win a trip to NY
http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Get fast, reliable Internet access with MSN 9 Dial-up  now 3 months 
FREE! http://join.msn.click-url.com/go/onm00200361ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fix for advanced tokenizers and highlighter problem

2004-06-22 Thread David Spencer
[EMAIL PROTECTED] wrote:
I think this version of the highlighter should provide a fix: http://www.inperspective.com/lucene/hilite2beta.zip
Before I update the version of the highlighter in the sandbox I'd appreciate feedback from those troubled 
with the issues to do with overlapping tokens in token streams (Erik, Dave, Bruce?)
1st pass of testing - yes, this does indeed fix the problem.
I've realized I may want to modify my Analyer now too.
I was focusing on the Token position increment instead of the offset.
For something like the case where I broken HashMap into 3 tokens: 
Hash, Map, HashMap, I was returning the same start/end offsets for 
 all of them (thus a search on Map ends up with all of HashMap 
being highlighted). Probably more correct is to return offsets within 
the orig larger token so that you can see exactly where your term 
matched. I'll update my code and then put up a site that demonstrates this.

thx,
 Dave

I added my own test analyzer to the Junit test that introduces synonyms into the 
token stream at the same
position as the trigger token and the new code works OK for me with that analyzer.
The fix means I needed to change the Formatter interface - this now takes a TokenGroup object instead 
of a token because that can be used to represent a single token OR a sequence of overlapping tokens.
I dont think most people have needed to create custom Formatter implementations so I dont think this
redefined interface should break too much existing code (if any).

Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
I've run across an amusing interaction between advanced 
Analyzers/TokenStreams and the very useful term highlighter: 
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/

I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has a custom TokenStream which tries to more 
intelligently tokenize java-language tokens.

A naive analyzer would turn something like SyncThreadPool into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a 0 position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so SyncThread 
and ThreadPool appear too].

The point behind this is someone searching for threadpool probably 
would want to see a match for SyncThreadPool even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me to 
no end.

So the analyzer/tokenizer works great, and I have a demo site about to 
come up that indexes lots of publicly avail javadoc as a kind of 
resource so you can easily find what's already been done.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't look 
at the position increment of the tokens and consequently a nonsense 
stream of matches is output. If I use a different Analyzer w/ the 
highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the subtokens.

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
[EMAIL PROTECTED] wrote:
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs 
- can you email me or post here the source to your analyzer?
 

Code attached - don't make fun of it please :) - very prelim. I think it 
only uses one other file, (TRQueue) also attached (but: note, it's in a 
different package). Also any comments in the code may be inaccurate. The 
general goal is as stated in my earlier mail, examples are:

AlphaBeta -
Alpha (incr 0)
Beta (incr 0)
AlphaBeta (incr 1)
MAX_INT -
MAX (incr 0)
INT (incr 0)
MAX_INT (incr 1)
thx,
Dave
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


package com.tropo.lucene;

import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;
import com.tropo.util.*;
import java.util.regex.*;
/**
 * Try to parse javadoc better than othe analyzers.
 */
public final class JavadocAnalyzer
extends Analyzer
{

// [A-Za-z0-9._]+
// 
public final TokenStream tokenStream( String fieldName, Reader reader)
{
return new LowerCaseFilter( new JStream( fieldName, reader));
}

/**
 * Try to break up a token into subset/subtokens that might be said to occur 
in the same place.
 */
public static List breakup( String s)
{
// a - null
// alphaBeta - alpha, Beta
// XXAlpha - ?, Alpha
// BIG_NUM - BIG, NUM

List lis = new LinkedList();

Matcher m;

m = breakupPattern.matcher( s);
while (m.find())
{
String g = m.group();
if ( ! g.equals( s))
lis.add( g);
}

// hard ones
m = breakupPattern2.matcher( s);
while (m.find())
{
String g;
if ( m.groupCount() == 2) // wierd XXFoo case
g = m.group( 2);
else
g = m.group();
if ( ! g.equals( s))
lis.add( g);
/*
o.println( gc:  + m.groupCount() +
   / + m.group( 0) + / + m.group( 1) + / 
+ m.group( 2));
*/
//lis.add( m.group());
}   
return lis;
}   


/**
 *
 */
private static class JStream
extends TokenStream
{
private TRQueue q = new TRQueue();
private Set already = new HashSet();
private String fieldName;
private PushbackReader pb;

private StringBuffer sb = new StringBuffer( 32);
private int offset;

// eat white
// have 
private int state = 0;


/**
 *
 */
private JStream( String fieldName, Reader reader)
{
this.fieldName = fieldName;
pb = new PushbackReader( reader);
}


/**
 *
 */
public Token next()
throws IOException
{
if ( q.size()  0) // pre-calculated
return (Token) q.dequeue();
int c;
int start = offset;
sb.setLength( 0);
offset--;
boolean done = false;
String type = mystery;
state = 0;

while ( ! done 
( c = pb.read()) != -1)
{
char ch = (char) c;
offset++;
switch( state)
{
case 0:
if ( Character.isJavaIdentifierStart( ch))
{
start = offset;
sb.append( ch);
state = 1;
type = id;
}
else if ( Character.isDigit( ch))
  

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
Erik Hatcher wrote:
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like SyncThreadPool into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a 0 position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so 
SyncThread and ThreadPool appear too].

The point behind this is someone searching for threadpool probably 
would want to see a match for SyncThreadPool even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me 
to no end.

There are indexing/querying solutions/workarounds to the 
leading-prefix issue, such as reversing the text as you index it and 
ensuring you do the same on queries so they match.  There are some 
interesting techniques for this type of thing in the Managing 
Gigabytes book I'm currently reading, which Lucene could support with 
custom analysis and queries, I believe.
Yeah, great book. I thought my approach fit into Lucene the most 
naturally for my goals - and no doubt, things like just having the 
possibility of different pos increments is a great concept that I 
haven't seen in other search engines. I keep meaning to try an idea that 
appeared on the list some months ago, bumping up the incr between 
sentences so that it's harders for, say, a 2 word phrase to match w/ 1 
word in each sentence (makes sense to a computer, but usually not what a 
human wants).  Another side project...


The problem is as follows. In all cases I use my Analyzer to index 
the documents.
If I use my Analyzer with with the Highligher package,  it doesn't 
look at the position increment of the tokens and consequently a 
nonsense stream of matches is output. If I use a different Analyzer 
w/ the highlighter (say, the StandardAnalyzer), then it doesn't show 
the matches that really matched, as it doesn't see the subtokens.

Are your subtokens marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

I think so but this is the first time I've done this kind of thing. When 
I hit the special case several of the subtokens are 1st returned w/ an 
incr of 0, then the normal token, w/ an incr of 1 - which seems to make 
sense to me at least.


It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an 
incr of 0 and match one part of the query.

Has this come up before and is the issue clear?

The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it
Oh sure, I'll post any changes but wait for Mark for now.
would be great to have your contribution rolled back in :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote:
On Jun 9, 2004, at 8:53 AM, Terry Steichen wrote:
3) Is there a plan for adding QueryParser support for the SpanQuery 
family?

Another important facet to Terry's question here is what syntax to use 
to express all various types of queries?  I suspect that Google stats
And others - Altavista logs I think...
show us that most folks query with 1 - 3 words and do not use the any 
of the advanced features.

But with automagic query expansion these things might be done behind the 
scenes.  Nutch, for one, expands simple queries to check against 
multiple fields, with different boosts, and even gives a bonus for terms 
that are near each other.

The elegance of the query syntax is quite important, and QueryParser 
has gotten a bit hairy.  I would enjoy discussions on creating new 
query parsers (one size doesn't fit all, I don't think) and what syntax

I suggested in some email a while ago making the QueryParser extensible 
at, runtime or startup time, so you can add other types if queries that 
it doesn't support - so you have a way of registering these other query 
types (SpanQuery, SubstringQuery etc) and then some syntax like 
span:foo to invoke the query expander registered w/ span on foo...

should be used.
Paul Elschot created a surround query parser that he posted about to 
the list in April.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


extensible query parser - Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote:
On Jun 9, 2004, at 12:21 PM, David Spencer wrote:
show us that most folks query with 1 - 3 words and do not use the 
any of the advanced features.

But with automagic query expansion these things might be done behind 
the scenes.  Nutch, for one, expands simple queries to check against 
multiple fields, with different boosts, and even gives a bonus for 
terms that are near each other.

Ah yes!  Don't worry, I hadn't forgotten about Nutch.  I'm tinkering 
with its query parsing and analysis as we speak in fact.  Very clever 
indeed.

The elegance of the query syntax is quite important, and QueryParser 
has gotten a bit hairy.  I would enjoy discussions on creating new 
query parsers (one size doesn't fit all, I don't think) and what syntax

I suggested in some email a while ago making the QueryParser 
extensible at, runtime or startup time, so you can add other types if 
queries that it doesn't support - so you have a way of registering 
these other query types (SpanQuery, SubstringQuery etc) and then some 
syntax like span:foo to invoke the query expander registered w/ 
span on foo...

I would be curious to see how an implementation of this played out.  
For example, could I add my own syntax such that

some phrase -3- another phrase
could be parsed into a SpanNearQuery of two SpanNearQuery's?
I like the idea of a flexible run-time grammar, but it sounds too good 
to be true in a general purpose kinda way.
My idea isn't perfect for humans, but at least lets you use queries not 
hard coded.

You have something like
[1] how you register, could be in existing QueryParser
void register( String name,  SubqueryParser qp)
[2] what you register
interface SubQueryParser
{
Query parse( String s); // parses string user enters, forms a Query...
}
[3] example of registration
register( substring, new SubstringQP());  // instead of prefix matches 
allows term anywhere
register( span, new SurroundQP());
register( syn, new SynonymExpanderQP()); // expands a word to include 
synonyms

[4]  syntax
normal query parser syntax but add something else like NAME::TEXT 
(note 2 colons) so

this:  black syn::bird
expands to calls in the new extensible query parser,  something like
BooleanQuery bq = ...
bq.add( new TermQuery( contents, black))
bq.add( SubstringParser.parse( bird)) // really SynonymExpanderQP
return bq
behind the scenes SynonymExpanderQP expanded bird to the query 
equivalent of, um, bird avian^.5 wingedanimal^.5 or whatnot.

[5] the point
Be backward  compatible and natural for existing query syntax, but 
leave a hook so that if you innovate and define new query expansion code 
there's some hope of someone using it as they can in theory drop it in 
and use it w/o coding. Right now if you create some code in this area I 
suspect there's little chance people will try it out as there's too much 
friction to try it out.








Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Setting Similarity in IndexWriter and IndexSearcher

2004-06-07 Thread David Spencer
Does it ever make sense to set the Similartity obj in either (only one 
of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can 
I avoid setting it in IndexSearcher? Also, can I avoid setting it in 
IndexWriter and only set it in IndexSearcher? I noticed Nutch sets it in 
both places and was wondering about what's going on behind the scenes...

thx,
 Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No tvx reader

2004-06-05 Thread David Spencer
Using 1.4rc3.
Running an app that indexes 50k documents (thus it just uses an 
IndexWriter).
One field has that boolean set for it to have a term vector stored for 
it, while other 11 fields don't.

On stdout I see No tvx file 13 times.
Glancing thru the src it seems this comes from TermVectorReader.
The generated index seeems fine.
What could be causing this and is this normal?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


bonus for exact case match

2004-06-03 Thread David Spencer
Does anyone have any experiences with giving a bonus for exactly 
matching case in queries?

One use case is in the java world maybe I want to see references to 
Map (java.util.Map)  but am not interested in a (geographical) map.

I believe, in the context of Lucene, one way is to have an Analyzer that 
returns a TokenStream which, in cases where a word has some upper case 
characters, returns the word twice in that position, once as-is and once 
in lower case,  using the magic of Token.getPositionIncrement(). Then 
you'll need a query expander or whatnot which, when given a query like 
Map, expands it to Map^2 map.

Thoughts/comments?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: similarity of two texts

2004-06-02 Thread David Spencer
Terry Steichen wrote:
Erik,
Could you expand on this just a wee bit, perhaps with an example of how to
compute this vector angle?
I'm tempted to write the code to see how it works, but FYI this doc 
seems to nicely explain the concepts:

http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
TIA,
Terry
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, June 01, 2004 9:39 AM
Subject: Re: similarity of two texts


On Jun 1, 2004, at 9:24 AM, Grant Ingersoll wrote:
Hey Eric,
Eri*K*  :)

What did you do to calc similarity?
I computed the angle between two vectors.  The vectors are obtained
from IndexReader.getTermFreqVector(docId, field).

 I haven't had time, but was thinking of ways to add the ability to
get the similarity score (as calculated when doing a search) given a
term vector (or just a document id).
It would be quite compute-intensive to do something like this.  This
could be done through a custom sort as well, if applying it at the
scoring level doesn't work.  I haven't given any thought to how this
could work for scoring or sorting before, but does sound quite
interesting.

 Any ideas on how to approach this would be appreciated.  The scoring
in Lucene has always been a bit confusing to me, despite looking at
the code several times, especially once you get into boolean queries,
etc.
No doubt that it is confusing - to me also.  But Explanation is your
friend.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: similarity of two texts - another question

2004-06-02 Thread David Spencer
Gerard Sychay wrote:
Hmm, the term vector does not have to consist of only term frequencies,
does it? To give weight to rare terms, could you create a term vector of
(TF*IDF) values for each term?  Then, a distance function would measure
how many terms two vectors have in common, giving weight to how many
rare terms two vectors have in common.
Yeah, but if you're gonna do that why not just form a query with all 
words in the source document, and let the Lucene engine do the idf/tf 
calculations? I've done this and it seems to work fine.

Here's code I've used. It could be done better by avoiding QueryParser, 
and odds are it could hit that exception for too many clauses in a 
boolean expression unless you configure lucene from its default, but 
this is the idea. srch is the entire body of the source document.

public static Query formSimilarQuery( String srch, Analyzer a)
throws org.apache.lucene.queryParser.ParseException, IOException
{
StringBuffer sb = new StringBuffer();
TokenStream ts = a.tokenStream( foo, new StringReader( srch));
org.apache.lucene.analysis.Token t; 
while ( (t = ts.next()) != null)
{
sb.append( t.termText() +  );
}
return QueryParser.parse( sb.toString(),DFields.CONTENTS, a);
}


David Spencer [EMAIL PROTECTED] 06/01/04 08:25PM 
Erik Hatcher wrote:

On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:

Well, a question again, how does Lucene compute the score between a 

document and a query?

And I might add, thus, this approach to similarity gives more weight to
rare terms that match, which one might want for this kind of similarity
measure.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Page ranking

2004-06-01 Thread David Spencer
Scott Sayles wrote:
Is there anyone out there that has page ranking implemented on top of
Lucene?
 

I recently discovered JUNG which has 2 impls of PageRank:
http://jung.sourceforge.net/api/1.4.1/edu/uci/ics/jung/algorithms/importance/PageRank.html
I did a test of hooking it up to my spider and calculating pagerank of 
all pages in a javadoc tree (experimented with both 
http://jakarta.apache.org/lucene/docs/api/overview-summary.html and 
http://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html). 

The basic prodcedure is
[1] grab all pages to a local cache while building a table of page-page 
links
[2] using the page-page link data, calculate pageranks with JUNG and 
cache this
[3] go thru cache and index the pages ( to a Lucene index), setting each 
documents boost (Document.setBoost()) to the pagerank value

I've just got this going over the weekend. Prelim results are 
disappointing.  Pages like 
http://java.sun.com/j2se/1.4.2/docs/api/deprecated-list.html get a high 
pagerank as all kinds of pages link to it, though when I search javadoc 
I never want that page. It might be this turns out better however - I'm 
not doing any query expansion now, though next pass I'll auto-boost for 
title matches.

I can make available a table of pageranks (URL,pagerank pairs) for these 
runs if people want.

Just in case anyone may be thinking otherwise, when I say page ranking
I'm not referring to the ranking of results from searches.  I'm talking
about something similar to how google computes what page may be more
relevant or important (often referred to as PageRank) which is effected
in part by how many other pages reference that page.
I've been through the examples listed here:
http://www.iprcom.com/papers/pagerank/index.html
which provides information from the origianl google paper about page
ranking.  Running the examples are fairly easy, but the big question I
have is how can I practically update such data?  

I think this is a batch operation, you have to precalc it when indexing 
the entire collection.

And is there any
potential integration with Lucene? 

My thoughts are Doc.setBoost or just a plain field and store it there 
and use it to sort the results.

It would seem that one could store
the computed ranking values in the actual Lucene Document itself, but
the updates 

Unless something has changed, index are write-only. You really can't 
update an index other than deleting a doc and readding it, and to calc 
pagerank you need all links between pages.

would be fairly laborious as a few minor changes in rankings
can produce a large ripple in other related document rankings.  This, of
course, would be the same issue if the ranking information were stored
outside of Lucene.  One could potentially store this in a separate
database and then look up the ranking information for each document
found and then perform updates as an external asynchronous task.
Anyone have any experience with maintaining page rankings?
 

It might be of interest to see what Nutch does. It doesn't use pagerank 
but it does seem to care about the # of incoming links. I think the key 
file is IndexSegment ( see the src, not the jdoc).

Thanks,
Scott
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: about search and update one index simultaneously

2004-06-01 Thread David Spencer
xuemei li wrote:
Hi,all,
 

see this:
http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex
Can we do search and update one index simultaneously?Is someone know sth
about it? I had done some experiments.Now the search will be blocked
when the index is being updated.The error in search node is like this:
   caught a class java.io.IOException
   with message:Stale NFS file handle
Thanks
Xuemei Li

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: similarity of two texts - another question

2004-06-01 Thread David Spencer
Erik Hatcher wrote:
On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:
Well, a question again, how does Lucene compute the score between a  
document and a query?

And I might add, thus, this approach to similarity gives more weight to 
rare terms that match, which one might want for this kind of similarity 
measure.

Using the equation here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
Similarity.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


now maybe Mozlla/IMAP URLs - Re: StandardTokenizer and e-mail

2004-05-21 Thread David Spencer
This reminds me - if you have a search engine that indexes a mail store 
and you present results in a web page to a browser, you want to (of 
course...well I think this is obvious)  send back a URL that would cause 
the users native mail client to pull up the msg.
IMAP has a URL format, and I use Mozilla on windows to browse  read 
mail, however when I've presented IMAP URLs on a results page the IMAP 
URL doesn't work - either nothing happens or the cursor changes to busy 
but still no mail comes up. Has anyone come across this? This may be 
more appropriate for a moz list but it's definitely a search issue.

This page mentions the problem:
http://www.mozilla.org/projects/security/known-vulnerabilities.html
A writeup on an IMAP indexer I did a while ago:
http://www.tropo.com/techno/java/lucene/imap.html

Albert Vila wrote:
Hi all,
I want to achieve the following, when I indexing the 
'[EMAIL PROTECTED]', I want to index the '[EMAIL PROTECTED]' token, then 
the 'xyz' token, the 'company' token and the 'com'token.
This way, you'll be able to find the document searching for 
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.

How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


asktog on search problems

2004-05-21 Thread David Spencer
Haven't seen this discussed here.
See 7a at the link below:
http://www.asktog.com/columns/062top10ReasonsToNotShop.html
7a talks about searching on a camera site for the Lowepro 100 AW.
He says this query works:Lowepro 100 AW
and this query does not work: Lowepro 100AW
Cross checking with google indeed shows that the 1st form is much more 
popular, however the 2nd form is used, and if you're a commerce site or 
a site that wants to make it easier for users to find things you should 
help them out.

So the discussion question is what's the best way to handle this.
I guess the somewhat general form of this is that in a query, and term 
might be split into 2 terms that are individually indexed (so 100AW is 
not indexed, but 100 and AW is).
In a way the flip side of this is that any 2 terms could be concatenated 
to form another term that was indexed (so in another universe it might 
be that passing 100 AW is not as precise as passing 100AW but how's 
the user to know).

In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, transform 
Lowepro 100AW to Lowepro~ 100AW~
- write a query parser that breaks apart unindexed tokens into ones that 
are indexed (so 100AW becomes 100 AW)
- write a tokenizer that inserts dummy tokens for every pair of tokens, 
so the stream Lowepro 100 AW would also have Lowepro100 and 100AW 
inserted, presumably via magic w/ TokenStream.next()

Comments on best way to handle this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Scoring documents by Click Count

2004-05-06 Thread David Spencer
Otis Gospodnetic wrote:

Sure.
On click, get document Id (not internal docId, but something you use as
s surrogate primary key) of the clicked document.  Retrieve the
document.  Pull out the value of 'clickCount' field.  +1 it.  Delete
the document, and re-add it (there is no 'update(Document)' method).
 

Yeah but isn't the essence of it that Lucene is really not set up for 
dynamically adjusting the *score*?
Also, above, to clarify, I think you're implying there are 2 entries for 
given doc - one Document for the indexed content, and one for the 
clickCount, as (from memory) I didn't think you could even re-add a doc 
w/o reindexing it...

Otis

--- Centaur zeus [EMAIL PROTECTED] wrote:
 

Hi all,

I want to integrate lucene into my web app. I would like to increase
the 
score of the document when more people click on it. Could I implement
that 
in lucene ?

Thanks.

Perseus

_
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. 
http://join.msn.com/?page=features/virus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene index - information

2004-03-19 Thread David Spencer
Karl Koch wrote:

If I create an standard index, what does Lucene store in this index?

What should be stored in an index at least? Just a link to the file and
keywords? Or also wordnumbers? What else?
Does somebody know a paper which discusses this problem of what to put in
an good universal IR index ?


Well if you want a textbook I found Managing Gigabytes to have 
excellent coverage of the internals and messy details of search/indexes.

http://www.amazon.com/exec/obidos/ASIN/1558605703/tropoA
http://www.cs.mu.oz.au/mg/


Cheers,
Karl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: incomplete word match

2004-03-11 Thread David Spencer
SubstringQuery, my humble contribution.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg06388.html

Tomcat Programmer wrote:

I have a situation where I need to be able to find
incomplete word matches, for example a search for the
string 'ape' would return matches for 'grapes'
'naples' 'staples' etc.  I have been searching the
archives of this user list and can't seem to find any
example of someone doing this. 

At one point I recall finding someone's site (on
Google) who indicated that their search engine was
Lucene, and they offered the capability of doing this
type of matching. However I can't seem to find that
site again to save my life!  

Has anyone been successful in implementing this type
of matching with Lucene? If so, would you be able to
share some insight as to how you did it? 

Thanks in advance! 

-TP

__
Do you Yahoo!?
Yahoo! Search - Find what youre looking for faster
http://search.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread David Spencer
Maybe I missed something but I always thought the stop list should be a 
Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
know is existence and that's what a Set does.

Doug Cutting wrote:

Erik Hatcher wrote:

Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable signature 
method there.  I suppose we could change the signature to use a Map 
instead, but I believe there are some issues with doing something like 
this if you do not recompile your own source code against a new Lucene 
JAR so I will simply provide another signature too.


This is also a problem for folks who're implementing analyzers which use 
StopFilter.  For example:

public MyAnalyzer extends Analyzer {

  private static Hashtable stopTable =
StopFilter.makeStopTable(stopWords);
  public TokenStream tokenStream(String field, Reader reader) {
... new StopFilter(stopTable) ...
}

This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);
Does that make sense?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Database

2004-02-26 Thread David Spencer
Parminder Singh wrote:

I've a CMS application that deploys metadata to a database. Is it possible to use lucene to search this database instead of it's (lucene's) index. If you could tell me the steps that would be involved in doing this, it'd be great help. I'm new to Lucene. 
 

I've done this extensively. Basically you create documents out of the 
database and in my case I generated a URL for each doc which would be 
fed to users after a query.  This is one of those things where Lucene 
stands apart from what seemed to be the alternatives a few years ago 
(htdig was one thing I used) -- it doesn't have to spider a web site, 
and if you have a dynamic web site (pages generated from db queries) 
then the indexing is in a way more efficient as you don't have to parse 
html to extract what may or may not be the actual text - the db will 
have your exact content so you index off the db, not from web pages..

Thank You.

Parminder Singh



In war: resolution. In defeat: defiance. In victory: magnanimity. In peace: goodwill. - Sir Winston Leonard Spencer Churchill

*
Disclaimer
This message (including any attachments) contains 
confidential information intended for a specific 
individual and purpose, and is protected by law. 
If you are not the intended recipient, you should 
delete this message and are hereby notified that 
any disclosure, copying, or distribution of this
message, or the taking of any action based on it, 
is strictly prohibited.

*
Visit us at http://www.mahindrabt.com
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  1   2   >