spans directory in the CVS version

2004-02-11 Thread Nicolas Maisonneuve
hy,
recently, there is a new subdirectory spans in the search directory. what is it  and 
how use it ?

thanks in advance
nicolas maisonneuve

thanks for your mail

2004-02-11 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to re-index

2004-02-11 Thread Otis Gospodnetic
Update in Lucene means: delete the document and then re-add it.
This may be a FAQ.

Otis

--- Markus Brosch [EMAIL PROTECTED] wrote:
  However, I have problems with reindexing. 
  First, I index all my object contents. Then some of these objects
 can
  change
  and need to be re-indexed. 
  
  I did it with IndexWriter(Dir, Analyzer, FALSE). With the boolean
 value
  false the new document will be added to the index, but the old
 document
  still remains in the index :-/ 
 
 Sorry for the second mail, but maybe I sould say that I am looking
 for an
 UPDATE of the index! What I am doing at the moment is adding (see
 above) and
 deleting with IndexReader ...
 
 Thanks ;-)
 
 -- 
 GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99
 EUR/Monat...)
 jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++
 http://www.gmx.net/derspiegel +++
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: spans directory in the CVS version

2004-02-11 Thread Erik Hatcher
On Feb 11, 2004, at 5:00 AM, Nicolas Maisonneuve wrote:
hy,
recently, there is a new subdirectory spans in the search directory. 
what is it  and how use it ?
Have a look at the test cases which use the new features, and also see 
the CHANGES file which mentions it.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: commit.lock file

2004-02-11 Thread Otis Gospodnetic
If there are commit.lock files being left over, you should really
investigate why that is happening.  Something is probaly dying, and you
are not catching it and cleaning up by closing things like IndexReader
or IndexWriter.
If you want to forcefully unlock the index, use isLocked and unlock
methods in IndexWriter.  Not recommended, though.

Otis


--- Supun Edirisinghe [EMAIL PROTECTED] wrote:
 Hi everybody, I'm new to the mail list.
 
 I'm also new to using Lucene.
 
 We use lucene to index some of our pages.
 
 sometimes (for a reason unknown to us) a commit.lock file is left and
 
 searches using the index  don't work.
 
 what are some of the causes for this commit.lock file to persist.
 
 I've read in the faq that it is written so that access to the
 segments 
 is synchronized correctly.
 
 What are some good strategies to make make this file go away? Would
 it 
 be a good idea to assign a  program to  just check the timestamp on 
 that file and just delete it if it has been there for a long time?
 
 all comments are welcome.
 
 thanks
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: featues page in the Lucene web site

2004-02-11 Thread Otis Gospodnetic
Nicolas,

That SF page is out of date now.
The best way to learn about different features right now is by reading
articles about Lucene (links on the site) or browsing the Javadocs
(also linked on the site).

Erik Hatcher and I are finishing up a book about Lucene.
Once published, this will be the most comprehensive and up to date
documentation about Lucene.

Otis Gospodnetic


--- Nicolas Maisonneuve [EMAIL PROTECTED] wrote:
 hy, 
 it would be great if a page with all features of lucene would be
 created in the apache lucene site !
 
 in the sourceforge website
 (http://lucene.sourceforge.net/features.html) ,there is this
 page..but is it updated ?
 
 thanks in advance
 nicolas maisonneuve


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Leading Wild Card Search

2004-02-11 Thread Vipul Sagare
Lucene docs, FAQs and other research indicates

 

Note: Leading wildcards (e.g. *ook) are not supported.

 

 

Is there any work around for implementation of such feature (if one has
to implement)?

 

 



RE: Leading Wild Card Search

2004-02-11 Thread Wesley MacDonald

Flip your text and add it as another field and when the user enters *word
you can search that field for drow* 

Wesley

-Original Message-
From: Vipul Sagare [mailto:[EMAIL PROTECTED] 
Sent: February 11, 2004 1:54 PM
To: [EMAIL PROTECTED]
Subject: Leading Wild Card Search

Lucene docs, FAQs and other research indicates

 

Note: Leading wildcards (e.g. *ook) are not supported.

 

 

Is there any work around for implementation of such feature (if one has to
implement)?

 

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for your mail

2004-02-11 Thread [EMAIL PROTECTED]
Received your mail we will get back to you shortly


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Leading Wild Card Search

2004-02-11 Thread David Spencer
Vipul Sagare wrote:

Lucene docs, FAQs and other research indicates



Note: Leading wildcards (e.g. *ook) are not supported.





Is there any work around for implementation of such feature (if one has
to implement)?
 

I've written a PrefixQuery and it's not hard to do  -I can post it too.
Problem is that it is not integrated into the query parser (.jj) so odds
are noone will use it, and the general sentiment on this list (and 
lucene-dev)
is that prefix queries are evil because it's an expensive operation as 
the query
code has to traverse all terms to expand the query. I would prefer
a more user oriented view i.e. just allow it as sometimes it's what you 
need and
the only alternative I can think of, doing a fuzzy query, isn't quite right.






 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: commit.lock file

2004-02-11 Thread Supun Edirisinghe
thanks otis

you are right. Is there a way the thread using isLock and unlock know
how old the lock is?

my assumption is that if it is older than a couple seconds it is from
something dying or some branch where something is uncaught.

I guess I can try looking at the timestamp of the commit.lock file using
IO. I'm worried that that read of the lock file will be many times the
time
searching. I guess I won't know until I test and that time will only
come in cases where the lock is set. I didn't see any methods in the API
for 
finding the age of the lock. am I wrong?

thanks again


On Wed, 2004-02-11 at 03:33, Otis Gospodnetic wrote:
 If there are commit.lock files being left over, you should really
 investigate why that is happening.  Something is probaly dying, and you
 are not catching it and cleaning up by closing things like IndexReader
 or IndexWriter.
 If you want to forcefully unlock the index, use isLocked and unlock
 methods in IndexWriter.  Not recommended, though.
 
 Otis
 
 
 --- Supun Edirisinghe [EMAIL PROTECTED] wrote:
  Hi everybody, I'm new to the mail list.
  
  I'm also new to using Lucene.
  
  We use lucene to index some of our pages.
  
  sometimes (for a reason unknown to us) a commit.lock file is left and
  
  searches using the index  don't work.
  
  what are some of the causes for this commit.lock file to persist.
  
  I've read in the faq that it is written so that access to the
  segments 
  is synchronized correctly.
  
  What are some good strategies to make make this file go away? Would
  it 
  be a good idea to assign a  program to  just check the timestamp on 
  that file and just delete it if it has been there for a long time?
  
  all comments are welcome.
  
  thanks
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-11 Thread Matt Tucker
Timo,

We implemented that type of system using a spelling engine by Wintertree:

http://www.wintertree-software.com

There are some free Java spelling packages out there too that you could 
likely use.

Regards,
Matt
[EMAIL PROTECTED] wrote:

Hi!

Can I do things like Google's Did you mean...? correction for mistyped words 
with Lucene?

Warm Regards,
Timo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ANNOUNCE: Plucene

2004-02-11 Thread lucene
Hi!

Somewhat off-topic: is there a PHP port of Lucene?

Warm regards
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANNOUNCE: Plucene

2004-02-11 Thread Erik Hatcher
In this case, I'd recommend calling out to a Lucene, CLucene, or  
PLucene.

Sam Ruby plugged it into his Perl-based blog like this:  
http://radio.weblogs.com/0101679/stories/2002/08/13/ 
luceneSearchFromBlosxom.html

On Feb 11, 2004, at 6:23 PM, [EMAIL PROTECTED] wrote:

Hi!

Somewhat off-topic: is there a PHP port of Lucene?

Warm regards
Timo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


code for more like this query expansion - was - Re: setMaxClauseCount ??

2004-02-11 Thread David Spencer
Doug Cutting wrote:

Karl Koch wrote:

Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and 
stemming?

Using term frequencies of the document is not really possible since 
lucene
is not providing access to a document vector, isn't it?


Lucene does let you access the document frequency of terms, with 
IndexReader.docFreq().  Term frequencies can be computed by 
re-tokenizing the text, which, for a single document, is usually fast 
enough.  But looking up the docFreq() of every term in the document is 
probably too slow.

You can use some heuristics to prune the set of terms, to avoid 
calling docFreq() too much, or at all.  Since you're trying to 
maximize a tf*idf score, you're probably most interested in terms with 
a high tf. Choosing a tf threshold even as low as two or three will 
radically reduce the number of terms under consideration.  Another 
heuristic is that terms with a high idf (i.e., a low df) tend to be 
longer.  So you could threshold the terms by the number of characters, 
not selecting anything less than, e.g., six or seven characters.  With 
these sorts of heuristics you can usually find small set of, e.g., ten 
or fewer terms that do a pretty good job of characterizing a document.

It all depends on what you're trying to do.  If you're trying to eek 
out that last percent of precision and recall regardless of 
computational difficulty so that you can win a TREC competition, then 
the techniques I mention above are useless.  But if you're trying to 
provide a more like this button on a search results page that does a 
decent job and has good performance, such techniques might be useful.

An efficient, effective more-like-this query generator would be a 
great contribution, if anyone's interested.  I'd imagine that it would 
take a Reader or a String (the document's text), an Analyzer, and 
return a set of representative terms using heuristics like those 
above.  The frequency and length thresholds could be parameters, etc.


Well I've done a prelim impl of the above. Maybe someone could proofread 
my code.
The code is hot off the presses and seems to work...

Questions are:
[a] is the code right
[b] are any more (less) params needed to properly genericize the 
algorithm? e.g. max words to return?
[c] I can tweak the code to be a little more usable..does it make sense 
to return, say, a Query?
[d] then the eternal question - I think these things are interesting but 
my theory is that Queries (is-a Query impls) which are not implemented 
into the QueryParser will never really be used

Anyway:

There are two parts - the main() quick test I did which is not set up to 
run on another system
right now but shows how the mlt rountine (mlt-MoreLikeThis) is called:



   public static void main( String[] a)
   throws Throwable
   {
   Hashtable stopTable = StopFilter.makeStopTable( 
StopAnalyzer.ENGLISH_STOP_WORDS);
   String fn = c:/Program Files/Apache 
Group/Apache/htdocs/manual/vhosts/index.html.en;
   PrintStream o = System.out;
   final IndexReader r = IndexReader.open( localhost_index);

   String body = new com.tropo.html.HTMLTextMuncher( new 
FileInputStream( fn)).getText();
   PriorityQueue q = mlt(  new StringReader( body), 
getDefAnalyzer(), r, contents, 2, stopTable, 0, 0);

   o.println( res... + q.size());
   o.println();
   Object cur;
   while ( (cur = q.pop()) != null)
   {
   Object[] ar = (Object[]) cur;
   o.println( ar[ 0] +  =  + ar[ 1]);
   }
   }



And the impl which will compile with appropriate imports.

import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.search.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;
   /**
* Find words for a more-like-this query former.
*
* @param r the reader that has the content of the document
* @param a the analyzer to parse the reader with
* @param field the field of interest in the document
* @param minFreq filter out terms that occur less than this in the 
document
* @param stop a table of stopwords to ignore
* @param minLen ignore words less than this length or pass in 0 to 
not use this
* @param maxLen ignore words greater than this length or pass in 0 
to not use this
* @return a priority queue ordered by docs with the largest score 
(tf*idf)
*/
   public static PriorityQueue mlt( Reader r,
Analyzer a,
IndexReader ir,
String field,
int minFreq,
Hashtable stop,
int minLen,
int maxLen)
   throws IOException
   {
   Similarity sim = new DefaultSimilarity(); // for 

Re: What is the status of Query Parser AND / OR ?

2004-02-11 Thread Morus Walter
Daniel B. Davis writes:
 There was a lot of correspondence during December about this.
 Is there any further resolution?
 
There's a patch and I hope it will find it's way into the lucene 
sources.

see: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Seems I missed the mail about Otis latest comment.
Sorry about that, I'll take a look at these issues ASAP.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-11 Thread lucene
On Thursday 12 February 2004 00:15, Matt Tucker wrote:
 We implemented that type of system using a spelling engine by Wintertree:

 http://www.wintertree-software.com

 There are some free Java spelling packages out there too that you could
 likely use.

But this does not ensure that the word really exists in the index. The word 
google does propose however to exist.

Regards
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]