Re: Starts With x and Ends With x Queries

2005-02-06 Thread Erik Hatcher
On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote:
If you want to start doing suffix queries (ie: all names ending with
s, or all names ending with Smith) one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey 
Term
that starts with a *. ie...

   Query q3 = new WildcardQuery(new Term(name,*s));
   Query q4 = new WildcardQuery(new Term(name,*Smith));
(NOTE: Erik says you can do this, but the docs for WildcardQuery say 
you
can't I'll assume the docs are wrong and Erik is correct.)
I assume you mean this comment on WildcardQuery's javadocs:
In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards code*/code or 
code?/code.

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change must to should.  And 
yes, WildcardQuery itself supports a leading wildcard character exactly 
as you have shown.

Which leads me to my point: if you denormalize your data so that you 
store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by 
sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when 
doing
a SuffixQuery...

   D1 name:Adam Smith rname:htimS madA age:13 state:CA ...
   D2 name:Joe Bob rname:boB oeJ age:42 state:WA ...
   D3 name:John Adams rname:smadA nhoJ age:35 state:NV ...
   D3 name:Sue Smith rname:htimS euS age:33 state:CA ...
   Query q1 = new PrefixQuery(new Term(name,J*));
   Query q2 = new PrefixQuery(new Term(name,Sue*));
   Query q3 = new PrefixQuery(new Term(rname,s*));
   Query q4 = new PrefixQuery(new Term(rname,htimS*));
(If anyone sees a flaw in my theory, please chime in)
This trick has been mentioned on this list before, and is a good one.  
I'll go one step further and mention another technique I found in the 
book Managing Gigabytes, making *string* queries drastically more 
efficient for searching (though also impacting index size).  Take the 
term cat.  It would be indexed with all rotated variations with an 
end of word marker added:

cat$
at$c
t$ca
$cat
The query for *at* would be preprocessed and rotated such that the 
wildcards are collapsed at the end to search for at* as a 
PrefixQuery.  A wildcard in the middle of a string like c*t would 
become a prefix query for t$c*.

Has anyone tried this technique with Lucene?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-06 Thread Maurits van Wijland
Hi Owen,
This can easily be done! Simply install tomcat on port 8080 and create a 
jk2 or proxy that points to tomcat. then all requests for jsps can be 
send to tomcat. The search engine can even be placed on a separate 
server. If you give me some details on your server, i will create a 
proxy script for your apache!

regards,
Maurits
Owen Densmore wrote:
I'm building a lucene project for a client who uses php for their 
dynamic web pages.  It would be possible to add servlets to their 
environment easily enough (they use apache) but I'd like to have 
minimal impact on their IT group.

There appears to be a php java extension that lets php call back  
forth to java classes, but I thought I'd ask here if anyone has had 
success using lucene from php.

Note: I looked in the Lucene In Action search page, and yup, I bought 
the book and love it!  No examples there tho.  The list archives 
mention that using java lucene from php is the way to go, without 
saying how.  There's mention of a lucene server and a php interface to 
that.  And some similar comments.  But I'm a bit surprised there's not 
a bit more in terms of use of the official java extension to php.

Thanks for the great package!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-06 Thread Kelvin Tan
How about XML-RPC/SOAP, or REST?

For REST, just have a servlet listening for HTTP Gets and respond with XML that 
your PHP app can parse (for searching). For indexing, let's say you want to 
index an uploaded file, construct a URL with the fields and field values, and 
also pass the location of the file on the FS. Shouldn't be that difficult.

I'm guessing its more desirable to have all your code in one place, which is an 
advantage to using Java in PHP. But it feels cleaner to have the Java stuff in 
one codebase and the PHP in another. May make debugging easier. No idea how 
widely used the PHP-Java binding is.

k

On Sun, 6 Feb 2005 10:10:36 -0700, Owen Densmore wrote:
 I'm building a lucene project for a client who uses php for their
 dynamic web pages.  It would be possible to add servlets to their
 environment easily enough (they use apache) but I'd like to have
 minimal impact on their IT group.

 There appears to be a php java extension that lets php call back 
 forth to java classes, but I thought I'd ask here if anyone has had
 success using lucene from php.

 Note: I looked in the Lucene In Action search page, and yup, I
 bought the book and love it!  No examples there tho.  The list
 archives mention that using java lucene from php is the way to go,
 without saying how.  There's mention of a lucene server and a php
 interface to that.  And some similar comments.  But I'm a bit
 surprised there's not a bit more in terms of use of the official
 java extension to php.

 Thanks for the great package!

 Owen


 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-02-06 Thread Erik Hatcher
Eventually you can just do PHP within the servlet container
http://www.jcp.org/en/jsr/detail?id=223
and have your cake and eat it too!  :)
Erik
On Feb 6, 2005, at 12:10 PM, Owen Densmore wrote:
I'm building a lucene project for a client who uses php for their 
dynamic web pages.  It would be possible to add servlets to their 
environment easily enough (they use apache) but I'd like to have 
minimal impact on their IT group.

There appears to be a php java extension that lets php call back  
forth to java classes, but I thought I'd ask here if anyone has had 
success using lucene from php.

Note: I looked in the Lucene In Action search page, and yup, I bought 
the book and love it!  No examples there tho.  The list archives 
mention that using java lucene from php is the way to go, without 
saying how.  There's mention of a lucene server and a php interface to 
that.  And some similar comments.  But I'm a bit surprised there's not 
a bit more in terms of use of the official java extension to php.

Thanks for the great package!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document numbers and ids

2005-02-06 Thread Chris Hostetter
:  care about their content. I only want to know a particular numeric
:  field from
:  document (id of document's category).
:  I also need to know how many docs in category were found, so I can't
:  index

: You should explore the use of IndexReader.  Index your documents with
: category id field, and use the methods on IndexReader to find all
: unique categories (TermEnum).

to expand on erik's suggestion: once you know the complete list of
categories you iterate over then and execute your search once per
category, filtering each time on the category Id (to determine the number
of results from that category).



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-02-06 Thread Andrzej Bialecki
Erik Hatcher wrote:
Eventually you can just do PHP within the servlet container
http://www.jcp.org/en/jsr/detail?id=223
and have your cake and eat it too!  :)
An intriguing thought occured to me: with the recent work on PyLucene, 
it should be quite possible to generate a SWIG wrapper for PHP and build 
a fully native PHPLucene module using gcj.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-06 Thread Chris Hostetter

: book Managing Gigabytes, making *string* queries drastically more
: efficient for searching (though also impacting index size).  Take the
: term cat.  It would be indexed with all rotated variations with an
: end of word marker added:
...
: The query for *at* would be preprocessed and rotated such that the
: wildcards are collapsed at the end to search for at* as a
: PrefixQuery.  A wildcard in the middle of a string like c*t would
: become a prefix query for t$c*.

That's a pretty slick trick.

Considering how many Terms the index would wind up containing in order to
denormalize the data in that way, I wonder if it would be more practicle
to index each of the characters as a seperate term, with the word repeated
after the end of word character, making wildcard searches into phase
searches (after doing preprocessing and rotating as you described).

Ie, index cat as:   c a t $ c a t
  search for *at* as a phrase search for a t
  search for *at  as a phrase search for a t $
  search for c*t  as a phrase search for t $ c

...i'm fairly certain that would keep the index size much smaller (the
number of terms would be much smaller, while the average term frequence
wouldn't really increase), but i'm not sure if it would actaully be any
faster.  it depends on the algorithm/performace of PhraseQuery -- which is
something I haven't really looked into.  It could very well be
significantly slower.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter: new support for encoding

2005-02-06 Thread markharw00d
Nicko Cadell was good enough to point out the issues involved with 
generating XHTML compliant markup with the highlighter and provided a 
patch to fix it.

The main code has now been updated in the new SVN repository here: 
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

To encode your content simply pass an encoder to the Highlighter eg:
//create an example doc for this test 
   String myDocContent = \Smith  sons' prices  3 and 4\ 
claims article;   
   //Ordinarily you'd get the doc content like this..
   //myDocContent=hits.doc(i).get(FIELD_NAME)

  //create a query - you'd normally get this from QueryParser.parse
   Query myDocQuery=new TermQuery(new Term(contents,prices));
   //Create a highlighter and pass a QueryScorer to provide the 
list of query tokens 
   Highlighter highlighter = new Highlighter(new 
QueryScorer(myDocQuery));
   //set the choice of encoder to our simple encoder - otherwise 
default is no encoding
   highlighter.setEncoder(new SimpleHTMLEncoder());
  
  
   //Tokenize the document content to get the positions using an 
analyzer:
   Analyzer analyzer=new WhitespaceAnalyzer();
   TokenStream tokenStream = analyzer.tokenStream(contents, new 
StringReader(myDocContent));
  
  
   //As a faster alternative to re-analyzing doc content you can
   //use TokenSources to take advantage of any pre-tokenized 
content held in any term vectors:
   //TokenStream 
tokenStream=TokenSources.getAnyTokenStream(indexReader,docId, 
fieldName,analyzer);
  
   //Now pass the tokenStream to the highlighter to process
   String encodedSnippet = 
highlighter.getBestFragments(tokenStream, myDocContent,1,...);
   System.out.println(encodedSnippet);
   //Should print quot;Smith amp; sons' Bprices/B lt; 3 and 
gt;4quot; claims article

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-06 Thread [EMAIL PROTECTED]
Hi Owen
I am using Lucene with PHP, though in previous replies it was suggested 
to run Tomcat on an alternate port, but for me that was not a solution. 
I did not want to run too many tasks or too many servers
for various reasons (maintenance, security etc) and also needed to have 
control over PHP sessions and what not.

The original PHP extension for Java is broken and is far fro being 
usable in production. Instead I have been using PHP and Lucene with a 
PHP-Java-Bridge for the past 6 months or so.
It does the job very well and I can call classes and methods right out 
of PHP just like you would expect with a PHP extension.

The bridge is available here: 
http://sourceforge.net/projects/php-java-bridge

Hope this helps,
-pedja


Owen Densmore said the following on 2/6/2005 12:10 PM:
I'm building a lucene project for a client who uses php for their 
dynamic web pages.  It would be possible to add servlets to their 
environment easily enough (they use apache) but I'd like to have 
minimal impact on their IT group.

There appears to be a php java extension that lets php call back  
forth to java classes, but I thought I'd ask here if anyone has had 
success using lucene from php.

Note: I looked in the Lucene In Action search page, and yup, I bought 
the book and love it!  No examples there tho.  The list archives 
mention that using java lucene from php is the way to go, without 
saying how.  There's mention of a lucene server and a php interface to 
that.  And some similar comments.  But I'm a bit surprised there's not 
a bit more in terms of use of the official java extension to php.

Thanks for the great package!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document numbers and ids

2005-02-06 Thread Simeon Koptelov
On Sunday 06 February 2005 20:00, Chris Hostetter wrote:
 :  care about their content. I only want to know a particular numeric
 :  field from
 :  document (id of document's category).
 :  I also need to know how many docs in category were found, so I can't
 :  index
 :
 : You should explore the use of IndexReader.  Index your documents with
 : category id field, and use the methods on IndexReader to find all
 : unique categories (TermEnum).

 to expand on erik's suggestion: once you know the complete list of
 categories you iterate over then and execute your search once per
 category, filtering each time on the category Id (to determine the number
 of results from that category).

Nah, I did a little more tricky thing, but promises to be faster (I have 12K 
categories now and there will be more).
I index docs' categories ids as zero-padded keywords. Then I do search for 
documents, sorting them by category id. Then I iterate Hits following the 
scheme: 
1. I have the cache that holds ids of documents in current category.
2. Each time I see doc id that is not in current category, I read that 
document and reload cache with it's category data. 

So if I found docs in N categories (N usually is not big), I really need to 
read exactly N docs from disk, the rest of iterating through Hits is just 
checking cache (because I sort by category).

It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, 
HitCollector ), but if I understood Hits properly, it gives me O( log2
( doc_dum ) ) performance impact per resultset, which is perfectly 
acceptable.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-06 Thread sergiu gordea
Hi Erick,

In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards code*/code or 
code?/code.

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change must to should. 
Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?

 Best,
 Sergiu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Disk space used by optimize

2005-02-06 Thread Morus Walter
Bernhard Messer writes:
 
 However, three times the space sounds a bit too much, or I make a
 mistake in the book. :)
   
 
 there already was  a discussion about disk usage during index optimize. 
 Please have a look to the developers list at: 
 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 
 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569
 where i made some measurements about the disk usage within lucene.
 At that time i proposed a patch which was reducing disk total used disk 
 size from 3 times to a little more than 2 times of the final index size. 
 Together with Christoph we implemented some improvements to the 
 optimization patch and finally commit the changes.
 
Hmm. In the case that the index is used (open reader), I doubt your patch 
makes a difference. In that case the disk space used by the non optimized 
index will still be used even if the files are deleted (on unix/linux).
What happens, if disk space run's out during creation of the compound index?
Will the non compound files be a usable index?
Otherwise you risk to loose the index.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DbDirectory and Berkeley DB Java Edition...

2005-02-06 Thread Kevin A. Burton
I'm reading the Lucene in Action book right nowand on page 309 they talk 
about using the DbDirectory which berkeley DB for maintaining your index.

Anyone ever consider a port to Berkeley DB Java Edition?
The only downside would be the license (I think its GPL) but it could 
really free up the time it takes to optimize() I think.  You could just 
rehash the hashtable and then insert rows into the new table.

Would be interesting to benchmark I think though.
Thoughts?
http://www.sleepycat.com/products/je.shtml
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]