Re: Starts With x and Ends With x Queries
On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote: If you want to start doing suffix queries (ie: all names ending with s, or all names ending with Smith) one approach would be to use WildcarQuery, which as Erik mentioned, will allow you to use a quey Term that starts with a *. ie... Query q3 = new WildcardQuery(new Term(name,*s)); Query q4 = new WildcardQuery(new Term(name,*Smith)); (NOTE: Erik says you can do this, but the docs for WildcardQuery say you can't I'll assume the docs are wrong and Erik is correct.) I assume you mean this comment on WildcardQuery's javadocs: In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards code*/code or code?/code. I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change must to should. And yes, WildcardQuery itself supports a leading wildcard character exactly as you have shown. Which leads me to my point: if you denormalize your data so that you store both the Term you want, and the *reverse* of the term you want, then a Suffix query is just a Prefix query on a reversed field -- by sacrificing space, you can get all the speed efficiencies of a PrefixQuery when doing a SuffixQuery... D1 name:Adam Smith rname:htimS madA age:13 state:CA ... D2 name:Joe Bob rname:boB oeJ age:42 state:WA ... D3 name:John Adams rname:smadA nhoJ age:35 state:NV ... D3 name:Sue Smith rname:htimS euS age:33 state:CA ... Query q1 = new PrefixQuery(new Term(name,J*)); Query q2 = new PrefixQuery(new Term(name,Sue*)); Query q3 = new PrefixQuery(new Term(rname,s*)); Query q4 = new PrefixQuery(new Term(rname,htimS*)); (If anyone sees a flaw in my theory, please chime in) This trick has been mentioned on this list before, and is a good one. I'll go one step further and mention another technique I found in the book Managing Gigabytes, making *string* queries drastically more efficient for searching (though also impacting index size). Take the term cat. It would be indexed with all rotated variations with an end of word marker added: cat$ at$c t$ca $cat The query for *at* would be preprocessed and rotated such that the wildcards are collapsed at the end to search for at* as a PrefixQuery. A wildcard in the middle of a string like c*t would become a prefix query for t$c*. Has anyone tried this technique with Lucene? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Hi Owen, This can easily be done! Simply install tomcat on port 8080 and create a jk2 or proxy that points to tomcat. then all requests for jsps can be send to tomcat. The search engine can even be placed on a separate server. If you give me some details on your server, i will create a proxy script for your apache! regards, Maurits Owen Densmore wrote: I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
How about XML-RPC/SOAP, or REST? For REST, just have a servlet listening for HTTP Gets and respond with XML that your PHP app can parse (for searching). For indexing, let's say you want to index an uploaded file, construct a URL with the fields and field values, and also pass the location of the file on the FS. Shouldn't be that difficult. I'm guessing its more desirable to have all your code in one place, which is an advantage to using Java in PHP. But it feels cleaner to have the Java stuff in one codebase and the PHP in another. May make debugging easier. No idea how widely used the PHP-Java binding is. k On Sun, 6 Feb 2005 10:10:36 -0700, Owen Densmore wrote: I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Eventually you can just do PHP within the servlet container http://www.jcp.org/en/jsr/detail?id=223 and have your cake and eat it too! :) Erik On Feb 6, 2005, at 12:10 PM, Owen Densmore wrote: I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
: care about their content. I only want to know a particular numeric : field from : document (id of document's category). : I also need to know how many docs in category were found, so I can't : index : You should explore the use of IndexReader. Index your documents with : category id field, and use the methods on IndexReader to find all : unique categories (TermEnum). to expand on erik's suggestion: once you know the complete list of categories you iterate over then and execute your search once per category, filtering each time on the category Id (to determine the number of results from that category). -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Erik Hatcher wrote: Eventually you can just do PHP within the servlet container http://www.jcp.org/en/jsr/detail?id=223 and have your cake and eat it too! :) An intriguing thought occured to me: with the recent work on PyLucene, it should be quite possible to generate a SWIG wrapper for PHP and build a fully native PHPLucene module using gcj. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
: book Managing Gigabytes, making *string* queries drastically more : efficient for searching (though also impacting index size). Take the : term cat. It would be indexed with all rotated variations with an : end of word marker added: ... : The query for *at* would be preprocessed and rotated such that the : wildcards are collapsed at the end to search for at* as a : PrefixQuery. A wildcard in the middle of a string like c*t would : become a prefix query for t$c*. That's a pretty slick trick. Considering how many Terms the index would wind up containing in order to denormalize the data in that way, I wonder if it would be more practicle to index each of the characters as a seperate term, with the word repeated after the end of word character, making wildcard searches into phase searches (after doing preprocessing and rotating as you described). Ie, index cat as: c a t $ c a t search for *at* as a phrase search for a t search for *at as a phrase search for a t $ search for c*t as a phrase search for t $ c ...i'm fairly certain that would keep the index size much smaller (the number of terms would be much smaller, while the average term frequence wouldn't really increase), but i'm not sure if it would actaully be any faster. it depends on the algorithm/performace of PhraseQuery -- which is something I haven't really looked into. It could very well be significantly slower. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighter: new support for encoding
Nicko Cadell was good enough to point out the issues involved with generating XHTML compliant markup with the highlighter and provided a patch to fix it. The main code has now been updated in the new SVN repository here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/ To encode your content simply pass an encoder to the Highlighter eg: //create an example doc for this test String myDocContent = \Smith sons' prices 3 and 4\ claims article; //Ordinarily you'd get the doc content like this.. //myDocContent=hits.doc(i).get(FIELD_NAME) //create a query - you'd normally get this from QueryParser.parse Query myDocQuery=new TermQuery(new Term(contents,prices)); //Create a highlighter and pass a QueryScorer to provide the list of query tokens Highlighter highlighter = new Highlighter(new QueryScorer(myDocQuery)); //set the choice of encoder to our simple encoder - otherwise default is no encoding highlighter.setEncoder(new SimpleHTMLEncoder()); //Tokenize the document content to get the positions using an analyzer: Analyzer analyzer=new WhitespaceAnalyzer(); TokenStream tokenStream = analyzer.tokenStream(contents, new StringReader(myDocContent)); //As a faster alternative to re-analyzing doc content you can //use TokenSources to take advantage of any pre-tokenized content held in any term vectors: //TokenStream tokenStream=TokenSources.getAnyTokenStream(indexReader,docId, fieldName,analyzer); //Now pass the tokenStream to the highlighter to process String encodedSnippet = highlighter.getBestFragments(tokenStream, myDocContent,1,...); System.out.println(encodedSnippet); //Should print quot;Smith amp; sons' Bprices/B lt; 3 and gt;4quot; claims article Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Hi Owen I am using Lucene with PHP, though in previous replies it was suggested to run Tomcat on an alternate port, but for me that was not a solution. I did not want to run too many tasks or too many servers for various reasons (maintenance, security etc) and also needed to have control over PHP sessions and what not. The original PHP extension for Java is broken and is far fro being usable in production. Instead I have been using PHP and Lucene with a PHP-Java-Bridge for the past 6 months or so. It does the job very well and I can call classes and methods right out of PHP just like you would expect with a PHP extension. The bridge is available here: http://sourceforge.net/projects/php-java-bridge Hope this helps, -pedja Owen Densmore said the following on 2/6/2005 12:10 PM: I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
On Sunday 06 February 2005 20:00, Chris Hostetter wrote: : care about their content. I only want to know a particular numeric : field from : document (id of document's category). : I also need to know how many docs in category were found, so I can't : index : : You should explore the use of IndexReader. Index your documents with : category id field, and use the methods on IndexReader to find all : unique categories (TermEnum). to expand on erik's suggestion: once you know the complete list of categories you iterate over then and execute your search once per category, filtering each time on the category Id (to determine the number of results from that category). Nah, I did a little more tricky thing, but promises to be faster (I have 12K categories now and there will be more). I index docs' categories ids as zero-padded keywords. Then I do search for documents, sorting them by category id. Then I iterate Hits following the scheme: 1. I have the cache that holds ids of documents in current category. 2. Each time I see doc id that is not in current category, I read that document and reload cache with it's category data. So if I found docs in N categories (N usually is not big), I really need to read exactly N docs from disk, the rest of iterating through Hits is just checking cache (because I sort by category). It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, HitCollector ), but if I understood Hits properly, it gives me O( log2 ( doc_dum ) ) performance impact per resultset, which is perfectly acceptable. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
Hi Erick, In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards code*/code or code?/code. I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change must to should. Will this change available in the next realease of lucene? How do you plan to implement this? Will this be available as an atributte of QueryParser? Best, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
Bernhard Messer writes: However, three times the space sounds a bit too much, or I make a mistake in the book. :) there already was a discussion about disk usage during index optimize. Please have a look to the developers list at: http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 where i made some measurements about the disk usage within lucene. At that time i proposed a patch which was reducing disk total used disk size from 3 times to a little more than 2 times of the final index size. Together with Christoph we implemented some improvements to the optimization patch and finally commit the changes. Hmm. In the case that the index is used (open reader), I doubt your patch makes a difference. In that case the disk space used by the non optimized index will still be used even if the files are deleted (on unix/linux). What happens, if disk space run's out during creation of the compound index? Will the non compound files be a usable index? Otherwise you risk to loose the index. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DbDirectory and Berkeley DB Java Edition...
I'm reading the Lucene in Action book right nowand on page 309 they talk about using the DbDirectory which berkeley DB for maintaining your index. Anyone ever consider a port to Berkeley DB Java Edition? The only downside would be the license (I think its GPL) but it could really free up the time it takes to optimize() I think. You could just rehash the hashtable and then insert rows into the new table. Would be interesting to benchmark I think though. Thoughts? http://www.sleepycat.com/products/je.shtml -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]