Re: Exchange/PST/Mail parsing
We had to develop vb code to convert pst to eml files. I am using mbox, works fine for me. And I am also using aperture, but only for extracting text from non-mail files (like office etc), works fine too. On 7/2/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Anyone have any recommendations on a decent, open (doesn't have to be Apache license, but would prefer non-GPL if possible), extractor for MS Exchange and/or PST files? The Zoe link on the FAQ [1] seems dead. For mbox, I think mstor will suffice for me and I think tropo (from the FAQ should work for IMAP). Does anyone have experience with http://aperture.sourceforge.net/ [1] http://wiki.apache.org/lucene-java/LuceneFAQ#head- bcba2effabe224d5fb8c1761e4da1fedceb9800e Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Pagination
Hi, I still have no idea of how to get it done. Can give me some details? The web application is in jsp btw. Thanks a lot. Regards, Lee Li Bin -Original Message- From: Chris Lu [mailto:[EMAIL PROTECTED] Sent: Saturday, June 30, 2007 2:21 AM To: java-user@lucene.apache.org Subject: Re: Pagination After search, you will just get an object Hits, and go through all of the documents by hits.doc(i). The pagination is controlled by you. Lucene is pre-caching first 200 documents and lazy loading the rest by batch size 200. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m inutes On 6/29/07, Lee Li Bin <[EMAIL PROTECTED]> wrote: > > Hi, > > does anyone knows how to do pagination on jsp page using the number of > hits > return? Or any other solutions? > > > > Do provide me with some sample coding if possible or a step by step guide. > Sry if I'm asking too much, I'm new to lucene. > > > > Thanks > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pagination
The Hits class is OK but can be inefficient due to re-running the query unnecessarily. The class below illustrates how to efficiently retrieve a particular page of results and lends itself to webapps where you don't want to retain server side state (i.e. a Hits object) for each client. It would make sense to put an upper limit on the "start" parameter (as Google etc do) to avoid consuming to much RAM per client request. Cheers, Mark [Begin code] package lucene.pagination; import org.apache.lucene.index.Term; import org.apache.lucene.search.HitCollector; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TermQuery; import org.apache.lucene.util.PriorityQueue; /** * A HitCollector that retrieves a specific page of results * @author maharwood */ public class HitPageCollector extends HitCollector { //Demo code showing pagination public static void main(String[] args) throws Exception { IndexSearcher s=new IndexSearcher("/indexes/nasa"); HitPageCollector hpc=new HitPageCollector(1,10); Query q=new TermQuery(new Term("contents","sea")); s.search(q,hpc); ScoreDoc[] sd = hpc.getScores(); System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+" of "+hpc.getTotalAvailable()); for (int i = 0; i < sd.length; i++) { System.out.println(sd[i].doc); } s.close(); } int nDocs; PriorityQueue hq; float minScore = 0.0f; int totalHits = 0; int start; int maxNumHits; int totalInThisPage; public HitPageCollector(int start, int maxNumHits) { this.nDocs = start + maxNumHits; this.start = start; this.maxNumHits = maxNumHits; hq = new HitQueue(nDocs); } public void collect(int doc, float score) { totalHits++; if((hq.size()= minScore)) { ScoreDoc scoreDoc = new ScoreDoc(doc,score); hq.insert(scoreDoc); // update hit queue minScore = ((ScoreDoc)hq.top()).score; // reset minScore } totalInThisPage=hq.size(); } public ScoreDoc[] getScores() { //just returns the number of hits required from the required start point /* So, given hits: 1234567890 and a start of 2 + maxNumHits of 3 should return: 234 or, given hits 12 should return 2 and so, on. */ if (start <= 0) { throw new IllegalArgumentException("Invalid start :" + start+" - start should be >=1"); } int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1))); if (numReturned <= 0) { return new ScoreDoc[0]; } ScoreDoc[] scoreDocs = new ScoreDoc[numReturned]; ScoreDoc scoreDoc; for (int i = hq.size() - 1; i >= 0; i--) // put docs in array, working backwards from lowest count { scoreDoc = (ScoreDoc) hq.pop(); if (i < (start - 1)) { break; //off the beginning of the results array } if (i < (scoreDocs.length + (start - 1))) { scoreDocs[i - (start - 1)] = scoreDoc; //within scope of results array } } return scoreDocs; } public int getTotalAvailable() { return totalHits; } public int getStart() { return start; } public int getEnd() { return start+totalInThisPage-1; } public class HitQueue extends PriorityQueue { public HitQueue(int size) { initialize(size); } public final boolean lessThan(Object a, Object b) { ScoreDoc hitA = (ScoreDoc)a; ScoreDoc hitB = (ScoreDoc)b; if (hitA.score == hitB.score) return hitA.doc > hitB.doc; else return hitA.score < hitB.score; } } } - Original Message From: Lee Li Bin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, 2 July, 2007 9:59:14 AM Subject: RE: Pagination Hi, I still have no idea of how to get it done. Can give me some details? The web application is in jsp btw. Thanks a lot. Regards, Lee Li Bin -Original Message- From: Chris Lu [mailto:[EMAIL PROTECTED] Sent: Saturday, June 30, 2007 2:21 AM To: java-user@lucene.apache.org Subject: Re: Pagination After search, you will just get an object Hits, and go through all of the documents by hits.doc(i). The pagination is controlled by you. Lucene is pre-caching first 200 documents and lazy loading the rest by batch size 200. -- Chris Lu - Instant Scalable Full-Text Search On Any Database
Re: Exchange/PST/Mail parsing
On Sun, 1 Jul 2007, Grant Ingersoll wrote: Anyone have any recommendations on a decent, open (doesn't have to be Apache license, but would prefer non-GPL if possible), extractor for MS Exchange and/or PST files? There has been an offer to contribute a PST parser to Apache POI. We're hoping that Travis will have something to go into the POI scratchpad quite soon, but we understand he's currently still working on the first version. I can only suggest you keep an eye on the poi dev list for when the code comes through Nick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Geneology, nicknames, levenstein, soundex/metaphone, etc
Thank you for the link to the previous thread, lot of information there! *Synonym use of nicknames - that sounds quite feasible. Do you specifically mean the WordNet module in the Sandbox, or something different? > -Original Message- > From: Grant Ingersoll [mailto:[EMAIL PROTECTED] > Sent: Friday, June 29, 2007 12:30 PM > To: java-user@lucene.apache.org > Subject: Re: Geneology, nicknames, levenstein, soundex/metaphone, etc > > You may find this thread useful: http://www.gossamer-threads.com/ > lists/lucene/java-user/47824?search_string=record%20linkage;#47824 > although it doesn't answer all your ?'s > > > *nickname: would it be feasible to create an Analyzer that > will tie > > to an external/internal nickname datasource (datasource would vary > > dramatically based on nationality). Usecase: Jon, John, Johnny, > > Jonathan would have 'weight' in the relevance. Similarly 'Dick', > > 'Chuck', and 'Charles'. > > Maybe you could inject these as synonyms? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Geneology, nicknames, levenstein, soundex/metaphone, etc
On Jul 2, 2007, at 8:07 AM, Darren Hartford wrote: Thank you for the link to the previous thread, lot of information there! *Synonym use of nicknames - that sounds quite feasible. Do you specifically mean the WordNet module in the Sandbox, or something different? No, I think I was thinking along the lines of the SynonymAnalyzer in Lucene In Action whereby you add the nicknames as tokens at the same position as the original, that way searches on the nicknames would still match. Don't know that it solves your need for "weight" in the relevance, but maybe it would. -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Friday, June 29, 2007 12:30 PM To: java-user@lucene.apache.org Subject: Re: Geneology, nicknames, levenstein, soundex/metaphone, etc You may find this thread useful: http://www.gossamer-threads.com/ lists/lucene/java-user/47824?search_string=record%20linkage;#47824 although it doesn't answer all your ?'s *nickname: would it be feasible to create an Analyzer that will tie to an external/internal nickname datasource (datasource would vary dramatically based on nationality). Usecase: Jon, John, Johnny, Jonathan would have 'weight' in the relevance. Similarly 'Dick', 'Chuck', and 'Charles'. Maybe you could inject these as synonyms? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exchange/PST/Mail parsing
Hello Grant (cc-ing aperture-devel), I am one of the Aperture admins, I can tell you a bit more about Aperture's mail facilities. Short intro: Aperture is a framework for crawling and full-text and metadata extraction of a growing number of sources and file formats. We try to select the best-of-breed of the large number of open source libraries that tackle a specific source or format (e.g. PDFBox, Poi, JavaMail) and write some glue code around it so that they can be invoked in a uniform way. It's currently used in a number of desktop and enterprise search applications, both research and production systems. At the moment we support a number of mail systems. We can crawl IMAP mail boxes through JavaMail. In general it seems to work well, problems are usually caused by IMAP servers not conforming to the IMAP specs. Some people have used the ImapCrawler to crawl MS Exchange as well. Some succeeded, some didn't. I don't really know whether the fault is in Aperture's code or in the Exchange configuration but I would be happy to take a look at it when someone runs into problems. Outlook can also be crawled by connecting to a running Outlook process through jacob.dll. Others on aperture-devel can tell you more about its current status. Besides this crawler, I would also be very interested in having a crawler that directly processes .pst files, as to stay clear from communicating with other processes outside your own control. People have been working on crawling Thunderbird mailboxes but I don't know what the current status is. Ultimately, we try to support any major mail system. In practice, effort is usually dependent on knowledge and experience as well as customer demand. We are happy to help you out with trying to get Aperture working in your domain and looking into the problems that you may encounter. Kind regards, Chris -- Grant Ingersoll wrote: Anyone have any recommendations on a decent, open (doesn't have to be Apache license, but would prefer non-GPL if possible), extractor for MS Exchange and/or PST files? The Zoe link on the FAQ [1] seems dead. For mbox, I think mstor will suffice for me and I think tropo (from the FAQ should work for IMAP). Does anyone have experience with http://aperture.sourceforge.net/ [1] http://wiki.apache.org/lucene-java/LuceneFAQ#head-bcba2effabe224d5fb8c1761e4da1fedceb9800e Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Auto Slop
I just ran into an interesting problem today, and wanted to know if it was my understanding or Lucene that was out of whack -- right now I'm leaning toward a fault between the chair and the keyboard. I attempted to do a simple phrase query using the StandardAnalyzer: "United States" Against my corpus of test documents, I got no results returned, which surprised me. I know it's in there. So, I ran this same query in Luke, and it also returned no results. Luke explains: PhraseQuery: boost=1., slop=0 pos[0,1] Term 0: field='contents' text='united' Term 1: field='contents' text='states' Now I know Lucene handles phrases, so I tried manually setting the slop to 1, given that there were two terms: "United States"~1 ...and suddenly I got the results I was expecting! In fact, after a little trial and error with larger phrases, I always get no results unless I *manually* specify at least slop value of the number of terms minus one. Isn't this supposed to be the default behavior if no slop is specified? Lucene's standard analyzer, which clear knows the number of terms, should be able to deduce the minimum slop amount. Why must it be manually specified? Could I be missing some configuration setting, have a bad understanding of the query syntax, or is there a clever reason (like searching for encoding synonyms) that makes more sense as a default value for slop that I'm not seeing? Many thanks to all that unravel my confusion. -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Auto Slop
Examine your indexes and analyzers. The default slop is 0, which means allow 0 terms between the terms in the phrase. That would be an exact match. A slop of 1 is not the default and would allow a term movement of one position to match the phrase. - Mark Walt Stoneburner wrote: I just ran into an interesting problem today, and wanted to know if it was my understanding or Lucene that was out of whack -- right now I'm leaning toward a fault between the chair and the keyboard. I attempted to do a simple phrase query using the StandardAnalyzer: "United States" Against my corpus of test documents, I got no results returned, which surprised me. I know it's in there. So, I ran this same query in Luke, and it also returned no results. Luke explains: PhraseQuery: boost=1., slop=0 pos[0,1] Term 0: field='contents' text='united' Term 1: field='contents' text='states' Now I know Lucene handles phrases, so I tried manually setting the slop to 1, given that there were two terms: "United States"~1 ...and suddenly I got the results I was expecting! In fact, after a little trial and error with larger phrases, I always get no results unless I *manually* specify at least slop value of the number of terms minus one. Isn't this supposed to be the default behavior if no slop is specified? Lucene's standard analyzer, which clear knows the number of terms, should be able to deduce the minimum slop amount. Why must it be manually specified? Could I be missing some configuration setting, have a bad understanding of the query syntax, or is there a clever reason (like searching for encoding synonyms) that makes more sense as a default value for slop that I'm not seeing? Many thanks to all that unravel my confusion. -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Auto Slop
> I just ran into an interesting problem today, and wanted to know if it > was my understanding or Lucene that was out of whack -- right now I'm > leaning toward a fault between the chair and the keyboard. > > I attempted to do a simple phrase query using the StandardAnalyzer: > "United States" And you also analyzed this the particular field with this same StandardAnalyzer during indexing? Sounds like you used another analyzer during creating the index Regards Ard > > Against my corpus of test documents, I got no results returned, which > surprised me. I know it's in there. > > So, I ran this same query in Luke, and it also returned no results. > > Luke explains: > PhraseQuery: boost=1., slop=0 > pos[0,1] > Term 0: field='contents' text='united' > Term 1: field='contents' text='states' > > Now I know Lucene handles phrases, so I tried manually setting the > slop to 1, given that there were two terms: "United States"~1 > > ...and suddenly I got the results I was expecting! > > In fact, after a little trial and error with larger phrases, I always > get no results unless I *manually* specify at least slop value of the > number of terms minus one. > > Isn't this supposed to be the default behavior if no slop is > specified? > > Lucene's standard analyzer, which clear knows the number of terms, > should be able to deduce the minimum slop amount. Why must it be > manually specified? > > Could I be missing some configuration setting, have a bad > understanding of the query syntax, or is there a clever reason (like > searching for encoding synonyms) that makes more sense as a default > value for slop that I'm not seeing? > > Many thanks to all that unravel my confusion. > > -wls > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
highlighting phrase query
Hi All, I am developing a search tool using lucene. I am using lucene 2.1. i have a requirement to highlight query words in the results. .Lucene-highlighter 2.1 doesn't work well in highlighting phase query. For example - if i have a query string "lucene Java" .It highlights not only occurrences of "lucene java" but occurrences of lucene and java too in the text. I think, this is a known problem..is this issue solved in lucene 2.2. well my application is almost complete and i really don't wanna switch to lucene 2.2. I was going through previous posts but i couldn't find a solution of this problem. There r some alternate highlighter s but it seems, they r not stable and still in evolution phase. I am looking for a standard n stable API for this purpose.. I'd appreciate any thoughts/guidance in this issue. Thanks Sandeep -- SANDEEP CHAWLA House No- 23 10th main BTM 1st Stage Bangalore Mobile: 91-9986150603 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pagination
Mark, The ScoreDoc[] contains only the IDs of each lucene document. what would be the best way of getting the entire (lucene)document ? Should i do a new search with the ID retrivied by hpc.getScores() - (searcher.doc(idDoc))? thanks. Alixandre On 7/2/07, mark harwood <[EMAIL PROTECTED]> wrote: The Hits class is OK but can be inefficient due to re-running the query unnecessarily. The class below illustrates how to efficiently retrieve a particular page of results and lends itself to webapps where you don't want to retain server side state (i.e. a Hits object) for each client. It would make sense to put an upper limit on the "start" parameter (as Google etc do) to avoid consuming to much RAM per client request. Cheers, Mark [Begin code] package lucene.pagination; import org.apache.lucene.index.Term; import org.apache.lucene.search.HitCollector; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TermQuery; import org.apache.lucene.util.PriorityQueue; /** * A HitCollector that retrieves a specific page of results * @author maharwood */ public class HitPageCollector extends HitCollector { //Demo code showing pagination public static void main(String[] args) throws Exception { IndexSearcher s=new IndexSearcher("/indexes/nasa"); HitPageCollector hpc=new HitPageCollector(1,10); Query q=new TermQuery(new Term("contents","sea")); s.search(q,hpc); ScoreDoc[] sd = hpc.getScores(); System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+" of "+hpc.getTotalAvailable()); for (int i = 0; i < sd.length; i++) { System.out.println(sd[i].doc); } s.close(); } int nDocs; PriorityQueue hq; float minScore = 0.0f; int totalHits = 0; int start; int maxNumHits; int totalInThisPage; public HitPageCollector(int start, int maxNumHits) { this.nDocs = start + maxNumHits; this.start = start; this.maxNumHits = maxNumHits; hq = new HitQueue(nDocs); } public void collect(int doc, float score) { totalHits++; if((hq.size()= minScore)) { ScoreDoc scoreDoc = new ScoreDoc(doc,score); hq.insert(scoreDoc); // update hit queue minScore = ((ScoreDoc)hq.top()).score; // reset minScore } totalInThisPage=hq.size(); } public ScoreDoc[] getScores() { //just returns the number of hits required from the required start point /* So, given hits: 1234567890 and a start of 2 + maxNumHits of 3 should return: 234 or, given hits 12 should return 2 and so, on. */ if (start <= 0) { throw new IllegalArgumentException("Invalid start :" + start+" - start should be >=1"); } int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1))); if (numReturned <= 0) { return new ScoreDoc[0]; } ScoreDoc[] scoreDocs = new ScoreDoc[numReturned]; ScoreDoc scoreDoc; for (int i = hq.size() - 1; i >= 0; i--) // put docs in array, working backwards from lowest count { scoreDoc = (ScoreDoc) hq.pop(); if (i < (start - 1)) { break; //off the beginning of the results array } if (i < (scoreDocs.length + (start - 1))) { scoreDocs[i - (start - 1)] = scoreDoc; //within scope of results array } } return scoreDocs; } public int getTotalAvailable() { return totalHits; } public int getStart() { return start; } public int getEnd() { return start+totalInThisPage-1; } public class HitQueue extends PriorityQueue { public HitQueue(int size) { initialize(size); } public final boolean lessThan(Object a, Object b) { ScoreDoc hitA = (ScoreDoc)a; ScoreDoc hitB = (ScoreDoc)b; if (hitA.score == hitB.score) return hitA.doc > hitB.doc; else return hitA.score < hitB.score; } } } - Original Message From: Lee Li Bin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, 2 July, 2007 9:59:14 AM Subject: RE: Pagination Hi, I still have no idea of how to get it done. Can give me some details? The web application is in jsp btw. Thanks a lot. Regards, Lee Li Bin -Original Message- From: Chris Lu [mailto:[EMAIL PROTECTED] Sent: Saturday, June 30, 2007 2:21 AM To: java-user@lucene.apache.org Subject: Re: Pagination After search, you will just g
Modify search results
I have managed to download and install Lucene. In addition, I have reached the point at which I am able to generate an index and run a search. The search returns a 'raw' list of the HTML pages in which my search term occurs. . . . chapter17, chapter18, etc. Question: how do I go about manipulating the search results? Is it possible to "intercept" the listing of HTML pages returned by the Lucene search function and modify the report it sends to the screen. Can this be as simple as adding a line to the Lucene Java code so that instead of reporting a simple chapter number, it will report the chapter surrounded by HTML code , e.g. instead of simply seeing "Chapter 17" on the screen, I want the report to read Chapter 17, paragraph 3. . . . of course then I'll need to get that info to an HTML page . . . later . . . later . . I suspect this will be handled by 1) modification to the Lucene source code or 2) an addition of javascript or perl-script . . . but am not at all sure. Thanks in advance for any help that might be provided. r. mullin
Lucene index in memcache
Is there a way to store lucene index in memcache. During high traffic search becomes very slow. :( -- Cathy www.nachofoto.com
Re: Lucene index in memcache
You can always read the current index into a RAMdir, but I really wonder if that will make much of a difference, as your op system should be taking care of this kind of thing for you. How big is your index? What kind of performance are you seeing? What else is running on that box? I'd do some profiling to see where things are actually slow. In particular, think about logging how long each query takes to complete, just the Lucene part. I've seen similar situations where the actual time was being taken *outside* of lucene itself by XML manipulations, for instance. Also, are you iterating over a Hits object for more than the top 100 entries? That would be very inefficient. Are you using a collector and calling IndexReader.doc() inside the loop? I'd *very* strongly recommend that you pinpoint where the time is actually being spent before jumping to the conclusion that using a RAMdir would fix your problem. I can't tell you how many times I've been *sure* I knew where the bottleneck was only to find out that it's someplace completely different. You simply cannot reliably optimize performance without really understanding where the time is being spent. Trust me on this one ... Some simple timings with System.currentTimeMilliseconds() will tell you a lot. Best Erick On 7/2/07, Cathy Murphy <[EMAIL PROTECTED]> wrote: Is there a way to store lucene index in memcache. During high traffic search becomes very slow. :( -- Cathy www.nachofoto.com
Re: highlighting phrase query
There has been a lot of Highlighter discussion lately, but just to try and sum up the state of Highlighting in the Lucene world: There are four Highlighter implementations that I know of. From what I can tell, only the original Contrib Highlighter has received sustained active development by more than one individual. Contrib Highlighter: The Contrib Highlighter supports the widest array of analyzers and corner cases and has had the widest exposure. It is generally slower on larger documents due to the requirement that you re-analyze text and to support a wider variety of use cases -- the TokenGroup for token overlaps and inspecting every term for Fragmentation contribute to a huge performance drain on large documents. This highlighter does not support highlighting based on position and all terms from the query will be highlighted in the text. You can avoid some of the cost of re-analyzing by using the TokenSources class to rebuild a TokenStream using stored offsets and/or positions, but this is unlikely to be faster unless you are using very large documents with a complex analyzer. Getting and sorting offsets/positions is relatively slow and for smaller docs it is faster to just re-analyze. LUCENE-403: I have not spent a lot of time with this approach, but it is similar to the Contrib Highlighter approach. It almost certainly does not cover as many odd corner cases as Contrib Highlighter and the framework is lacking, but it does add some support for proper PhraseQuery highlighting by implementing some custom PhraseQuery search logic. Because LUCENE-403 is not as rigorous as the Contrib Highlighter, it may well be a bit faster. The author claims that HTML tags will not be broken when fragmenting. LUCENE-644: This Highlighter approach requires that you have stored term offsets in the index. This Highlighter can be very fast if you are using a complicated analyzer since there is no need for re-analyzing the text (due to the stored offsets). Also, rather then scoring every term like the Contrib Highlighter, only terms from the query are effectively "handled". For smaller documents and simpler analyzers there is not much speed improvement over the Contrib Highlighter (due to the time it takes to retrieve and sort offsets), but for larger documents , especially with more complex analyzers, this Highlighter can be extremely fast. Again, positional highlighting for Phrase and Span queries is not supported. The biggest reason this implementation performs so well is that it deals with the text in much bigger chunks. Contrib Highlighter can also avoid re-analyzing by storing offsets and positions, but then it scores the document and rebuilds the text one token at a time using the performance draining TokenGroup (which helps cover some of those corner cases). This is very slow on very large documents. LUCENE-794: This approach extends the Contrib Highlighter to support Highlighting Span and Phrase queries. The approach used for non position sensitive Query clauses is the same as the Contrib Highlighter, and if you use the latest CachingTokenFilter the speed is roughly about the same. Position sensitive Query clauses are a bit slower as a MemoryIndex is used to retrieve the correct positions to Highlight. This gives exact highlighting without reimplementing search logic. Also, all of the use cases and corner cases that have been solved for the Contrib Highlighter are retained. All of the deficiencies of the Contrib Highlighter (slower on large docs) are also retained. The majority of the code for this comes from the Contrib Highlighter -- it uses the Contrib Highlighter framework. Which points out a plus for the Contrib Highlighter setup -- it allows for an extension like this, while LUCENE-644 could not easily be expanded to handle position sensitive queries. There has been some discussion of getting Lucene to identify correct highlights as the search is processed. I am not very optimistic that this will be fruitful, but those that are discussing it know more more about this than I do. - Mark sandeep chawla wrote: Hi All, I am developing a search tool using lucene. I am using lucene 2.1. i have a requirement to highlight query words in the results. .Lucene-highlighter 2.1 doesn't work well in highlighting phase query. For example - if i have a query string "lucene Java" .It highlights not only occurrences of "lucene java" but occurrences of lucene and java too in the text. I think, this is a known problem..is this issue solved in lucene 2.2. well my application is almost complete and i really don't wanna switch to lucene 2.2. I was going through previous posts but i couldn't find a solution of this problem. There r some alternate highlighter s but it seems, they r not stable and still in evolution phase. I am looking for a standard n stable API for this purpose.. I'd appreciate any thoughts/guidance in this issue. Thanks Sandeep -
Re: Lucene index in memcache
: Is there a way to store lucene index in memcache. During high traffic search : becomes very slow. :( http://people.apache.org/~hossman/#xyproblem Your question appears to be an "XY Problem" ... that is: you are dealing with "X", you are assuming "Y" will help you, and you are asking about "Y" without giving more details about the "X" so that we can understand the full issue. Perhaps the best solution doesn't involve "Y" at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 If you provide some more info about how you are using Lucene (ie: what you code looks like) and what the concepts of "high traffic" and "slow" mean to you, we might be able to help you better. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Reusing Document Objects (was Auto Slop)
If I create a Document object, can I pass it to multiple index writers without harm? Or, does the process of being handed to an Index Writer somehow mutate the state of the Document object, say during tokenizing, that would cause it's re-use with a totally separate index to cause problems ...such as I'm seeing with slop? -wls
RE: highlighting phrase query
Mark: Thanks a million for this comprehensive analysis. This is going straight to my manager. :) --Renaud -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 2:11 PM To: java-user@lucene.apache.org Subject: Re: highlighting phrase query There has been a lot of Highlighter discussion lately, but just to try and sum up the state of Highlighting in the Lucene world: There are four Highlighter implementations that I know of. From what I can tell, only the original Contrib Highlighter has received sustained active development by more than one individual. Contrib... [snip] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
multi-term query weighting
I have an index with two different sources of information, one small but of high quality (call it "title"), and one large, but of lower quality (call it "body"). I give boosts to certain documents related to their popularity (this is very similar to what one would do indexing the web). The problem I have is a query like "John Bush". I translate that into " (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) ". But the results I get are: 1. George Bush ... 4. John Kerry ... 10. John Bush The reason is (looking at explain) that George Bush is scored: 169 = sum( 1 = ) 168 = sum( 160 = 8 = ) ) and John Kerry is similar but reversed. Poor old "John Bush" only scores: 72 = sum( 40 = (+) 32 = (+ ) ) because his initial boost was only 1/4 of George's. The question I have is, how can tell the searcher to care about "balance"? I really want the score over 2 terms to be more like (sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y)) rather than just X+Y. Is that supported in some obvious way, or is there some other way to phrase my query to say "I want both terms but they should both be important if possible?" Thanks, Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Pagination
Hi, Thanks Mark! I do have the same question as Alixandre. How do I get the content of the document instead of the document id? Thanks. Regards, Lee Li Bin -Original Message- From: Alixandre Santana [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 03, 2007 12:55 AM To: java-user@lucene.apache.org Subject: Re: Pagination Mark, The ScoreDoc[] contains only the IDs of each lucene document. what would be the best way of getting the entire (lucene)document ? Should i do a new search with the ID retrivied by hpc.getScores() - (searcher.doc(idDoc))? thanks. Alixandre On 7/2/07, mark harwood <[EMAIL PROTECTED]> wrote: > The Hits class is OK but can be inefficient due to re-running the query unnecessarily. > > The class below illustrates how to efficiently retrieve a particular page of results and lends itself to webapps where you don't want to retain server side state (i.e. a Hits object) for each client. > It would make sense to put an upper limit on the "start" parameter (as Google etc do) to avoid consuming to much RAM per client request. > > Cheers, > Mark > > [Begin code] > > > > > package lucene.pagination; > > import org.apache.lucene.index.Term; > import org.apache.lucene.search.HitCollector; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.TermQuery; > import org.apache.lucene.util.PriorityQueue; > > /** > * A HitCollector that retrieves a specific page of results > * @author maharwood > */ > public class HitPageCollector extends HitCollector > { > //Demo code showing pagination > public static void main(String[] args) throws Exception > { > IndexSearcher s=new IndexSearcher("/indexes/nasa"); > HitPageCollector hpc=new HitPageCollector(1,10); > Query q=new TermQuery(new Term("contents","sea")); > s.search(q,hpc); > ScoreDoc[] sd = hpc.getScores(); > System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+" of "+hpc.getTotalAvailable()); > for (int i = 0; i < sd.length; i++) > { > System.out.println(sd[i].doc); > } > s.close(); > } > > int nDocs; > PriorityQueue hq; > float minScore = 0.0f; > int totalHits = 0; > int start; > int maxNumHits; > int totalInThisPage; > > public HitPageCollector(int start, int maxNumHits) > { > this.nDocs = start + maxNumHits; > this.start = start; > this.maxNumHits = maxNumHits; > hq = new HitQueue(nDocs); > } > > public void collect(int doc, float score) > { > totalHits++; > if((hq.size()= minScore)) > { > ScoreDoc scoreDoc = new ScoreDoc(doc,score); > hq.insert(scoreDoc); // update hit queue > minScore = ((ScoreDoc)hq.top()).score; // reset minScore > } > totalInThisPage=hq.size(); > } > > > public ScoreDoc[] getScores() > { > //just returns the number of hits required from the required start point > /* > So, given hits: > 1234567890 > and a start of 2 + maxNumHits of 3 should return: > 234 > or, given hits > 12 > should return > 2 > and so, on. > */ > if (start <= 0) > { > throw new IllegalArgumentException("Invalid start :" + start+" - start should be >=1"); > } > int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1))); > if (numReturned <= 0) > { > return new ScoreDoc[0]; > } > ScoreDoc[] scoreDocs = new ScoreDoc[numReturned]; > ScoreDoc scoreDoc; > for (int i = hq.size() - 1; i >= 0; i--) // put docs in array, working backwards from lowest count > { > scoreDoc = (ScoreDoc) hq.pop(); > if (i < (start - 1)) > { > break; //off the beginning of the results array > } > if (i < (scoreDocs.length + (start - 1))) > { > scoreDocs[i - (start - 1)] = scoreDoc; //within scope of results array > } > } > return scoreDocs; > } > > public int getTotalAvailable() > { > return totalHits; > } > > public int getStart() > { > return start; > } > > public int getEnd() > { > return start+totalInThisPage-1; > } > > public class HitQueue extends PriorityQueue > { > public HitQueue(int size) > { > initialize(size); > } > public final boolean lessThan(Object a, Object b) > { > ScoreDoc hitA = (ScoreDoc)a; > ScoreDoc hitB = (ScoreDoc)b; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc
RE: Pagination
Hi Mark, How do I display results on the second page? I manage to display on one page using your coding. Regards, Lee Li Bin -Original Message- From: Alixandre Santana [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 03, 2007 12:55 AM To: java-user@lucene.apache.org Subject: Re: Pagination Mark, The ScoreDoc[] contains only the IDs of each lucene document. what would be the best way of getting the entire (lucene)document ? Should i do a new search with the ID retrivied by hpc.getScores() - (searcher.doc(idDoc))? thanks. Alixandre On 7/2/07, mark harwood <[EMAIL PROTECTED]> wrote: > The Hits class is OK but can be inefficient due to re-running the query unnecessarily. > > The class below illustrates how to efficiently retrieve a particular page of results and lends itself to webapps where you don't want to retain server side state (i.e. a Hits object) for each client. > It would make sense to put an upper limit on the "start" parameter (as Google etc do) to avoid consuming to much RAM per client request. > > Cheers, > Mark > > [Begin code] > > > > > package lucene.pagination; > > import org.apache.lucene.index.Term; > import org.apache.lucene.search.HitCollector; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.TermQuery; > import org.apache.lucene.util.PriorityQueue; > > /** > * A HitCollector that retrieves a specific page of results > * @author maharwood > */ > public class HitPageCollector extends HitCollector > { > //Demo code showing pagination > public static void main(String[] args) throws Exception > { > IndexSearcher s=new IndexSearcher("/indexes/nasa"); > HitPageCollector hpc=new HitPageCollector(1,10); > Query q=new TermQuery(new Term("contents","sea")); > s.search(q,hpc); > ScoreDoc[] sd = hpc.getScores(); > System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+" of "+hpc.getTotalAvailable()); > for (int i = 0; i < sd.length; i++) > { > System.out.println(sd[i].doc); > } > s.close(); > } > > int nDocs; > PriorityQueue hq; > float minScore = 0.0f; > int totalHits = 0; > int start; > int maxNumHits; > int totalInThisPage; > > public HitPageCollector(int start, int maxNumHits) > { > this.nDocs = start + maxNumHits; > this.start = start; > this.maxNumHits = maxNumHits; > hq = new HitQueue(nDocs); > } > > public void collect(int doc, float score) > { > totalHits++; > if((hq.size()= minScore)) > { > ScoreDoc scoreDoc = new ScoreDoc(doc,score); > hq.insert(scoreDoc); // update hit queue > minScore = ((ScoreDoc)hq.top()).score; // reset minScore > } > totalInThisPage=hq.size(); > } > > > public ScoreDoc[] getScores() > { > //just returns the number of hits required from the required start point > /* > So, given hits: > 1234567890 > and a start of 2 + maxNumHits of 3 should return: > 234 > or, given hits > 12 > should return > 2 > and so, on. > */ > if (start <= 0) > { > throw new IllegalArgumentException("Invalid start :" + start+" - start should be >=1"); > } > int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1))); > if (numReturned <= 0) > { > return new ScoreDoc[0]; > } > ScoreDoc[] scoreDocs = new ScoreDoc[numReturned]; > ScoreDoc scoreDoc; > for (int i = hq.size() - 1; i >= 0; i--) // put docs in array, working backwards from lowest count > { > scoreDoc = (ScoreDoc) hq.pop(); > if (i < (start - 1)) > { > break; //off the beginning of the results array > } > if (i < (scoreDocs.length + (start - 1))) > { > scoreDocs[i - (start - 1)] = scoreDoc; //within scope of results array > } > } > return scoreDocs; > } > > public int getTotalAvailable() > { > return totalHits; > } > > public int getStart() > { > return start; > } > > public int getEnd() > { > return start+totalInThisPage-1; > } > > public class HitQueue extends PriorityQueue > { > public HitQueue(int size) > { > initialize(size); > } > public final boolean lessThan(Object a, Object b) > { > ScoreDoc hitA = (ScoreDoc)a; > ScoreDoc hitB = (ScoreDoc)b; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else >
Re: highlighting phrase query
Thanks a lot Mark, has any one used Lucene-794? how stable it it. is it widely used in industry. These are some of my questions :) Thanks Sandeep On 03/07/07, Renaud Waldura <[EMAIL PROTECTED]> wrote: Mark: Thanks a million for this comprehensive analysis. This is going straight to my manager. :) --Renaud -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 2:11 PM To: java-user@lucene.apache.org Subject: Re: highlighting phrase query There has been a lot of Highlighter discussion lately, but just to try and sum up the state of Highlighting in the Lucene world: There are four Highlighter implementations that I know of. From what I can tell, only the original Contrib Highlighter has received sustained active development by more than one individual. Contrib... [snip] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- SANDEEP CHAWLA House No- 23 10th main BTM 1st Stage Bangalore Mobile: 91-9986150603 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene index in memcache
Hi Erick & Chris , Thanks for your response. I have done some profiling , and it seems the response is slow when there are long queries(more than 5-6 words per query). The way I have implemented is : I pass in the search query and lucene returns the total number of hits, along with ids . I then fetch objects for only those ids , as required per the pagination. Also it is a dedicated search box . Thanks, -- Cathy www.nachofoto.com On 7/2/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Is there a way to store lucene index in memcache. During high traffic search : becomes very slow. :( http://people.apache.org/~hossman/#xyproblem Your question appears to be an "XY Problem" ... that is: you are dealing with "X", you are assuming "Y" will help you, and you are asking about "Y" without giving more details about the "X" so that we can understand the full issue. Perhaps the best solution doesn't involve "Y" at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 If you provide some more info about how you are using Lucene (ie: what you code looks like) and what the concepts of "high traffic" and "slow" mean to you, we might be able to help you better. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]