Re: Lucene "cuts" the search results ?
markharw00d wrote: The highlighter uses a number of "pluggable" services, one of which is the choice of "Fragmenter" implementation. This interface is for classes which decide the boundaries where to cut the original text into snippets. The default implementation used simply breaks up text into evenly sized chunks. A more intelligent implementation could be made to detect sentence boundaries. Also note that paragraph boundaries alone would help a lot and are easier to reliably detect. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene "cuts" the search results ?
Hi Pierre, Here's the response I gave the last time this question was raised:: The highlighter uses a number of "pluggable" services, one of which is the choice of "Fragmenter" implementation. This interface is for classes which decide the boundaries where to cut the original text into snippets. The default implementation used simply breaks up text into evenly sized chunks. A more intelligent implementation could be made to detect sentence boundaries. What you are asking for requires that the Fragmenter would know where the upcoming query matches are and decides on fragment boundaries with this in mind. To have this foresight would require a preliminary pass over the TokenStream to identify the match points before calling the highlighter. This Fragmenter implementation does not exist but it does not sound unachievable. I would suggest that some knowledge of sentence boundaries probably would probably help here too. I dont have any plans to write such a Fragmenter now but this is how it could be done. Hope this helps, Cheers, Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene "cuts" the search results ?
Thank for reply Daniel, But is there anything to do then to avoid such a thing to happen ? Regards Daniel Naber a écrit : On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote: String fragment = highlighter.getBestFragment(stream, introduction); The highlighter breaks up text into same-size chunks (100 characters by default). If the matching term now appears just at the end or at the start of such a chunk you'll get no context and it looks as if text was cut off. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene "cuts" the search results ?
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote: > String fragment = highlighter.getBestFragment(stream, > introduction); The highlighter breaks up text into same-size chunks (100 characters by default). If the matching term now appears just at the end or at the start of such a chunk you'll get no context and it looks as if text was cut off. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene "cuts" the search results ?
Hi all, I'm quite a newbie for Lucene, but I bought "Lucene In Action" and I'm trying to customize few examples caught from there. I Have this sample code of JSP (bad JSP caus' I'm also a jsp newbie - :-)) : Here's the code .html head body <% long start = new Date().getTime(); Iterator myIterator = vIndexDir.iterator(); while(myIterator.hasNext()) { IndexSearcher searcher = new IndexSearcher((String)myIterator.next()); Query query = new TermQuery(new Term("introduction", queryString)); Hits hits = searcher.search(query); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(scorer); %> <% out.println("NUMBER OF MATCHING NEWS FOR \""+ (String)myIterator.next() + "\" -->" +hits.length() + ""); for (int i = 0; i < hits.length(); i++) { String introduction = hits.doc(i).get("introduction"); TokenStream stream = new SimpleAnalyzer().tokenStream("introduction", new StringReader(introduction)); String fragment = highlighter.getBestFragment(stream, introduction); String pubDate = hits.doc(i).get("pubDate").substring(0, hits.doc(i).get("pubDate").length()-13); String link = hits.doc(i).get("link"); float score = hits.score(i); String title = hits.doc(i).get("title"); %> Scoring : <%=score%> <%=pubDate + " link + "', 'news', 'width=760;height=600')\">" + title + "" %> <%= fragment%> <%}%> <% } long end = new Date().getTime(); long interval = end - start; %> System time for query : <%= interval%> milliseconds --- The output is all right, but at the en of this result page, the last "hit" is cut (I mean for example) : Scoring : 0.9210043 Fri, 28 Jan 2005 - I'm running all this in tomcat 5.0.28 and last nightly fresh build of lucene. So, Could it be a caching problem ? Could this come from JSP or Lucene ? Thanks, and please I do apologise for my poor english ;-) Pierre VANNIER - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]