Re: Lucene "cuts" the search results ?

2005-02-15 Thread Doug Cutting
markharw00d wrote:
The highlighter uses a number of "pluggable" services, one of which is the
choice of "Fragmenter" implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. The 
default
implementation used simply breaks up text into evenly sized chunks. A more
intelligent implementation could be made to detect sentence boundaries.
Also note that paragraph boundaries alone would help a lot and are 
easier to reliably detect.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene "cuts" the search results ?

2005-02-15 Thread markharw00d
Hi Pierre,
Here's the response I gave the last time this question was raised::
The highlighter uses a number of "pluggable" services, one of which is the
choice of "Fragmenter" implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. The 
default
implementation used simply breaks up text into evenly sized chunks. A more
intelligent implementation could be made to detect sentence boundaries.
What you are asking for requires that the Fragmenter would know where the
upcoming query matches are and decides on fragment boundaries with this in
mind. To have this foresight would require a preliminary pass over the
TokenStream to identify the match points before calling the highlighter.

This Fragmenter implementation does not exist but it does not sound
unachievable. I would suggest that some knowledge of sentence boundaries
probably would probably help here too. I dont have any plans to write such a
Fragmenter now but this is how it could be done.
Hope this helps,
Cheers,
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene "cuts" the search results ?

2005-02-15 Thread Pierre VANNIER
Thank for reply Daniel,
But is there anything to do then to avoid such a thing to happen ?
Regards
Daniel Naber a écrit :
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:
 

String fragment = highlighter.getBestFragment(stream,
introduction);
   

The highlighter breaks up text into same-size chunks (100 characters by 
default). If the matching term now appears just at the end or at the start of 
such a chunk you'll get no context and it looks as if text was cut off.

Regards
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene "cuts" the search results ?

2005-02-15 Thread Daniel Naber
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:

>          String fragment = highlighter.getBestFragment(stream,
> introduction);

The highlighter breaks up text into same-size chunks (100 characters by 
default). If the matching term now appears just at the end or at the start of 
such a chunk you'll get no context and it looks as if text was cut off.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene "cuts" the search results ?

2005-02-15 Thread Pierre VANNIER
 Hi all,
I'm quite a newbie for Lucene, but I bought "Lucene In Action" and I'm 
trying to customize few examples caught from there.

I Have this sample code of JSP (bad JSP caus' I'm also a jsp newbie - :-)) :
Here's the code

.html head body 
<%
long start = new Date().getTime();
Iterator myIterator = vIndexDir.iterator();
while(myIterator.hasNext())
{
IndexSearcher searcher = new IndexSearcher((String)myIterator.next());
Query query = new TermQuery(new Term("introduction", queryString));
Hits hits = searcher.search(query);
QueryScorer  scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(scorer);
%>

<%
 out.println("NUMBER OF MATCHING NEWS FOR \""+ 
(String)myIterator.next() + "\" -->" +hits.length() + "");
for (int i = 0; i < hits.length(); i++)
{
String introduction = hits.doc(i).get("introduction");
TokenStream stream = new 
SimpleAnalyzer().tokenStream("introduction", new 
StringReader(introduction));
String fragment = highlighter.getBestFragment(stream, 
introduction);
String pubDate = hits.doc(i).get("pubDate").substring(0, 
hits.doc(i).get("pubDate").length()-13);
String link = hits.doc(i).get("link");
float score =  hits.score(i);
String title = hits.doc(i).get("title");
%>

 
 Scoring : <%=score%>
 <%=pubDate +
 " 
 link + "', 'news', 'width=760;height=600')\">" +
 title +
 ""
 %>
 
 <%= fragment%>
 
 
 
<%}%>

<%
   }
long end = new Date().getTime();
long interval  = end - start;
%>
System time for query : <%= interval%> 
milliseconds



---
The output is all right, but at the en of this result page, the last 
"hit" is cut (I mean for example) :

Scoring : 0.9210043
Fri, 28 Jan 2005
-
I'm running all this in tomcat 5.0.28 and last nightly fresh build of 
lucene.

So, Could it be a caching problem ? Could this come from JSP or Lucene ?
Thanks, and please I do apologise for my poor english ;-)
Pierre VANNIER
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]