Improving Readability of Hit Highlighting

2009-01-12 Thread Terence Gannon
I'm indexing text from an OCR of an old document.  Many words get read
perfectly, but they're typically embedded in a lot of junk.  I would
like the hit highlighting to show only the 'good' words, in the order
in which they appeared in the original document.  Is it possible to
use output of the filter classes as the text used in hit highlighting?
 Or do you have to all the text cleanup outside of Solr and present it
with two fields to index, one with the original text, and one with the
cleaned up text.  The objective of the hit highlighting is to give the
user a *sense* of the original context, even if it's not provided
verbatim from the original document.  Thanks in advance.

TerryG


Re: Improving Readability of Hit Highlighting

2009-01-12 Thread Otis Gospodnetic
I'm not sure if I have a good suggestion, but I have a question. :)  What is 
considered junk?  Would it be possible to eliminate the junk before it even 
goes into the index in order to avoid GIGO (Garbage In Garbage Out)?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Terence Gannon butzi0...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, January 12, 2009 11:00:31 AM
 Subject: Improving Readability of Hit Highlighting
 
 I'm indexing text from an OCR of an old document.  Many words get read
 perfectly, but they're typically embedded in a lot of junk.  I would
 like the hit highlighting to show only the 'good' words, in the order
 in which they appeared in the original document.  Is it possible to
 use output of the filter classes as the text used in hit highlighting?
 Or do you have to all the text cleanup outside of Solr and present it
 with two fields to index, one with the original text, and one with the
 cleaned up text.  The objective of the hit highlighting is to give the
 user a *sense* of the original context, even if it's not provided
 verbatim from the original document.  Thanks in advance.
 
 TerryG



Re: Improving Readability of Hit Highlighting

2009-01-12 Thread Terence Gannon
To answer your questions specifically, here is an example of the raw OCR output;

CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea

to which I would like to see;

mom ale access tour sheet to

in the hit highlight.  My schema for this field is pretty much
standard, as follows;

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ...
filter class=solr.WordDelimiterFilterFactory ...
filter class=solr.LowerCaseFilterFactory ...
filter class=solr.EnglishPorterFilterFactory ...
filter class=solr.RemoveDuplicatesTokenFilterFactory ...

When I examine the effect of each of these with the Analyzer, it seems
like if I could use the output after LowerCaseFilterFactory in the hit
highlight, I would come close to achieving what I want.

I'm not averse to doing the text cleanup external to Solr before the
indexing, but only if it's *not* redundant to what the filter
factories are going to do anyway.  Thanks for your help!

TerryG


Re: Improving Readability of Hit Highlighting

2009-01-12 Thread Otis Gospodnetic
Hi,

Quick note: please include copy of previous email when replying, so people can 
be reminded of the context.

You mentioned junk getting highlighted.  In your case is 
CONTRACTORINMPRIMENTAYIVE getting highlighted?  And that is junk?If so, why 
not augment your indexing to throw out junk tokens if you have some rules for 
what constitutes junk tokens? (e.g. token not in dictionary)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Terence Gannon butzi0...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, January 12, 2009 4:07:57 PM
 Subject: Re: Improving Readability of Hit Highlighting
 
 To answer your questions specifically, here is an example of the raw OCR 
 output;
 
 CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea
 
 to which I would like to see;
 
 mom ale access tour sheet to
 
 in the hit highlight.  My schema for this field is pretty much
 standard, as follows;
 
 
 
 
 
 
 
 
 When I examine the effect of each of these with the Analyzer, it seems
 like if I could use the output after LowerCaseFilterFactory in the hit
 highlight, I would come close to achieving what I want.
 
 I'm not averse to doing the text cleanup external to Solr before the
 indexing, but only if it's *not* redundant to what the filter
 factories are going to do anyway.  Thanks for your help!
 
 TerryG