Improving Readability of Hit Highlighting
I'm indexing text from an OCR of an old document. Many words get read perfectly, but they're typically embedded in a lot of junk. I would like the hit highlighting to show only the 'good' words, in the order in which they appeared in the original document. Is it possible to use output of the filter classes as the text used in hit highlighting? Or do you have to all the text cleanup outside of Solr and present it with two fields to index, one with the original text, and one with the cleaned up text. The objective of the hit highlighting is to give the user a *sense* of the original context, even if it's not provided verbatim from the original document. Thanks in advance. TerryG
Re: Improving Readability of Hit Highlighting
I'm not sure if I have a good suggestion, but I have a question. :) What is considered junk? Would it be possible to eliminate the junk before it even goes into the index in order to avoid GIGO (Garbage In Garbage Out)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Terence Gannon butzi0...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, January 12, 2009 11:00:31 AM Subject: Improving Readability of Hit Highlighting I'm indexing text from an OCR of an old document. Many words get read perfectly, but they're typically embedded in a lot of junk. I would like the hit highlighting to show only the 'good' words, in the order in which they appeared in the original document. Is it possible to use output of the filter classes as the text used in hit highlighting? Or do you have to all the text cleanup outside of Solr and present it with two fields to index, one with the original text, and one with the cleaned up text. The objective of the hit highlighting is to give the user a *sense* of the original context, even if it's not provided verbatim from the original document. Thanks in advance. TerryG
Re: Improving Readability of Hit Highlighting
To answer your questions specifically, here is an example of the raw OCR output; CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea to which I would like to see; mom ale access tour sheet to in the hit highlight. My schema for this field is pretty much standard, as follows; tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ... filter class=solr.WordDelimiterFilterFactory ... filter class=solr.LowerCaseFilterFactory ... filter class=solr.EnglishPorterFilterFactory ... filter class=solr.RemoveDuplicatesTokenFilterFactory ... When I examine the effect of each of these with the Analyzer, it seems like if I could use the output after LowerCaseFilterFactory in the hit highlight, I would come close to achieving what I want. I'm not averse to doing the text cleanup external to Solr before the indexing, but only if it's *not* redundant to what the filter factories are going to do anyway. Thanks for your help! TerryG
Re: Improving Readability of Hit Highlighting
Hi, Quick note: please include copy of previous email when replying, so people can be reminded of the context. You mentioned junk getting highlighted. In your case is CONTRACTORINMPRIMENTAYIVE getting highlighted? And that is junk?If so, why not augment your indexing to throw out junk tokens if you have some rules for what constitutes junk tokens? (e.g. token not in dictionary) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Terence Gannon butzi0...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, January 12, 2009 4:07:57 PM Subject: Re: Improving Readability of Hit Highlighting To answer your questions specifically, here is an example of the raw OCR output; CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea to which I would like to see; mom ale access tour sheet to in the hit highlight. My schema for this field is pretty much standard, as follows; When I examine the effect of each of these with the Analyzer, it seems like if I could use the output after LowerCaseFilterFactory in the hit highlight, I would come close to achieving what I want. I'm not averse to doing the text cleanup external to Solr before the indexing, but only if it's *not* redundant to what the filter factories are going to do anyway. Thanks for your help! TerryG