At 3:45 PM -0700 10/4/07, Mike Klaas wrote:
>I'm actually somewhat surprised that several people are interested in this but 
>none have have been sufficiently interested to implement a solution to 
>contribute:
>
>http://issues.apache.org/jira/browse/SOLR-42

I just devised a workaround earlier in the week and was planning on posting it; 
thanks to your nudge I just did (to SOLR-42).  Hopefully it may be of use to 
someone else.

It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or 
XML-like tags:

  (?:\s*</?\w+((\s+\w+(\s*=\s*(?:"?&"'.?'|[^'">\s]+))?)\s*|\s*)/?>\s*)|\s

and it will treat runs of "things that look like HTML/XML open or close tags 
with optional attributes, optionally preceded or followed by spaces" 
identically to "runs of one or more spaces" as token delimiters, and swallow 
them up, so the previous and following tokens have the correct offsets.

Of course this is just a hack: It doesn't have any real understanding of HTML 
or XML syntax, so something invalid like </closing attr="x"/> will get matched. 
On the other hand, < and > in text will be left alone.

Also note it doesn't decode XML or HTML numeric or symbolic entity references, 
as HTMLStripReader does (my indexer is pre-decoding the entity references 
before sending the text to Solr for indexing).

So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the 
right thing with offsets would still be a worthy task.  I wonder whether 
recasting HTMLStripReader using the 
org.apache.lucene.analysis.standard.CharStream interface would make sense for 
this?

(I just added the above to the Jira comment, please pardon the redundancy)

- J.J.

Reply via email to