At 3:45 PM -0700 10/4/07, Mike Klaas wrote: >I'm actually somewhat surprised that several people are interested in this but >none have have been sufficiently interested to implement a solution to >contribute: > >http://issues.apache.org/jira/browse/SOLR-42
I just devised a workaround earlier in the week and was planning on posting it; thanks to your nudge I just did (to SOLR-42). Hopefully it may be of use to someone else. It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or XML-like tags: (?:\s*</?\w+((\s+\w+(\s*=\s*(?:"?&"'.?'|[^'">\s]+))?)\s*|\s*)/?>\s*)|\s and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets. Of course this is just a hack: It doesn't have any real understanding of HTML or XML syntax, so something invalid like </closing attr="x"/> will get matched. On the other hand, < and > in text will be left alone. Also note it doesn't decode XML or HTML numeric or symbolic entity references, as HTMLStripReader does (my indexer is pre-decoding the entity references before sending the text to Solr for indexing). So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the right thing with offsets would still be a worthy task. I wonder whether recasting HTMLStripReader using the org.apache.lucene.analysis.standard.CharStream interface would make sense for this? (I just added the above to the Jira comment, please pardon the redundancy) - J.J.