Lance Norskog-2 wrote:
> 
> The PatternReplace and HTMPStrip tokenizers might be the right bet.
> The easiest way to go about this is to make a bunch of text fields
> with different analysis stacks and investigate them in the Scema
> Browser. You can paste an HTML document into the text box and see
> exactly how the words & markup get torn apart.
> 

Thanks Lance, I'll experiment.

For reference, for anyone else who comes across this thread -- the html in
my original post might have got munged on the way into or out of the list
server. It was supposed to look like this:

This is the entire content of my field, but [a
href="http://example.com/"]some of the words[/a] are a hyperlink.

(but with real html tags instead of the square brackets)

and I am just trying to extract the words and the link target but lose the
rest of the markup.

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to