Lance Norskog-2 wrote: > > The PatternReplace and HTMPStrip tokenizers might be the right bet. > The easiest way to go about this is to make a bunch of text fields > with different analysis stacks and investigate them in the Scema > Browser. You can paste an HTML document into the text box and see > exactly how the words & markup get torn apart. >
Thanks Lance, I'll experiment. For reference, for anyone else who comes across this thread -- the html in my original post might have got munged on the way into or out of the list server. It was supposed to look like this: This is the entire content of my field, but [a href="http://example.com/"]some of the words[/a] are a hyperlink. (but with real html tags instead of the square brackets) and I am just trying to extract the words and the link target but lose the rest of the markup. Cheers, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html Sent from the Solr - User mailing list archive at Nabble.com.