Hopefully a quick question for someone.
I have a plugin that implement HtmlParseFilter. I want to remove certain
text from the page before indexing. I use a BufferedReader to read
"content", find the text, and run a replaceAll. Everything looks good.
Now I just need to return this updated content and continue on my way. Of
course, HtmlParseFilter returns "parseFilter", and not "content". I run
this to try and get the parseResult that I return to contain my results:
Parse thisParse = parseResult.get(content.getUrl());
parseResult.put(content.getUrl(), new ParseText(result),
thisParse.getData());
return parseResult;
("result" is the product of my "replaceAll" on the full-text of
"content").
And, as someone more familiar with the code can probably tell you from
that, I've now replaced the "content" with the full-text HTML of the page
(minus, nicely enough, the tagged <noindex> .* </noindex> section I'm
trying to delete.) e.g., where it used to be parsed text, now I have raw
html in there. I think I'm mixing my "content" and "parseResult" fields.
Is there another way to do this that I'm neglecting? Previously, I
modified "parse-html" and simply added this to getTextHelper:
if ("noindex".equalsIgnoreCase(nodeName)) {
walker.skipChildren();
}
I want to do this the *right* way and do this in my own plugin. Thanks in
advance for any help.