Hopefully a quick question for someone.

I have a plugin that implement HtmlParseFilter.  I want to remove certain
text from the page before indexing. I use a BufferedReader to read
"content",  find the text, and run a replaceAll.  Everything looks good.
Now I just need to return this updated content and continue on my way.  Of
course, HtmlParseFilter returns "parseFilter", and not "content". I run
this to try and get the parseResult that I return to contain my results:

          Parse thisParse = parseResult.get(content.getUrl());
          parseResult.put(content.getUrl(), new ParseText(result),
thisParse.getData());

      return parseResult;

 ("result" is the product of my "replaceAll" on the full-text of
"content").

And, as someone more familiar with the code can probably tell you from
that, I've now replaced the "content" with the full-text HTML of the page
(minus, nicely enough, the tagged <noindex> .* </noindex> section I'm
trying to delete.) e.g., where it used to be parsed text, now I have raw
html in there.  I think I'm mixing my "content" and "parseResult" fields.

Is there another way to do this that I'm neglecting?  Previously, I
modified "parse-html" and simply added this to getTextHelper:

      if ("noindex".equalsIgnoreCase(nodeName)) {
        walker.skipChildren();
      }

I want to do this the *right* way and do this in my own plugin.  Thanks in
advance for any help.

Reply via email to