Maybe an obvious question Tony, but have you tried stepping through this and debugging your code? There is another thread which appeared today, which basically is the same problem as you have. I am struggling to see how there are parsefilter plugin implementations shipped with 2.x which do not appear to have this behaviour. The most likely reason that no-one has got back to you is that they are not getting the results you are getting, or are too busy or something. Regarding the understanding of the interface, it doesn't get any more precise than the Javadoc http://nutch.apache.org/apidocs-2.2/index.html?org/apache/nutch/parse/ParseFilter.html " Extension point for DOM-based parsers. Permits one to add additional metadata to parses provided by the html or tika plugins. All plugins found which implement this extension point are run sequentially on the parse"
I am keen to find out what your Utf8 text = page.getText(); object returns? On Thursday, June 20, 2013, Tony Mullins <[email protected]> wrote: > Hi, > > Did any one get chance to look at the pointed out issue ? > > Just would like to know that is this a bug in new Nutch 2.x.... or my > understanding of how ParseFilter works ( that it will be run after each url > parse job in seed.txt and will give user the raw html of that *URL ONLY* ) > is wrong. > > Thanks, > Tony. > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected] >wrote: > >> *Hi, >> >> * >> * >> This is my seed.txt * >> >> http://www.google.nl >> http://www.bing.com >> >> *This is my ParseFilter * >> >> public class HtmlElementSelectorFilter implements ParseFilter { >> >> public static final Logger log = >> LoggerFactory.getLogger("HtmlElementSelectorFilter"); >> private Configuration conf = null; >> >> public HtmlElementSelectorFilter() {} >> >> @Override >> public void setConf(Configuration conf) { >> this.conf = conf; >> } >> @Override >> public Configuration getConf() { >> return conf; >> } >> >> @Override >> public Collection<WebPage.Field> getFields() { >> return new HashSet<WebPage.Field>(); >> } >> >> @Override >> public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags >> htmlMetaTags, DocumentFragment documentFragment) { >> >> StringBuffer sb = new StringBuffer(); >> >> sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n"); >> sb.append("status:\t").append(page.getStatus()).append(" >> (").append( >> CrawlStatus.getName((byte) >> page.getStatus())).append(")\n"); >> sb.append("fetchTime:\t" + page.getFetchTime()).append("\n"); >> sb.append("prevFetchTime:\t" + >> page.getPrevFetchTime()).append("\n"); >> sb.append("fetchInterval:\t" + >> page.getFetchInterval()).append("\n"); >> sb.append("retriesSinceFetch:\t" + >> page.getRetriesSinceFetch()).append("\n"); >> sb.append("modifiedTime:\t" + >> page.getModifiedTime()).append("\n"); >> sb.append("prevModifiedTime:\t" + >> page.getPrevModifiedTime()).append("\n"); >> sb.append("protocolStatus:\t" + >> >> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n"); >> >> ByteBuffer content = page.getContent(); >> if (content != null ) { >> sb.append("contentType:\t" + >> page.getContentType()).append("\n"); >> sb.append("content:start:\n"); >> sb.append(Bytes.toString(content.array())); >> sb.append("\ncontent:end:\n"); >> } >> Utf8 text = page.getText(); >> if (text != null ) { >> sb.append("text:start:\n"); >> sb.append(text.toString()); >> sb.append("\ntext:end:\n"); >> } >> >> log.info("My Log is " + sb.toString()); >> return parse; >> } >> } >> * >> * >> *And this is my log file and as you can see that for each url in >> seed.txt, it is returning the html of both pages ( bing & google )* >> >> >> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS -- *Lewis*

