Hi, Did any one get chance to look at the pointed out issue ?
Just would like to know that is this a bug in new Nutch 2.x.... or my understanding of how ParseFilter works ( that it will be run after each url parse job in seed.txt and will give user the raw html of that *URL ONLY* ) is wrong. Thanks, Tony. On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected]>wrote: > *Hi, > > * > * > This is my seed.txt * > > http://www.google.nl > http://www.bing.com > > *This is my ParseFilter * > > public class HtmlElementSelectorFilter implements ParseFilter { > > public static final Logger log = > LoggerFactory.getLogger("HtmlElementSelectorFilter"); > private Configuration conf = null; > > public HtmlElementSelectorFilter() {} > > @Override > public void setConf(Configuration conf) { > this.conf = conf; > } > @Override > public Configuration getConf() { > return conf; > } > > @Override > public Collection<WebPage.Field> getFields() { > return new HashSet<WebPage.Field>(); > } > > @Override > public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags > htmlMetaTags, DocumentFragment documentFragment) { > > StringBuffer sb = new StringBuffer(); > > sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n"); > sb.append("status:\t").append(page.getStatus()).append(" > (").append( > CrawlStatus.getName((byte) > page.getStatus())).append(")\n"); > sb.append("fetchTime:\t" + page.getFetchTime()).append("\n"); > sb.append("prevFetchTime:\t" + > page.getPrevFetchTime()).append("\n"); > sb.append("fetchInterval:\t" + > page.getFetchInterval()).append("\n"); > sb.append("retriesSinceFetch:\t" + > page.getRetriesSinceFetch()).append("\n"); > sb.append("modifiedTime:\t" + > page.getModifiedTime()).append("\n"); > sb.append("prevModifiedTime:\t" + > page.getPrevModifiedTime()).append("\n"); > sb.append("protocolStatus:\t" + > > ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n"); > > ByteBuffer content = page.getContent(); > if (content != null ) { > sb.append("contentType:\t" + > page.getContentType()).append("\n"); > sb.append("content:start:\n"); > sb.append(Bytes.toString(content.array())); > sb.append("\ncontent:end:\n"); > } > Utf8 text = page.getText(); > if (text != null ) { > sb.append("text:start:\n"); > sb.append(text.toString()); > sb.append("\ntext:end:\n"); > } > > log.info("My Log is " + sb.toString()); > return parse; > } > } > * > * > *And this is my log file and as you can see that for each url in > seed.txt, it is returning the html of both pages ( bing & google )* > > > https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JSVE4zWjg/edit?usp=sharing > > Could any please help me here , I really need to understand what I am > doing wrong here and why I am not getting the html of page which is > currently being processed by ParseFilter ( i.e the page shown by > page.getBaseUrl() ) > > Thanks. > Tony. >

