*Hi, * * This is my seed.txt *
http://www.google.nl http://www.bing.com *This is my ParseFilter * public class HtmlElementSelectorFilter implements ParseFilter { public static final Logger log = LoggerFactory.getLogger("HtmlElementSelectorFilter"); private Configuration conf = null; public HtmlElementSelectorFilter() {} @Override public void setConf(Configuration conf) { this.conf = conf; } @Override public Configuration getConf() { return conf; } @Override public Collection<WebPage.Field> getFields() { return new HashSet<WebPage.Field>(); } @Override public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags htmlMetaTags, DocumentFragment documentFragment) { StringBuffer sb = new StringBuffer(); sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n"); sb.append("status:\t").append(page.getStatus()).append(" (").append( CrawlStatus.getName((byte) page.getStatus())).append(")\n"); sb.append("fetchTime:\t" + page.getFetchTime()).append("\n"); sb.append("prevFetchTime:\t" + page.getPrevFetchTime()).append("\n"); sb.append("fetchInterval:\t" + page.getFetchInterval()).append("\n"); sb.append("retriesSinceFetch:\t" + page.getRetriesSinceFetch()).append("\n"); sb.append("modifiedTime:\t" + page.getModifiedTime()).append("\n"); sb.append("prevModifiedTime:\t" + page.getPrevModifiedTime()).append("\n"); sb.append("protocolStatus:\t" + ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n"); ByteBuffer content = page.getContent(); if (content != null ) { sb.append("contentType:\t" + page.getContentType()).append("\n"); sb.append("content:start:\n"); sb.append(Bytes.toString(content.array())); sb.append("\ncontent:end:\n"); } Utf8 text = page.getText(); if (text != null ) { sb.append("text:start:\n"); sb.append(text.toString()); sb.append("\ntext:end:\n"); } log.info("My Log is " + sb.toString()); return parse; } } * * *And this is my log file and as you can see that for each url in seed.txt, it is returning the html of both pages ( bing & google )* https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JSVE4zWjg/edit?usp=sharing Could any please help me here , I really need to understand what I am doing wrong here and why I am not getting the html of page which is currently being processed by ParseFilter ( i.e the page shown by page.getBaseUrl() ) Thanks. Tony.

