And the rest of the webpage fields actually. Are you getting multiple values for each field or is it just for content?
On Thursday, June 20, 2013, Tony Mullins <[email protected]> wrote: > Hi, > > Did any one get chance to look at the pointed out issue ? > > Just would like to know that is this a bug in new Nutch 2.x.... or my > understanding of how ParseFilter works ( that it will be run after each url > parse job in seed.txt and will give user the raw html of that *URL ONLY* ) > is wrong. > > Thanks, > Tony. > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected] >wrote: > >> *Hi, >> >> * >> * >> This is my seed.txt * >> >> http://www.google.nl >> http://www.bing.com >> >> *This is my ParseFilter * >> >> public class HtmlElementSelectorFilter implements ParseFilter { >> >> public static final Logger log = >> LoggerFactory.getLogger("HtmlElementSelectorFilter"); >> private Configuration conf = null; >> >> public HtmlElementSelectorFilter() {} >> >> @Override >> public void setConf(Configuration conf) { >> this.conf = conf; >> } >> @Override >> public Configuration getConf() { >> return conf; >> } >> >> @Override >> public Collection<WebPage.Field> getFields() { >> return new HashSet<WebPage.Field>(); >> } >> >> @Override >> public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags >> htmlMetaTags, DocumentFragment documentFragment) { >> >> StringBuffer sb = new StringBuffer(); >> >> sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n"); >> sb.append("status:\t").append(page.getStatus()).append(" >> (").append( >> CrawlStatus.getName((byte) >> page.getStatus())).append(")\n"); >> sb.append("fetchTime:\t" + page.getFetchTime()).append("\n"); >> sb.append("prevFetchTime:\t" + >> page.getPrevFetchTime()).append("\n"); >> sb.append("fetchInterval:\t" + >> page.getFetchInterval()).append("\n"); >> sb.append("retriesSinceFetch:\t" + >> page.getRetriesSinceFetch()).append("\n"); >> sb.append("modifiedTime:\t" + >> page.getModifiedTime()).append("\n"); >> sb.append("prevModifiedTime:\t" + >> page.getPrevModifiedTime()).append("\n"); >> sb.append("protocolStatus:\t" + >> >> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n"); >> >> ByteBuffer content = page.getContent(); >> if (content != null ) { >> sb.append("contentType:\t" + >> page.getContentType()).append("\n"); >> sb.append("content:start:\n"); >> sb.append(Bytes.toString(content.array())); >> sb.append("\ncontent:end:\n"); >> } >> Utf8 text = page.getText(); >> if (text != null ) { >> sb.append("text:start:\n"); >> sb.append(text.toString()); >> sb.append("\ntext:end:\n"); >> } >> >> log.info("My Log is " + sb.toString()); >> return parse; >> } >> } >> * >> * >> *And this is my log file and as you can see that for each url in >> seed.txt, it is returning the html of both pages ( bing & google )* >> >> >> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS -- *Lewis*

