Lewis, I have debuged my ParseFilter code many times and in debug too I get same results which I get in my log file.
I am getting null for page.getText() and page.getTitle(). And page.getContent().array() contains the html of all urls present in seed.txt. If there is one seed then it has html of one page , if there are 2 seeds then html of these 2 pages. I have tried this code now on new CentoOS 6.4 VM and I am getting same result. I really dont know what else I do here !!! Could you please try any simple ParseFilter with latest Nutch2.x. ? Thanks, Tony On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney < [email protected]> wrote: > And the rest of the webpage fields actually. > Are you getting multiple values for each field or is it just for content? > > On Thursday, June 20, 2013, Tony Mullins <[email protected]> wrote: > > Hi, > > > > Did any one get chance to look at the pointed out issue ? > > > > Just would like to know that is this a bug in new Nutch 2.x.... or my > > understanding of how ParseFilter works ( that it will be run after each > url > > parse job in seed.txt and will give user the raw html of that *URL ONLY* > ) > > is wrong. > > > > Thanks, > > Tony. > > > > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected] > >wrote: > > > >> *Hi, > >> > >> * > >> * > >> This is my seed.txt * > >> > >> http://www.google.nl > >> http://www.bing.com > >> > >> *This is my ParseFilter * > >> > >> public class HtmlElementSelectorFilter implements ParseFilter { > >> > >> public static final Logger log = > >> LoggerFactory.getLogger("HtmlElementSelectorFilter"); > >> private Configuration conf = null; > >> > >> public HtmlElementSelectorFilter() {} > >> > >> @Override > >> public void setConf(Configuration conf) { > >> this.conf = conf; > >> } > >> @Override > >> public Configuration getConf() { > >> return conf; > >> } > >> > >> @Override > >> public Collection<WebPage.Field> getFields() { > >> return new HashSet<WebPage.Field>(); > >> } > >> > >> @Override > >> public Parse filter(String s, WebPage page, Parse parse, > HTMLMetaTags > >> htmlMetaTags, DocumentFragment documentFragment) { > >> > >> StringBuffer sb = new StringBuffer(); > >> > >> sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n"); > >> sb.append("status:\t").append(page.getStatus()).append(" > >> (").append( > >> CrawlStatus.getName((byte) > >> page.getStatus())).append(")\n"); > >> sb.append("fetchTime:\t" + > page.getFetchTime()).append("\n"); > >> sb.append("prevFetchTime:\t" + > >> page.getPrevFetchTime()).append("\n"); > >> sb.append("fetchInterval:\t" + > >> page.getFetchInterval()).append("\n"); > >> sb.append("retriesSinceFetch:\t" + > >> page.getRetriesSinceFetch()).append("\n"); > >> sb.append("modifiedTime:\t" + > >> page.getModifiedTime()).append("\n"); > >> sb.append("prevModifiedTime:\t" + > >> page.getPrevModifiedTime()).append("\n"); > >> sb.append("protocolStatus:\t" + > >> > >> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n"); > >> > >> ByteBuffer content = page.getContent(); > >> if (content != null ) { > >> sb.append("contentType:\t" + > >> page.getContentType()).append("\n"); > >> sb.append("content:start:\n"); > >> sb.append(Bytes.toString(content.array())); > >> sb.append("\ncontent:end:\n"); > >> } > >> Utf8 text = page.getText(); > >> if (text != null ) { > >> sb.append("text:start:\n"); > >> sb.append(text.toString()); > >> sb.append("\ntext:end:\n"); > >> } > >> > >> log.info("My Log is " + sb.toString()); > >> return parse; > >> } > >> } > >> * > >> * > >> *And this is my log file and as you can see that for each url in > >> seed.txt, it is returning the html of both pages ( bing & google )* > >> > >> > >> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS > > -- > *Lewis* >

