Tony, The plugins directory contains quite a few examples of parsefilters e.g. http://svn.apache.org/viewvc/nutch/branches/2.1/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java?view=markup
I don't use 2.x and don't know how many people use Cassandra as a backend in GORA but maybe it would be worth trying your code with HBase+GORA to check whether it could be related to the backend. Julien On 21 June 2013 07:15, Tony Mullins <[email protected]> wrote: > Lewis, > I have debuged my ParseFilter code many times and in debug too I get same > results which I get in my log file. > > I am getting null for page.getText() and page.getTitle(). > And page.getContent().array() contains the html of all urls present in > seed.txt. If there is one seed then it has html of one page , if there are > 2 seeds then html of these 2 pages. > > I have tried this code now on new CentoOS 6.4 VM and I am getting same > result. > > I really dont know what else I do here !!! > > Could you please try any simple ParseFilter with latest Nutch2.x. ? > > Thanks, > Tony > > > On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney < > [email protected]> wrote: > > > And the rest of the webpage fields actually. > > Are you getting multiple values for each field or is it just for content? > > > > On Thursday, June 20, 2013, Tony Mullins <[email protected]> > wrote: > > > Hi, > > > > > > Did any one get chance to look at the pointed out issue ? > > > > > > Just would like to know that is this a bug in new Nutch 2.x.... or my > > > understanding of how ParseFilter works ( that it will be run after each > > url > > > parse job in seed.txt and will give user the raw html of that *URL > ONLY* > > ) > > > is wrong. > > > > > > Thanks, > > > Tony. > > > > > > > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins < > [email protected] > > >wrote: > > > > > >> *Hi, > > >> > > >> * > > >> * > > >> This is my seed.txt * > > >> > > >> http://www.google.nl > > >> http://www.bing.com > > >> > > >> *This is my ParseFilter * > > >> > > >> public class HtmlElementSelectorFilter implements ParseFilter { > > >> > > >> public static final Logger log = > > >> LoggerFactory.getLogger("HtmlElementSelectorFilter"); > > >> private Configuration conf = null; > > >> > > >> public HtmlElementSelectorFilter() {} > > >> > > >> @Override > > >> public void setConf(Configuration conf) { > > >> this.conf = conf; > > >> } > > >> @Override > > >> public Configuration getConf() { > > >> return conf; > > >> } > > >> > > >> @Override > > >> public Collection<WebPage.Field> getFields() { > > >> return new HashSet<WebPage.Field>(); > > >> } > > >> > > >> @Override > > >> public Parse filter(String s, WebPage page, Parse parse, > > HTMLMetaTags > > >> htmlMetaTags, DocumentFragment documentFragment) { > > >> > > >> StringBuffer sb = new StringBuffer(); > > >> > > >> sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n"); > > >> sb.append("status:\t").append(page.getStatus()).append(" > > >> (").append( > > >> CrawlStatus.getName((byte) > > >> page.getStatus())).append(")\n"); > > >> sb.append("fetchTime:\t" + > > page.getFetchTime()).append("\n"); > > >> sb.append("prevFetchTime:\t" + > > >> page.getPrevFetchTime()).append("\n"); > > >> sb.append("fetchInterval:\t" + > > >> page.getFetchInterval()).append("\n"); > > >> sb.append("retriesSinceFetch:\t" + > > >> page.getRetriesSinceFetch()).append("\n"); > > >> sb.append("modifiedTime:\t" + > > >> page.getModifiedTime()).append("\n"); > > >> sb.append("prevModifiedTime:\t" + > > >> page.getPrevModifiedTime()).append("\n"); > > >> sb.append("protocolStatus:\t" + > > >> > > >> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n"); > > >> > > >> ByteBuffer content = page.getContent(); > > >> if (content != null ) { > > >> sb.append("contentType:\t" + > > >> page.getContentType()).append("\n"); > > >> sb.append("content:start:\n"); > > >> sb.append(Bytes.toString(content.array())); > > >> sb.append("\ncontent:end:\n"); > > >> } > > >> Utf8 text = page.getText(); > > >> if (text != null ) { > > >> sb.append("text:start:\n"); > > >> sb.append(text.toString()); > > >> sb.append("\ntext:end:\n"); > > >> } > > >> > > >> log.info("My Log is " + sb.toString()); > > >> return parse; > > >> } > > >> } > > >> * > > >> * > > >> *And this is my log file and as you can see that for each url in > > >> seed.txt, it is returning the html of both pages ( bing & google )* > > >> > > >> > > >> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS > > > > -- > > *Lewis* > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

