Yes , using hbase has solved my this issue. Now I am getting correct html in ParseFilter plugin.
Thanks guyz for help. Tony. On Fri, Jun 21, 2013 at 8:31 PM, Lewis John Mcgibbney < [email protected]> wrote: > In short yes I think it is gora-cassandra that is the problem here. > This is precisely the reason that I'm using the Cassandra backend, to try > and root these bugs out. > > On Friday, June 21, 2013, Jamshaid Ashraf <[email protected]> wrote: > > Hi, > > > > I'm also facing the same issue with cassandra backend. > > > > Do you think that cassandra is the reason for returning repeated html in > > parse job for parsefilter plugin? > > > > Regards, > > Jamshaid > > > > > > On Fri, Jun 21, 2013 at 1:18 PM, Julien Nioche < > > [email protected]> wrote: > > > >> Tony, > >> > >> The plugins directory contains quite a few examples of parsefilters e.g. > >> > >> > > http://svn.apache.org/viewvc/nutch/branches/2.1/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java?view=markup > >> > >> I don't use 2.x and don't know how many people use Cassandra as a > backend > >> in GORA but maybe it would be worth trying your code with HBase+GORA to > >> check whether it could be related to the backend. > >> > >> Julien > >> > >> > >> > >> On 21 June 2013 07:15, Tony Mullins <[email protected]> wrote: > >> > >> > Lewis, > >> > I have debuged my ParseFilter code many times and in debug too I get > same > >> > results which I get in my log file. > >> > > >> > I am getting null for page.getText() and page.getTitle(). > >> > And page.getContent().array() contains the html of all urls present in > >> > seed.txt. If there is one seed then it has html of one page , if there > >> are > >> > 2 seeds then html of these 2 pages. > >> > > >> > I have tried this code now on new CentoOS 6.4 VM and I am getting same > >> > result. > >> > > >> > I really dont know what else I do here !!! > >> > > >> > Could you please try any simple ParseFilter with latest Nutch2.x. ? > >> > > >> > Thanks, > >> > Tony > >> > > >> > > >> > On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney < > >> > [email protected]> wrote: > >> > > >> > > And the rest of the webpage fields actually. > >> > > Are you getting multiple values for each field or is it just for > >> content? > >> > > > >> > > On Thursday, June 20, 2013, Tony Mullins <[email protected]> > >> > wrote: > >> > > > Hi, > >> > > > > >> > > > Did any one get chance to look at the pointed out issue ? > >> > > > > >> > > > Just would like to know that is this a bug in new Nutch 2.x.... or > my > >> > > > understanding of how ParseFilter works ( that it will be run after > >> each > >> > > url > >> > > > parse job in seed.txt and will give user the raw html of that *URL > >> > ONLY* > >> > > ) > >> > > > is wrong. > >> > > > > >> > > > Thanks, > >> > > > Tony. > >> > > > > >> > > > > >> > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins < > >> > [email protected] > >> > > >wrote: > >> > > > > >> > > >> *Hi, > >> > > >> > >> > > >> * > >> > > >> * > >> > > >> This is my seed.txt * > >> > > >> > >> > > >> http://www.google.nl > >> > > >> http://www.bing.com > >> > > >> > >> > > >> *This is my ParseFilter * > >> > > >> > >> > > >> public class HtmlElementSelectorFilter implements ParseFilter { > >> > > >> > >> > > >> public static final Logger log = > >> > > >> LoggerFactory.getLogger("HtmlElementSelectorFilter"); > >> > > >> private Configuration conf = null; > >> > > >> > >> > > >> public HtmlElementSelectorFilter() {} > >> > > >> > >> > > >> @Override > >> > > >> public void setConf(Configuration conf) { > >> > > >> this.conf = conf; > >> > > >> } > >> > > >> @Override > >> > > > > > -- > *Lewis* >

