Yes , using hbase has solved my this issue. Now I am getting correct html
in ParseFilter plugin.

Thanks guyz for help.
Tony.


On Fri, Jun 21, 2013 at 8:31 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> In short yes I think it is gora-cassandra that is the problem here.
> This is precisely the reason that I'm using the Cassandra backend, to try
> and root these bugs out.
>
> On Friday, June 21, 2013, Jamshaid Ashraf <[email protected]> wrote:
> > Hi,
> >
> > I'm also facing the same issue with cassandra backend.
> >
> > Do you think that cassandra is the reason for returning repeated html in
> > parse job for parsefilter plugin?
> >
> > Regards,
> > Jamshaid
> >
> >
> > On Fri, Jun 21, 2013 at 1:18 PM, Julien Nioche <
> > [email protected]> wrote:
> >
> >> Tony,
> >>
> >> The plugins directory contains quite a few examples of parsefilters e.g.
> >>
> >>
>
> http://svn.apache.org/viewvc/nutch/branches/2.1/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java?view=markup
> >>
> >> I don't use 2.x and don't know how many people use Cassandra as a
> backend
> >> in GORA but maybe it would be worth trying your code with HBase+GORA to
> >> check whether it could be related to the backend.
> >>
> >> Julien
> >>
> >>
> >>
> >> On 21 June 2013 07:15, Tony Mullins <[email protected]> wrote:
> >>
> >> > Lewis,
> >> > I have debuged my ParseFilter code many times and in debug too I get
> same
> >> > results which I get in my log file.
> >> >
> >> > I am getting null for page.getText() and page.getTitle().
> >> > And page.getContent().array() contains the html of all urls present in
> >> > seed.txt. If there is one seed then it has html of one page , if there
> >> are
> >> > 2 seeds then html of these 2 pages.
> >> >
> >> > I have tried this code now on new CentoOS 6.4 VM and I am getting same
> >> > result.
> >> >
> >> > I really dont know what else I do here !!!
> >> >
> >> > Could you please try any simple ParseFilter with latest Nutch2.x.  ?
> >> >
> >> > Thanks,
> >> > Tony
> >> >
> >> >
> >> > On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney <
> >> > [email protected]> wrote:
> >> >
> >> > > And the rest of the webpage fields actually.
> >> > > Are you getting multiple values for each field or is it just for
> >> content?
> >> > >
> >> > > On Thursday, June 20, 2013, Tony Mullins <[email protected]>
> >> > wrote:
> >> > > > Hi,
> >> > > >
> >> > > > Did any one get chance to look at the pointed out issue ?
> >> > > >
> >> > > > Just would like to know that is this a bug in new Nutch 2.x.... or
> my
> >> > > > understanding of how ParseFilter works ( that it will be run after
> >> each
> >> > > url
> >> > > > parse job in seed.txt and will give user the raw html of that *URL
> >> > ONLY*
> >> > > )
> >> > > > is wrong.
> >> > > >
> >> > > > Thanks,
> >> > > > Tony.
> >> > > >
> >> > > >
> >> > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <
> >> > [email protected]
> >> > > >wrote:
> >> > > >
> >> > > >> *Hi,
> >> > > >>
> >> > > >> *
> >> > > >> *
> >> > > >> This is my seed.txt *
> >> > > >>
> >> > > >> http://www.google.nl
> >> > > >> http://www.bing.com
> >> > > >>
> >> > > >> *This is my ParseFilter *
> >> > > >>
> >> > > >> public class HtmlElementSelectorFilter implements ParseFilter {
> >> > > >>
> >> > > >>      public static final Logger log =
> >> > > >> LoggerFactory.getLogger("HtmlElementSelectorFilter");
> >> > > >>      private Configuration conf = null;
> >> > > >>
> >> > > >>      public HtmlElementSelectorFilter() {}
> >> > > >>
> >> > > >>   @Override
> >> > > >>   public void setConf(Configuration conf) {
> >> > > >>             this.conf = conf;
> >> > > >>           }
> >> > > >>   @Override
> >> > > >
>
> --
> *Lewis*
>

Reply via email to