In short yes I think it is gora-cassandra that is the problem here.
This is precisely the reason that I'm using the Cassandra backend, to try
and root these bugs out.

On Friday, June 21, 2013, Jamshaid Ashraf <[email protected]> wrote:
> Hi,
>
> I'm also facing the same issue with cassandra backend.
>
> Do you think that cassandra is the reason for returning repeated html in
> parse job for parsefilter plugin?
>
> Regards,
> Jamshaid
>
>
> On Fri, Jun 21, 2013 at 1:18 PM, Julien Nioche <
> [email protected]> wrote:
>
>> Tony,
>>
>> The plugins directory contains quite a few examples of parsefilters e.g.
>>
>>
http://svn.apache.org/viewvc/nutch/branches/2.1/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java?view=markup
>>
>> I don't use 2.x and don't know how many people use Cassandra as a backend
>> in GORA but maybe it would be worth trying your code with HBase+GORA to
>> check whether it could be related to the backend.
>>
>> Julien
>>
>>
>>
>> On 21 June 2013 07:15, Tony Mullins <[email protected]> wrote:
>>
>> > Lewis,
>> > I have debuged my ParseFilter code many times and in debug too I get
same
>> > results which I get in my log file.
>> >
>> > I am getting null for page.getText() and page.getTitle().
>> > And page.getContent().array() contains the html of all urls present in
>> > seed.txt. If there is one seed then it has html of one page , if there
>> are
>> > 2 seeds then html of these 2 pages.
>> >
>> > I have tried this code now on new CentoOS 6.4 VM and I am getting same
>> > result.
>> >
>> > I really dont know what else I do here !!!
>> >
>> > Could you please try any simple ParseFilter with latest Nutch2.x.  ?
>> >
>> > Thanks,
>> > Tony
>> >
>> >
>> > On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> > > And the rest of the webpage fields actually.
>> > > Are you getting multiple values for each field or is it just for
>> content?
>> > >
>> > > On Thursday, June 20, 2013, Tony Mullins <[email protected]>
>> > wrote:
>> > > > Hi,
>> > > >
>> > > > Did any one get chance to look at the pointed out issue ?
>> > > >
>> > > > Just would like to know that is this a bug in new Nutch 2.x.... or
my
>> > > > understanding of how ParseFilter works ( that it will be run after
>> each
>> > > url
>> > > > parse job in seed.txt and will give user the raw html of that *URL
>> > ONLY*
>> > > )
>> > > > is wrong.
>> > > >
>> > > > Thanks,
>> > > > Tony.
>> > > >
>> > > >
>> > > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <
>> > [email protected]
>> > > >wrote:
>> > > >
>> > > >> *Hi,
>> > > >>
>> > > >> *
>> > > >> *
>> > > >> This is my seed.txt *
>> > > >>
>> > > >> http://www.google.nl
>> > > >> http://www.bing.com
>> > > >>
>> > > >> *This is my ParseFilter *
>> > > >>
>> > > >> public class HtmlElementSelectorFilter implements ParseFilter {
>> > > >>
>> > > >>      public static final Logger log =
>> > > >> LoggerFactory.getLogger("HtmlElementSelectorFilter");
>> > > >>      private Configuration conf = null;
>> > > >>
>> > > >>      public HtmlElementSelectorFilter() {}
>> > > >>
>> > > >>   @Override
>> > > >>   public void setConf(Configuration conf) {
>> > > >>             this.conf = conf;
>> > > >>           }
>> > > >>   @Override
>> > > >

-- 
*Lewis*

Reply via email to