Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Tony Mullins Thu, 20 Jun 2013 23:16:46 -0700

Lewis,
I have debuged my ParseFilter code many times and in debug too I get same
results which I get in my log file.


I am getting null for page.getText() and page.getTitle().
And page.getContent().array() contains the html of all urls present in
seed.txt. If there is one seed then it has html of one page , if there are
2 seeds then html of these 2 pages.

I have tried this code now on new CentoOS 6.4 VM and I am getting same
result.

I really dont know what else I do here !!!

Could you please try any simple ParseFilter with latest Nutch2.x.  ?

Thanks,
Tony


On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> And the rest of the webpage fields actually.
> Are you getting multiple values for each field or is it just for content?
>
> On Thursday, June 20, 2013, Tony Mullins <[email protected]> wrote:
> > Hi,
> >
> > Did any one get chance to look at the pointed out issue ?
> >
> > Just would like to know that is this a bug in new Nutch 2.x.... or my
> > understanding of how ParseFilter works ( that it will be run after each
> url
> > parse job in seed.txt and will give user the raw html of that *URL ONLY*
> )
> > is wrong.
> >
> > Thanks,
> > Tony.
> >
> >
> > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected]
> >wrote:
> >
> >> *Hi,
> >>
> >> *
> >> *
> >> This is my seed.txt *
> >>
> >> http://www.google.nl
> >> http://www.bing.com
> >>
> >> *This is my ParseFilter *
> >>
> >> public class HtmlElementSelectorFilter implements ParseFilter {
> >>
> >>      public static final Logger log =
> >> LoggerFactory.getLogger("HtmlElementSelectorFilter");
> >>      private Configuration conf = null;
> >>
> >>      public HtmlElementSelectorFilter() {}
> >>
> >>   @Override
> >>   public void setConf(Configuration conf) {
> >>             this.conf = conf;
> >>           }
> >>   @Override
> >>    public Configuration getConf() {
> >>             return conf;
> >>           }
> >>
> >>   @Override
> >>   public Collection<WebPage.Field> getFields() {
> >>       return new HashSet<WebPage.Field>();
> >>   }
> >>
> >> @Override
> >>     public Parse filter(String s, WebPage page, Parse parse,
> HTMLMetaTags
> >> htmlMetaTags, DocumentFragment documentFragment) {
> >>
> >>           StringBuffer sb = new StringBuffer();
> >>
> >>             sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n");
> >>             sb.append("status:\t").append(page.getStatus()).append("
> >> (").append(
> >>                 CrawlStatus.getName((byte)
> >> page.getStatus())).append(")\n");
> >>             sb.append("fetchTime:\t" +
> page.getFetchTime()).append("\n");
> >>             sb.append("prevFetchTime:\t" +
> >> page.getPrevFetchTime()).append("\n");
> >>             sb.append("fetchInterval:\t" +
> >> page.getFetchInterval()).append("\n");
> >>             sb.append("retriesSinceFetch:\t" +
> >> page.getRetriesSinceFetch()).append("\n");
> >>             sb.append("modifiedTime:\t" +
> >> page.getModifiedTime()).append("\n");
> >>             sb.append("prevModifiedTime:\t" +
> >> page.getPrevModifiedTime()).append("\n");
> >>             sb.append("protocolStatus:\t" +
> >>
> >> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n");
> >>
> >>             ByteBuffer content = page.getContent();
> >>             if (content != null ) {
> >>               sb.append("contentType:\t" +
> >> page.getContentType()).append("\n");
> >>               sb.append("content:start:\n");
> >>               sb.append(Bytes.toString(content.array()));
> >>               sb.append("\ncontent:end:\n");
> >>             }
> >>             Utf8 text = page.getText();
> >>             if (text != null ) {
> >>               sb.append("text:start:\n");
> >>               sb.append(text.toString());
> >>               sb.append("\ntext:end:\n");
> >>             }
> >>
> >>             log.info("My Log is " + sb.toString());
> >>            return parse;
> >>     }
> >> }
> >> *
> >> *
> >> *And this is my log file and as you can see that for each url in
> >> seed.txt,  it is returning the html of both pages ( bing & google )*
> >>
> >>
> >> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS
>
> --
> *Lewis*
>

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to