Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Julien Nioche Fri, 21 Jun 2013 01:19:45 -0700

Tony,

The plugins directory contains quite a few examples of parsefilters e.g.
http://svn.apache.org/viewvc/nutch/branches/2.1/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java?view=markup


I don't use 2.x and don't know how many people use Cassandra as a backend
in GORA but maybe it would be worth trying your code with HBase+GORA to
check whether it could be related to the backend.

Julien



On 21 June 2013 07:15, Tony Mullins <[email protected]> wrote:

> Lewis,
> I have debuged my ParseFilter code many times and in debug too I get same
> results which I get in my log file.
>
> I am getting null for page.getText() and page.getTitle().
> And page.getContent().array() contains the html of all urls present in
> seed.txt. If there is one seed then it has html of one page , if there are
> 2 seeds then html of these 2 pages.
>
> I have tried this code now on new CentoOS 6.4 VM and I am getting same
> result.
>
> I really dont know what else I do here !!!
>
> Could you please try any simple ParseFilter with latest Nutch2.x.  ?
>
> Thanks,
> Tony
>
>
> On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > And the rest of the webpage fields actually.
> > Are you getting multiple values for each field or is it just for content?
> >
> > On Thursday, June 20, 2013, Tony Mullins <[email protected]>
> wrote:
> > > Hi,
> > >
> > > Did any one get chance to look at the pointed out issue ?
> > >
> > > Just would like to know that is this a bug in new Nutch 2.x.... or my
> > > understanding of how ParseFilter works ( that it will be run after each
> > url
> > > parse job in seed.txt and will give user the raw html of that *URL
> ONLY*
> > )
> > > is wrong.
> > >
> > > Thanks,
> > > Tony.
> > >
> > >
> > > On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <
> [email protected]
> > >wrote:
> > >
> > >> *Hi,
> > >>
> > >> *
> > >> *
> > >> This is my seed.txt *
> > >>
> > >> http://www.google.nl
> > >> http://www.bing.com
> > >>
> > >> *This is my ParseFilter *
> > >>
> > >> public class HtmlElementSelectorFilter implements ParseFilter {
> > >>
> > >>      public static final Logger log =
> > >> LoggerFactory.getLogger("HtmlElementSelectorFilter");
> > >>      private Configuration conf = null;
> > >>
> > >>      public HtmlElementSelectorFilter() {}
> > >>
> > >>   @Override
> > >>   public void setConf(Configuration conf) {
> > >>             this.conf = conf;
> > >>           }
> > >>   @Override
> > >>    public Configuration getConf() {
> > >>             return conf;
> > >>           }
> > >>
> > >>   @Override
> > >>   public Collection<WebPage.Field> getFields() {
> > >>       return new HashSet<WebPage.Field>();
> > >>   }
> > >>
> > >> @Override
> > >>     public Parse filter(String s, WebPage page, Parse parse,
> > HTMLMetaTags
> > >> htmlMetaTags, DocumentFragment documentFragment) {
> > >>
> > >>           StringBuffer sb = new StringBuffer();
> > >>
> > >>             sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n");
> > >>             sb.append("status:\t").append(page.getStatus()).append("
> > >> (").append(
> > >>                 CrawlStatus.getName((byte)
> > >> page.getStatus())).append(")\n");
> > >>             sb.append("fetchTime:\t" +
> > page.getFetchTime()).append("\n");
> > >>             sb.append("prevFetchTime:\t" +
> > >> page.getPrevFetchTime()).append("\n");
> > >>             sb.append("fetchInterval:\t" +
> > >> page.getFetchInterval()).append("\n");
> > >>             sb.append("retriesSinceFetch:\t" +
> > >> page.getRetriesSinceFetch()).append("\n");
> > >>             sb.append("modifiedTime:\t" +
> > >> page.getModifiedTime()).append("\n");
> > >>             sb.append("prevModifiedTime:\t" +
> > >> page.getPrevModifiedTime()).append("\n");
> > >>             sb.append("protocolStatus:\t" +
> > >>
> > >> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n");
> > >>
> > >>             ByteBuffer content = page.getContent();
> > >>             if (content != null ) {
> > >>               sb.append("contentType:\t" +
> > >> page.getContentType()).append("\n");
> > >>               sb.append("content:start:\n");
> > >>               sb.append(Bytes.toString(content.array()));
> > >>               sb.append("\ncontent:end:\n");
> > >>             }
> > >>             Utf8 text = page.getText();
> > >>             if (text != null ) {
> > >>               sb.append("text:start:\n");
> > >>               sb.append(text.toString());
> > >>               sb.append("\ntext:end:\n");
> > >>             }
> > >>
> > >>             log.info("My Log is " + sb.toString());
> > >>            return parse;
> > >>     }
> > >> }
> > >> *
> > >> *
> > >> *And this is my log file and as you can see that for each url in
> > >> seed.txt,  it is returning the html of both pages ( bing & google )*
> > >>
> > >>
> > >> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS
> >
> > --
> > *Lewis*
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to