Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Tony Mullins Thu, 20 Jun 2013 11:06:08 -0700

Hi,

Did any one get chance to look at the pointed out issue ?


Just would like to know that is this a bug in new Nutch 2.x.... or my
understanding of how ParseFilter works ( that it will be run after each url
parse job in seed.txt and will give user the raw html of that *URL ONLY* )
is wrong.

Thanks,
Tony.


On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected]>wrote:

> *Hi,
>
> *
> *
> This is my seed.txt *
>
> http://www.google.nl
> http://www.bing.com
>
> *This is my ParseFilter *
>
> public class HtmlElementSelectorFilter implements ParseFilter {
>
>      public static final Logger log =
> LoggerFactory.getLogger("HtmlElementSelectorFilter");
>      private Configuration conf = null;
>
>      public HtmlElementSelectorFilter() {}
>
>   @Override
>   public void setConf(Configuration conf) {
>             this.conf = conf;
>           }
>   @Override
>    public Configuration getConf() {
>             return conf;
>           }
>
>   @Override
>   public Collection<WebPage.Field> getFields() {
>       return new HashSet<WebPage.Field>();
>   }
>
> @Override
>     public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags
> htmlMetaTags, DocumentFragment documentFragment) {
>
>           StringBuffer sb = new StringBuffer();
>
>             sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n");
>             sb.append("status:\t").append(page.getStatus()).append("
> (").append(
>                 CrawlStatus.getName((byte)
> page.getStatus())).append(")\n");
>             sb.append("fetchTime:\t" + page.getFetchTime()).append("\n");
>             sb.append("prevFetchTime:\t" +
> page.getPrevFetchTime()).append("\n");
>             sb.append("fetchInterval:\t" +
> page.getFetchInterval()).append("\n");
>             sb.append("retriesSinceFetch:\t" +
> page.getRetriesSinceFetch()).append("\n");
>             sb.append("modifiedTime:\t" +
> page.getModifiedTime()).append("\n");
>             sb.append("prevModifiedTime:\t" +
> page.getPrevModifiedTime()).append("\n");
>             sb.append("protocolStatus:\t" +
>
> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n");
>
>             ByteBuffer content = page.getContent();
>             if (content != null ) {
>               sb.append("contentType:\t" +
> page.getContentType()).append("\n");
>               sb.append("content:start:\n");
>               sb.append(Bytes.toString(content.array()));
>               sb.append("\ncontent:end:\n");
>             }
>             Utf8 text = page.getText();
>             if (text != null ) {
>               sb.append("text:start:\n");
>               sb.append(text.toString());
>               sb.append("\ntext:end:\n");
>             }
>
>             log.info("My Log is " + sb.toString());
>            return parse;
>     }
> }
> *
> *
> *And this is my log file and as you can see that for each url in
> seed.txt,  it is returning the html of both pages ( bing & google )*
>
>
> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JSVE4zWjg/edit?usp=sharing
>
> Could any please help me here , I really need to understand what I am
> doing wrong here and why I am not getting the html of page which is
> currently being processed by ParseFilter ( i.e the page shown by
> page.getBaseUrl() )
>
> Thanks.
> Tony.
>

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to