Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Tony Mullins Wed, 19 Jun 2013 10:24:03 -0700

*Hi,

*
*
This is my seed.txt *


http://www.google.nl
http://www.bing.com

*This is my ParseFilter *

public class HtmlElementSelectorFilter implements ParseFilter {

     public static final Logger log =
LoggerFactory.getLogger("HtmlElementSelectorFilter");
     private Configuration conf = null;

     public HtmlElementSelectorFilter() {}

  @Override
  public void setConf(Configuration conf) {
            this.conf = conf;
          }
  @Override
   public Configuration getConf() {
            return conf;
          }

  @Override
  public Collection<WebPage.Field> getFields() {
      return new HashSet<WebPage.Field>();
  }

@Override
    public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags
htmlMetaTags, DocumentFragment documentFragment) {

          StringBuffer sb = new StringBuffer();

            sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n");
            sb.append("status:\t").append(page.getStatus()).append("
(").append(
                CrawlStatus.getName((byte) page.getStatus())).append(")\n");
            sb.append("fetchTime:\t" + page.getFetchTime()).append("\n");
            sb.append("prevFetchTime:\t" +
page.getPrevFetchTime()).append("\n");
            sb.append("fetchInterval:\t" +
page.getFetchInterval()).append("\n");
            sb.append("retriesSinceFetch:\t" +
page.getRetriesSinceFetch()).append("\n");
            sb.append("modifiedTime:\t" +
page.getModifiedTime()).append("\n");
            sb.append("prevModifiedTime:\t" +
page.getPrevModifiedTime()).append("\n");
            sb.append("protocolStatus:\t" +

ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n");

            ByteBuffer content = page.getContent();
            if (content != null ) {
              sb.append("contentType:\t" +
page.getContentType()).append("\n");
              sb.append("content:start:\n");
              sb.append(Bytes.toString(content.array()));
              sb.append("\ncontent:end:\n");
            }
            Utf8 text = page.getText();
            if (text != null ) {
              sb.append("text:start:\n");
              sb.append(text.toString());
              sb.append("\ntext:end:\n");
            }

            log.info("My Log is " + sb.toString());
           return parse;
    }
}
*
*
*And this is my log file and as you can see that for each url in seed.txt,
it is returning the html of both pages ( bing & google )*

https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JSVE4zWjg/edit?usp=sharing

Could any please help me here , I really need to understand what I am doing
wrong here and why I am not getting the html of page which is currently
being processed by ParseFilter ( i.e the page shown by page.getBaseUrl() )

Thanks.
Tony.

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to