Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Lewis John Mcgibbney Thu, 20 Jun 2013 12:37:54 -0700

And the rest of the webpage fields actually.
Are you getting multiple values for each field or is it just for content?


On Thursday, June 20, 2013, Tony Mullins <[email protected]> wrote:
> Hi,
>
> Did any one get chance to look at the pointed out issue ?
>
> Just would like to know that is this a bug in new Nutch 2.x.... or my
> understanding of how ParseFilter works ( that it will be run after each
url
> parse job in seed.txt and will give user the raw html of that *URL ONLY* )
> is wrong.
>
> Thanks,
> Tony.
>
>
> On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected]
>wrote:
>
>> *Hi,
>>
>> *
>> *
>> This is my seed.txt *
>>
>> http://www.google.nl
>> http://www.bing.com
>>
>> *This is my ParseFilter *
>>
>> public class HtmlElementSelectorFilter implements ParseFilter {
>>
>>      public static final Logger log =
>> LoggerFactory.getLogger("HtmlElementSelectorFilter");
>>      private Configuration conf = null;
>>
>>      public HtmlElementSelectorFilter() {}
>>
>>   @Override
>>   public void setConf(Configuration conf) {
>>             this.conf = conf;
>>           }
>>   @Override
>>    public Configuration getConf() {
>>             return conf;
>>           }
>>
>>   @Override
>>   public Collection<WebPage.Field> getFields() {
>>       return new HashSet<WebPage.Field>();
>>   }
>>
>> @Override
>>     public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags
>> htmlMetaTags, DocumentFragment documentFragment) {
>>
>>           StringBuffer sb = new StringBuffer();
>>
>>             sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n");
>>             sb.append("status:\t").append(page.getStatus()).append("
>> (").append(
>>                 CrawlStatus.getName((byte)
>> page.getStatus())).append(")\n");
>>             sb.append("fetchTime:\t" + page.getFetchTime()).append("\n");
>>             sb.append("prevFetchTime:\t" +
>> page.getPrevFetchTime()).append("\n");
>>             sb.append("fetchInterval:\t" +
>> page.getFetchInterval()).append("\n");
>>             sb.append("retriesSinceFetch:\t" +
>> page.getRetriesSinceFetch()).append("\n");
>>             sb.append("modifiedTime:\t" +
>> page.getModifiedTime()).append("\n");
>>             sb.append("prevModifiedTime:\t" +
>> page.getPrevModifiedTime()).append("\n");
>>             sb.append("protocolStatus:\t" +
>>
>> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n");
>>
>>             ByteBuffer content = page.getContent();
>>             if (content != null ) {
>>               sb.append("contentType:\t" +
>> page.getContentType()).append("\n");
>>               sb.append("content:start:\n");
>>               sb.append(Bytes.toString(content.array()));
>>               sb.append("\ncontent:end:\n");
>>             }
>>             Utf8 text = page.getText();
>>             if (text != null ) {
>>               sb.append("text:start:\n");
>>               sb.append(text.toString());
>>               sb.append("\ntext:end:\n");
>>             }
>>
>>             log.info("My Log is " + sb.toString());
>>            return parse;
>>     }
>> }
>> *
>> *
>> *And this is my log file and as you can see that for each url in
>> seed.txt,  it is returning the html of both pages ( bing & google )*
>>
>>
>> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS

-- 
*Lewis*

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to