Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Lewis John Mcgibbney Thu, 20 Jun 2013 12:37:20 -0700

Maybe an obvious question Tony, but have you tried stepping through this
and debugging your code?
There is another thread which appeared today, which basically is the same
problem as you have.
I am struggling to see how there are parsefilter plugin implementations
shipped with 2.x which do not appear to have this behaviour.
The most likely reason that no-one has got back to you is that they are not
getting the results you are getting, or are too busy or something.
Regarding the understanding of the interface, it doesn't get any more
precise than the Javadoc
http://nutch.apache.org/apidocs-2.2/index.html?org/apache/nutch/parse/ParseFilter.html
" Extension point for DOM-based parsers. Permits one to add additional
metadata to parses provided by the html or tika plugins. All plugins found
which implement this extension point are run sequentially on the parse"


I am keen to find out what your  Utf8 text = page.getText(); object returns?


On Thursday, June 20, 2013, Tony Mullins <[email protected]> wrote:
> Hi,
>
> Did any one get chance to look at the pointed out issue ?
>
> Just would like to know that is this a bug in new Nutch 2.x.... or my
> understanding of how ParseFilter works ( that it will be run after each
url
> parse job in seed.txt and will give user the raw html of that *URL ONLY* )
> is wrong.
>
> Thanks,
> Tony.
>
>
> On Wed, Jun 19, 2013 at 10:23 PM, Tony Mullins <[email protected]
>wrote:
>
>> *Hi,
>>
>> *
>> *
>> This is my seed.txt *
>>
>> http://www.google.nl
>> http://www.bing.com
>>
>> *This is my ParseFilter *
>>
>> public class HtmlElementSelectorFilter implements ParseFilter {
>>
>>      public static final Logger log =
>> LoggerFactory.getLogger("HtmlElementSelectorFilter");
>>      private Configuration conf = null;
>>
>>      public HtmlElementSelectorFilter() {}
>>
>>   @Override
>>   public void setConf(Configuration conf) {
>>             this.conf = conf;
>>           }
>>   @Override
>>    public Configuration getConf() {
>>             return conf;
>>           }
>>
>>   @Override
>>   public Collection<WebPage.Field> getFields() {
>>       return new HashSet<WebPage.Field>();
>>   }
>>
>> @Override
>>     public Parse filter(String s, WebPage page, Parse parse, HTMLMetaTags
>> htmlMetaTags, DocumentFragment documentFragment) {
>>
>>           StringBuffer sb = new StringBuffer();
>>
>>             sb.append("baseUrl:\t" + page.getBaseUrl()).append("\n");
>>             sb.append("status:\t").append(page.getStatus()).append("
>> (").append(
>>                 CrawlStatus.getName((byte)
>> page.getStatus())).append(")\n");
>>             sb.append("fetchTime:\t" + page.getFetchTime()).append("\n");
>>             sb.append("prevFetchTime:\t" +
>> page.getPrevFetchTime()).append("\n");
>>             sb.append("fetchInterval:\t" +
>> page.getFetchInterval()).append("\n");
>>             sb.append("retriesSinceFetch:\t" +
>> page.getRetriesSinceFetch()).append("\n");
>>             sb.append("modifiedTime:\t" +
>> page.getModifiedTime()).append("\n");
>>             sb.append("prevModifiedTime:\t" +
>> page.getPrevModifiedTime()).append("\n");
>>             sb.append("protocolStatus:\t" +
>>
>> ProtocolStatusUtils.toString(page.getProtocolStatus())).append("\n");
>>
>>             ByteBuffer content = page.getContent();
>>             if (content != null ) {
>>               sb.append("contentType:\t" +
>> page.getContentType()).append("\n");
>>               sb.append("content:start:\n");
>>               sb.append(Bytes.toString(content.array()));
>>               sb.append("\ncontent:end:\n");
>>             }
>>             Utf8 text = page.getText();
>>             if (text != null ) {
>>               sb.append("text:start:\n");
>>               sb.append(text.toString());
>>               sb.append("\ntext:end:\n");
>>             }
>>
>>             log.info("My Log is " + sb.toString());
>>            return parse;
>>     }
>> }
>> *
>> *
>> *And this is my log file and as you can see that for each url in
>> seed.txt,  it is returning the html of both pages ( bing & google )*
>>
>>
>> https://docs.google.com/file/d/0B9DKVnl1zAbSb0wtN2JS

-- 
*Lewis*

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

Reply via email to