Re: Need to crawl the site that requires flash to be enabled

Alexis Hope Thu, 05 Feb 2015 04:43:14 -0800

Hi Kartik,

The site you're trying to crawl is a Flash website. Unfortunatly that will
be a problem for Nutch.
Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or
JS that are included in the page.

To limit the crawls to the domain http://museums.bankofamerica.com try
using the regex-urlfilter plugin. I believe setting a new line with +^
http://museums.bankofamerica.com in the conf/regex-urlfilter.txt should
limit it. The depth is only how many pages to crawl from root, this as I
understand will include outlinks.

This is from the nutch wiki '-depth depth indicates the link depth from the
root page that should be crawled.' depth 1 will still include outlinks.

Hope this helps.
Lex

On Thu, Feb 5, 2015 at 2:09 AM, Krishnanand, Kartik <
[email protected]> wrote:

> Hi,
>
> We are crawling http://museums.bankofamerica.com as test for setting up
> Nutch. After the crawl is complete, we see the following entry in Solr
>
> " You will need the current version of Flash to view this website
> properly."
>
> When we load http://museums.bankofamerica.com in the browser, I am
> redirected to http://museums.bankofamerica.com/mobile website.
>
> We are not interested in crawling the outlinks, so we have set the crawl
> depth to 1. We just want to crawl the content of this webpage.
>
> Any help would be gratefully appreciated.
>
> Thanks,
>
> Kartik
>
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>

Re: Need to crawl the site that requires flash to be enabled

Reply via email to