Re: Need to crawl the site that requires flash to be enabled

Lewis John Mcgibbney Sun, 08 Feb 2015 01:40:20 -0800

Hi Kartik and Alexis,

On Fri, Feb 6, 2015 at 5:19 AM, <[email protected]> wrote:


>
> The site you're trying to crawl is a Flash website. Unfortunatly that will
> be a problem for Nutch.
> Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or
> JS that are included in the page.
>

There is a patch which you guys can use
https://issues.apache.org/jira/browse/NUTCH-1933
This will allow you to get all of the page content. It is a WIP and not
thoroughly tested but it will achieve what you want right now. If you have
comments please put them on the Jira ticket


>
> To limit the crawls to the domain http://museums.bankofamerica.com try
> using the regex-urlfilter plugin. I believe setting a new line with +^
> http://museums.bankofamerica.com in the conf/regex-urlfilter.txt should
> limit it. The depth is only how many pages to crawl from root, this as I
> understand will include outlinks.
>

Correct. Just make sure to add a -. to ensure that nothing else from
outside of the domain is fetched.


> Lewis
>

Re: Need to crawl the site that requires flash to be enabled

Reply via email to