Hi Kartik and Alexis, On Fri, Feb 6, 2015 at 5:19 AM, <[email protected]> wrote:
> > The site you're trying to crawl is a Flash website. Unfortunatly that will > be a problem for Nutch. > Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or > JS that are included in the page. > There is a patch which you guys can use https://issues.apache.org/jira/browse/NUTCH-1933 This will allow you to get all of the page content. It is a WIP and not thoroughly tested but it will achieve what you want right now. If you have comments please put them on the Jira ticket > > To limit the crawls to the domain http://museums.bankofamerica.com try > using the regex-urlfilter plugin. I believe setting a new line with +^ > http://museums.bankofamerica.com in the conf/regex-urlfilter.txt should > limit it. The depth is only how many pages to crawl from root, this as I > understand will include outlinks. > Correct. Just make sure to add a -. to ensure that nothing else from outside of the domain is fetched. > Lewis >

