If you're behind a firewall then I think your best bet would be to either
open port 8983 and run nutch on a server open to the internet and have it
index documents to solr over the open port 8983 that will only accept the
required HTTP headers or better yet, if its going to be a static index and
no one else will be writing to it setup solr on the same server as nutch.
index locally and then sftp the index to your network behind the firewall.
You can probably just copy the entire solr directory and it should work.


On Wed, Feb 12, 2014 at 5:44 PM, A Laxmi <[email protected]> wrote:

> Hi d_k! Yes, I am indexing them using Solr.
> Solr is also running on the same server on port 8983. I plan to perform the
> crawl every 30 days to update the old crawled data and to crawl any new
> sites.
>
>
> On Wed, Feb 12, 2014 at 1:37 AM, d_k <[email protected]> wrote:
>
> > What are you doing with the crawled data?
> > If you index it using solr then you can open the port solr is listening
> on
> > and run nutch on a server without a firewall and have it send the
> documents
> > to the solr behind your firewall using the port you opened.
> >
> > Is it a one time crawl? How often do you plan to perform the crawl?
> >
> >
> > On Wed, Feb 12, 2014 at 3:53 AM, A Laxmi <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I installed Nutch 2.2.1 on a server that is restricted by firewall to
> > > access internet. I tried to run my first crawl on that server, I
> started
> > > getting timedout errors and the crawl was getting hung. So, firewall
> was
> > > actually blocking my nutch crawler to crawl any site. I verified that
> > with
> > > hosting admin and they mentioned firewall does block the crawler from
> > > crawling websites.
> > >
> > > I am not sure how I go about getting nutch to crawl websites in such a
> > > firewall restricted environment? Please suggest
> > >
> > > Thanks!
> > >
> >
>

Reply via email to