Re: Nutch 2.2.1 crawler cannot progress due to restriction from firewall

d_k Wed, 12 Feb 2014 09:01:26 -0800

I'm not sure how to set httpclient to work on a specific client port. You
can try the httpclient mailing list.


If its going to happens once a month perhaps you can run it from your
desktop or rent a cheap server outside the production environment?

On Wed, Feb 12, 2014 at 6:26 PM, A Laxmi <[email protected]> wrote:

> thanks d_k for all your help! But its a production environment and all the
> servers in that environment are restricted by the firewall. I am pretty
> sure that I will not find a server that is open to the internet for running
> nutch in that environment. :(
>
>
> On Wed, Feb 12, 2014 at 11:21 AM, d_k <[email protected]> wrote:
>
> > If you're behind a firewall then I think your best bet would be to either
> > open port 8983 and run nutch on a server open to the internet and have it
> > index documents to solr over the open port 8983 that will only accept the
> > required HTTP headers or better yet, if its going to be a static index
> and
> > no one else will be writing to it setup solr on the same server as nutch.
> > index locally and then sftp the index to your network behind the
> firewall.
> > You can probably just copy the entire solr directory and it should work.
> >
> >
> > On Wed, Feb 12, 2014 at 5:44 PM, A Laxmi <[email protected]> wrote:
> >
> > > Hi d_k! Yes, I am indexing them using Solr.
> > > Solr is also running on the same server on port 8983. I plan to perform
> > the
> > > crawl every 30 days to update the old crawled data and to crawl any new
> > > sites.
> > >
> > >
> > > On Wed, Feb 12, 2014 at 1:37 AM, d_k <[email protected]> wrote:
> > >
> > > > What are you doing with the crawled data?
> > > > If you index it using solr then you can open the port solr is
> listening
> > > on
> > > > and run nutch on a server without a firewall and have it send the
> > > documents
> > > > to the solr behind your firewall using the port you opened.
> > > >
> > > > Is it a one time crawl? How often do you plan to perform the crawl?
> > > >
> > > >
> > > > On Wed, Feb 12, 2014 at 3:53 AM, A Laxmi <[email protected]>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I installed Nutch 2.2.1 on a server that is restricted by firewall
> to
> > > > > access internet. I tried to run my first crawl on that server, I
> > > started
> > > > > getting timedout errors and the crawl was getting hung. So,
> firewall
> > > was
> > > > > actually blocking my nutch crawler to crawl any site. I verified
> that
> > > > with
> > > > > hosting admin and they mentioned firewall does block the crawler
> from
> > > > > crawling websites.
> > > > >
> > > > > I am not sure how I go about getting nutch to crawl websites in
> such
> > a
> > > > > firewall restricted environment? Please suggest
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>

Re: Nutch 2.2.1 crawler cannot progress due to restriction from firewall

Reply via email to