Thanks d_k! I will look into the options you suggested. Thanks a lot for your help!
On Wed, Feb 12, 2014 at 12:00 PM, d_k <[email protected]> wrote: > I'm not sure how to set httpclient to work on a specific client port. You > can try the httpclient mailing list. > > If its going to happens once a month perhaps you can run it from your > desktop or rent a cheap server outside the production environment? > > On Wed, Feb 12, 2014 at 6:26 PM, A Laxmi <[email protected]> wrote: > > > thanks d_k for all your help! But its a production environment and all > the > > servers in that environment are restricted by the firewall. I am pretty > > sure that I will not find a server that is open to the internet for > running > > nutch in that environment. :( > > > > > > On Wed, Feb 12, 2014 at 11:21 AM, d_k <[email protected]> wrote: > > > > > If you're behind a firewall then I think your best bet would be to > either > > > open port 8983 and run nutch on a server open to the internet and have > it > > > index documents to solr over the open port 8983 that will only accept > the > > > required HTTP headers or better yet, if its going to be a static index > > and > > > no one else will be writing to it setup solr on the same server as > nutch. > > > index locally and then sftp the index to your network behind the > > firewall. > > > You can probably just copy the entire solr directory and it should > work. > > > > > > > > > On Wed, Feb 12, 2014 at 5:44 PM, A Laxmi <[email protected]> > wrote: > > > > > > > Hi d_k! Yes, I am indexing them using Solr. > > > > Solr is also running on the same server on port 8983. I plan to > perform > > > the > > > > crawl every 30 days to update the old crawled data and to crawl any > new > > > > sites. > > > > > > > > > > > > On Wed, Feb 12, 2014 at 1:37 AM, d_k <[email protected]> wrote: > > > > > > > > > What are you doing with the crawled data? > > > > > If you index it using solr then you can open the port solr is > > listening > > > > on > > > > > and run nutch on a server without a firewall and have it send the > > > > documents > > > > > to the solr behind your firewall using the port you opened. > > > > > > > > > > Is it a one time crawl? How often do you plan to perform the crawl? > > > > > > > > > > > > > > > On Wed, Feb 12, 2014 at 3:53 AM, A Laxmi <[email protected]> > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I installed Nutch 2.2.1 on a server that is restricted by > firewall > > to > > > > > > access internet. I tried to run my first crawl on that server, I > > > > started > > > > > > getting timedout errors and the crawl was getting hung. So, > > firewall > > > > was > > > > > > actually blocking my nutch crawler to crawl any site. I verified > > that > > > > > with > > > > > > hosting admin and they mentioned firewall does block the crawler > > from > > > > > > crawling websites. > > > > > > > > > > > > I am not sure how I go about getting nutch to crawl websites in > > such > > > a > > > > > > firewall restricted environment? Please suggest > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > >

