Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
Added some clarifications

------------------------------------------------------------------------------
  
  === Download ===
  
- [http://lucene.apache.org/nutch/release/ Download] the release and extract 
anywhere on your hard disk e.g. `c:\nutch-0.9`
+ [http://lucene.apache.org/nutch/release/ Download] the release and extract on 
your hard disk in a directory that ''does not'' contain a space in it (e.g., 
`c:\nutch-0.9`).  If the directory does contain a space (e.g., `c:\my 
programs\nutch-0.9`), the Nutch scripts will not work properly.
  
- Create an empty text file in your nutch directory e.g. `urls` and add the 
URLs of the sites you want to crawl.
+ Create an empty text file (use any name you wish) in your nutch directory 
(e.g., `urls`) and add the URLs of the sites you want to crawl.
  
- Add your URLs to the `crawl-urlfilter.txt` (e.g. 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
+ Add your URLs to the `crawl-urlfilter.txt` (e.g., 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
  {{{
  +^http://([a-z0-9]*\.)*apache.org/
  }}}
  
- Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
+ Load up cygwin and navigate to your `nutch` directory.  When cygwin launches, 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
  
- If your workstation needs to go through a windows authentication proxy to get 
to the internet then you can use an application such as the 
[http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to 
get through it.  You'll then need to edit the `nutch-site.xml` file to point to 
the port opened by the app.
+ If your workstation needs to go through a Windows Authentication Proxy to get 
to the Internet (this is not common), then you can use an application such as 
the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] 
to get through it.  You'll then need to edit the `nutch-site.xml` file to point 
to the port opened by the app.
  
  == Intranet Crawling ==
  
@@ -48, +48 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log
  }}}
- then a folder called crawl/ is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
+ then a folder called `crawl` is created in your `nutch` directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
  
  You'll need to delete or move the crawl directory before starting the crawl 
off again unless you specify another path on the command above.
  

Reply via email to