[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
For Vista, give cygwin administrative privileges

--
  
  Install cygwin and set the PATH environment variable for it. You can set it 
from the Control Panel, System, Advanced Tab, Environment Variables and 
edit/add PATH.
  
- I have in PATH like:
+ Example PATH:
  {{{
  C:\Sun\SDK\bin;C:\cygwin\bin
  }}}
  If you run bash from the Windows command line (Start  Run...  cmd.exe) it 
should successfully run cygwin.
  
- If you are running Eclipse on Vista, you will likely need to 
[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/
 turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely 
complain that it cannot change a directory permission when you later run the 
crawler:
+ If you are running Eclipse on Vista, you will need to either give cygwin 
administrative privileges or 
[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/
 turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely 
complain that it cannot change a directory permission when you later run the 
crawler:
  {{{
  org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions 
of ... Permission denied
  }}}


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Removed install of whoami in Windows (cygwin's whoami is used)

--
  
  == Before you start ==
  
+ Setting up Nutch to run into Eclipse can be tricky, and most of the time it 
is much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents). However, it's very useful to be able to debug Nutch 
in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug 
a problem.
- Setting up Nutch to run into Eclipse can be tricky, and most of the time you 
are much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents).
- However, it's very useful to be able to debug Nutch in Eclipse. But again you 
might be quicker by looking at the logs (logs/hadoop.log)...
  
  
  == Steps ==
@@ -34, +33 @@

  
  C:\Sun\SDK\bin;C:\cygwin\bin
  
- If you run bash in Start-RUN-cmd.exe it should work. 
+ If you run bash in Start  Run...  cmd.exe it should work. 
  
- 
- Then you should install tools from Microsoft website (adding 'whoami' 
command).
- 
- Example for Windows XP and sp2
- 
- 
http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en
- 
- 
- Then you can follow rest of these steps
  
  === Install Nutch ===
   * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Add link to official release

--
  
  
  === Install Nutch ===
-  * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html
+  * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] 
of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ 
official 1.0 release]. 
-  * Do not build Nutch now. Make sure you have no .project and .classpath 
files in the Nutch directory
+  * Do not build Nutch yet. Make sure you have no .project and .classpath 
files in the Nutch directory
  
  
  === Create a new java project in Eclipse ===


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Added fix for RTFParseFactory issues

--
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
  Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually).
  
+ === Two Errors with RTFParseFactory ===
+ 
+ If you are trying to build the official 1.0 release, Eclipse will complain 
about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar 
file from the previous step).  This problem was fixed (see 
[http://issues.apache.org/jira/browse/NUTCH-644 NUTCH-644] and 
[http://issues.apache.org/jira/browse/NUTCH-705 NUTCH-705]) but was not 
included in the 1.0 official release because of licensing issues. So you will 
need to manually alter the code to remove these 2 build errors.
+ 
+ In RTFParseFactory.java:
+  1. Add the following import statement: {{{import 
org.apache.nutch.parse.ParseResult;}}}
+ 
+  2. Change 
+ 
+ {{{
+ public Parse getParse(Content content) {
+ }}}
+ to
+ {{{
+ public ParseResult getParse(Content content) {
+ }}}
+  1.#3 In the getParse function, replace
+ {{{
+ return new ParseStatus(ParseStatus.FAILED,
+ParseStatus.FAILED_EXCEPTION,
+e.toString()).getEmptyParse(conf);
+ }}}
+ with
+ {{{
+ return new ParseStatus(ParseStatus.FAILED,
+ ParseStatus.FAILED_EXCEPTION,
+   e.toString()).getEmptyParseResult(content.getUrl(), getConf());
+ }}}
+  1.#4 In the getParse function, replace
+ {{{
+ return new ParseImpl(text,
+  new ParseData(ParseStatus.STATUS_SUCCESS,
+title,
+OutlinkExtractor.getOutlinks(text, 
this.conf),
+content.getMetadata(),
+metadata));
+ }}}
+ with
+ {{{
+ return ParseResult.createParseResult(content.getUrl(),
+new ParseImpl(text,
+new ParseData(ParseStatus.STATUS_SUCCESS,
+title,
+OutlinkExtractor.getOutlinks(text, 
this.conf),
+content.getMetadata(),
+metadata)));
+ 
+ }}}
+ 
+ In TestRTFParser.java, replace
+ {{{
+ parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, content);
+ }}}
+ with
+ {{{
+ parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, 
content).get(urlString);
+ }}}
+ 
+ Once you have made these changes and saved the files, Eclipse should build 
with no errors.
  
  === Build Nutch ===
  If you setup the project correctly, Eclipse will build Nutch for you into 
tmp_build. See below for problems you could run into.


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
In case you forget to add cygwin to path

--
* add the hadoop project as a dependent project of nutch project 
* you can now also set break points within hadoop classes lik inputformat 
implementations etc. 
  
+ 
+ === Failed to get the current user's information ===
+ 
+ On Windows, if the crawler throws an exception complaining it Failed to get 
the current user's information or 'Login failed: Cannot run program bash', 
it is likely you forgot to set the PATH to point to cygwin.  Open a new command 
line window (All Programs  Accessories  Command Prompt) and type bash.  
This should start cygwin.  If it doesn't, type path to see your path. You 
should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See 
the steps to adding this to your PATH at the top of the article under For 
Windows Users.
+ 
+ 
  Original credits: RenaudRichardet
  


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Moved heap problem to location of other problems

--
   * if all works, you should see Nutch getting busy at crawling :-)
  
  
- == Java Heap Size problem ==
- 
- If you find in hadoop.log line similar to this:
- 
- {{{
- 2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
- java.lang.OutOfMemoryError: Java heap space
- }}}
- 
- You should increase amount of RAM for running applications from eclipse.
- 
- Just set it in:
- 
- Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments
- 
- I've set mine to 
- {{{
- -Xms5m -Xmx150m 
- }}}
- because I have like 200MB RAM left after runnig all apps
- 
- -Xms (minimum ammount of RAM memory for running applications)
- -Xmx (maximum) 
- 
  == Debug Nutch in Eclipse (not yet tested for 0.9) ==
   * Set breakpoints and debug a crawl
   * It can be tricky to find out where to set the breakpoint, because of the 
Hadoop jobs. Here are a few good places to set breakpoints:
@@ -195, +171 @@

  == If things do not work... ==
  Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)
  
+ === Java Heap Size problem ===
+ 
+ If the crawler throws an IOException exception early in the crawl (Exception 
in thread main java.io.IOException: Job failed!), check the logs/hadoop.log 
file for further information. If you find in hadoop.log lines similar to this:
+ 
+ {{{
+ 2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
+ java.lang.OutOfMemoryError: Java heap space
+ }}}
+ 
+ then you should increase amount of RAM for running applications from Eclipse.
+ 
+ Just set it in:
+ 
+ Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments
+ 
+ I've set mine to 
+ {{{
+ -Xms5m -Xmx150m 
+ }}}
+ because I have like 200MB RAM left after running all apps
+ 
+ -Xms (minimum ammount of RAM memory for running applications)
+ -Xmx (maximum) 
+ 
- === eclipse: Cannot create project content in workspace ===
+ === Eclipse: Cannot create project content in workspace ===
  The nutch source code must be out of the workspace folder. My first attempt 
was download the code with eclipse (svn) under my workspace. When I try to 
create the project using existing code, eclipse don't let me do it from source 
code into the workspace. I use the source code out of my workspace and it work 
fine.
  
  === plugin dir not found ===


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
UAC issue on Vista

--
  
  == Tested with ==
   * Nutch release 1.0
-  * Eclipse 3.3 - aka Europa, ganymede
+  * Eclipse 3.3 (Europa) and 3.4 (Ganymede)
   * Java 1.6
   * Ubuntu (should work on most platforms though)
-  * Windows XP
+  * Windows XP and Vista
  
  == Before you start ==
  
@@ -21, +21 @@

  
  === For Windows Users ===
  
- If you are running Windows (tested on Windows XP) you must first install 
cygwin
+ If you are running Windows (tested on Windows XP) you must first install 
cygwin. Download it from http://www.cygwin.com/setup.exe
  
+ Install cygwin and set the PATH environment variable for it. You can set it 
from the Control Panel, System, Advanced Tab, Environment Variables and 
edit/add PATH.
- Download cygwin from http://www.cygwin.com/setup.exe
- 
- Install cygwin and set PATH variable for it.
- 
- It's in control panel, system, advanced tab, environment variables and 
edit/add PATH
  
  I have in PATH like:
- 
+ {{{
  C:\Sun\SDK\bin;C:\cygwin\bin
+ }}}
+ If you run bash from the Windows command line (Start  Run...  cmd.exe) it 
should successfully run cygwin.
  
- If you run bash in Start  Run...  cmd.exe it should work. 
+ If you are running Eclipse on Vista, you will likely need to 
[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/
 turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely 
complain that it cannot change a directory permission when you later run the 
crawler:
+ {{{
+ org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions 
of ... Permission denied
+ }}}
  
+ See 
[http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell%24ExitCodeException%3A%20chmod%3A%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results
 this] for more information about the UAC issue.
  
  === Install Nutch ===
   * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] 
of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ 
official 1.0 release].