[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: For Vista, give cygwin administrative privileges -- Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH. - I have in PATH like: + Example PATH: {{{ C:\Sun\SDK\bin;C:\cygwin\bin }}} If you run bash from the Windows command line (Start Run... cmd.exe) it should successfully run cygwin. - If you are running Eclipse on Vista, you will likely need to [http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/ turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: + If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or [http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/ turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: {{{ org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied }}}
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Removed install of whoami in Windows (cygwin's whoami is used) -- == Before you start == + Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem. - Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). - However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quicker by looking at the logs (logs/hadoop.log)... == Steps == @@ -34, +33 @@ C:\Sun\SDK\bin;C:\cygwin\bin - If you run bash in Start-RUN-cmd.exe it should work. + If you run bash in Start Run... cmd.exe it should work. - - Then you should install tools from Microsoft website (adding 'whoami' command). - - Example for Windows XP and sp2 - - http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en - - - Then you can follow rest of these steps === Install Nutch === * Grab a fresh release of Nutch 0.9 - http://lucene.apache.org/nutch/version_control.html
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Add link to official release -- === Install Nutch === - * Grab a fresh release of Nutch 0.9 - http://lucene.apache.org/nutch/version_control.html + * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ official 1.0 release]. - * Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory + * Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory === Create a new java project in Eclipse ===
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Added fix for RTFParseFactory issues -- Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder Build Path Configure Build Path... Then select the Libraries tab, click Add Jars... and then add each .jar file individually). + === Two Errors with RTFParseFactory === + + If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see [http://issues.apache.org/jira/browse/NUTCH-644 NUTCH-644] and [http://issues.apache.org/jira/browse/NUTCH-705 NUTCH-705]) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors. + + In RTFParseFactory.java: + 1. Add the following import statement: {{{import org.apache.nutch.parse.ParseResult;}}} + + 2. Change + + {{{ + public Parse getParse(Content content) { + }}} + to + {{{ + public ParseResult getParse(Content content) { + }}} + 1.#3 In the getParse function, replace + {{{ + return new ParseStatus(ParseStatus.FAILED, +ParseStatus.FAILED_EXCEPTION, +e.toString()).getEmptyParse(conf); + }}} + with + {{{ + return new ParseStatus(ParseStatus.FAILED, + ParseStatus.FAILED_EXCEPTION, + e.toString()).getEmptyParseResult(content.getUrl(), getConf()); + }}} + 1.#4 In the getParse function, replace + {{{ + return new ParseImpl(text, + new ParseData(ParseStatus.STATUS_SUCCESS, +title, +OutlinkExtractor.getOutlinks(text, this.conf), +content.getMetadata(), +metadata)); + }}} + with + {{{ + return ParseResult.createParseResult(content.getUrl(), +new ParseImpl(text, +new ParseData(ParseStatus.STATUS_SUCCESS, +title, +OutlinkExtractor.getOutlinks(text, this.conf), +content.getMetadata(), +metadata))); + + }}} + + In TestRTFParser.java, replace + {{{ + parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, content); + }}} + with + {{{ + parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, content).get(urlString); + }}} + + Once you have made these changes and saved the files, Eclipse should build with no errors. === Build Nutch === If you setup the project correctly, Eclipse will build Nutch for you into tmp_build. See below for problems you could run into.
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: In case you forget to add cygwin to path -- * add the hadoop project as a dependent project of nutch project * you can now also set break points within hadoop classes lik inputformat implementations etc. + + === Failed to get the current user's information === + + On Windows, if the crawler throws an exception complaining it Failed to get the current user's information or 'Login failed: Cannot run program bash', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs Accessories Command Prompt) and type bash. This should start cygwin. If it doesn't, type path to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under For Windows Users. + + Original credits: RenaudRichardet
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: Moved heap problem to location of other problems -- * if all works, you should see Nutch getting busy at crawling :-) - == Java Heap Size problem == - - If you find in hadoop.log line similar to this: - - {{{ - 2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001 - java.lang.OutOfMemoryError: Java heap space - }}} - - You should increase amount of RAM for running applications from eclipse. - - Just set it in: - - Eclipse - Window - Preferences - Java - Installed JREs - edit - Default VM arguments - - I've set mine to - {{{ - -Xms5m -Xmx150m - }}} - because I have like 200MB RAM left after runnig all apps - - -Xms (minimum ammount of RAM memory for running applications) - -Xmx (maximum) - == Debug Nutch in Eclipse (not yet tested for 0.9) == * Set breakpoints and debug a crawl * It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints: @@ -195, +171 @@ == If things do not work... == Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-) + === Java Heap Size problem === + + If the crawler throws an IOException exception early in the crawl (Exception in thread main java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this: + + {{{ + 2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001 + java.lang.OutOfMemoryError: Java heap space + }}} + + then you should increase amount of RAM for running applications from Eclipse. + + Just set it in: + + Eclipse - Window - Preferences - Java - Installed JREs - edit - Default VM arguments + + I've set mine to + {{{ + -Xms5m -Xmx150m + }}} + because I have like 200MB RAM left after running all apps + + -Xms (minimum ammount of RAM memory for running applications) + -Xmx (maximum) + - === eclipse: Cannot create project content in workspace === + === Eclipse: Cannot create project content in workspace === The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine. === plugin dir not found ===
[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by FrankMcCown: http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0 The comment on the change is: UAC issue on Vista -- == Tested with == * Nutch release 1.0 - * Eclipse 3.3 - aka Europa, ganymede + * Eclipse 3.3 (Europa) and 3.4 (Ganymede) * Java 1.6 * Ubuntu (should work on most platforms though) - * Windows XP + * Windows XP and Vista == Before you start == @@ -21, +21 @@ === For Windows Users === - If you are running Windows (tested on Windows XP) you must first install cygwin + If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from http://www.cygwin.com/setup.exe + Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH. - Download cygwin from http://www.cygwin.com/setup.exe - - Install cygwin and set PATH variable for it. - - It's in control panel, system, advanced tab, environment variables and edit/add PATH I have in PATH like: - + {{{ C:\Sun\SDK\bin;C:\cygwin\bin + }}} + If you run bash from the Windows command line (Start Run... cmd.exe) it should successfully run cygwin. - If you run bash in Start Run... cmd.exe it should work. + If you are running Eclipse on Vista, you will likely need to [http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/ turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler: + {{{ + org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied + }}} + See [http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell%24ExitCodeException%3A%20chmod%3A%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results this] for more information about the UAC issue. === Install Nutch === * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ official 1.0 release].