Re: login failed exception

2009-04-14 Thread Bartosz Gadzimski

Hello Frank,

Yes, it is memory issue you must increase java heap size.

Just follow this instructions (another things to add to wiki ;)

Eclipse - Window - Preferences - Java - Installed JREs - edit - 
Default VM arguments


I've set mine to -Xms5m -Xmx150m because I have like 200MB RAM left 
after runnig all apps


-Xms (minimum ammount of RAM memory for running applications)
-Xmx (maximum)

It should help.

Thanks,
Bartosz

Frank McCown pisze:

Hello Bartosz,

I'm running the default Nutch 1.0 version on Windows XP (2 GB RAM)
with Eclipse 3.3.0.  I followed the directions at

http://wiki.apache.org/nutch/RunNutchInEclipse0.9

exactly as stated.  I'm able to run the default Nutch 0.9 release
without any problems in Eclipse.  But when I run 1.0, I always get the
java.io.IOException as stated in my last email.  I had assumed it was
due to the plugin issue, but maybe not.  I'm just running a very small
crawl with two seed URLs.

Here's what hadoop.log says:

2009-04-13 13:41:03,010 INFO  crawl.Crawl - crawl started in: crawl
2009-04-13 13:41:03,025 INFO  crawl.Crawl - rootUrlDir = urls
2009-04-13 13:41:03,025 INFO  crawl.Crawl - threads = 10
2009-04-13 13:41:03,025 INFO  crawl.Crawl - depth = 3
2009-04-13 13:41:03,025 INFO  crawl.Crawl - topN = 5
2009-04-13 13:41:03,479 INFO  crawl.Injector - Injector: starting
2009-04-13 13:41:03,479 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-04-13 13:41:03,479 INFO  crawl.Injector - Injector: urlDir: urls
2009-04-13 13:41:03,479 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2009-04-13 13:41:03,588 WARN  mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:498)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)


I have not tried Sanjoy's advice yet... it looks like this is a memory issue.

Any advice would be much appreciated,
Frank


2009/4/10 Bartosz Gadzimski bartek...@o2.pl:
  

Hello Frank,

Please look into hadoop.log and let maybe there is something more.

About your error - you must give us more specific configuration of your
nutch.

Default nutch installation is working with no problems (I'v never changed
src/plugin path)

Please tell us: version of nutch
any changes
different configurations (different then crawl-urlfilter - adding your
domain).

Thanks,
Bartosz

Frank McCown pisze:


Adding cygwin to my PATH solved my problem with whoami.  But now I'm
getting an exception when running the crawler:

Injector: Converting injected urls to crawl db entries.
Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

I know from searching the mailing list that this is normally due to a
bad plugin.folders setting in the nutch-default.xml, but I used the
same value as the tutorial (./src/plugin) to no avail.

(As an aside, seems like Hadoop should provide a better error message
if the plugin folder doesn't exist.)

Anyway, thanks, Bartosz, for your help.

Frank


2009/4/10 Bartosz Gadzimski bartek...@o2.pl:

  

Hello,

So now you have to install cygwin and be sure that you add it to PATH

it's in http://wiki.apache.org/nutch/RunNutchInEclipse0.9

After this you should be able to run bash command from command prompt
(Menu Start  RUN  cmd.exe)

Then you'r done - everything will be working.

I must add it to wiki, I forgot about whoami problem.

Take care,
Bartosz

sanjoy.gh...@thomsonreuters.com pisze:



Thanks for the suggestion Bartosz.  I downloaded whoami, and It promptly
crashed on bash.

09/04/10 12:02:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run
program bash: CreateProcess error=2, The system cannot find the file
specified
  at
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupI
nformation.java:250)
  at
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupI
nformation.java:275)
  at
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupI
nformation.java:257)
  at
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformati
on.java:67)
  at
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438)
  at
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:84)

Where am I going to find bash on 

[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
Added java heap size solution

--
- = RunNutchInEclipse =
+ = Run Nutch In Eclipse on Linux and Windows nutch version 0.9=
  
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  
@@ -104, +104 @@

   * click on Run
   * if all works, you should see Nutch getting busy at crawling :-)
  
- == Debug Nutch in Eclipse (not yet tested for 0.9) ==
+ == Java Heap Size problem ==
+ 
+ If you find in hadoop.log line similar to this:
+ 
+ {{{
+ 2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
+ java.lang.OutOfMemoryError: Java heap space
+ }}}
+ 
+ You should increase amount of RAM for running applications from eclipse.
+ 
+ Just set it in:
+ 
+ Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments
+ 
+ I've set mine to 
+ {{{
+ -Xms5m -Xmx150m 
+ }}}
+ because I have like 200MB RAM left after runnig all apps
+ 
+ -Xms (minimum ammount of RAM memory for running applications)
+ -Xmx (maximum) 
+ 
+ 
+ == Debug Nutch in Eclipse  ==
   * Set breakpoints and debug a crawl
   * It can be tricky to find out where to set the breakpoint, because of the 
Hadoop jobs. Here are a few good places to set breakpoints:
  {{{


[Nutch Wiki] Update of RunNutchInEclipse1.0 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Copied page for 1.0 release

New page:
= Run Nutch In Eclipse on Linux and Windows nutch version 1.0=

This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)

== Tested with ==
 * Nutch release 1.0
 * Eclipse 3.3 - aka Europa, ganymede
 * Java 1.6
 * Ubuntu (should work on most platforms though)
 * Windows XP

== Before you start ==

Setting up Nutch to run into Eclipse can be tricky, and most of the time you 
are much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents).
However, it's very useful to be able to debug Nutch in Eclipse. But again you 
might be quicker by looking at the logs (logs/hadoop.log)...


== Steps ==


=== For Windows Users ===

If you are running Windows (tested on Windows XP) you must first install cygwin

Download cygwin from http://www.cygwin.com/setup.exe

Install cygwin and set PATH variable for it.

It's in control panel, system, advanced tab, environment variables and edit/add 
PATH

I have in PATH like:

C:\Sun\SDK\bin;C:\cygwin\bin

If you run bash in Start-RUN-cmd.exe it should work. 


Then you should install tools from Microsoft website (adding 'whoami' command).

Example for Windows XP and sp2

http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en


Then you can follow rest of these steps

=== Install Nutch ===
 * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html
 * Do not build Nutch now. Make sure you have no .project and .classpath files 
in the Nutch directory


=== Create a new java project in Eclipse ===
 * File  New  Project  Java project  click Next
 * Name the project (Nutch_Trunk for instance)
 * Select Create project from existing source and use the location where you 
downloaded Nutch
 * Click on Next, and wait while Eclipse is scanning the folders
 * Add the folder conf to the classpath (third tab and then add class folder) 
 * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top. It's required to make eclipse take config 
(nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not 
anywhere else.
 * Eclipse should have guessed all the java files that must be added on your 
classpath. If it's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
 * Set output dir to tmp_build, create it if necessary
 * DO NOT add build to classpath


=== Configure Nutch ===
 * See the [http://wiki.apache.org/nutch/NutchTutorial Tutorial]
 * Change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml 
 * Make sure Nutch is configured correctly before testing it into Eclipse ;-)

=== Missing org.farng and com.etranslate ===
Eclipse will complain about some import statements in parse-mp3 and parse-rtf 
plugins (30 errors in my case).
Because of incompatibility with the Apache license, the .jar files that define 
the necessary classes were not included with the source code. 

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ 
respectively.
Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually).


=== Build Nutch ===
If you setup the project correctly, Eclipse will build Nutch for you into 
tmp_build. See below for problems you could run into.



=== Create Eclipse launcher ===
 * Menu Run  Run...
 * create New for Java Application
 * set in Main class
{{{
org.apache.nutch.crawl.Crawl
}}}
 * on tab Arguments, Program Arguments
{{{
urls -dir crawl -depth 3 -topN 50
}}}
 * in VM arguments
{{{
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
}}}
 * click on Run
 * if all works, you should see Nutch getting busy at crawling :-)


== Java Heap Size problem ==

If you find in hadoop.log line similar to this:

{{{
2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
}}}

You should increase amount of RAM for running applications from eclipse.

Just set it in:

Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments

I've set mine to 
{{{
-Xms5m -Xmx150m 
}}}
because I have like 200MB RAM left after runnig all apps

-Xms 

[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

--
- = Run Nutch In Eclipse on Linux and Windows nutch version 0.9=
+ = Run Nutch In Eclipse on Linux and Windows nutch version 0.9 =
  
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  


[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

--
- = Run Nutch In Eclipse on Linux and Windows nutch version 1.0=
+ = Run Nutch In Eclipse on Linux and Windows nutch version 1.0 =
  
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  


[Nutch Wiki] Update of FrontPage by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/FrontPage

--
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse for v0.8
-  * [RunNutchInEclipse0.9] for v0.9
+  * [RunNutchInEclipse0.9] for v0.9 (Linux and Windows)
+  * [RunNutchInEclipse1.0] for v1.0 (Linux and Windows)
   * [Crawl] - script to crawl (and possible recrawl too)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls