Running the Crawl without using bin/nutch in side a scala program
Hi , I am trying to run the crawl inside a scala program without using bin/nutch command, I am adding all the environment variables which are set by nutch.sh when crawl is running through bin/nutch command. And i am calling the Crawl.main(prams) class and i am getting the following error Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:160) at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) and here is the code i am trying to write for { (line) - Source.fromFile(/root/classpaths.sh).getLines } if(line != null){ var bo:Array[Byte] = new Array[Byte](100); var cmd:Array[String] = new Array[String](3); cmd(0)=bash cmd(1)=-c cmd(2)=line; var checkingCrawl:Process = Runtime.getRuntime().exec(cmd); } var params:Array[String] = new Array[String](5); params(0)=urls params(1)=-dir params(2)=insidejava params(3)=-depth params(4)=1 org.apache.nutch.crawl.Crawl.main(params); contents of classpaths.sh: JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m # check envvars which might override default args if [ $NUTCH_HEAPSIZE != ]; then #echo run with heapsize $NUTCH_HEAPSIZE JAVA_HEAP_MAX=-Xmx$NUTCH_HEAPSIZEm #echo $JAVA_HEAP_MAX fi # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar # so that filenames w/ spaces are handled correctly in loops below IFS= # for developers, add plugins, job test code to CLASSPATH if [ -d $NUTCH_HOME/build/plugins ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build fi if [ -d $NUTCH_HOME/build/test/classes ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes fi if [ $IS_CORE == 0 ] then for f in $NUTCH_HOME/build/nutch-*.job; do CLASSPATH=${CLASSPATH}:$f; done # for releases, add Nutch job to CLASSPATH for f in $NUTCH_HOME/nutch-*.job; do CLASSPATH=${CLASSPATH}:$f; done else CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes fi # add plugins to classpath if [ -d $NUTCH_HOME/plugins ]; then CLASSPATH=${NUTCH_HOME}:${CLASSPATH} fi # add libs to CLASSPATH for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # setup 'java.library.path' for native-hadoop code if necessary JAVA_LIBRARY_PATH='' if [ -d ${NUTCH_HOME}/build/native -o -d ${NUTCH_HOME}/lib/native ]; then JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` if [ -d $NUTCH_HOME/build/native ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d ${NUTCH_HOME}/lib/native ]; then if [ x$JAVA_LIBRARY_PATH != x ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} fi fi fi # restore ordinary behaviour unset IFS # default log directory file if [ $NUTCH_LOG_DIR = ]; then NUTCH_LOG_DIR=$NUTCH_HOME/logs fi if [ $NUTCH_LOGFILE = ]; then NUTCH_LOGFILE='hadoop.log' fi NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE if [ x$JAVA_LIBRARY_PATH != x ]; then NUTCH_OPTS=$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH fi contents of hadoop.log: 2009-07-27 18:48:55,345 INFO crawl.Crawl - crawl started in: insidejava 2009-07-27 18:48:55,347 INFO crawl.Crawl - rootUrlDir = urls 2009-07-27 18:48:55,347 INFO crawl.Crawl - threads = 10 2009-07-27 18:48:55,347 INFO crawl.Crawl - depth = 1 2009-07-27 18:48:55,779 INFO crawl.Injector - Injector: starting 2009-07-27 18:48:55,780 INFO crawl.Injector - Injector: crawlDb: insidejava/crawldb 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: urlDir: urls 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-07-27 18:48:55,974 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-27 18:49:19,685 WARN plugin.PluginRepository - Plugins: not a file: url. Can't load plugins from: jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Plugins: 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Extension-Points: 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE 2009-07-27 18:49:19,689 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
Re: Running the Crawl without using bin/nutch in side a scala program
On Mon, Jul 27, 2009 at 16:47, Sailaja Dhivitisailaja_dhiv...@persistent.co.in wrote: Hi , I am trying to run the crawl inside a scala program without using bin/nutch command, I am adding all the environment variables which are set by nutch.sh when crawl is running through bin/nutch command. And i am calling the Crawl.main(prams) class and i am getting the following error Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:160) at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) and here is the code i am trying to write for { (line) - Source.fromFile(/root/classpaths.sh).getLines } if(line != null){ var bo:Array[Byte] = new Array[Byte](100); var cmd:Array[String] = new Array[String](3); cmd(0)=bash cmd(1)=-c cmd(2)=line; var checkingCrawl:Process = Runtime.getRuntime().exec(cmd); } var params:Array[String] = new Array[String](5); params(0)=urls params(1)=-dir params(2)=insidejava params(3)=-depth params(4)=1 org.apache.nutch.crawl.Crawl.main(params); contents of classpaths.sh: JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m # check envvars which might override default args if [ $NUTCH_HEAPSIZE != ]; then #echo run with heapsize $NUTCH_HEAPSIZE JAVA_HEAP_MAX=-Xmx$NUTCH_HEAPSIZEm #echo $JAVA_HEAP_MAX fi # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar # so that filenames w/ spaces are handled correctly in loops below IFS= # for developers, add plugins, job test code to CLASSPATH if [ -d $NUTCH_HOME/build/plugins ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build fi if [ -d $NUTCH_HOME/build/test/classes ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes fi if [ $IS_CORE == 0 ] then for f in $NUTCH_HOME/build/nutch-*.job; do CLASSPATH=${CLASSPATH}:$f; done # for releases, add Nutch job to CLASSPATH for f in $NUTCH_HOME/nutch-*.job; do CLASSPATH=${CLASSPATH}:$f; done else CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes fi # add plugins to classpath if [ -d $NUTCH_HOME/plugins ]; then CLASSPATH=${NUTCH_HOME}:${CLASSPATH} fi # add libs to CLASSPATH for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # setup 'java.library.path' for native-hadoop code if necessary JAVA_LIBRARY_PATH='' if [ -d ${NUTCH_HOME}/build/native -o -d ${NUTCH_HOME}/lib/native ]; then JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` if [ -d $NUTCH_HOME/build/native ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d ${NUTCH_HOME}/lib/native ]; then if [ x$JAVA_LIBRARY_PATH != x ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} fi fi fi # restore ordinary behaviour unset IFS # default log directory file if [ $NUTCH_LOG_DIR = ]; then NUTCH_LOG_DIR=$NUTCH_HOME/logs fi if [ $NUTCH_LOGFILE = ]; then NUTCH_LOGFILE='hadoop.log' fi NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE if [ x$JAVA_LIBRARY_PATH != x ]; then NUTCH_OPTS=$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH fi contents of hadoop.log: 2009-07-27 18:48:55,345 INFO crawl.Crawl - crawl started in: insidejava 2009-07-27 18:48:55,347 INFO crawl.Crawl - rootUrlDir = urls 2009-07-27 18:48:55,347 INFO crawl.Crawl - threads = 10 2009-07-27 18:48:55,347 INFO crawl.Crawl - depth = 1 2009-07-27 18:48:55,779 INFO crawl.Injector - Injector: starting 2009-07-27 18:48:55,780 INFO crawl.Injector - Injector: crawlDb: insidejava/crawldb 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: urlDir: urls 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-07-27 18:48:55,974 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-27 18:49:19,685 WARN plugin.PluginRepository - Plugins: not a file: url. Can't load plugins from: jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Plugins: 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Extension-Points: 2009-07-27 18:49:19,686 INFO plugin.PluginRepository -