Running the Crawl without using bin/nutch in side a scala program

2009-07-27 Thread Sailaja Dhiviti
Hi ,
I am trying to run the crawl inside a scala program without using 
bin/nutch command, I am adding all the environment variables which are set by 
nutch.sh when crawl is running through bin/nutch command. And i am calling the 
Crawl.main(prams) class and i am getting the following error Exception in 
thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
and here is the code i am trying to write
for {
   (line) - Source.fromFile(/root/classpaths.sh).getLines
 } if(line != null){

var bo:Array[Byte] = new Array[Byte](100);
var cmd:Array[String] = new Array[String](3);
 cmd(0)=bash
 cmd(1)=-c
 cmd(2)=line;
  var checkingCrawl:Process = Runtime.getRuntime().exec(cmd);
  }
  var params:Array[String] = new Array[String](5);
  params(0)=urls
  params(1)=-dir
  params(2)=insidejava
  params(3)=-depth
  params(4)=1
  org.apache.nutch.crawl.Crawl.main(params);



contents of classpaths.sh:

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m

# check envvars which might override default args
if [ $NUTCH_HEAPSIZE !=  ]; then
  #echo run with heapsize $NUTCH_HEAPSIZE
  JAVA_HEAP_MAX=-Xmx$NUTCH_HEAPSIZEm
  #echo $JAVA_HEAP_MAX
fi

# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

# so that filenames w/ spaces are handled correctly in loops below
IFS=

# for developers, add plugins, job  test code to CLASSPATH
if [ -d $NUTCH_HOME/build/plugins ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
fi
if [ -d $NUTCH_HOME/build/test/classes ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
fi

if [ $IS_CORE == 0 ]
then
  for f in $NUTCH_HOME/build/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
  done

  # for releases, add Nutch job to CLASSPATH
  for f in $NUTCH_HOME/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
  done
else
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
fi
# add plugins to classpath
if [ -d $NUTCH_HOME/plugins ]; then
  CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
fi
# add libs to CLASSPATH
for f in $NUTCH_HOME/lib/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done

for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done

# setup 'java.library.path' for native-hadoop code if necessary
JAVA_LIBRARY_PATH=''
if [ -d ${NUTCH_HOME}/build/native -o -d ${NUTCH_HOME}/lib/native ]; then
  JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} 
org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`

  if [ -d $NUTCH_HOME/build/native ]; then
JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib
  fi

  if [ -d ${NUTCH_HOME}/lib/native ]; then
if [ x$JAVA_LIBRARY_PATH != x ]; then
  
JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
else
  JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
fi
  fi
fi

# restore ordinary behaviour
unset IFS

# default log directory  file
if [ $NUTCH_LOG_DIR =  ]; then
  NUTCH_LOG_DIR=$NUTCH_HOME/logs
fi
if [ $NUTCH_LOGFILE =  ]; then
  NUTCH_LOGFILE='hadoop.log'
fi
NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR
NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE

if [ x$JAVA_LIBRARY_PATH != x ]; then
  NUTCH_OPTS=$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH
fi


contents of hadoop.log:

2009-07-27 18:48:55,345 INFO  crawl.Crawl - crawl started in: insidejava
2009-07-27 18:48:55,347 INFO  crawl.Crawl - rootUrlDir = urls
2009-07-27 18:48:55,347 INFO  crawl.Crawl - threads = 10
2009-07-27 18:48:55,347 INFO  crawl.Crawl - depth = 1
2009-07-27 18:48:55,779 INFO  crawl.Injector - Injector: starting
2009-07-27 18:48:55,780 INFO  crawl.Injector - Injector: crawlDb: 
insidejava/crawldb
2009-07-27 18:48:55,781 INFO  crawl.Injector - Injector: urlDir: urls
2009-07-27 18:48:55,781 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2009-07-27 18:48:55,974 WARN  mapred.JobClient - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2009-07-27 18:49:19,685 WARN  plugin.PluginRepository - Plugins: not a file: 
url. Can't load plugins from: jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins
2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Registered Plugins:
2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - NONE
2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Registered 
Extension-Points:
2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - NONE
2009-07-27 18:49:19,689 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not 
found.
   

Re: Running the Crawl without using bin/nutch in side a scala program

2009-07-27 Thread Doğacan Güney
On Mon, Jul 27, 2009 at 16:47, Sailaja
Dhivitisailaja_dhiv...@persistent.co.in wrote:
 Hi ,
        I am trying to run the crawl inside a scala program without using 
 bin/nutch command, I am adding all the environment variables which are set by 
 nutch.sh when crawl is running through bin/nutch command. And i am calling 
 the Crawl.main(prams) class and i am getting the following error Exception in 
 thread main java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
 and here is the code i am trying to write
 for {
       (line) - Source.fromFile(/root/classpaths.sh).getLines
     } if(line != null){

    var bo:Array[Byte] = new Array[Byte](100);
    var cmd:Array[String] = new Array[String](3);
     cmd(0)=bash
     cmd(1)=-c
     cmd(2)=line;
      var checkingCrawl:Process = Runtime.getRuntime().exec(cmd);
      }
      var params:Array[String] = new Array[String](5);
      params(0)=urls
      params(1)=-dir
      params(2)=insidejava
      params(3)=-depth
      params(4)=1
      org.apache.nutch.crawl.Crawl.main(params);



 contents of classpaths.sh:

 JAVA=$JAVA_HOME/bin/java
 JAVA_HEAP_MAX=-Xmx1000m

 # check envvars which might override default args
 if [ $NUTCH_HEAPSIZE !=  ]; then
  #echo run with heapsize $NUTCH_HEAPSIZE
  JAVA_HEAP_MAX=-Xmx$NUTCH_HEAPSIZEm
  #echo $JAVA_HEAP_MAX
 fi

 # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to 
 $NUTCH_HOME/conf
 CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
 CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

 # so that filenames w/ spaces are handled correctly in loops below
 IFS=

 # for developers, add plugins, job  test code to CLASSPATH
 if [ -d $NUTCH_HOME/build/plugins ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
 fi
 if [ -d $NUTCH_HOME/build/test/classes ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
 fi

 if [ $IS_CORE == 0 ]
 then
  for f in $NUTCH_HOME/build/nutch-*.job; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # for releases, add Nutch job to CLASSPATH
  for f in $NUTCH_HOME/nutch-*.job; do
    CLASSPATH=${CLASSPATH}:$f;
  done
 else
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
 fi
 # add plugins to classpath
 if [ -d $NUTCH_HOME/plugins ]; then
  CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
 fi
 # add libs to CLASSPATH
 for f in $NUTCH_HOME/lib/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
 done

 for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
 done

 # setup 'java.library.path' for native-hadoop code if necessary
 JAVA_LIBRARY_PATH=''
 if [ -d ${NUTCH_HOME}/build/native -o -d ${NUTCH_HOME}/lib/native ]; then
  JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} 
 org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`

  if [ -d $NUTCH_HOME/build/native ]; then
    JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib
  fi

  if [ -d ${NUTCH_HOME}/lib/native ]; then
    if [ x$JAVA_LIBRARY_PATH != x ]; then
      
 JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
    else
      JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
    fi
  fi
 fi

 # restore ordinary behaviour
 unset IFS

 # default log directory  file
 if [ $NUTCH_LOG_DIR =  ]; then
  NUTCH_LOG_DIR=$NUTCH_HOME/logs
 fi
 if [ $NUTCH_LOGFILE =  ]; then
  NUTCH_LOGFILE='hadoop.log'
 fi
 NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR
 NUTCH_OPTS=$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE

 if [ x$JAVA_LIBRARY_PATH != x ]; then
  NUTCH_OPTS=$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH
 fi


 contents of hadoop.log:

 2009-07-27 18:48:55,345 INFO  crawl.Crawl - crawl started in: insidejava
 2009-07-27 18:48:55,347 INFO  crawl.Crawl - rootUrlDir = urls
 2009-07-27 18:48:55,347 INFO  crawl.Crawl - threads = 10
 2009-07-27 18:48:55,347 INFO  crawl.Crawl - depth = 1
 2009-07-27 18:48:55,779 INFO  crawl.Injector - Injector: starting
 2009-07-27 18:48:55,780 INFO  crawl.Injector - Injector: crawlDb: 
 insidejava/crawldb
 2009-07-27 18:48:55,781 INFO  crawl.Injector - Injector: urlDir: urls
 2009-07-27 18:48:55,781 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2009-07-27 18:48:55,974 WARN  mapred.JobClient - Use GenericOptionsParser for 
 parsing the arguments. Applications should implement Tool for the same.
 2009-07-27 18:49:19,685 WARN  plugin.PluginRepository - Plugins: not a file: 
 url. Can't load plugins from: 
 jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins
 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Plugin 
 Auto-activation mode: [true]
 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Registered Plugins:
 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository -         NONE
 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Registered 
 Extension-Points:
 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository -