Nutch 2.3 Rest gets stuck on EMR

Ketan Bhokray Thu, 19 Nov 2015 03:00:06 -0800

Hi,

I'm running Nutch 2.3 on EMR (AMI version 2.4.2). The crawl steps are
working fine in local and distributed mode (`hadoop -jar
apache-nutch-2.3.job <MainClass> <args>`), and am able to call the steps by
spinning up the rest service in local mode. But, when I try to run the rest
in distributed mode (`hadoop -jar apache-nutch-2.3.job
org.apache.nutch.api.NutchServer`), the rest is receiving the calls, but is
not getting the job done. What is the correct way to run nutch in
distributed mode?


Info
----
When the InjectorJob is run offline in a distributed mode, the output is as
follows:

*COMMAND*:
hadoop jar ./apache-nutch-2.3.job org.apache.nutch.crawl.InjectorJob
s3://myemrbucket/urls -crawlId 2

    15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: starting at
2015-11-19 09:55:06
    15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: Injecting
urlDir: s3://myemrbucket/urls
    15/11/19 09:55:06 INFO s3native.NativeS3FileSystem: Created AmazonS3
with InstanceProfileCredentialsProvider
    15/11/19 09:55:08 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'2_webpage'Assuming they are the same.
    15/11/19 09:55:08 INFO crawl.InjectorJob: InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    15/11/19 09:55:08 INFO mapred.JobClient: Default number of map tasks:
null
    15/11/19 09:55:08 INFO mapred.JobClient: Setting default number of map
tasks based on cluster size to : 4
    15/11/19 09:55:08 INFO mapred.JobClient: Default number of reduce
tasks: 0
    15/11/19 09:55:10 INFO security.ShellBasedUnixGroupsMapping: add hadoop
to shell userGroupsCache
    15/11/19 09:55:10 INFO mapred.JobClient: Setting group to hadoop
    15/11/19 09:55:10 INFO input.FileInputFormat: Total input paths to
process : 1
    15/11/19 09:55:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl
library
    15/11/19 09:55:10 WARN lzo.LzoCodec: Could not find build properties
file with revision hash
    15/11/19 09:55:10 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev UNKNOWN]
    15/11/19 09:55:10 WARN snappy.LoadSnappy: Snappy native library is
available
    15/11/19 09:55:10 INFO snappy.LoadSnappy: Snappy native library loaded
    15/11/19 09:55:10 INFO mapred.JobClient: Running job:
job_201511182052_0037
    15/11/19 09:55:11 INFO mapred.JobClient:  map 0% reduce 0%
    15/11/19 09:55:38 INFO mapred.JobClient:  map 100% reduce 0%
    15/11/19 09:55:43 INFO mapred.JobClient: Job complete:
job_201511182052_0037
    15/11/19 09:55:43 INFO mapred.JobClient: Counters: 20
    15/11/19 09:55:43 INFO mapred.JobClient:   Job Counters
    15/11/19 09:55:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16424
    15/11/19 09:55:43 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
    15/11/19 09:55:43 INFO mapred.JobClient:     Total time spent by all
maps waiting after reserving slots (ms)=0
    15/11/19 09:55:43 INFO mapred.JobClient:     Rack-local map tasks=1
    15/11/19 09:55:43 INFO mapred.JobClient:     Launched map tasks=1
    15/11/19 09:55:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
    15/11/19 09:55:43 INFO mapred.JobClient:   File Output Format Counters
    15/11/19 09:55:43 INFO mapred.JobClient:     Bytes Written=0
    15/11/19 09:55:43 INFO mapred.JobClient:   injector
    15/11/19 09:55:43 INFO mapred.JobClient:     urls_injected=1
    15/11/19 09:55:43 INFO mapred.JobClient:   FileSystemCounters
    15/11/19 09:55:43 INFO mapred.JobClient:     HDFS_BYTES_READ=98
    15/11/19 09:55:43 INFO mapred.JobClient:     S3_BYTES_READ=61
    15/11/19 09:55:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=36254
    15/11/19 09:55:43 INFO mapred.JobClient:   File Input Format Counters
    15/11/19 09:55:43 INFO mapred.JobClient:     Bytes Read=61
    15/11/19 09:55:43 INFO mapred.JobClient:   Map-Reduce Framework
    15/11/19 09:55:43 INFO mapred.JobClient:     Map input records=1
    15/11/19 09:55:43 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=193712128
    15/11/19 09:55:43 INFO mapred.JobClient:     Spilled Records=0
    15/11/19 09:55:43 INFO mapred.JobClient:     CPU time spent (ms)=3960
    15/11/19 09:55:43 INFO mapred.JobClient:     Total committed heap usage
(bytes)=298319872
    15/11/19 09:55:43 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=1525059584
    15/11/19 09:55:43 INFO mapred.JobClient:     Map output records=1
    15/11/19 09:55:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=98
    15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of
urls rejected by filters: 0
    15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of
urls injected after normalization and filtering: 1
    15/11/19 09:55:44 INFO crawl.InjectorJob: Injector: finished at
2015-11-19 09:55:44, elapsed: 00:00:38

By calling it through the REST, the job gets stuck after giving out the
following output:

*POST ARGS*:

    {
      "crawlId":"11",
      "confId":"default",
      "type":"INJECT",
      "args":{"seedDir":"s3://myemrbucket/urls"}
    }

    15/11/19 09:46:14 INFO api.NutchServer: Starting NutchServer on port:
8081 with logging level: INFO ...
    Nov 19, 2015 9:46:14 AM org.restlet.engine.connector.NetServerHelper
start
    INFO: Starting the internal [HTTP/1.1] server on port 8081
    15/11/19 09:46:14 INFO api.NutchServer: Started NutchServer on port 8081
    Nov 19, 2015 9:46:25 AM org.restlet.engine.log.LogFilter afterHandle
    INFO: 2015-11-19 09:46:25 1xx.xx.x.xx - - 8081 POST /job/create - 200 28
110 498 http://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8081 Mozilla/5.0
(Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/46.0.2490.80 Safari/537.36-
    15/11/19 09:46:25 INFO s3native.NativeS3FileSystem: Created AmazonS3
with InstanceProfileCredentialsProvider
    15/11/19 09:46:27 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'11_webpage'Assuming they are the same.
    15/11/19 09:46:28 INFO crawl.InjectorJob: InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    15/11/19 09:46:28 INFO mapred.JobClient: Default number of map tasks:
null
    15/11/19 09:46:28 INFO mapred.JobClient: Setting default number of map
tasks based on cluster size to : 4
    15/11/19 09:46:28 INFO mapred.JobClient: Default number of reduce
tasks: 0
    15/11/19 09:46:28 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

and does not move forward.

Thanks,

Nutch 2.3 Rest gets stuck on EMR

Reply via email to