Hello everyone,
I have added a set of seeds to crawl using this command
*
./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4*
For first iteration all of the commands(*inject, **generate, **fetch,
**parse, **update-table, **Indexer & delete duplicates.*) got executed
successfully.
For second iteration, *"update-table" *command got failed (please see
error log for reference), because of failure of this command the whole
process gets terminated.
****************************************************LOG
START************************************************************************************************
CrawlDB update for 1
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch
updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1452969522-27478 -crawlId 1
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at
2016-01-17 02:10:17
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId:
1452969522-27478
16/01/17 02:10:17 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-root/hadoop-unjar3649584948711945520/classes/plugins
16/01/17 02:10:18 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Plugins:
16/01/17 02:10:18 INFO plugin.PluginRepository: Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
16/01/17 02:10:18 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
16/01/17 02:10:18 INFO plugin.PluginRepository: Html Parse Plug-in
(parse-html)
16/01/17 02:10:18 INFO plugin.PluginRepository: MetaTags
(parse-metatags)
16/01/17 02:10:18 INFO plugin.PluginRepository: Http / Https Protocol
Plug-in (protocol-httpclient)
16/01/17 02:10:18 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
16/01/17 02:10:18 INFO plugin.PluginRepository: Basic Indexing Filter
(index-basic)
16/01/17 02:10:18 INFO plugin.PluginRepository: XML Libraries (lib-xml)
16/01/17 02:10:18 INFO plugin.PluginRepository: JavaScript Parser
(parse-js)
16/01/17 02:10:18 INFO plugin.PluginRepository: Anchor Indexing Filter
(index-anchor)
16/01/17 02:10:18 INFO plugin.PluginRepository: Tika Parser Plug-in
(parse-tika)
16/01/17 02:10:18 INFO plugin.PluginRepository: Top Level Domain
Plugin (tld)
16/01/17 02:10:18 INFO plugin.PluginRepository: Language
Identification Parser/Filter (language-identifier)
16/01/17 02:10:18 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Metadata Indexing
Filter (index-metadata)
16/01/17 02:10:18 INFO plugin.PluginRepository: CyberNeko HTML Parser
(lib-nekohtml)
16/01/17 02:10:18 INFO plugin.PluginRepository: Subcollection indexing
and query filter (subcollection)
16/01/17 02:10:18 INFO plugin.PluginRepository: Link Analysis Scoring
Plug-in (scoring-link)
16/01/17 02:10:18 INFO plugin.PluginRepository: Pass-through URL
Normalizer (urlnormalizer-pass)
16/01/17 02:10:18 INFO plugin.PluginRepository: OPIC Scoring Plug-in
(scoring-opic)
16/01/17 02:10:18 INFO plugin.PluginRepository: More Indexing Filter
(index-more)
16/01/17 02:10:18 INFO plugin.PluginRepository: Http Protocol Plug-in
(protocol-http)
16/01/17 02:10:18 INFO plugin.PluginRepository: SOLRIndexWriter
(indexer-solr)
16/01/17 02:10:18 INFO plugin.PluginRepository: Creative Commons
Plugins (creativecommons)
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered
Extension-Points:
16/01/17 02:10:18 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Index Cleaning
Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Content Parser
(org.apache.nutch.parse.Parser)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch URL Filter (
org.apache.nutch.net.URLFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch URL Normalizer (
org.apache.nutch.net.URLNormalizer)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
16/01/17 02:10:19 INFO Configuration.deprecation:
mapred.map.tasks.speculative.execution is deprecated. Instead, use
mapreduce.map.speculative
16/01/17 02:10:19 INFO Configuration.deprecation:
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
mapreduce.reduce.speculative
16/01/17 02:10:19 INFO Configuration.deprecation:
mapred.compress.map.output is deprecated. Instead, use
mapreduce.map.output.compress
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks is
deprecated. Instead, use mapreduce.job.reduces
16/01/17 02:10:19 INFO zookeeper.RecoverableZooKeeper: Process
identifier=hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:host.name
=cism479
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.version=1.8.0_65
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.vendor=Oracle Corporation
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.class.path=/usr/share/searchEngine/hadoop-2.5.2/conf:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-configuration-1.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/activation-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/paranamer-2.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/httpclient-4.2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jets3t-0.9.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/zookeeper-3.4.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hadoop-auth-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-el-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jettison-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/avro-1.7.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-net-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-digester-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hadoop-annotations-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsch-0.1.42.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-httpclient-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/stax-api-1.0-2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-json-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/slf4j-api-1.7.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-collections-3.2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/junit-4.11.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/mockito-all-1.8.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/httpcore-4.2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/xmlenc-0.52.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-nfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-el-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-nfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jline-0.9.94.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/activation-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jettison-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-client-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guice-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-httpclient-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-json-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-collections-3.2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-tests-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-api-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-client-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/hadoop-annotations-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/guice-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/junit-4.11.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/javax.inject-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/contrib/capacity-scheduler/*.jar:/usr/share/searchEngine/hbase-0.98.8-hadoop2/lib/*.jar:/usr/share/searchEngine/hbase-0.98.8-hadoop2/conf
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.library.path=/usr/share/searchEngine/hadoop-2.5.2/lib/native
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.compiler=<NA>
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:os.name
=Linux
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:os.arch=amd64
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:os.version=3.16.0-30-generic
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:user.name
=root
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:user.home=/root
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:user.dir=/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=hconnection-0x60a2630a, quorum=localhost:2181, baseZNode=/hbase
16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00143, negotiated timeout = 90000
16/01/17 02:10:21 INFO Configuration.deprecation: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
16/01/17 02:10:21 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'1_webpage'Assuming they are the same.
16/01/17 02:10:21 INFO zookeeper.RecoverableZooKeeper: Process
identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:21 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181,
baseZNode=/hbase
16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00144, negotiated timeout = 90000
16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00144
closed
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/17 02:10:23 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'1_webpage'Assuming they are the same.
16/01/17 02:10:23 INFO zookeeper.RecoverableZooKeeper: Process
identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181,
baseZNode=/hbase
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00145, negotiated timeout = 90000
16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00145
closed
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/17 02:10:23 INFO client.RMProxy: Connecting to ResourceManager at /
0.0.0.0:8032
16/01/17 02:10:27 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'1_webpage'Assuming they are the same.
16/01/17 02:10:27 INFO zookeeper.RecoverableZooKeeper: Process
identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:27 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181,
baseZNode=/hbase
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00146, negotiated timeout = 90000
16/01/17 02:10:27 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00146
closed
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: number of splits:2
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1452929501009_0024
16/01/17 02:10:28 INFO impl.YarnClientImpl: Submitted application
application_1452929501009_0024
16/01/17 02:10:28 INFO mapreduce.Job: The url to track the job:
http://cism479:8088/proxy/application_1452929501009_0024/
16/01/17 02:10:28 INFO mapreduce.Job: Running job: job_1452929501009_0024
16/01/17 02:10:39 INFO mapreduce.Job: Job job_1452929501009_0024 running
in uber mode : false
16/01/17 02:10:39 INFO mapreduce.Job: map 0% reduce 0%
16/01/17 02:11:37 INFO mapreduce.Job: Task Id :
attempt_1452929501009_0024_m_000000_0, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a
href="https:"
at java.net.URL.<init>(URL.java:620)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a
href="https:"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
at java.net.URL.<init>(URL.java:615)
... 13 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
16/01/17 02:12:13 INFO mapreduce.Job: map 33% reduce 0%
16/01/17 02:12:24 INFO mapreduce.Job: map 50% reduce 0%
16/01/17 02:12:44 INFO mapreduce.Job: Task Id :
attempt_1452929501009_0024_m_000000_1, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a
href="https:"
at java.net.URL.<init>(URL.java:620)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a
href="https:"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
at java.net.URL.<init>(URL.java:615)
... 13 more
16/01/17 02:13:19 INFO mapreduce.Job: Task Id :
attempt_1452929501009_0024_m_000000_2, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a
href="https:"
at java.net.URL.<init>(URL.java:620)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a
href="https:"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
at java.net.URL.<init>(URL.java:615)
... 13 more
16/01/17 02:13:42 INFO mapreduce.Job: map 100% reduce 100%
16/01/17 02:13:43 INFO mapreduce.Job: Job job_1452929501009_0024 failed
with state FAILED due to: Task failed task_1452929501009_0024_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/01/17 02:13:44 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=49949067
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1193
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Launched map tasks=5
Other local map tasks=3
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=829677
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=276559
Total vcore-seconds taken by all map tasks=276559
Total megabyte-seconds taken by all map tasks=849589248
Map-Reduce Framework
Map input records=30201
Map output records=1164348
Map output bytes=250659088
Map output materialized bytes=49832245
Input split bytes=1193
Combine input records=0
Spilled Records=1164348
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=3541
CPU time spent (ms)=42980
Physical memory (bytes) snapshot=2062766080
Virtual memory (bytes) snapshot=5086490624
Total committed heap usage (bytes)=2127036416
File Input Format Counters
Bytes Read=0
Exception in thread "main" java.lang.RuntimeException: job failed:
name=[1]update-table, jobid=job_1452929501009_0024
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
at
org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Error running:
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch
updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1452969522-27478 -crawlId 1
Failed with exit value 1.
****************************************************LOG END
************************************************************************************************
As its pretty clear from error that its because of the malformed urls. So
is there a way to get rid of this kind of malformed urls ? or is there any
solution which could either skip these kind of urls or bypasss them, so the
subsequent processes get executed ?
**Please advise.
Kshitij Shukla
Software developer
CIS
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin <
https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.