Thanks for your prompt reply Zara. Really appreciated.
I checked the nutch setup and I found the regex-normalize.xml file with default setting setup already there in place. I am a newbie in regex, can you please let me know a bit more about the setting/regex I should write?

BR

On Monday 18 January 2016 01:41 PM, Zara Parst wrote:
please use normalization regex and your problem is solved.

On Mon, Jan 18, 2016 at 1:34 PM, Kshitij Shukla <[email protected]>
wrote:

Hello everyone,

I have added a set of seeds to crawl using this command
*
./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4*

For first iteration all of the commands(*inject, **generate, **fetch,
**parse, **update-table, **Indexer & delete duplicates.*) got executed
successfully.
For second iteration, *"update-table" *command got failed (please see
error log for reference), because of failure of this command the whole
process gets terminated.


****************************************************LOG
START************************************************************************************************
CrawlDB update for 1
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch
updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1452969522-27478 -crawlId 1
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at
2016-01-17 02:10:17
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId:
1452969522-27478
16/01/17 02:10:17 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-root/hadoop-unjar3649584948711945520/classes/plugins
16/01/17 02:10:18 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Plugins:
16/01/17 02:10:18 INFO plugin.PluginRepository:     Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
16/01/17 02:10:18 INFO plugin.PluginRepository:     HTTP Framework
(lib-http)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Html Parse Plug-in
(parse-html)
16/01/17 02:10:18 INFO plugin.PluginRepository:     MetaTags
(parse-metatags)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Http / Https Protocol
Plug-in (protocol-httpclient)
16/01/17 02:10:18 INFO plugin.PluginRepository:     the nutch core
extension points (nutch-extensionpoints)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Basic Indexing Filter
(index-basic)
16/01/17 02:10:18 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
16/01/17 02:10:18 INFO plugin.PluginRepository:     JavaScript Parser
(parse-js)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Anchor Indexing Filter
(index-anchor)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Tika Parser Plug-in
(parse-tika)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Top Level Domain
Plugin (tld)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Language
Identification Parser/Filter (language-identifier)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Regex URL Filter
Framework (lib-regex-filter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Metadata Indexing
Filter (index-metadata)
16/01/17 02:10:18 INFO plugin.PluginRepository:     CyberNeko HTML Parser
(lib-nekohtml)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Subcollection indexing
and query filter (subcollection)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Link Analysis Scoring
Plug-in (scoring-link)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Pass-through URL
Normalizer (urlnormalizer-pass)
16/01/17 02:10:18 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
(scoring-opic)
16/01/17 02:10:18 INFO plugin.PluginRepository:     More Indexing Filter
(index-more)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Http Protocol Plug-in
(protocol-http)
16/01/17 02:10:18 INFO plugin.PluginRepository:     SOLRIndexWriter
(indexer-solr)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Creative Commons
Plugins (creativecommons)
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered
Extension-Points:
16/01/17 02:10:18 INFO plugin.PluginRepository:     Parse Filter
(org.apache.nutch.parse.ParseFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Index Cleaning
Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Content Parser
(org.apache.nutch.parse.Parser)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch URL Filter (
org.apache.nutch.net.URLFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch URL Normalizer (
org.apache.nutch.net.URLNormalizer)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Protocol
(org.apache.nutch.protocol.Protocol)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
16/01/17 02:10:19 INFO Configuration.deprecation:
mapred.map.tasks.speculative.execution is deprecated. Instead, use
mapreduce.map.speculative
16/01/17 02:10:19 INFO Configuration.deprecation:
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
mapreduce.reduce.speculative
16/01/17 02:10:19 INFO Configuration.deprecation:
mapred.compress.map.output is deprecated. Instead, use
mapreduce.map.output.compress
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks is
deprecated. Instead, use mapreduce.job.reduces
16/01/17 02:10:19 INFO zookeeper.RecoverableZooKeeper: Process
identifier=hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:host.name
=cism479
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.version=1.8.0_65
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.vendor=Oracle Corporation
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.class.path=/usr/share/searchEngine/hadoop-2.5.2/conf:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-configuration-1.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/activation-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/paranamer-2.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/httpclient-4.2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jets3t-0.9.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/zookeeper-3.4.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hadoop-auth-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-el-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jettison-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/avro-1.7.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-net-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-digester-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hadoop-annotations-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsch-0.1.42.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-httpclient-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/stax-api-1.0-2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-json-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/slf4j-api-1.7.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-collections-3.2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/junit-4.11.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/mockito-all-1.8.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/httpcore-4.2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/xmlenc-0.52.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-nfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-el-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-nfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jline-0.9.94.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/activation-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jettison-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-client-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guice-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-httpclient-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-json-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-collections-3.2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-tests-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-api-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-client-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/hadoop-annotations-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/guice-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/junit-4.11.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/javax.inject-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/contrib/capacity-scheduler/*.jar:/usr/share/searchEngine/hbase-0.98.8-hadoop2/lib/*.jar:/usr/share/searchEngine/hbase-0.98.8-hadoop2/conf
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.library.path=/usr/share/searchEngine/hadoop-2.5.2/lib/native
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:java.compiler=<NA>
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:os.name
=Linux
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:os.arch=amd64
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:os.version=3.16.0-30-generic
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:user.name
=root
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:user.home=/root
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client
environment:user.dir=/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=hconnection-0x60a2630a, quorum=localhost:2181, baseZNode=/hbase
16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00143, negotiated timeout = 90000
16/01/17 02:10:21 INFO Configuration.deprecation: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
16/01/17 02:10:21 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'1_webpage'Assuming they are the same.
16/01/17 02:10:21 INFO zookeeper.RecoverableZooKeeper: Process
identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:21 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181,
baseZNode=/hbase
16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00144, negotiated timeout = 90000
16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00144
closed
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/17 02:10:23 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'1_webpage'Assuming they are the same.
16/01/17 02:10:23 INFO zookeeper.RecoverableZooKeeper: Process
identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181,
baseZNode=/hbase
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00145, negotiated timeout = 90000
16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00145
closed
16/01/17 02:10:23 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/17 02:10:23 INFO client.RMProxy: Connecting to ResourceManager at /
0.0.0.0:8032
16/01/17 02:10:27 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'1_webpage'Assuming they are the same.
16/01/17 02:10:27 INFO zookeeper.RecoverableZooKeeper: Process
identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper
ensemble=localhost:2181
16/01/17 02:10:27 INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=90000
watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181,
baseZNode=/hbase
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Socket connection established
to localhost/127.0.0.1:2181, initiating session
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Session establishment
complete on server localhost/127.0.0.1:2181, sessionid =
0x152495dedd00146, negotiated timeout = 90000
16/01/17 02:10:27 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00146
closed
16/01/17 02:10:27 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: number of splits:2
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1452929501009_0024
16/01/17 02:10:28 INFO impl.YarnClientImpl: Submitted application
application_1452929501009_0024
16/01/17 02:10:28 INFO mapreduce.Job: The url to track the job:
http://cism479:8088/proxy/application_1452929501009_0024/
16/01/17 02:10:28 INFO mapreduce.Job: Running job: job_1452929501009_0024
16/01/17 02:10:39 INFO mapreduce.Job: Job job_1452929501009_0024 running
in uber mode : false
16/01/17 02:10:39 INFO mapreduce.Job:  map 0% reduce 0%
16/01/17 02:11:37 INFO mapreduce.Job: Task Id :
attempt_1452929501009_0024_m_000000_0, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a
href="https:"
     at java.net.URL.<init>(URL.java:620)
     at java.net.URL.<init>(URL.java:483)
     at java.net.URL.<init>(URL.java:432)
     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a
href="https:"
     at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     at java.lang.Integer.parseInt(Integer.java:569)
     at java.lang.Integer.parseInt(Integer.java:615)
     at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
     at java.net.URL.<init>(URL.java:615)
     ... 13 more

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

16/01/17 02:12:13 INFO mapreduce.Job:  map 33% reduce 0%
16/01/17 02:12:24 INFO mapreduce.Job:  map 50% reduce 0%
16/01/17 02:12:44 INFO mapreduce.Job: Task Id :
attempt_1452929501009_0024_m_000000_1, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a
href="https:"
     at java.net.URL.<init>(URL.java:620)
     at java.net.URL.<init>(URL.java:483)
     at java.net.URL.<init>(URL.java:432)
     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a
href="https:"
     at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     at java.lang.Integer.parseInt(Integer.java:569)
     at java.lang.Integer.parseInt(Integer.java:615)
     at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
     at java.net.URL.<init>(URL.java:615)
     ... 13 more

16/01/17 02:13:19 INFO mapreduce.Job: Task Id :
attempt_1452929501009_0024_m_000000_2, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a
href="https:"
     at java.net.URL.<init>(URL.java:620)
     at java.net.URL.<init>(URL.java:483)
     at java.net.URL.<init>(URL.java:432)
     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a
href="https:"
     at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     at java.lang.Integer.parseInt(Integer.java:569)
     at java.lang.Integer.parseInt(Integer.java:615)
     at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
     at java.net.URL.<init>(URL.java:615)
     ... 13 more

16/01/17 02:13:42 INFO mapreduce.Job:  map 100% reduce 100%
16/01/17 02:13:43 INFO mapreduce.Job: Job job_1452929501009_0024 failed
with state FAILED due to: Task failed task_1452929501009_0024_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/01/17 02:13:44 INFO mapreduce.Job: Counters: 34
     File System Counters
         FILE: Number of bytes read=0
         FILE: Number of bytes written=49949067
         FILE: Number of read operations=0
         FILE: Number of large read operations=0
         FILE: Number of write operations=0
         HDFS: Number of bytes read=1193
         HDFS: Number of bytes written=0
         HDFS: Number of read operations=1
         HDFS: Number of large read operations=0
         HDFS: Number of write operations=0
     Job Counters
         Failed map tasks=4
         Launched map tasks=5
         Other local map tasks=3
         Data-local map tasks=2
         Total time spent by all maps in occupied slots (ms)=829677
         Total time spent by all reduces in occupied slots (ms)=0
         Total time spent by all map tasks (ms)=276559
         Total vcore-seconds taken by all map tasks=276559
         Total megabyte-seconds taken by all map tasks=849589248
     Map-Reduce Framework
         Map input records=30201
         Map output records=1164348
         Map output bytes=250659088
         Map output materialized bytes=49832245
         Input split bytes=1193
         Combine input records=0
         Spilled Records=1164348
         Failed Shuffles=0
         Merged Map outputs=0
         GC time elapsed (ms)=3541
         CPU time spent (ms)=42980
         Physical memory (bytes) snapshot=2062766080
         Virtual memory (bytes) snapshot=5086490624
         Total committed heap usage (bytes)=2127036416
     File Input Format Counters
         Bytes Read=0
Exception in thread "main" java.lang.RuntimeException: job failed:
name=[1]update-table, jobid=job_1452929501009_0024
     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
     at
org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:497)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Error running:
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch
updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1452969522-27478 -crawlId 1
Failed with exit value 1.
****************************************************LOG END
************************************************************************************************

As its pretty clear from error that its because of the malformed urls. So
is there a way to get rid of this kind of malformed urls ? or is there any
solution which could either skip these kind of urls or bypasss them, so the
subsequent processes get executed ?

**Please advise.

Kshitij Shukla
Software developer
CIS

--

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin <
https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.



--

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the intended recipient, you should delete this message and are notified that any disclosure, copying or distribution of this message, or taking any action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

--

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the intended recipient, you should delete this message and are notified that any disclosure, copying or distribution of this message, or taking any action based on it, is strictly prohibited by Law.

Reply via email to