please use normalization regex and your problem is solved. On Mon, Jan 18, 2016 at 1:34 PM, Kshitij Shukla <[email protected]> wrote:
> Hello everyone, > > I have added a set of seeds to crawl using this command > * > ./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4* > > For first iteration all of the commands(*inject, **generate, **fetch, > **parse, **update-table, **Indexer & delete duplicates.*) got executed > successfully. > For second iteration, *"update-table" *command got failed (please see > error log for reference), because of failure of this command the whole > process gets terminated. > > > ****************************************************LOG > START************************************************************************************************ > CrawlDB update for 1 > /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch > updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true 1452969522-27478 -crawlId 1 > 16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at > 2016-01-17 02:10:17 > 16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: > 1452969522-27478 > 16/01/17 02:10:17 INFO plugin.PluginRepository: Plugins: looking in: > /tmp/hadoop-root/hadoop-unjar3649584948711945520/classes/plugins > 16/01/17 02:10:18 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Plugins: > 16/01/17 02:10:18 INFO plugin.PluginRepository: Rel-Tag microformat > Parser/Indexer/Querier (microformats-reltag) > 16/01/17 02:10:18 INFO plugin.PluginRepository: HTTP Framework > (lib-http) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 16/01/17 02:10:18 INFO plugin.PluginRepository: MetaTags > (parse-metatags) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Http / Https Protocol > Plug-in (protocol-httpclient) > 16/01/17 02:10:18 INFO plugin.PluginRepository: the nutch core > extension points (nutch-extensionpoints) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 16/01/17 02:10:18 INFO plugin.PluginRepository: XML Libraries (lib-xml) > 16/01/17 02:10:18 INFO plugin.PluginRepository: JavaScript Parser > (parse-js) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Top Level Domain > Plugin (tld) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Language > Identification Parser/Filter (language-identifier) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Metadata Indexing > Filter (index-metadata) > 16/01/17 02:10:18 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Subcollection indexing > and query filter (subcollection) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Link Analysis Scoring > Plug-in (scoring-link) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Pass-through URL > Normalizer (urlnormalizer-pass) > 16/01/17 02:10:18 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 16/01/17 02:10:18 INFO plugin.PluginRepository: More Indexing Filter > (index-more) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 16/01/17 02:10:18 INFO plugin.PluginRepository: SOLRIndexWriter > (indexer-solr) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Creative Commons > Plugins (creativecommons) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Registered > Extension-Points: > 16/01/17 02:10:18 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Index Cleaning > Filter (org.apache.nutch.indexer.IndexCleaningFilter) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch URL Filter ( > org.apache.nutch.net.URLFilter) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch URL Normalizer ( > org.apache.nutch.net.URLNormalizer) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Index Writer > (org.apache.nutch.indexer.IndexWriter) > 16/01/17 02:10:18 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 16/01/17 02:10:19 INFO Configuration.deprecation: > mapred.map.tasks.speculative.execution is deprecated. Instead, use > mapreduce.map.speculative > 16/01/17 02:10:19 INFO Configuration.deprecation: > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > mapreduce.reduce.speculative > 16/01/17 02:10:19 INFO Configuration.deprecation: > mapred.compress.map.output is deprecated. Instead, use > mapreduce.map.output.compress > 16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks is > deprecated. Instead, use mapreduce.job.reduces > 16/01/17 02:10:19 INFO zookeeper.RecoverableZooKeeper: Process > identifier=hconnection-0x60a2630a connecting to ZooKeeper > ensemble=localhost:2181 > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:host.name > =cism479 > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.version=1.8.0_65 > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.vendor=Oracle Corporation > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.class.path=/usr/share/searchEngine/hadoop-2.5.2/conf:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-configuration-1.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/activation-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/paranamer-2.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/httpclient-4.2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jets3t-0.9.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/zookeeper-3.4.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hadoop-auth-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-el-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jettison-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/avro-1.7.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-net-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-digester-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hadoop-annotations-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jsch-0.1.42.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-httpclient-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/stax-api-1.0-2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jersey-json-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/slf4j-api-1.7.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-collections-3.2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/junit-4.11.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/mockito-all-1.8.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/httpcore-4.2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/xmlenc-0.52.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-nfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-el-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/hdfs/hadoop-hdfs-nfs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jline-0.9.94.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/activation-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jsr305-1.3.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jettison-1.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-codec-1.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-cli-1.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-client-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guava-11.0.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guice-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-httpclient-3.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jersey-json-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-collections-3.2.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jetty-6.1.26.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/commons-lang-2.6.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/servlet-api-2.5.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-tests-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-api-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-client-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/yarn/hadoop-yarn-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/hadoop-annotations-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/guice-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/xz-1.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/asm-3.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/junit-4.11.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/lib/javax.inject-1.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.2-tests.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.5.2.jar:/usr/share/searchEngine/hadoop-2.5.2/contrib/capacity-scheduler/*.jar:/usr/share/searchEngine/hbase-0.98.8-hadoop2/lib/*.jar:/usr/share/searchEngine/hbase-0.98.8-hadoop2/conf > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.library.path=/usr/share/searchEngine/hadoop-2.5.2/lib/native > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.io.tmpdir=/tmp > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:java.compiler=<NA> > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:os.name > =Linux > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:os.arch=amd64 > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:os.version=3.16.0-30-generic > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:user.name > =root > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:user.home=/root > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client > environment:user.dir=/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy > 16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost:2181 sessionTimeout=90000 > watcher=hconnection-0x60a2630a, quorum=localhost:2181, baseZNode=/hbase > 16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/127.0.0.1:2181. Will not attempt to authenticate using > SASL (unknown error) > 16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/127.0.0.1:2181, initiating session > 16/01/17 02:10:19 INFO zookeeper.ClientCnxn: Session establishment > complete on server localhost/127.0.0.1:2181, sessionid = > 0x152495dedd00143, negotiated timeout = 90000 > 16/01/17 02:10:21 INFO Configuration.deprecation: hadoop.native.lib is > deprecated. Instead, use io.native.lib.available > 16/01/17 02:10:21 WARN store.HBaseStore: Mismatching schema's names. > Mappingfile schema: 'webpage'. PersistentClass schema's name: > '1_webpage'Assuming they are the same. > 16/01/17 02:10:21 INFO zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper > ensemble=localhost:2181 > 16/01/17 02:10:21 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost:2181 sessionTimeout=90000 > watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181, > baseZNode=/hbase > 16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/127.0.0.1:2181. Will not attempt to authenticate using > SASL (unknown error) > 16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/127.0.0.1:2181, initiating session > 16/01/17 02:10:21 INFO zookeeper.ClientCnxn: Session establishment > complete on server localhost/127.0.0.1:2181, sessionid = > 0x152495dedd00144, negotiated timeout = 90000 > 16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00144 > closed > 16/01/17 02:10:23 INFO zookeeper.ClientCnxn: EventThread shut down > 16/01/17 02:10:23 WARN store.HBaseStore: Mismatching schema's names. > Mappingfile schema: 'webpage'. PersistentClass schema's name: > '1_webpage'Assuming they are the same. > 16/01/17 02:10:23 INFO zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper > ensemble=localhost:2181 > 16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost:2181 sessionTimeout=90000 > watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181, > baseZNode=/hbase > 16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/127.0.0.1:2181. Will not attempt to authenticate using > SASL (unknown error) > 16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/127.0.0.1:2181, initiating session > 16/01/17 02:10:23 INFO zookeeper.ClientCnxn: Session establishment > complete on server localhost/127.0.0.1:2181, sessionid = > 0x152495dedd00145, negotiated timeout = 90000 > 16/01/17 02:10:23 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00145 > closed > 16/01/17 02:10:23 INFO zookeeper.ClientCnxn: EventThread shut down > 16/01/17 02:10:23 INFO client.RMProxy: Connecting to ResourceManager at / > 0.0.0.0:8032 > 16/01/17 02:10:27 WARN store.HBaseStore: Mismatching schema's names. > Mappingfile schema: 'webpage'. PersistentClass schema's name: > '1_webpage'Assuming they are the same. > 16/01/17 02:10:27 INFO zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x60a2630a connecting to ZooKeeper > ensemble=localhost:2181 > 16/01/17 02:10:27 INFO zookeeper.ZooKeeper: Initiating client connection, > connectString=localhost:2181 sessionTimeout=90000 > watcher=catalogtracker-on-hconnection-0x60a2630a, quorum=localhost:2181, > baseZNode=/hbase > 16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Opening socket connection to > server localhost/127.0.0.1:2181. Will not attempt to authenticate using > SASL (unknown error) > 16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Socket connection established > to localhost/127.0.0.1:2181, initiating session > 16/01/17 02:10:27 INFO zookeeper.ClientCnxn: Session establishment > complete on server localhost/127.0.0.1:2181, sessionid = > 0x152495dedd00146, negotiated timeout = 90000 > 16/01/17 02:10:27 INFO zookeeper.ZooKeeper: Session: 0x152495dedd00146 > closed > 16/01/17 02:10:27 INFO zookeeper.ClientCnxn: EventThread shut down > 16/01/17 02:10:27 INFO mapreduce.JobSubmitter: number of splits:2 > 16/01/17 02:10:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: > job_1452929501009_0024 > 16/01/17 02:10:28 INFO impl.YarnClientImpl: Submitted application > application_1452929501009_0024 > 16/01/17 02:10:28 INFO mapreduce.Job: The url to track the job: > http://cism479:8088/proxy/application_1452929501009_0024/ > 16/01/17 02:10:28 INFO mapreduce.Job: Running job: job_1452929501009_0024 > 16/01/17 02:10:39 INFO mapreduce.Job: Job job_1452929501009_0024 running > in uber mode : false > 16/01/17 02:10:39 INFO mapreduce.Job: map 0% reduce 0% > 16/01/17 02:11:37 INFO mapreduce.Job: Task Id : > attempt_1452929501009_0024_m_000000_0, Status : FAILED > Error: java.net.MalformedURLException: For input string: "#10;from <a > href="https:" > at java.net.URL.<init>(URL.java:620) > at java.net.URL.<init>(URL.java:483) > at java.net.URL.<init>(URL.java:432) > at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43) > at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96) > at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > Caused by: java.lang.NumberFormatException: For input string: "#10;from <a > href="https:" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:569) > at java.lang.Integer.parseInt(Integer.java:615) > at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216) > at java.net.URL.<init>(URL.java:615) > ... 13 more > > Container killed by the ApplicationMaster. > Container killed on request. Exit code is 143 > Container exited with a non-zero exit code 143 > > 16/01/17 02:12:13 INFO mapreduce.Job: map 33% reduce 0% > 16/01/17 02:12:24 INFO mapreduce.Job: map 50% reduce 0% > 16/01/17 02:12:44 INFO mapreduce.Job: Task Id : > attempt_1452929501009_0024_m_000000_1, Status : FAILED > Error: java.net.MalformedURLException: For input string: "#10;from <a > href="https:" > at java.net.URL.<init>(URL.java:620) > at java.net.URL.<init>(URL.java:483) > at java.net.URL.<init>(URL.java:432) > at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43) > at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96) > at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > Caused by: java.lang.NumberFormatException: For input string: "#10;from <a > href="https:" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:569) > at java.lang.Integer.parseInt(Integer.java:615) > at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216) > at java.net.URL.<init>(URL.java:615) > ... 13 more > > 16/01/17 02:13:19 INFO mapreduce.Job: Task Id : > attempt_1452929501009_0024_m_000000_2, Status : FAILED > Error: java.net.MalformedURLException: For input string: "#10;from <a > href="https:" > at java.net.URL.<init>(URL.java:620) > at java.net.URL.<init>(URL.java:483) > at java.net.URL.<init>(URL.java:432) > at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43) > at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96) > at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > Caused by: java.lang.NumberFormatException: For input string: "#10;from <a > href="https:" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:569) > at java.lang.Integer.parseInt(Integer.java:615) > at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216) > at java.net.URL.<init>(URL.java:615) > ... 13 more > > 16/01/17 02:13:42 INFO mapreduce.Job: map 100% reduce 100% > 16/01/17 02:13:43 INFO mapreduce.Job: Job job_1452929501009_0024 failed > with state FAILED due to: Task failed task_1452929501009_0024_m_000000 > Job failed as tasks failed. failedMaps:1 failedReduces:0 > > 16/01/17 02:13:44 INFO mapreduce.Job: Counters: 34 > File System Counters > FILE: Number of bytes read=0 > FILE: Number of bytes written=49949067 > FILE: Number of read operations=0 > FILE: Number of large read operations=0 > FILE: Number of write operations=0 > HDFS: Number of bytes read=1193 > HDFS: Number of bytes written=0 > HDFS: Number of read operations=1 > HDFS: Number of large read operations=0 > HDFS: Number of write operations=0 > Job Counters > Failed map tasks=4 > Launched map tasks=5 > Other local map tasks=3 > Data-local map tasks=2 > Total time spent by all maps in occupied slots (ms)=829677 > Total time spent by all reduces in occupied slots (ms)=0 > Total time spent by all map tasks (ms)=276559 > Total vcore-seconds taken by all map tasks=276559 > Total megabyte-seconds taken by all map tasks=849589248 > Map-Reduce Framework > Map input records=30201 > Map output records=1164348 > Map output bytes=250659088 > Map output materialized bytes=49832245 > Input split bytes=1193 > Combine input records=0 > Spilled Records=1164348 > Failed Shuffles=0 > Merged Map outputs=0 > GC time elapsed (ms)=3541 > CPU time spent (ms)=42980 > Physical memory (bytes) snapshot=2062766080 > Virtual memory (bytes) snapshot=5086490624 > Total committed heap usage (bytes)=2127036416 > File Input Format Counters > Bytes Read=0 > Exception in thread "main" java.lang.RuntimeException: job failed: > name=[1]update-table, jobid=job_1452929501009_0024 > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111) > at > org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Error running: > /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch > updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true 1452969522-27478 -crawlId 1 > Failed with exit value 1. > ****************************************************LOG END > ************************************************************************************************ > > As its pretty clear from error that its because of the malformed urls. So > is there a way to get rid of this kind of malformed urls ? or is there any > solution which could either skip these kind of urls or bypasss them, so the > subsequent processes get executed ? > > **Please advise. > > Kshitij Shukla > Software developer > CIS > > -- > > ------------------------------ > > *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)* > > Central India's largest Technology company. > > *Ensuring the success of our clients and partners through our highly > optimized Technology solutions.* > > www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin < > https://www.linkedin.com/company/cyber-infrastructure-private-limited> | > Offices: *Indore, India.* *Singapore. Silicon Valley, USA*. > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the > intended recipient, you should delete this message and are notified that > any disclosure, copying or distribution of this message, or taking any > action based on it, is strictly prohibited by Law. >

