Hi - not everyone knows about Nutch 2.x and i don't know anything about Gora etc. Some others do, but not everyone reads their email all day. Be patient and perhaps you might try the Apache Gora list as well. Markus
-----Original message----- > From:Kshitij Shukla <[email protected]> > Sent: Thursday 21st January 2016 13:45 > To: [email protected] > Subject: [CIS-CMMI-3] Re: IllegalArgumentException: Row length 41221 is > > 32767 > > So, no one is willing to help/guide me through the error ? > > On Wednesday 20 January 2016 12:24 PM, Kshitij Shukla wrote: > > Hello everyone, > > > > I have added a set of seeds to crawl using this command > > * > > ./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4* > > > > For first iteration all of the commands(*inject, **generate, **fetch, > > **parse, **update-table, **Indexer & delete duplicates.*) got executed > > successfully. > > For second iteration, *"CrawlDB update" *command got failed (please > > see error log for reference), because of failure of this command the > > whole process gets terminated. > > > > > > ****************************************************LOG > > START************************************************************************************************ > > 16/01/20 02:45:19 INFO parse.ParserJob: ParserJob: finished at > > 2016-01-20 02:45:19, time elapsed: 00:06:57 > > CrawlDB update for 1 > > /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch > > updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m > > -D mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true 1453230757-13191 -crawlId 1 > > 16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at > > 2016-01-20 02:45:27 > > 16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: > > 1453230757-13191 > > 16/01/20 02:45:27 INFO plugin.PluginRepository: Plugins: looking in: > > /tmp/hadoop-root/hadoop-unjar5654418190157422003/classes/plugins > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Plugin Auto-activation > > mode: [true] > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Plugins: > > 16/01/20 02:45:28 INFO plugin.PluginRepository: HTTP Framework > > (lib-http) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Html Parse Plug-in > > (parse-html) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: MetaTags > > (parse-metatags) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: the nutch core > > extension points (nutch-extensionpoints) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Basic Indexing > > Filter (index-basic) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: XML Libraries > > (lib-xml) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Anchor Indexing > > Filter (index-anchor) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Basic URL > > Normalizer (urlnormalizer-basic) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Language > > Identification Parser/Filter (language-identifier) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Metadata Indexing > > Filter (index-metadata) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: CyberNeko HTML > > Parser (lib-nekohtml) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Subcollection > > indexing and query filter (subcollection) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: SOLRIndexWriter > > (indexer-solr) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Rel-Tag > > microformat Parser/Indexer/Querier (microformats-reltag) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Http / Https > > Protocol Plug-in (protocol-httpclient) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: JavaScript Parser > > (parse-js) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Tika Parser > > Plug-in (parse-tika) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Top Level Domain > > Plugin (tld) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Regex URL Filter > > Framework (lib-regex-filter) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Regex URL > > Normalizer (urlnormalizer-regex) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Link Analysis > > Scoring Plug-in (scoring-link) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: OPIC Scoring > > Plug-in (scoring-opic) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: More Indexing > > Filter (index-more) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Http Protocol > > Plug-in (protocol-http) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Creative Commons > > Plugins (creativecommons) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Registered > > Extension-Points: > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Parse Filter > > (org.apache.nutch.parse.ParseFilter) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Index > > Cleaning Filter (org.apache.nutch.indexer.IndexCleaningFilter) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Content > > Parser (org.apache.nutch.parse.Parser) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch URL Filter > > (org.apache.nutch.net.URLFilter) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Scoring > > (org.apache.nutch.scoring.ScoringFilter) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch URL > > Normalizer (org.apache.nutch.net.URLNormalizer) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Protocol > > (org.apache.nutch.protocol.Protocol) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Index Writer > > (org.apache.nutch.indexer.IndexWriter) > > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Indexing > > Filter (org.apache.nutch.indexer.IndexingFilter) > > 16/01/20 02:45:29 INFO Configuration.deprecation: > > mapred.map.tasks.speculative.execution is deprecated. Instead, use > > mapreduce.map.speculative > > 16/01/20 02:45:29 INFO Configuration.deprecation: > > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > > mapreduce.reduce.speculative > > 16/01/20 02:45:29 INFO Configuration.deprecation: > > mapred.compress.map.output is deprecated. Instead, use > > mapreduce.map.output.compress > > 16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks > > is deprecated. Instead, use mapreduce.job.reduces > > 16/01/20 02:45:29 INFO zookeeper.RecoverableZooKeeper: Process > > identifier=hconnection-0x60a2630a connecting to ZooKeeper > > ensemble=localhost:2181 > > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > > environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT > > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > > environment:host.name=cism479 > > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > > environment:java.version=1.8.0_65 > > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > > environment:java.vendor=Oracle Corporation > > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > > environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre > > 16/01/20 02:45:35 INFO zookeeper.ClientCnxn: EventThread shut down > > 16/01/20 02:45:35 INFO mapreduce.JobSubmitter: number of splits:2 > > 16/01/20 02:45:36 INFO mapreduce.JobSubmitter: Submitting tokens for > > job: job_1453210838763_0011 > > 16/01/20 02:45:36 INFO impl.YarnClientImpl: Submitted application > > application_1453210838763_0011 > > 16/01/20 02:45:36 INFO mapreduce.Job: The url to track the job: > > http://cism479:8088/proxy/application_1453210838763_0011/ > > 16/01/20 02:45:36 INFO mapreduce.Job: Running job: job_1453210838763_0011 > > 16/01/20 02:45:48 INFO mapreduce.Job: Job job_1453210838763_0011 > > running in uber mode : false > > 16/01/20 02:45:48 INFO mapreduce.Job: map 0% reduce 0% > > 16/01/20 02:47:31 INFO mapreduce.Job: map 33% reduce 0% > > 16/01/20 02:47:47 INFO mapreduce.Job: map 50% reduce 0% > > 16/01/20 02:48:08 INFO mapreduce.Job: map 83% reduce 0% > > 16/01/20 02:48:16 INFO mapreduce.Job: map 100% reduce 0% > > 16/01/20 02:48:31 INFO mapreduce.Job: map 100% reduce 31% > > 16/01/20 02:48:34 INFO mapreduce.Job: map 100% reduce 33% > > 16/01/20 02:50:30 INFO mapreduce.Job: map 100% reduce 34% > > 16/01/20 03:01:18 INFO mapreduce.Job: map 100% reduce 35% > > 16/01/20 03:11:58 INFO mapreduce.Job: map 100% reduce 36% > > 16/01/20 03:22:50 INFO mapreduce.Job: map 100% reduce 37% > > 16/01/20 03:24:22 INFO mapreduce.Job: map 100% reduce 50% > > 16/01/20 03:24:35 INFO mapreduce.Job: map 100% reduce 82% > > 16/01/20 03:24:38 INFO mapreduce.Job: map 100% reduce 83% > > 16/01/20 03:26:33 INFO mapreduce.Job: map 100% reduce 84% > > 16/01/20 03:37:35 INFO mapreduce.Job: map 100% reduce 85% > > 16/01/20 03:39:38 INFO mapreduce.Job: Task Id : > > attempt_1453210838763_0011_r_000001_0, Status : FAILED > > *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767* > > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506) > > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487) > > at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89) > > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208) > > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79) > > at > > org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156) > > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56) > > at > > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114) > > at > > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) > > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) > > at > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > > > 16/01/20 03:39:39 INFO mapreduce.Job: map 100% reduce 50% > > 16/01/20 03:39:52 INFO mapreduce.Job: map 100% reduce 82% > > 16/01/20 03:39:55 INFO mapreduce.Job: map 100% reduce 83% > > 16/01/20 03:41:56 INFO mapreduce.Job: map 100% reduce 84% > > 16/01/20 03:53:39 INFO mapreduce.Job: map 100% reduce 85% > > 16/01/20 03:55:49 INFO mapreduce.Job: Task Id : > > attempt_1453210838763_0011_r_000001_1, Status : FAILED > > *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767* > > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506) > > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487) > > at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89) > > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208) > > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79) > > at > > org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156) > > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56) > > at > > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114) > > at > > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) > > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) > > at > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > > > 16/01/20 03:55:50 INFO mapreduce.Job: map 100% reduce 50% > > 16/01/20 03:56:01 INFO mapreduce.Job: map 100% reduce 83% > > 16/01/20 03:58:02 INFO mapreduce.Job: map 100% reduce 84% > > 16/01/20 04:10:09 INFO mapreduce.Job: map 100% reduce 85% > > 16/01/20 04:12:33 INFO mapreduce.Job: Task Id : > > attempt_1453210838763_0011_r_000001_2, Status : FAILED > > *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767* > > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506) > > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487) > > at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89) > > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208) > > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79) > > at > > org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156) > > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56) > > at > > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114) > > at > > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) > > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) > > at > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > > > 16/01/20 04:12:34 INFO mapreduce.Job: map 100% reduce 50% > > 16/01/20 04:12:45 INFO mapreduce.Job: map 100% reduce 82% > > 16/01/20 04:12:48 INFO mapreduce.Job: map 100% reduce 83% > > 16/01/20 04:14:46 INFO mapreduce.Job: map 100% reduce 84% > > 16/01/20 04:26:53 INFO mapreduce.Job: map 100% reduce 85% > > 16/01/20 04:29:09 INFO mapreduce.Job: map 100% reduce 100% > > 16/01/20 04:29:10 INFO mapreduce.Job: Job job_1453210838763_0011 > > failed with state FAILED due to: Task failed > > task_1453210838763_0011_r_000001 > > Job failed as tasks failed. failedMaps:0 failedReduces:1 > > > > 16/01/20 04:29:11 INFO mapreduce.Job: Counters: 50 > > File System Counters > > FILE: Number of bytes read=38378343 > > FILE: Number of bytes written=115957636 > > FILE: Number of read operations=0 > > FILE: Number of large read operations=0 > > FILE: Number of write operations=0 > > HDFS: Number of bytes read=2382 > > HDFS: Number of bytes written=0 > > HDFS: Number of read operations=2 > > HDFS: Number of large read operations=0 > > HDFS: Number of write operations=0 > > Job Counters > > Failed reduce tasks=4 > > Launched map tasks=2 > > Launched reduce tasks=5 > > Data-local map tasks=2 > > Total time spent by all maps in occupied slots (ms)=789909 > > Total time spent by all reduces in occupied slots (ms)=30215090 > > Total time spent by all map tasks (ms)=263303 > > Total time spent by all reduce tasks (ms)=6043018 > > Total vcore-seconds taken by all map tasks=263303 > > Total vcore-seconds taken by all reduce tasks=6043018 > > Total megabyte-seconds taken by all map tasks=808866816 > > Total megabyte-seconds taken by all reduce tasks=30940252160 > > Map-Reduce Framework > > Map input records=49929 > > Map output records=1777904 > > Map output bytes=382773368 > > Map output materialized bytes=77228942 > > Input split bytes=2382 > > Combine input records=0 > > Combine output records=0 > > Reduce input groups=754170 > > Reduce shuffle bytes=38318183 > > Reduce input records=881156 > > Reduce output records=754170 > > Spilled Records=2659060 > > Shuffled Maps =2 > > Failed Shuffles=0 > > Merged Map outputs=2 > > GC time elapsed (ms)=17993 > > CPU time spent (ms)=819690 > > Physical memory (bytes) snapshot=4080136192 > > Virtual memory (bytes) snapshot=15234293760 > > Total committed heap usage (bytes)=4149739520 > > Shuffle Errors > > BAD_ID=0 > > CONNECTION=0 > > IO_ERROR=0 > > WRONG_LENGTH=0 > > WRONG_MAP=0 > > WRONG_REDUCE=0 > > File Input Format Counters > > Bytes Read=0 > > File Output Format Counters > > Bytes Written=0 > > Exception in thread "main" java.lang.RuntimeException: job failed: > > name=[1]update-table, jobid=job_1453210838763_0011 > > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) > > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111) > > at > > org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140) > > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > > Error running: > > /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch > > updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m > > -D mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true 1453230757-13191 -crawlId 1 > > Failed with exit value 1. > > ****************************************************LOG > > END************************************************************************************************ > > > > -- > > > > Please let me know if you have any questions , concerns or updates. > > Have a great day ahead :) > > > > Thanks and Regards, > > > > Kshitij Shukla > > Software developer > > > > *Cyber Infrastructure(CIS) > > **/The RightSourcing Specialists with 1250 man years of experience!/* > > > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not > > the intended recipient, you should delete this message and are > > notified that any disclosure, copying or distribution of this message, > > or taking any action based on it, is strictly prohibited by Law. > > > > Please don't print this e-mail unless you really need to. > > > -- > > Please let me know if you have any questions , concerns or updates. > Have a great day ahead :) > > Thanks and Regards, > > Kshitij Shukla > Software developer > > *Cyber Infrastructure(CIS) > **/The RightSourcing Specialists with 1250 man years of experience!/* > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the > intended recipient, you should delete this message and are notified that > any disclosure, copying or distribution of this message, or taking any > action based on it, is strictly prohibited by Law. > > Please don't print this e-mail unless you really need to. > > -- > > ------------------------------ > > *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)* > > Central India's largest Technology company. > > *Ensuring the success of our clients and partners through our highly > optimized Technology solutions.* > > www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin > <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | > Offices: *Indore, India.* *Singapore. Silicon Valley, USA*. > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the > intended recipient, you should delete this message and are notified that > any disclosure, copying or distribution of this message, or taking any > action based on it, is strictly prohibited by Law. >

