Re: readlinkdb fails to dump linkdb
On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote: On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote: Using nutch 0.9 (hadoop 0.17.1): [EMAIL PROTECTED] working]$ bin/nutch readlinkdb /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt LinkDb dump: starting LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb It seems you are providing a crawldb as argument. You should pass the linkdb. Thanks a lot for the hint, but I cannot find linkdb dir anywhere on the HDFS :_/ Can you point me where should it be ? java.io.IOException: Type mismatch in value from map: expected org.apache.nutch.crawl.Inlinks, recieved org.apache.nutch.crawl.CrawlDatum at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) LinkDbReader: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110) at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114) This is the first time I use readlinkdb and the rest of the crawling process is working ok, I've looked up JIRA and there's no related bug. I've also tried latest trunk nutch but DFS is not working for me: [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118) at org.apache.hadoop.fs.FsShell.init(FsShell.java:88) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646) ... 10 more Should I file both bugs on JIRA ? This I am not sure, but did you try ant clean; ant? It may be a version mismatch. Yes, I did ant clean ant before trying the above command. I also tried to upgrade the filesystem unsuccessfully and even created it from scratch: https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650556#action_12650556 -- Doğacan Güney
Re: readlinkdb fails to dump linkdb
On Thu, Dec 4, 2008 at 11:33 AM, brainstorm [EMAIL PROTECTED] wrote: On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote: On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote: Using nutch 0.9 (hadoop 0.17.1): [EMAIL PROTECTED] working]$ bin/nutch readlinkdb /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt LinkDb dump: starting LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb It seems you are providing a crawldb as argument. You should pass the linkdb. Thanks a lot for the hint, but I cannot find linkdb dir anywhere on the HDFS :_/ Can you point me where should it be ? A linkdb is created with the command: invertlinks, e.g: bin/nutch invertlinks crawl/linkdb crawl/segments/ java.io.IOException: Type mismatch in value from map: expected org.apache.nutch.crawl.Inlinks, recieved org.apache.nutch.crawl.CrawlDatum at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) LinkDbReader: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110) at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114) This is the first time I use readlinkdb and the rest of the crawling process is working ok, I've looked up JIRA and there's no related bug. I've also tried latest trunk nutch but DFS is not working for me: [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118) at org.apache.hadoop.fs.FsShell.init(FsShell.java:88) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646) ... 10 more Should I file both bugs on JIRA ? This I am not sure, but did you try ant clean; ant? It may be a version mismatch. Yes, I did ant clean ant before trying the above command. I also tried to upgrade the filesystem unsuccessfully and even created it from scratch: https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650556#action_12650556 -- Doğacan Güney -- Doğacan Güney
readlinkdb fails to dump linkdb
Using nutch 0.9 (hadoop 0.17.1): [EMAIL PROTECTED] working]$ bin/nutch readlinkdb /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt LinkDb dump: starting LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb java.io.IOException: Type mismatch in value from map: expected org.apache.nutch.crawl.Inlinks, recieved org.apache.nutch.crawl.CrawlDatum at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) LinkDbReader: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110) at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114) This is the first time I use readlinkdb and the rest of the crawling process is working ok, I've looked up JIRA and there's no related bug. I've also tried latest trunk nutch but DFS is not working for me: [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118) at org.apache.hadoop.fs.FsShell.init(FsShell.java:88) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646) ... 10 more Should I file both bugs on JIRA ?
Re: readlinkdb fails to dump linkdb
On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote: Using nutch 0.9 (hadoop 0.17.1): [EMAIL PROTECTED] working]$ bin/nutch readlinkdb /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt LinkDb dump: starting LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb It seems you are providing a crawldb as argument. You should pass the linkdb. java.io.IOException: Type mismatch in value from map: expected org.apache.nutch.crawl.Inlinks, recieved org.apache.nutch.crawl.CrawlDatum at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) LinkDbReader: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110) at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114) This is the first time I use readlinkdb and the rest of the crawling process is working ok, I've looked up JIRA and there's no related bug. I've also tried latest trunk nutch but DFS is not working for me: [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118) at org.apache.hadoop.fs.FsShell.init(FsShell.java:88) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646) ... 10 more Should I file both bugs on JIRA ? This I am not sure, but did you try ant clean; ant? It may be a version mismatch. -- Doğacan Güney