Re: readlinkdb fails to dump linkdb

2008-12-04 Thread brainstorm
On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
 On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote:
 Using nutch 0.9 (hadoop 0.17.1):

 [EMAIL PROTECTED] working]$ bin/nutch readlinkdb
 /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
 LinkDb dump: starting
 LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
  

 It seems you are providing a crawldb as argument. You should pass the linkdb.


Thanks a lot for the hint, but I cannot find linkdb dir anywhere on
the HDFS :_/ Can you point me where should it be ?


 java.io.IOException: Type mismatch in value from map: expected
 org.apache.nutch.crawl.Inlinks, recieved
 org.apache.nutch.crawl.CrawlDatum
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at 
 org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 LinkDbReader: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at 
 org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

 This is the first time I use readlinkdb and the rest of the crawling
 process is working ok, I've looked up JIRA and there's no related bug.

 I've also tried latest trunk nutch but DFS is not working for me:

 [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls

 Exception in thread main java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more

 Should I file both bugs on JIRA ?


 This I am not sure, but did you try ant clean; ant? It may be a
 version mismatch.


Yes, I did ant clean  ant before trying the above command. I also
tried to upgrade the filesystem unsuccessfully and even created it
from scratch:

https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650556#action_12650556



 --
 Doğacan Güney



Re: readlinkdb fails to dump linkdb

2008-12-04 Thread Doğacan Güney
On Thu, Dec 4, 2008 at 11:33 AM, brainstorm [EMAIL PROTECTED] wrote:
 On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
 On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote:
 Using nutch 0.9 (hadoop 0.17.1):

 [EMAIL PROTECTED] working]$ bin/nutch readlinkdb
 /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
 LinkDb dump: starting
 LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
  

 It seems you are providing a crawldb as argument. You should pass the linkdb.


 Thanks a lot for the hint, but I cannot find linkdb dir anywhere on
 the HDFS :_/ Can you point me where should it be ?

A linkdb is created with the command: invertlinks, e.g:

bin/nutch invertlinks crawl/linkdb crawl/segments/



 java.io.IOException: Type mismatch in value from map: expected
 org.apache.nutch.crawl.Inlinks, recieved
 org.apache.nutch.crawl.CrawlDatum
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at 
 org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 LinkDbReader: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at 
 org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

 This is the first time I use readlinkdb and the rest of the crawling
 process is working ok, I've looked up JIRA and there's no related bug.

 I've also tried latest trunk nutch but DFS is not working for me:

 [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls

 Exception in thread main java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more

 Should I file both bugs on JIRA ?


 This I am not sure, but did you try ant clean; ant? It may be a
 version mismatch.


 Yes, I did ant clean  ant before trying the above command. I also
 tried to upgrade the filesystem unsuccessfully and even created it
 from scratch:

 https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650556#action_12650556



 --
 Doğacan Güney





-- 
Doğacan Güney


readlinkdb fails to dump linkdb

2008-12-03 Thread brainstorm
Using nutch 0.9 (hadoop 0.17.1):

[EMAIL PROTECTED] working]$ bin/nutch readlinkdb
/home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
LinkDb dump: starting
LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
java.io.IOException: Type mismatch in value from map: expected
org.apache.nutch.crawl.Inlinks, recieved
org.apache.nutch.crawl.CrawlDatum
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at 
org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

LinkDbReader: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at 
org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

This is the first time I use readlinkdb and the rest of the crawling
process is working ok, I've looked up JIRA and there's no related bug.

I've also tried latest trunk nutch but DFS is not working for me:

[EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls

Exception in thread main java.lang.RuntimeException:
java.lang.ClassNotFoundException:
org.apache.hadoop.hdfs.DistributedFileSystem
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more

Should I file both bugs on JIRA ?


Re: readlinkdb fails to dump linkdb

2008-12-03 Thread Doğacan Güney
On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote:
 Using nutch 0.9 (hadoop 0.17.1):

 [EMAIL PROTECTED] working]$ bin/nutch readlinkdb
 /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
 LinkDb dump: starting
 LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
  

It seems you are providing a crawldb as argument. You should pass the linkdb.

 java.io.IOException: Type mismatch in value from map: expected
 org.apache.nutch.crawl.Inlinks, recieved
 org.apache.nutch.crawl.CrawlDatum
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at 
 org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 LinkDbReader: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at 
 org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

 This is the first time I use readlinkdb and the rest of the crawling
 process is working ok, I've looked up JIRA and there's no related bug.

 I've also tried latest trunk nutch but DFS is not working for me:

 [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls

 Exception in thread main java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more

 Should I file both bugs on JIRA ?


This I am not sure, but did you try ant clean; ant? It may be a
version mismatch.


-- 
Doğacan Güney