Oh yes, thank your very much Sergey, that was the problem. Would have been nice, if the inverlinks command had told me that it has ignored them :-)
Cheers, Marek Am 23.08.2011 19:26, schrieb Sergey A Volkov: > Hi > > Is it possible that you fetch documents from just one site/domain? > > Looks like by default nutch ignore internal site links > (db.ignore.internal.links) > > Sergey Volkov > > On 08/23/2011 07:04 PM, Marek Bachmann wrote: >> Hi Lewis, >> >> thank you for your suggestion. >> Unfortunately this isn't the problem. Actually I have also tried to >> merge all segments together and put the one big segment to the >> inverlinks command. Same (none) effect. :-( >> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# >> ./nutch mergesegs crawl/one-seg -dir crawl/segments/ >> Merging 29 segments to crawl/one-seg/20110823165144 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321 >> SegmentMerger: adding >> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732 >> SegmentMerger: using segment data from: content crawl_generate >> crawl_fetch crawl_parse parse_data parse_text >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm >> -rf crawl/linkdb/ >> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# >> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/ >> -noNormalize -noFilter >> LinkDb: starting at 2011-08-23 17:01:44 >> LinkDb: linkdb: crawl/linkdb >> LinkDb: URL normalize: false >> LinkDb: URL filter: false >> LinkDb: adding segment: crawl/one-seg/20110823165144 >> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08 >> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# >> ./nutch readlinkdb crawl/linkdb/ -dump linkdump >> LinkDb dump: starting at 2011-08-23 17:03:12 >> LinkDb dump: db: crawl/linkdb/ >> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01 >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd >> linkdump/ >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump# >> ll >> total 0 >> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000 >> >> >> Am 23.08.2011 16:44, schrieb lewis john mcgibbney: >>> Hi >>> >>> Small suggestion, but I do not see any -dir argument passed >>> alongside your >>> initial invertlinks command. I understand that you have multiple >>> segment >>> directories, which have been fetched over a recent number of days, >>> and that >>> the output would also suggest the process was properly executed, >>> however I >>> have never used the command without the -dir option (as it has >>> always worked >>> for me), therefore I can only suggest that this may be the problem. >>> >>> >>> >>> On Tue, Aug 23, 2011 at 3:29 PM, Marek >>> Bachmann<[email protected]>wrote: >>> >>>> Hi Markus, >>>> >>>> thank you for the quick reply. I already searched for this >>>> Configuration >>>> error and found: >>>> >>>> http://www.mail-archive.com/**[email protected]/**msg15397.html<http://www.mail-archive.com/[email protected]/msg15397.html> >>>> >>>> >>>> Where they say that "This exception is innocuous - it helps to >>>> debug at >>>> which points in the code the Configuration instances are being >>>> created. >>>> (...)" >>>> >>>> I have indeed not much disk space on the machine but it should be >>>> enough at >>>> the moment: >>>> >>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# >>>> df >>>> -h . >>>> Filesystem Size Used Avail Use% Mounted on >>>> /dev/vda1 20G 5.9G 15G 30% /home >>>> >>>> As I am root and all directories under >>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin >>>> are set to root:root and 755 permissions shouldn't be the problem. >>>> >>>> Any further suggestions? :-/ >>>> >>>> Thank you once again >>>> >>>> >>>> >>>> Am 23.08.2011 16:10, schrieb Markus Jelsma: >>>> >>>> There are some peculiarities in your log: >>>>> >>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - >>>>> java.io.IOException: >>>>> config() >>>>> at org.apache.hadoop.conf.**Configuration.<init>(** >>>>> Configuration.java:211) >>>>> at org.apache.hadoop.conf.**Configuration.<init>(** >>>>> Configuration.java:198) >>>>> at >>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213) >>>>> at >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(** >>>>> LocalJobRunner.java:93) >>>>> at >>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(** >>>>> LocalJobRunner.java:373) >>>>> at >>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(** >>>>> JobClient.java:800) >>>>> at >>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.** >>>>> java:730) >>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.** >>>>> java:1249) >>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190) >>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292) >>>>> at >>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) >>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255) >>>>> >>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job: >>>>> job_local_0002 >>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - >>>>> java.io.IOException: >>>>> config(config) >>>>> at org.apache.hadoop.conf.**Configuration.<init>(** >>>>> Configuration.java:226) >>>>> at >>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184) >>>>> at >>>>> org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.** >>>>> java:52) >>>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.** >>>>> java:32) >>>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.** >>>>> java:38) >>>>> at >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(** >>>>> LocalJobRunner.java:111) >>>>> >>>>> >>>>> Can you check permissions, disk space etc? >>>>> >>>>> >>>>> >>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote: >>>>> >>>>>> Hey Ho, >>>>>> >>>>>> for some reasons the inverlinks command produces an empty linkdb. >>>>>> >>>>>> I did: >>>>>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# >>>>>> >>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize >>>>>> -noFilter >>>>>> LinkDb: starting at 2011-08-23 14:47:21 >>>>>> LinkDb: linkdb: crawl/linkdb >>>>>> LinkDb: URL normalize: false >>>>>> LinkDb: URL filter: false >>>>>> LinkDb: adding segment: crawl/segments/20110817164804 >>>>>> LinkDb: adding segment: crawl/segments/20110817164912 >>>>>> LinkDb: adding segment: crawl/segments/20110817165053 >>>>>> LinkDb: adding segment: crawl/segments/20110817165524 >>>>>> LinkDb: adding segment: crawl/segments/20110817170729 >>>>>> LinkDb: adding segment: crawl/segments/20110817171757 >>>>>> LinkDb: adding segment: crawl/segments/20110817172919 >>>>>> LinkDb: adding segment: crawl/segments/20110819135218 >>>>>> LinkDb: adding segment: crawl/segments/20110819165658 >>>>>> LinkDb: adding segment: crawl/segments/20110819170807 >>>>>> LinkDb: adding segment: crawl/segments/20110819171841 >>>>>> LinkDb: adding segment: crawl/segments/20110819173350 >>>>>> LinkDb: adding segment: crawl/segments/20110822135934 >>>>>> LinkDb: adding segment: crawl/segments/20110822141229 >>>>>> LinkDb: adding segment: crawl/segments/20110822143419 >>>>>> LinkDb: adding segment: crawl/segments/20110822143824 >>>>>> LinkDb: adding segment: crawl/segments/20110822144031 >>>>>> LinkDb: adding segment: crawl/segments/20110822144232 >>>>>> LinkDb: adding segment: crawl/segments/20110822144435 >>>>>> LinkDb: adding segment: crawl/segments/20110822144617 >>>>>> LinkDb: adding segment: crawl/segments/20110822144750 >>>>>> LinkDb: adding segment: crawl/segments/20110822144927 >>>>>> LinkDb: adding segment: crawl/segments/20110822145249 >>>>>> LinkDb: adding segment: crawl/segments/20110822150757 >>>>>> LinkDb: adding segment: crawl/segments/20110822152354 >>>>>> LinkDb: adding segment: crawl/segments/20110822152503 >>>>>> LinkDb: adding segment: crawl/segments/20110822153900 >>>>>> LinkDb: adding segment: crawl/segments/20110822155321 >>>>>> LinkDb: adding segment: crawl/segments/20110822155732 >>>>>> LinkDb: merging with existing linkdb: crawl/linkdb >>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14 >>>>>> >>>>>> After that: >>>>>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# >>>>>> >>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump >>>>>> LinkDb dump: starting at 2011-08-23 14:48:26 >>>>>> LinkDb dump: db: crawl/linkdb/ >>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01 >>>>>> >>>>>> And then: >>>>>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# >>>>>> >>>>>> cd >>>>>> linkdump/ >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/** >>>>>> runtime/local/bin/linkdump# >>>>>> ll >>>>>> total 0 >>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000 >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/** >>>>>> runtime/local/bin/linkdump# >>>>>> >>>>>> As you see, the dump size is 0 byte. >>>>>> >>>>>> Unfortunately I have no idea what went wrong. >>>>>> >>>>>> I have attached the hadoop.log for the inverlinks process. >>>>>> Perhaps that >>>>>> helps anybody? >>>>>> >>>>> >>>>> >>>> >>> >>> >> >

