Re: Empty LinkDB after invertlinks

Marek Bachmann Tue, 23 Aug 2011 14:49:20 -0700

Oh yes, thank your very much Sergey, that was the problem.

Would have been nice, if the inverlinks command had told me that it has
ignored them :-)


Cheers,

Marek

Am 23.08.2011 19:26, schrieb Sergey A Volkov:
> Hi
>
> Is it possible that you fetch documents from just one site/domain?
>
> Looks like by default nutch ignore internal site links
> (db.ignore.internal.links)
>
> Sergey Volkov
>
> On 08/23/2011 07:04 PM, Marek Bachmann wrote:
>> Hi Lewis,
>>
>> thank you for your suggestion.
>> Unfortunately this isn't the problem. Actually I have also tried to
>> merge all segments together and put the one big segment to the
>> inverlinks command. Same (none) effect. :-(
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
>> Merging 29 segments to crawl/one-seg/20110823165144
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
>> SegmentMerger: using segment data from: content crawl_generate
>> crawl_fetch crawl_parse parse_data parse_text
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
>> -rf crawl/linkdb/
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
>> -noNormalize -noFilter
>> LinkDb: starting at 2011-08-23 17:01:44
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: false
>> LinkDb: URL filter: false
>> LinkDb: adding segment: crawl/one-seg/20110823165144
>> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>> LinkDb dump: starting at 2011-08-23 17:03:12
>> LinkDb dump: db: crawl/linkdb/
>> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>> linkdump/
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>> ll
>> total 0
>> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>>
>>
>> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>>> Hi
>>>
>>> Small suggestion, but I do not see any -dir argument passed
>>> alongside your
>>> initial invertlinks command. I understand that you have multiple
>>> segment
>>> directories, which have been fetched over a recent number of days,
>>> and that
>>> the output would also suggest the process was properly executed,
>>> however I
>>> have never used the command without the -dir option (as it has
>>> always worked
>>> for me), therefore I can only suggest that this may be the problem.
>>>
>>>
>>>
>>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
>>> Bachmann<[email protected]>wrote:
>>>
>>>> Hi Markus,
>>>>
>>>> thank you for the quick reply. I already searched for this
>>>> Configuration
>>>> error and found:
>>>>
>>>> http://www.mail-archive.com/**[email protected]/**msg15397.html<http://www.mail-archive.com/[email protected]/msg15397.html>
>>>>
>>>>
>>>> Where they say that "This exception is innocuous - it helps to
>>>> debug at
>>>> which points in the code the Configuration instances are being
>>>> created.
>>>> (...)"
>>>>
>>>> I have indeed not much disk space on the machine but it should be
>>>> enough at
>>>> the moment:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> df
>>>> -h .
>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>> /dev/vda1              20G  5.9G   15G  30% /home
>>>>
>>>> As I am root and all directories under
>>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>>>> are set to root:root and 755 permissions shouldn't be the problem.
>>>>
>>>> Any further suggestions? :-/
>>>>
>>>> Thank you once again
>>>>
>>>>
>>>>
>>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>>>
>>>>   There are some peculiarities in your log:
>>>>>
>>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
>>>>> java.io.IOException:
>>>>> config()
>>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:211)
>>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:198)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>>>> LocalJobRunner.java:93)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>>>> LocalJobRunner.java:373)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>>>> JobClient.java:800)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>>>> java:730)
>>>>>         at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>>> java:1249)
>>>>>         at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>>>>         at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>>>>         at
>>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>>         at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>>>
>>>>> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
>>>>> job_local_0002
>>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
>>>>> java.io.IOException:
>>>>> config(config)
>>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:226)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>>>> java:52)
>>>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>>> java:32)
>>>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>>> java:38)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>>>> LocalJobRunner.java:111)
>>>>>
>>>>>
>>>>> Can you check permissions, disk space etc?
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>>>
>>>>>> Hey Ho,
>>>>>>
>>>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>>>
>>>>>> I did:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
>>>>>> -noFilter
>>>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>>>> LinkDb: linkdb: crawl/linkdb
>>>>>> LinkDb: URL normalize: false
>>>>>> LinkDb: URL filter: false
>>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>>>
>>>>>> After that:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>>>> LinkDb dump: db: crawl/linkdb/
>>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>>>
>>>>>> And then:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> cd
>>>>>> linkdump/
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>>> runtime/local/bin/linkdump#
>>>>>> ll
>>>>>> total 0
>>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>>> runtime/local/bin/linkdump#
>>>>>>
>>>>>> As you see, the dump size is 0 byte.
>>>>>>
>>>>>> Unfortunately I have no idea what went wrong.
>>>>>>
>>>>>> I have attached the hadoop.log for the inverlinks process.
>>>>>> Perhaps that
>>>>>> helps anybody?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: Empty LinkDB after invertlinks

Reply via email to