get CrawlDatum

2006-08-30 Thread Uroš Gruber
Hi, Could someone point me how to get CrawlDatum data from key url in ParseOutputFormat.write [83]. I would like to add data to link urls but this data depend on data of url being crawled. I hope I was clear enough about my problem. regards Uros

Re: get CrawlDatum

2006-08-30 Thread Andrzej Bialecki
Uroš Gruber wrote: Hi, Could someone point me how to get CrawlDatum data from key url in ParseOutputFormat.write [83]. I would like to add data to link urls but this data depend on data of url being crawled. You can't, because that instance of CrawlDatum is not available at this place.

Re: Use CrawlDb as a metadata Db?

2006-08-30 Thread Enis Soztutar
HUYLEBROECK Jeremy RD-ILAB-SSF wrote: If I am not wrong, segments generated by Generator are some sort of CrawlDatum. I am putting metadata in the CrawlDb (I keep information that never change) and I think they are copied to the segments by the Generator. But now I want to access those metadata

Fetch error

2006-08-30 Thread anton
I update hadoop but I am get next error now on fetch step (reduce): 06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334% reduce copy (6 of 6 at 11.77 MB/s) 06/08/29 08:31:20 WARN /: /getMapOutput.jsp?map=task_0003_m_02_0reduce=1: java.lang.IllegalStateException

Re: get CrawlDatum

2006-08-30 Thread Uroš Gruber
Andrzej Bialecki wrote: Uroš Gruber wrote: Hi, Could someone point me how to get CrawlDatum data from key url in ParseOutputFormat.write [83]. I would like to add data to link urls but this data depend on data of url being crawled. You can't, because that instance of CrawlDatum is not

Re: get CrawlDatum

2006-08-30 Thread Andrzej Bialecki
Uroš Gruber wrote: ParseData.metadata sounds nice, but I think I'm lost again :) If I understand code flow the best place would be in Fetcher [262] but i'm not sure that datum holds info of url being fetched On the input to the fetcher you get a URL and a CrawlDatum (originally coming from

Re: get CrawlDatum

2006-08-30 Thread Uroš Gruber
Andrzej Bialecki wrote: Uroš Gruber wrote: ParseData.metadata sounds nice, but I think I'm lost again :) If I understand code flow the best place would be in Fetcher [262] but i'm not sure that datum holds info of url being fetched On the input to the fetcher you get a URL and a CrawlDatum

RE: Fetch error

2006-08-30 Thread anton
Preview error I got from tasktracker log. In jobtracker log I am see next error now: 06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-30 Thread Enis Soztutar (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12431548 ] Enis Soztutar commented on NUTCH-356: - I observed strange behaviour, when one of the plug-ins could not be included. For example the plugin system fails to

Should URL normalization iterate?

2006-08-30 Thread Doug Cook
Hi, I've run across a few patterns in URLs where applying a normalization puts the URL in a form matching another normalization pattern (or even the same one). But that pattern won't get executed because the patterns are applied only once. Should normalization iterate until no patterns match

RE: Use CrawlDb as a metadata Db?

2006-08-30 Thread HUYLEBROECK Jeremy RD-ILAB-SSF
I think at the parser plugin level, you can't get back to the original crawldatum. The parsers get only the Content. What I did is putting stuff from the Crawldb in the Content MetaData at fetch time. Then the Parser gets this Metadata and can put it in the Parse object as needed. If you do

Re: books (and articles) about search engine algorithms

2006-08-30 Thread Thomas Delnoij
I found Mining the web - discovering knowledge from Hypertext Data by Soumen Ckakrabarti a usefull reference. http://www.amazon.com/gp/product/1558607544/103-9548474-1631829?v=glancen=283155 Rgrds, Thomas On 8/29/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Mladen Adamovic wrote: Hi! I

Re: Patch Available status?

2006-08-30 Thread Doug Cutting
Sami Siren wrote: I am not able to do it either, or then I just don't know how, can Doug help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports Patch Available. This workflow has one other

Re: Patch Available status?

2006-08-30 Thread Andrzej Bialecki
Doug Cutting wrote: Sami Siren wrote: I am not able to do it either, or then I just don't know how, can Doug help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports Patch Available. This workflow

[jira] Closed: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-08-30 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ] Andrzej Bialecki closed NUTCH-242. --- Resolution: Fixed Fixed in rev. 438670, with modifications. Add optional -urlFiltering to updatedb --

[jira] Closed: (NUTCH-143) Improper error numbers returned on exit

2006-08-30 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-143?page=all ] Andrzej Bialecki closed NUTCH-143. --- Resolution: Fixed Fixed in rev. 438670, with modifications. Improper error numbers returned on exit ---

Re: Patch Available status?

2006-08-30 Thread Chris Mattmann
Hi Doug and Andrzej, +1. I think that workflow makes a lot of sense. Currently users in the nutch-developers group can close and resolve issues. In the Hadoop workflow, would this continue to be the case? Cheers, Chris On 8/30/06 3:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doug