Hi,
Could someone point me how to get CrawlDatum data from key url in
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of
url being crawled.
I hope I was clear enough about my problem.
regards
Uros
Uroš Gruber wrote:
Hi,
Could someone point me how to get CrawlDatum data from key url in
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of
url being crawled.
You can't, because that instance of CrawlDatum is not available at this
place.
HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
If I am not wrong, segments generated by Generator are some sort of
CrawlDatum.
I am putting metadata in the CrawlDb (I keep information that never
change) and I think they are copied to the segments by the Generator.
But now I want to access those metadata
I update hadoop but I am get next error now on fetch step (reduce):
06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334%
reduce copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_02_0reduce=1:
java.lang.IllegalStateException
Andrzej Bialecki wrote:
Uroš Gruber wrote:
Hi,
Could someone point me how to get CrawlDatum data from key url in
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of
url being crawled.
You can't, because that instance of CrawlDatum is not
Uroš Gruber wrote:
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]
but i'm not sure that datum holds info of url being fetched
On the input to the fetcher you get a URL and a CrawlDatum (originally
coming from
Andrzej Bialecki wrote:
Uroš Gruber wrote:
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]
but i'm not sure that datum holds info of url being fetched
On the input to the fetcher you get a URL and a CrawlDatum
Preview error I got from tasktracker log. In jobtracker log I am see next
error now:
06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from
task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n
utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS
[
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12431548 ]
Enis Soztutar commented on NUTCH-356:
-
I observed strange behaviour, when one of the plug-ins could not be included.
For example the plugin system fails to
Hi,
I've run across a few patterns in URLs where applying a normalization puts
the URL in a form matching another normalization pattern (or even the same
one). But that pattern won't get executed because the patterns are applied
only once.
Should normalization iterate until no patterns match
I think at the parser plugin level, you can't get back to the original
crawldatum. The parsers get only the Content.
What I did is putting stuff from the Crawldb in the Content MetaData at
fetch time. Then the Parser gets this Metadata and can put it in the
Parse object as needed.
If you do
I found Mining the web - discovering knowledge from Hypertext Data
by Soumen Ckakrabarti a usefull reference.
http://www.amazon.com/gp/product/1558607544/103-9548474-1631829?v=glancen=283155
Rgrds, Thomas
On 8/29/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Mladen Adamovic wrote:
Hi!
I
Sami Siren wrote:
I am not able to do it either, or then I just don't know how, can Doug
help us here?
This requires a change the the project's workflow. I'd be happy to move
Nutch to use the workflow we use for Hadoop, which supports Patch
Available.
This workflow has one other
Doug Cutting wrote:
Sami Siren wrote:
I am not able to do it either, or then I just don't know how, can
Doug help us here?
This requires a change the the project's workflow. I'd be happy to
move Nutch to use the workflow we use for Hadoop, which supports
Patch Available.
This workflow
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ]
Andrzej Bialecki closed NUTCH-242.
---
Resolution: Fixed
Fixed in rev. 438670, with modifications.
Add optional -urlFiltering to updatedb
--
[ http://issues.apache.org/jira/browse/NUTCH-143?page=all ]
Andrzej Bialecki closed NUTCH-143.
---
Resolution: Fixed
Fixed in rev. 438670, with modifications.
Improper error numbers returned on exit
---
Hi Doug and Andrzej,
+1. I think that workflow makes a lot of sense. Currently users in the
nutch-developers group can close and resolve issues. In the Hadoop workflow,
would this continue to be the case?
Cheers,
Chris
On 8/30/06 3:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doug
17 matches
Mail list logo