[jira] Updated: (NUTCH-163) LogFormatter design

2006-01-05 Thread Daniel Feinstein (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-163?page=all ] Daniel Feinstein updated NUTCH-163: --- Attachment: LogFormatter.java Here is the solution we have in RawSugar.com project LogFormatter design --- Key: NUTCH-163

[jira] Commented: (NUTCH-163) LogFormatter design

2006-01-05 Thread nutch.newbie (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-163?page=comments#action_12361815 ] nutch.newbie commented on NUTCH-163: Hi Daniel I really like to give this a try. Could you please provide some instructions. It would be very helpful. Thanks

[jira] Commented: (NUTCH-163) LogFormatter design

2006-01-05 Thread nutch.newbie (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-163?page=comments#action_12361816 ] nutch.newbie commented on NUTCH-163: What I mean is that should I just replace the existing /src/java/org/apache/nutch/util/LogFormatter.java with yours and compile

Re: mapred crawling exception - Job failed!

2006-01-05 Thread Andrzej Bialecki
Lukas Vlcek wrote: How can I learn that? What I do is running regular one-step command [/bin/nutch crawl] In that case your nutch-default.xml / nutch-site.xml decides, there is a boolean option there. If you didn't change this, then it defaults to true (i.e. your fetcher is parsing the

Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Hi, I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a crawl frontier. Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the crawling

Re: Per-page crawling policy

2006-01-05 Thread Byron Miller
Excellent Ideas and that is what i'm hoping to use some of the social bookmarking type ideas to build the starter sites from and linkmaps from. I hope to work with Simpy or other bookmarking projects to build somewhat of a popularity map(human edited authorit) to merge and calculate against a

Re: no static NutchConf

2006-01-05 Thread Stefan Groschupf
I have two more ideas: 1) create NutchConf as interface (not class) 2) make it work as plugin I like the idea to make the conf as a singleton and understand the need to be able to integrate nutch. However I would love to do one first step and later on we can make this second step. I made

Re: no static NutchConf

2006-01-05 Thread Stefan Groschupf
(2) What I'd REALLY like to see is if NutchConf were an interface, As mentioned, give us some time to get the first step done and than I'm sure such kind of community contributions are every-time welcome. May people can work together on this. Stefan

Re: Per-page crawling policy

2006-01-05 Thread Stefan Groschupf
I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and

Re: no static NutchConf

2006-01-05 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi Andrzej, may be I come closer to your idea of caching some objects. Yes. If you remember our discussion, I'd like also to follow a pattern where such instances are cached inside this NutchConf instance, if appropriate (i.e. if they are reusable and multi-

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Stefan Groschupf wrote: I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we

Re: Per-page crawling policy

2006-01-05 Thread Doug Cutting
Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as something that might prove very

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Doug Cutting wrote: Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as

Re: Per-page crawling policy

2006-01-05 Thread Neal Whitley
Andrzej, This sounds like another great way to create more of a vertical search application as well. By defining trusted seed sources we can limit the scope of the crawl to a more suitable set of links. Also, being able to apply additional rules by domain/host or by trusted source would be

Re: [VOTE] Commiter access for Stefan Groschupf

2006-01-05 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I'm late, but better late than never: +1 (I thought Stefan was already a committer, actually). +1 Not as late as I am! I'm still catching up on December email... The Lucene PMC has final say, and not all members of the PMC are on nutch-dev, so I'll forward the

Re: no static NutchConf

2006-01-05 Thread Thomas Jaeger
Doug Cutting wrote: Stefan Groschupf wrote: I have two more ideas: 1) create NutchConf as interface (not class) 2) make it work as plugin I like the idea to make the conf as a singleton and understand the need to be able to integrate nutch. However I would love to do one first step and

Re: problems http-client

2006-01-05 Thread Doug Cutting
Andrzej Bialecki wrote: Hmm... I'm not saying it's flawless, there were surely some mysterious things going on with it. That large crawl you mention, was it with the (recently updated in Nutch) release 3.0? What were the issues? No, it was in early December, with the previous version. I

Normalizing URLs with anchors

2006-01-05 Thread Ken Krugler
Hi all, The default regex-normalize.xml currently strips out PHP session ids. I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different:

Re: Normalizing URLs with anchors

2006-01-05 Thread ogjunk-nutch
I think it's safe to strip anchors, as they simply point to a different portion of the same page for browser rendering. I do that for Simpy while normalizing URLs, in order not to have duplicates like this. Otis - Original Message From: Ken Krugler [EMAIL PROTECTED] To:

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361924 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361926 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361927 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread chris.mattmann
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them

Re: mapred crawling exception - Job failed!

2006-01-05 Thread Lukas Vlcek
Hi, I found the reason of that exception! If you look into my crawl.log carefully then you notice these lines: 060104 213608 Parsing [http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with [EMAIL PROTECTED] 060104 213609 Unable to successfully parse content