[ http://issues.apache.org/jira/browse/NUTCH-163?page=all ]
Daniel Feinstein updated NUTCH-163:
---
Attachment: LogFormatter.java
Here is the solution we have in RawSugar.com project
LogFormatter design
---
Key: NUTCH-163
[
http://issues.apache.org/jira/browse/NUTCH-163?page=comments#action_12361815 ]
nutch.newbie commented on NUTCH-163:
Hi Daniel
I really like to give this a try. Could you please provide some instructions.
It would be very helpful.
Thanks
[
http://issues.apache.org/jira/browse/NUTCH-163?page=comments#action_12361816 ]
nutch.newbie commented on NUTCH-163:
What I mean is that should I just replace the existing
/src/java/org/apache/nutch/util/LogFormatter.java
with yours and compile
Lukas Vlcek wrote:
How can I learn that?
What I do is running regular one-step command [/bin/nutch crawl]
In that case your nutch-default.xml / nutch-site.xml decides, there is a
boolean option there. If you didn't change this, then it defaults to
true (i.e. your fetcher is parsing the
Hi,
I've been toying with the following idea, which is an extension of the
existing URLFilter mechanism and the concept of a crawl frontier.
Let's suppose we have several initial seed urls, each with a different
subjective quality. We would like to crawl these, and expand the
crawling
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.
I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
I have two more ideas:
1) create NutchConf as interface (not class)
2) make it work as plugin
I like the idea to make the conf as a singleton and understand the
need to be able to integrate nutch.
However I would love to do one first step and later on we can make
this second step. I made
(2) What I'd REALLY like to see is if NutchConf were an interface,
As mentioned, give us some time to get the first step done and than
I'm sure such kind of community contributions are every-time welcome.
May people can work together on this.
Stefan
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
have a means to store and
Stefan Groschupf wrote:
Hi Andrzej,
may be I come closer to your idea of caching some objects.
Yes. If you remember our discussion, I'd like also to follow a
pattern where such instances are cached inside this NutchConf
instance, if appropriate (i.e. if they are reusable and multi-
Stefan Groschupf wrote:
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
Stefan Groschupf wrote:
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any kinds
of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as something that might prove very
Doug Cutting wrote:
Stefan Groschupf wrote:
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any
kinds of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as
Andrzej,
This sounds like another great way to create more of a vertical
search application as well. By defining trusted seed sources we can
limit the scope of the crawl to a more suitable set of links.
Also, being able to apply additional rules by domain/host or by
trusted source would be
[EMAIL PROTECTED] wrote:
I'm late, but better late than never: +1 (I thought Stefan was already a
committer, actually).
+1
Not as late as I am! I'm still catching up on December email...
The Lucene PMC has final say, and not all members of the PMC are on
nutch-dev, so I'll forward the
Doug Cutting wrote:
Stefan Groschupf wrote:
I have two more ideas:
1) create NutchConf as interface (not class)
2) make it work as plugin
I like the idea to make the conf as a singleton and understand the
need to be able to integrate nutch.
However I would love to do one first step and
Andrzej Bialecki wrote:
Hmm... I'm not saying it's flawless, there were surely some mysterious
things going on with it. That large crawl you mention, was it with the
(recently updated in Nutch) release 3.0? What were the issues?
No, it was in early December, with the previous version. I
Hi all,
The default regex-normalize.xml currently strips out PHP session ids.
I'm wondering whether it would also make sense to remove anchor text
from URLs. For example, currently these two URLs are treated as
different:
I think it's safe to strip anchors, as they simply point to a different portion
of the same page for browser rendering. I do that for Simpy while normalizing
URLs, in order not to have duplicates like this.
Otis
- Original Message
From: Ken Krugler [EMAIL PROTECTED]
To:
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361924 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361926 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361927 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
Guys,
My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them
Guys,
My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them
Hi,
I found the reason of that exception!
If you look into my crawl.log carefully then you notice these lines:
060104 213608 Parsing
[http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with
[EMAIL PROTECTED]
060104 213609 Unable to successfully parse content
25 matches
Mail list logo