ogjunk-nu...@yahoo.com is a member of nutch-...@lists.sourceforge.net
and nutch-gene...@lists.sourceforge.net. These lists do not otherwise
appear to forward to Apache lists. They used to perhaps forward through
nutch.org lists, but that domain no longer forwards any email. Please
check the
Will the next release really be 1.0 or will it be 0.10?
Doug
Briggs wrote:
I was just curious to know if there were any plans to release a
maintenence/bug-fix release before 1.0. I know there have been a slew
of patches and such (it's almost impossible to keep up, unless someone
has a
The problem is that nutch-dev (like most Apache mailing lists) sets the
Reply-to header to be itself, so that responses don't go back to the
sender. If you override this when responding (changing the To: line)
and respond to the sender, then it should end up as a comment, which
will be then
[
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473
]
Doug Cutting commented on NUTCH-479:
Neither. It would end up as the Lucene query:
+search phrase
Does the 0.9 crawl-delay implementation actually permit multiple threads
to access a site simultaneously?
Doug
Original Message
Subject: Nutch 0.9 and Crawl-Delay
Date: Sun, 3 Jun 2007 10:50:24 +0200
From: Lutz Zetzsche [EMAIL PROTECTED]
Reply-To: [EMAIL PROTECTED]
To: [EMAIL
[
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822
]
Doug Cutting commented on NUTCH-392:
Anchors, explain, and the cache are used relatively infrequently
Personnel discussions are conducted on the PMC's private mailing list.
I have forwarded your message there.
Thanks for the suggestion!
Doug
Gal Nitzan wrote:
Hi,
Since I'm no committer I can't really propose :-) but I just thought to draw
some attention to the great work done on the
karthik085 wrote:
How do you find when a revision was released?
Look at the tags in subversion:
http://svn.apache.org/viewvc/lucene/nutch/tags/
Doug
Tom White wrote:
I will be there too.
Unfortunately I won't be able to attend after all. The new baby in the
house won't let me!
Doug
Arun Kaundal wrote:
Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided
Have you submitted patches that have been ignored or rejected?
Each Nutch contributor indeed does what he or she decides. Nutch is not
a service organization that
Steve Severance wrote:
I am not looking to really make an image retrieval engine. During indexing
referencing docs will be analyzed and text content will be associated with the
image. Currently I want to keep this in a separate index. So despite the fact
that images will be returned the
[EMAIL PROTECTED] wrote:
[ ... ]
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
[ ... ]
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license
[
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854
]
Doug Cutting commented on NUTCH-455:
Alternately, we could define it as an error to attempt to dedup
Sami Siren wrote:
It would be more beneficial to everybody if the discussions (related to
release or Nutch) is
done on public (hey this is open source!). The off the list stuff IMO
smells.
+1 Folks sometimes wish to discuss project matters off-list to spare
others the boring details, but
Chris Mattmann wrote:
It's too bad that
this has turned out to be an issue that I've handled incorrectly, and for
that, I apologize.
Sorry if I blew this out of proportion. We all help each other run this
project. I don't think any grave error was made. I just saw an
opportunity to remind
Zaheed Haque wrote:
Its been about a month I been trying to find time to make the
necessary changes so that I could submit the code. Due to enormous
amount of work load I am unable to find the time. I am not sure how
should I proceed, I have personally try to contact some of you off
list. (Which
[
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243
]
Doug Cutting commented on NUTCH-445:
Note that the site field is also used for search-time deduplication
Andrzej Bialecki wrote:
The degree of simplification is very substantial. Our NutchSuperQuery
doesn't have to do much more work than a simple TermQuery, so we can
assume that the cost to run it is the same as TermQuery times some
constant. What we gain then is the cost of not running all those
[
https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cutting reassigned NUTCH-449:
--
Assignee: Doug Cutting
Format of junit output should be configurable
Nutch's nightly builds have been moved to a Hudson server at:
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/
I've stopped the old nightly build process and added a redirect from the
old nightly build distribution directory to this page.
Thanks to Nigel Daley for configuring
[
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821
]
Doug Cutting commented on NUTCH-443:
this patch in some places removes the log guards
Most of the log guards
Doug Cutting (JIRA) wrote:
this patch in some places removes the log guards
Most of the log guards are misguided. Log guards should only be used on DEBUG
level messages in performance-critical inner loops. Since INFO is the expected log
level, a guard on INFO WARN level messages does
Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without having
to fetch them individually.
Okay, so if Parser#parse returned a MapString,Parse, then the URL for
each parse should be that of its link, since you don't want to fetch
that separately. Right?
So
Chris Mattmann wrote:
Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...
A Content would no longer generate a single Parse. Instead, a Content
Doğacan Güney wrote:
OK, then should I go forward with this and implement something? This
should be pretty easy,
though I am not sure what to give as keys to a Parse[].
I mean, when getParse returned a single Parse, ParseSegment output them
as url, Parse. But, if getParse
returns an array,
Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search each
feed-entry as a single document. Does it makes sense?
But each feed item also contains a link whose content will be indexed
and that's generally a superset of the item. So should there be two
Doğacan Güney wrote:
I think it would make much more sense to change parse plugins to take
content and return Parse[] instead of Parse.
You're right. That does make more sense.
Doug
Gal Nitzan wrote:
IMHO the data that is needed i.e. the data that will be fetched in the next fetch process
is already available in the item element. Each item element represents one
web resource. And there is no reason to go to the server and re-fetch that resource.
Perhaps ProtocolOutput
Teruhiko Kurosaka wrote:
I suggest i18n be renamed to l10n, short for
localization.
Can you please file an issue in Jira for this? Ideally you could even
provide a patch. The source for the website is in subversion at:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site
Forrest
Scott Ganyo (JIRA) wrote:
... since Hadoop hijacks and reassigns all log formatters (also a bad
practice!) in the org.apache.hadoop.util.LogFormatter static constructor ...
FYI, Hadoop no longer does this.
Doug
Dennis Kubes wrote:
Andrzej Bialecki wrote:
I believe that at this point it's crucial to keep the project
well-focused (at the moment I think the main focus is on larger
installations, and not the small ones), and also to make Nutch
attractive to developers as a reusable search engine
Chris Mattmann wrote:
So, does this render the patch that I wrote obsolete?
It's at least out-of-date and perhaps obsolete. A quick read of
Fetcher.java looks like there might be a case where a fatal error is
logged but the fetcher doesn't exit, in FetcherThread#output().
Doug
[EMAIL PROTECTED] wrote:
Draft version of How to Become a Nutch Developer is on the wiki at:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
Please take a look and if you think anything needs to be added, removed,
or changed let me know.
Thanks for taking the time to write this up!
Andrzej Bialecki wrote:
The workflow is different - I'm not sure about the details, perhaps Doug
can correct me if I'm wrong ... and yes, it uses JIRA extensively.
1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is
[EMAIL PROTECTED] wrote:
Yes, certainly, anything that can be shared and decoupled from pieces that make
each branch (not SVN/CVS branch) different, should be decoupled. But I was
really curious about whether people think this is a valid idea/direction, not
necessarily immediately how things
Dennis Kubes wrote:
Can you answer the question of how to add developer names to JIRA or if
that is only for committers?
It's not just for committers, but also for regular contributors. I have
added you. Anyone else?
Doug
Stefan Groschupf wrote:
I don't want to start a emotional discussion here, however talking about
the problem in public might help.
What, specifically, is the problem you perceive?
Doug
Dennis Kubes wrote:
I will say that it is difficult for people to understand how to get more
involved. I have been working with Nutch and Hadoop for almost a year
now on a daily basis and only now am I understanding how to contribute
through jira, etc. There needs to be more guidance in
Stefan Groschupf wrote:
We run the gui in several production environemnts with patched hadoop
code - since this is from our point of view the clean approach.
Everything else feels like a workaround to fix some strange hadoop
behaviors.
Are there issues in Hadoop's Jira for these? If so, do
Andrzej Bialecki wrote:
The reason is that if you pack this file into your job JAR, the job jar
would become very large (presumably this 40MB is already compressed?).
Job jar needs to be copied to each tasktracker for each task, so you
will experience performance hit just because of the size
The wiki would be a good place for this.
Doug
Peter Landolt wrote:
Hello,
We tried to introduce Nutch at a telecommunication company in Switzerland
as search engine of their future main search solution. As they were also
proofing
commercial products we needed to offer them a brochure to
Sami Siren wrote:
Stefan Groschupf wrote:
See:
http://www.find23.net/nutch_guiToHadoop.pdf
Section required hadoop changes.
I quess you refer to these:
• LocalJobRunner:
• Run as kind of singelton
• Have a kind of jobQueue
• Implement JobSubmissionProtocol status-report
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]
Doug Cutting reassigned NUTCH-392:
--
Assignee: Doug Cutting
OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]
Doug Cutting updated NUTCH-392:
---
Attachment: NUTCH-392.patch
OutputFormat implementations should pass on Progressable
[
http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ]
Doug Cutting commented on NUTCH-392:
This should not be applied until Nutch uses Hadoop 0.8. It also contains a
patch required to make Nutch work correctly
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]
Doug Cutting updated NUTCH-392:
---
Attachment: (was: NUTCH-392.patch)
OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]
Doug Cutting updated NUTCH-392:
---
Attachment: NUTCH-392.patch
Oops. Attached the wrong patch. Here's the right one.
OutputFormat implementations should pass on Progressable
Sami Siren wrote:
looks like somebody just enabled email-to-jira-comments-feature. I was
just wondering would it be good to use this feature more widely.
I think it would be good. That way mailing list discussion would be
logged to the bug as well.
This could be achieved by removing the
[ http://issues.apache.org/jira/browse/NUTCH-304?page=all ]
Doug Cutting resolved NUTCH-304.
Resolution: Fixed
I just fixed this. Thanks for noticing!
Change JIRA email address for nutch issues from apache incubator
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ]
Doug Cutting commented on NUTCH-353:
It's worth noting that Google, Yahoo! and Microsoft's searches all return lots
of links to www-XXX.ibm.com. Just some
Chris Mattmann wrote:
+1. I think that workflow makes a lot of sense. Currently users in the
nutch-developers group can close and resolve issues. In the Hadoop workflow,
would this continue to be the case?
In Hadoop, most developers can resolve but not close. Only members of a
separate
Sami Siren wrote:
I am not able to do it either, or then I just don't know how, can Doug
help us here?
This requires a change the the project's workflow. I'd be happy to move
Nutch to use the workflow we use for Hadoop, which supports Patch
Available.
This workflow has one other
Sami Siren wrote:
Patch works for me.
OK. I just committed it.
Thanks!
Doug
Jérôme Charron wrote:
In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread main java.io.IOException:
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ]
Doug Cutting reopened NUTCH-309:
I am re-opening this issue, as the guards were added in far too many places.
Jerome, can you please fix these so that guards are only added when (a) the log
[ http://issues.apache.org/jira/browse/NUTCH-312?page=all ]
Doug Cutting resolved NUTCH-312:
Fix Version: 0.8-dev
Resolution: Fixed
I just upgraded Nutch to Hadoop 0.4.0, incorporating this patch. Thanks,
Milind!
Fix for upcoming
[EMAIL PROTECTED] wrote:
NUTCH-309 : Added logging code guards
[ ... ]
+ if (LOG.isWarnEnabled()) {
+LOG.warn(Line does not contain a field name: + line);
+ }
[ ...]
-1
I don't think guards should be added everywhere. They make the code
bigger and provide
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html
Jérôme Charron wrote:
For now, I have used the same log4 properties than hadoop (see
http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254
) for the back-end, and
I was thinking to use the stdout for front-end.
What do you think about this?
We
Stefan Groschupf wrote:
As far I understand hadoop use commons logging. Should we switch to use
commons logging as well?
+1
Doug
[
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ]
Doug Cutting commented on NUTCH-289:
It should be possible to partition by IP and limit fetchlists by IP. Resolving
only in the fetcher is too late to implement
Ken Krugler wrote:
2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer
such as
myself it appears as though webmasters are not getting many replies
to their
inqueries.
I can speak for myself only .. I'm
[
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ]
Doug Cutting commented on NUTCH-273:
Redirects should really not be followed immediately anyway. We should instead
note that it was redirected and to which URL
CrawlDatum should store IP address
--
Key: NUTCH-289
URL: http://issues.apache.org/jira/browse/NUTCH-289
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.8-dev
Reporter: Doug Cutting
If the CrawlDatum stored
[
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ]
Doug Cutting commented on NUTCH-288:
Is there a performant way of doing deduplication and knowing for sure how
many documents are available to view?
No. But we should
[
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ]
Doug Cutting commented on NUTCH-288:
Is there a quickfix possible somehow?
Someone needs to fix the OpenSearch servlet.
It looks like just changing line 146
[
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ]
Doug Cutting commented on NUTCH-267:
re: it's as if we didn't want it to be re-crawled if we can't find any inlinks
to it
We prioritize crawling based on the number
Jérôme Charron wrote:
This means there's no markup in the OpenSearch output?
Yes, no markup for now.
Doesn't this break any existing application that uses OpenSearch and
displays summaries in a web browser? This is an incompatible change
which we should avoid.
Shouldn't there be?
This is a known, fixed, Hadoop bug:
http://issues.apache.org/jira/browse/HADOOP-201
I'm going to release Hadoop 0.2.1 with this and one other patch as soon
as Subversion is back up, then upgrade Nutch to use 0.2.1.
Doug
Marko Bauhardt wrote:
Hi all,
i start nutch-0.8-dev (Revision
Jérôme Charron wrote:
Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
a common place (Summary).
And add one specific Formatter implementation for OpenSearch and another
one
for search.jsp :
The reason is that they should not use the same HTML code :
1. OpenSearch
[
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ]
Doug Cutting commented on NUTCH-267:
Andrzej: your analysis is correct, but it mostly only applies when re-crawling.
In an initial crawl, where each url is fetched only
[
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ]
Doug Cutting commented on NUTCH-134:
+1 for Summary as Writable and change HitSummarizer.getSummary() to return a
Summary directly rather than a String. I don't think
[
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ]
Doug Cutting commented on NUTCH-267:
The OPIC score is much like a count of incoming links, but a bit more refined.
OPIC(P) is one plus the sum of the OPIC contributions
Chris Schneider wrote:
I just noticed that the generate.max.per.host property is only enforced
on a per reduce task basis during the first generate job (see
Generator.Selector.reduce for details). At a minimum, it should probably
be documented this way in nutch-default.xml.template.
Yes, but
This sort of error will become much harder to make once we upgrade to
Hadoop 0.2 and replace most uses of java.io.File with
org.apache.hadoop.fs.Path.
Doug
[EMAIL PROTECTED] wrote:
Author: ab
Date: Wed May 3 19:42:02 2006
New Revision: 399515
URL:
It seems Stefan is giving a talk...
http://events.commerce.net/?p=58
Doug
[EMAIL PROTECTED] wrote:
As far as we understood from MapRed documentation all reduce tasks must be
launched after last map task is finished e.g map and reduce must not work
simultaneously. But often in logs we see such records: map 80%, reduce 10%
and many more records where map is less then
Jérôme Charron wrote:
We had to turn off
the guessing of content types to index Apache correctly.
Instead of turning off the guessing of content types you should only to
remove the magic for xml in mime-types.xml
Perhaps that would have worked also, but, with Apache, simply trusting
the
[
http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ]
Doug Cutting commented on NUTCH-257:
I'd vote to never have Summary#toString() perform entity encoding, to fix
search.jsp to encode things itself, and *not* to add a new
[EMAIL PROTECTED] wrote:
We updated hadoop from trunk branch. But now we get new errors:
Oops. Looks like I introduced a bug yesterday. Let me fix it...
Sorry,
Doug
Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April
27, 2006 12:48 AM
To: nutch-dev@lucene.apache.org
Subject: Re: exception
Importance: High
This is a Hadoop DFS error. It could mean that you don't have any
datanodes running, or that all your datanodes are full
Jérôme Charron wrote:
Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?
I'm not so sure. When crawling Apache we had trouble with this feature.
Some HTML files that had an XML header and the server identified as
text/html
This is a Hadoop DFS error. It could mean that you don't have any
datanodes running, or that all your datanodes are full. Or, it could be
a bug in dfs. You might try a recent nightly build of Hadoop to see if
it works any better.
Doug
Anton Potehin wrote:
What means error of following
[ http://issues.apache.org/jira/browse/NUTCH-250?page=all ]
Doug Cutting resolved NUTCH-250:
Fix Version: 0.8-dev
Resolution: Fixed
Assign To: Doug Cutting
I just committed this. Thanks, Rod.
Generate to log truncation caused
Anton Potehin wrote:
We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so?
It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that
all of the task
Anton Potehin wrote:
Are there any ways to rotate these logs ?
One way would be to configure the JVM to use a rolling FileHandler:
file:///home/cutting/local/jdk1.5-docs/api/java/util/logging/FileHandler.html
This should be possible by setting HADOOP_OPTS (in conf/hadoop-env.sh)
and
Anton Potehin wrote:
1. We have found these flags in CrawlDatum class:
public static final byte STATUS_SIGNATURE = 0;
public static final byte STATUS_DB_UNFETCHED = 1;
public static final byte STATUS_DB_FETCHED = 2;
public static final byte STATUS_DB_GONE = 3;
public static final
Shailesh Kochhar wrote:
If I understand this correctly, you can only dedup by one field. This
would mean that if you were to implement and use content-based
deduplication, you'd have to give up limiting the number of hits per host.
Is this correct, or did I miss something?
That's correct.
[EMAIL PROTECTED] wrote:
+!-- Copy the plugin.dtd file to the plugin doc-files dir --
+copy file=${plugins.dir}/plugin.dtd
+ todir=${src.dir}/org/apache/nutch/plugin/doc-files/
The build should not make changes to the source tree. The source tree
should be read-only to the
[
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374272 ]
Doug Cutting commented on NUTCH-246:
It seems like the Injector should be loading the current time from a job
configuration property in the same way
Chris Mattmann wrote:
+1 for a release sooner rather than later.
I think this is a good plan. There's no reason we can't do another
release in a month. If it is back-compatbible we can call it 0.8.x and
if it's incompatible we can call it 0.9.0.
I'm going to make a Hadoop 0.1.1 release
Andrzej Bialecki wrote:
This selection is primarily made in the while() loop in
CrawlDbReducer:45. My main objection is that selecting the highest
value (meaning most recent) relies on the fact that values of status
codes in CrawlDatum are ordered according to their meaning, and they are
Piotr Kosiorowski wrote:
I will make it totally separate target (so test do not
depend on it).
That was actually Doug's idea (and I agree with it) to stop the build
file if PMD complains about something. It's similar to testing -- if
your tests fail, the entire build file fails.
I totally
Sami Siren wrote:
I know there are people who think that a plain xml interface is good
enough for all but I would like to give this new architecture a try.
I think this would be a great addition. The XML has a lot of uses, but
we should include a good native, extensible, skinnable search UI.
TDLN wrote:
I mean, how do others keep uptodate with the main codeline? Do you
advice updating everyday?
Should we make a 0.8.0 release soon? What features are still missing
that we'd like to get into this release?
Doug
FYI, Mike wrote some evaluation stuff for Nutch a long time ago. I
found it in the Sourceforge Attic:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/Attic/
This worked by querying a set of search engines, those in:
Other options (raised on the Hadoop list) are Checkstyle:
http://checkstyle.sourceforge.net/
and FindBugs:
http://findbugs.sourceforge.net/
Although these are both under LGPL and thus harder to include in Apache
projects.
Anything that generates a lot of false positives is bad: it either
[
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ]
Doug Cutting commented on NUTCH-240:
+1 for committing Generator.patch.txt now.
0 for committing the rest until I've had more time to think about it. I'm not
against
[
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ]
Doug Cutting commented on NUTCH-240:
Also, note that we can now extend Hadoop's new MapReduceBase to implement
configure() and close() for many Mappers and Reducers
Jérôme Charron wrote:
One more question about javadoc (I hope the last one):
Do you think it makes sense to split the plugins gathered into the Misc
group
into many plugins (such as index-more / query-more), so that each sub-plugin
can be dispatched into proper Group.
No, I don't think so.
1 - 100 of 329 matches
Mail list logo