Re: Nutch ML cleanup

2009-03-10 Thread Doug Cutting
ogjunk-nu...@yahoo.com is a member of nutch-...@lists.sourceforge.net and nutch-gene...@lists.sourceforge.net. These lists do not otherwise appear to forward to Apache lists. They used to perhaps forward through nutch.org lists, but that domain no longer forwards any email. Please check the

Re: Plans on releasing another bug fix release?

2007-07-03 Thread Doug Cutting
Will the next release really be 1.0 or will it be 0.10? Doug Briggs wrote: I was just curious to know if there were any plans to release a maintenence/bug-fix release before 1.0. I know there have been a slew of patches and such (it's almost impossible to keep up, unless someone has a

Re: JIRA email question

2007-06-27 Thread Doug Cutting
The problem is that nutch-dev (like most Apache mailing lists) sets the Reply-to header to be itself, so that responses don't go back to the sender. If you override this when responding (changing the To: line) and respond to the sender, then it should end up as a comment, which will be then

[jira] Commented: (NUTCH-479) Support for OR queries

2007-06-22 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473 ] Doug Cutting commented on NUTCH-479: Neither. It would end up as the Lucene query: +search phrase

[Fwd: Nutch 0.9 and Crawl-Delay]

2007-06-04 Thread Doug Cutting
Does the 0.9 crawl-delay implementation actually permit multiple threads to access a site simultaneously? Doug Original Message Subject: Nutch 0.9 and Crawl-Delay Date: Sun, 3 Jun 2007 10:50:24 +0200 From: Lutz Zetzsche [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED] To: [EMAIL

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] Doug Cutting commented on NUTCH-392: Anchors, explain, and the cache are used relatively infrequently

Re: proposal for committer

2007-05-29 Thread Doug Cutting
Personnel discussions are conducted on the PMC's private mailing list. I have forwarded your message there. Thanks for the suggestion! Doug Gal Nitzan wrote: Hi, Since I'm no committer I can't really propose :-) but I just thought to draw some attention to the great work done on the

Re: NUTCH-348 and Nutch-0.7.2

2007-05-24 Thread Doug Cutting
karthik085 wrote: How do you find when a revision was released? Look at the tags in subversion: http://svn.apache.org/viewvc/lucene/nutch/tags/ Doug

Re: ApacheCon in Amsterdam

2007-04-23 Thread Doug Cutting
Tom White wrote: I will be there too. Unfortunately I won't be able to attend after all. The new baby in the house won't let me! Doug

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Doug Cutting
Arun Kaundal wrote: Actually nutch people are kind of autocrate., don't expect more from them They do what they have decided Have you submitted patches that have been ignored or rejected? Each Nutch contributor indeed does what he or she decides. Nutch is not a service organization that

Re: Image Search Engine Input

2007-03-29 Thread Doug Cutting
Steve Severance wrote: I am not looking to really make an image retrieval engine. During indexing referencing docs will be analyzed and text content will be associated with the image. Currently I want to keep this in a separate index. So despite the fact that images will be returned the

Re: svn commit: r516643 - in /lucene/nutch/trunk/src/plugin/parse-html/src: java/org/apache/nutch/parse/html/DOMContentUtils.java test/org/apache/nutch/parse/html/TestDOMContentUtils.java

2007-03-20 Thread Doug Cutting
[EMAIL PROTECTED] wrote: [ ... ] -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with [ ... ] +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license

[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854 ] Doug Cutting commented on NUTCH-455: Alternately, we could define it as an error to attempt to dedup

Re: Issues pending before 0.9 release

2007-03-06 Thread Doug Cutting
Sami Siren wrote: It would be more beneficial to everybody if the discussions (related to release or Nutch) is done on public (hey this is open source!). The off the list stuff IMO smells. +1 Folks sometimes wish to discuss project matters off-list to spare others the boring details, but

Re: FW: Nutch release process help

2007-03-06 Thread Doug Cutting
Chris Mattmann wrote: It's too bad that this has turned out to be an issue that I've handled incorrectly, and for that, I apologize. Sorry if I blew this out of proportion. We all help each other run this project. I don't think any grave error was made. I just saw an opportunity to remind

Re: Nutch JSF front-end code submission - Please advice next steps?

2007-02-28 Thread Doug Cutting
Zaheed Haque wrote: Its been about a month I been trying to find time to make the necessary changes so that I could submit the code. Due to enormous amount of work load I am unable to find the time. I am not sure how should I proceed, I have personally try to contact some of you off list. (Which

[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter

2007-02-27 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243 ] Doug Cutting commented on NUTCH-445: Note that the site field is also used for search-time deduplication

Re: Performance optimization for Nutch index / query

2007-02-23 Thread Doug Cutting
Andrzej Bialecki wrote: The degree of simplification is very substantial. Our NutchSuperQuery doesn't have to do much more work than a simple TermQuery, so we can assume that the cost to run it is the same as TermQuery times some constant. What we gain then is the cost of not running all those

[jira] Assigned: (NUTCH-449) Format of junit output should be configurable

2007-02-23 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned NUTCH-449: -- Assignee: Doug Cutting Format of junit output should be configurable

nightly builds moved to hudson

2007-02-23 Thread Doug Cutting
Nutch's nightly builds have been moved to a Hudson server at: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ I've stopped the old nightly build process and added a redirect from the old nightly build distribution directory to this page. Thanks to Nigel Daley for configuring

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-13 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821 ] Doug Cutting commented on NUTCH-443: this patch in some places removes the log guards Most of the log guards

log guards

2007-02-13 Thread Doug Cutting
Doug Cutting (JIRA) wrote: this patch in some places removes the log guards Most of the log guards are misguided. Log guards should only be used on DEBUG level messages in performance-critical inner loops. Since INFO is the expected log level, a guard on INFO WARN level messages does

Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting
Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So

Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting
Chris Mattmann wrote: Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... A Content would no longer generate a single Parse. Instead, a Content

Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting
Doğacan Güney wrote: OK, then should I go forward with this and implement something? This should be pretty easy, though I am not sure what to give as keys to a Parse[]. I mean, when getParse returned a single Parse, ParseSegment output them as url, Parse. But, if getParse returns an array,

Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting
Renaud Richardet wrote: The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. So should there be two

Re: RSS-fecter and index individul-how can i realize this function

2007-02-05 Thread Doug Cutting
Doğacan Güney wrote: I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. You're right. That does make more sense. Doug

Re: RSS-fecter and index individul-how can i realize this function

2007-02-02 Thread Doug Cutting
Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput

Re: i18n in nutch home page is misnomor

2007-01-25 Thread Doug Cutting
Teruhiko Kurosaka wrote: I suggest i18n be renamed to l10n, short for localization. Can you please file an issue in Jira for this? Ideally you could even provide a patch. The source for the website is in subversion at: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site Forrest

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting
Scott Ganyo (JIRA) wrote: ... since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter static constructor ... FYI, Hadoop no longer does this. Doug

Re: Next Nutch release

2007-01-25 Thread Doug Cutting
Dennis Kubes wrote: Andrzej Bialecki wrote: I believe that at this point it's crucial to keep the project well-focused (at the moment I think the main focus is on larger installations, and not the small ones), and also to make Nutch attractive to developers as a reusable search engine

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting
Chris Mattmann wrote: So, does this render the patch that I wrote obsolete? It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a fatal error is logged but the fetcher doesn't exit, in FetcherThread#output(). Doug

Re: Finished How to Become a Nutch Developer

2007-01-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Draft version of How to Become a Nutch Developer is on the wiki at: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer Please take a look and if you think anything needs to be added, removed, or changed let me know. Thanks for taking the time to write this up!

Re: How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting
Andrzej Bialecki wrote: The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively. 1. An issue is created 2. patches are added, removed commented, etc... 3. finally, a candidate patch is selected, and the issue is

Re: Reviving Nutch 0.7

2007-01-22 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled. But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things

Re: How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting
Dennis Kubes wrote: Can you answer the question of how to add developer names to JIRA or if that is only for committers? It's not just for committers, but also for regular contributors. I have added you. Anyone else? Doug

Re: Next Nutch release

2007-01-19 Thread Doug Cutting
Stefan Groschupf wrote: I don't want to start a emotional discussion here, however talking about the problem in public might help. What, specifically, is the problem you perceive? Doug

Re: Next Nutch release

2007-01-19 Thread Doug Cutting
Dennis Kubes wrote: I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in

Re: Next Nutch release

2007-01-18 Thread Doug Cutting
Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. Are there issues in Hadoop's Jira for these? If so, do

Re: How can I get one plugin's root dir

2007-01-16 Thread Doug Cutting
Andrzej Bialecki wrote: The reason is that if you pack this file into your job JAR, the job jar would become very large (presumably this 40MB is already compressed?). Job jar needs to be copied to each tasktracker for each task, so you will experience performance hit just because of the size

Re: Brochure for Nutch

2006-12-08 Thread Doug Cutting
The wiki would be a good place for this. Doug Peter Landolt wrote: Hello, We tried to introduce Nutch at a telecommunication company in Switzerland as search engine of their future main search solution. As they were also proofing commercial products we needed to offer them a brochure to

Re: What's the status of Nutch-GUI?

2006-12-08 Thread Doug Cutting
Sami Siren wrote: Stefan Groschupf wrote: See: http://www.find23.net/nutch_guiToHadoop.pdf Section required hadoop changes. I quess you refer to these: • LocalJobRunner: • Run as kind of singelton • Have a kind of jobQueue • Implement JobSubmissionProtocol status-report

[jira] Assigned: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting reassigned NUTCH-392: -- Assignee: Doug Cutting OutputFormat implementations should pass on Progressable

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: NUTCH-392.patch OutputFormat implementations should pass on Progressable

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ] Doug Cutting commented on NUTCH-392: This should not be applied until Nutch uses Hadoop 0.8. It also contains a patch required to make Nutch work correctly

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: (was: NUTCH-392.patch) OutputFormat implementations should pass on Progressable

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: NUTCH-392.patch Oops. Attached the wrong patch. Here's the right one. OutputFormat implementations should pass on Progressable

Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Doug Cutting
Sami Siren wrote: looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the bug as well. This could be achieved by removing the

[jira] Resolved: (NUTCH-304) Change JIRA email address for nutch issues from apache incubator

2006-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-304?page=all ] Doug Cutting resolved NUTCH-304. Resolution: Fixed I just fixed this. Thanks for noticing! Change JIRA email address for nutch issues from apache incubator

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ] Doug Cutting commented on NUTCH-353: It's worth noting that Google, Yahoo! and Microsoft's searches all return lots of links to www-XXX.ibm.com. Just some

Re: Patch Available status?

2006-08-31 Thread Doug Cutting
Chris Mattmann wrote: +1. I think that workflow makes a lot of sense. Currently users in the nutch-developers group can close and resolve issues. In the Hadoop workflow, would this continue to be the case? In Hadoop, most developers can resolve but not close. Only members of a separate

Re: Patch Available status?

2006-08-30 Thread Doug Cutting
Sami Siren wrote: I am not able to do it either, or then I just don't know how, can Doug help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports Patch Available. This workflow has one other

Re: Error with Hadoop-0.4.0

2006-07-12 Thread Doug Cutting
Sami Siren wrote: Patch works for me. OK. I just committed it. Thanks! Doug

Re: Error with Hadoop-0.4.0

2006-07-10 Thread Doug Cutting
Jérôme Charron wrote: In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException:

[jira] Reopened: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ] Doug Cutting reopened NUTCH-309: I am re-opening this issue, as the guards were added in far too many places. Jerome, can you please fix these so that guards are only added when (a) the log

[jira] Resolved: (NUTCH-312) Fix for upcoming incompatibility with Hadoop-0.4

2006-06-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-312?page=all ] Doug Cutting resolved NUTCH-312: Fix Version: 0.8-dev Resolution: Fixed I just upgraded Nutch to Hadoop 0.4.0, incorporating this patch. Thanks, Milind! Fix for upcoming

Re: svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ java/org/apach

2006-06-22 Thread Doug Cutting
[EMAIL PROTECTED] wrote: NUTCH-309 : Added logging code guards [ ... ] + if (LOG.isWarnEnabled()) { +LOG.warn(Line does not contain a field name: + line); + } [ ...] -1 I don't think guards should be added everywhere. They make the code bigger and provide

IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-14 Thread Doug Cutting
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html

Re: Nutch logging questions

2006-06-09 Thread Doug Cutting
Jérôme Charron wrote: For now, I have used the same log4 properties than hadoop (see http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 ) for the back-end, and I was thinking to use the stdout for front-end. What do you think about this? We

Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Doug Cutting
Stefan Groschupf wrote: As far I understand hadoop use commons logging. Should we switch to use commons logging as well? +1 Doug

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] Doug Cutting commented on NUTCH-289: It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-30 Thread Doug Cutting
Ken Krugler wrote: 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I can speak for myself only .. I'm

[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-05-26 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] Doug Cutting commented on NUTCH-273: Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL

[jira] Created: (NUTCH-289) CrawlDatum should store IP address

2006-05-26 Thread Doug Cutting (JIRA)
CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored

[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] Doug Cutting commented on NUTCH-288: Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? No. But we should

[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] Doug Cutting commented on NUTCH-288: Is there a quickfix possible somehow? Someone needs to fix the OpenSearch servlet. It looks like just changing line 146

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] Doug Cutting commented on NUTCH-267: re: it's as if we didn't want it to be re-crawled if we can't find any inlinks to it We prioritize crawling based on the number

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting
Jérôme Charron wrote: This means there's no markup in the OpenSearch output? Yes, no markup for now. Doesn't this break any existing application that uses OpenSearch and displays summaries in a web browser? This is an incompatible change which we should avoid. Shouldn't there be?

Re: dfs -report

2006-05-10 Thread Doug Cutting
This is a known, fixed, Hadoop bug: http://issues.apache.org/jira/browse/HADOOP-201 I'm going to release Hadoop 0.2.1 with this and one other patch as soon as Subversion is back up, then upgrade Nutch to use 0.2.1. Doug Marko Bauhardt wrote: Hi all, i start nutch-0.8-dev (Revision

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting
Jérôme Charron wrote: Yes Doug, but in fact, the idea is to add the toString(Formatter) method in a common place (Summary). And add one specific Formatter implementation for OpenSearch and another one for search.jsp : The reason is that they should not use the same HTML code : 1. OpenSearch

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] Doug Cutting commented on NUTCH-267: Andrzej: your analysis is correct, but it mostly only applies when re-crawling. In an initial crawl, where each url is fetched only

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] Doug Cutting commented on NUTCH-134: +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't think

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] Doug Cutting commented on NUTCH-267: The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions

Re: generate.max.per.host is per reduce task

2006-05-07 Thread Doug Cutting
Chris Schneider wrote: I just noticed that the generate.max.per.host property is only enforced on a per reduce task basis during the first generate job (see Generator.Selector.reduce for details). At a minimum, it should probably be documented this way in nutch-default.xml.template. Yes, but

Re: svn commit: r399515 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-05-05 Thread Doug Cutting
This sort of error will become much harder to make once we upgrade to Hadoop 0.2 and replace most uses of java.io.File with org.apache.hadoop.fs.Path. Doug [EMAIL PROTECTED] wrote: Author: ab Date: Wed May 3 19:42:02 2006 New Revision: 399515 URL:

CommerceNet Events » Blog Archive » T 3 5/11: Stefan Groschupf on Extending Nutch

2006-05-05 Thread Doug Cutting
It seems Stefan is giving a talk... http://events.commerce.net/?p=58 Doug

Re: mapred question

2006-05-02 Thread Doug Cutting
[EMAIL PROTECTED] wrote: As far as we understood from MapRed documentation all reduce tasks must be launched after last map task is finished e.g map and reduce must not work simultaneously. But often in logs we see such records: map 80%, reduce 10% and many more records where map is less then

Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the

[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ] Doug Cutting commented on NUTCH-257: I'd vote to never have Summary#toString() perform entity encoding, to fix search.jsp to encode things itself, and *not* to add a new

Re: exception

2006-04-27 Thread Doug Cutting
[EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: Oops. Looks like I introduced a bug yesterday. Let me fix it... Sorry, Doug

Re: TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)

2006-04-27 Thread Doug Cutting
Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 12:48 AM To: nutch-dev@lucene.apache.org Subject: Re: exception Importance: High This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full

Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html

Re: exception

2006-04-26 Thread Doug Cutting
This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following

[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host

2006-04-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-250?page=all ] Doug Cutting resolved NUTCH-250: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. Thanks, Rod. Generate to log truncation caused

Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting
Anton Potehin wrote: We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that all of the task

Re: jobtaraker and tasktracker

2006-04-19 Thread Doug Cutting
Anton Potehin wrote: Are there any ways to rotate these logs ? One way would be to configure the JVM to use a rolling FileHandler: file:///home/cutting/local/jdk1.5-docs/api/java/util/logging/FileHandler.html This should be possible by setting HADOOP_OPTS (in conf/hadoop-env.sh) and

Re: question about crawldb

2006-04-18 Thread Doug Cutting
Anton Potehin wrote: 1. We have found these flags in CrawlDatum class: public static final byte STATUS_SIGNATURE = 0; public static final byte STATUS_DB_UNFETCHED = 1; public static final byte STATUS_DB_FETCHED = 2; public static final byte STATUS_DB_GONE = 3; public static final

Re: Duplicate Detection: Offlince vs. Search Time

2006-04-17 Thread Doug Cutting
Shailesh Kochhar wrote: If I understand this correctly, you can only dedup by one field. This would mean that if you were to implement and use content-based deduplication, you'd have to give up limiting the number of hits per host. Is this correct, or did I miss something? That's correct.

Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom

2006-04-17 Thread Doug Cutting
[EMAIL PROTECTED] wrote: +!-- Copy the plugin.dtd file to the plugin doc-files dir -- +copy file=${plugins.dir}/plugin.dtd + todir=${src.dir}/org/apache/nutch/plugin/doc-files/ The build should not make changes to the source tree. The source tree should be read-only to the

[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374272 ] Doug Cutting commented on NUTCH-246: It seems like the Injector should be loading the current time from a job configuration property in the same way

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Doug Cutting
Chris Mattmann wrote: +1 for a release sooner rather than later. I think this is a good plan. There's no reason we can't do another release in a month. If it is back-compatbible we can call it 0.8.x and if it's incompatible we can call it 0.9.0. I'm going to make a Hadoop 0.1.1 release

Re: CrawlDbReducer - selecting data for DB update

2006-04-07 Thread Doug Cutting
Andrzej Bialecki wrote: This selection is primarily made in the while() loop in CrawlDbReducer:45. My main objection is that selecting the highest value (meaning most recent) relies on the fact that values of status codes in CrawlDatum are ordered according to their meaning, and they are

Re: PMD integration

2006-04-07 Thread Doug Cutting
Piotr Kosiorowski wrote: I will make it totally separate target (so test do not depend on it). That was actually Doug's idea (and I agree with it) to stop the build file if PMD complains about something. It's similar to testing -- if your tests fail, the entire build file fails. I totally

Re: web ui improvement

2006-04-07 Thread Doug Cutting
Sami Siren wrote: I know there are people who think that a plain xml interface is good enough for all but I would like to give this new architecture a try. I think this would be a great addition. The XML has a lot of uses, but we should include a good native, extensible, skinnable search UI.

0.8 release schedule (was Re: latest build throws error - critical)

2006-04-06 Thread Doug Cutting
TDLN wrote: I mean, how do others keep uptodate with the main codeline? Do you advice updating everyday? Should we make a 0.8.0 release soon? What features are still missing that we'd like to get into this release? Doug

Re: Search quality evaluation

2006-04-05 Thread Doug Cutting
FYI, Mike wrote some evaluation stuff for Nutch a long time ago. I found it in the Sourceforge Attic: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/Attic/ This worked by querying a set of search engines, those in:

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Doug Cutting
Other options (raised on the Hadoop list) are Checkstyle: http://checkstyle.sourceforge.net/ and FindBugs: http://findbugs.sourceforge.net/ Although these are both under LGPL and thus harder to include in Apache projects. Anything that generates a lot of false positives is bad: it either

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] Doug Cutting commented on NUTCH-240: +1 for committing Generator.patch.txt now. 0 for committing the rest until I've had more time to think about it. I'm not against

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] Doug Cutting commented on NUTCH-240: Also, note that we can now extend Hadoop's new MapReduceBase to implement configure() and close() for many Mappers and Reducers

Re: Refactoring some plugins

2006-03-31 Thread Doug Cutting
Jérôme Charron wrote: One more question about javadoc (I hope the last one): Do you think it makes sense to split the plugins gathered into the Misc group into many plugins (such as index-more / query-more), so that each sub-plugin can be dispatched into proper Group. No, I don't think so.

  1   2   3   4   >