Re: [Nutch-dev] Plans on releasing another bug fix release?

2007-07-03 Thread Doug Cutting
Will the next release really be 1.0 or will it be 0.10? Doug Briggs wrote: > I was just curious to know if there were any plans to release a > maintenence/bug-fix release before 1.0. I know there have been a slew > of patches and such (it's almost impossible to keep up, unless someone > has a su

Re: [Nutch-dev] JIRA email question

2007-06-27 Thread Doug Cutting
The problem is that nutch-dev (like most Apache mailing lists) sets the "Reply-to" header to be itself, so that responses don't go back to the sender. If you override this when responding (changing the "To:" line) and respond to the sender, then it should end up as a comment, which will be the

[Nutch-dev] [jira] Commented: (NUTCH-479) Support for OR queries

2007-06-22 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473 ] Doug Cutting commented on NUTCH-479: Neither. It would end up as the Lucene query: +"search p

[Nutch-dev] [Fwd: Nutch 0.9 and Crawl-Delay]

2007-06-04 Thread Doug Cutting
Does the 0.9 crawl-delay implementation actually permit multiple threads to access a site simultaneously? Doug Original Message Subject: Nutch 0.9 and Crawl-Delay Date: Sun, 3 Jun 2007 10:50:24 +0200 From: Lutz Zetzsche <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL

[Nutch-dev] [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] Doug Cutting commented on NUTCH-392: Anchors, explain, and the cache are used relatively infrequently

Re: [Nutch-dev] proposal for committer

2007-05-29 Thread Doug Cutting
Personnel discussions are conducted on the PMC's private mailing list. I have forwarded your message there. Thanks for the suggestion! Doug Gal Nitzan wrote: > Hi, > > Since I'm no committer I can't really "propose" :-) but I just thought to > draw > some attention to the great work done on

Re: [Nutch-dev] NUTCH-348 and Nutch-0.7.2

2007-05-24 Thread Doug Cutting
karthik085 wrote: > How do you find when a revision was released? Look at the tags in subversion: http://svn.apache.org/viewvc/lucene/nutch/tags/ Doug - This SF.net email is sponsored by DB2 Express Download DB2 Express C -

Re: [Nutch-dev] ApacheCon in Amsterdam

2007-04-23 Thread Doug Cutting
Tom White wrote: > I will be there too. Unfortunately I won't be able to attend after all. The new baby in the house won't let me! Doug - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE ver

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Doug Cutting
Arun Kaundal wrote: > Actually nutch people are kind of autocrate., don't expect more from them > They do what they have decided Have you submitted patches that have been ignored or rejected? Each Nutch contributor indeed does what he or she decides. Nutch is not a service organization that

Re: [Nutch-dev] Image Search Engine Input

2007-03-29 Thread Doug Cutting
Steve Severance wrote: > I am not looking to really make an image retrieval engine. During indexing > referencing docs will be analyzed and text content will be associated with > the image. Currently I want to keep this in a separate index. So despite the > fact that images will be returned the

Re: [Nutch-dev] svn commit: r516643 - in /lucene/nutch/trunk/src/plugin/parse-html/src: java/org/apache/nutch/parse/html/DOMContentUtils.java test/org/apache/nutch/parse/html/TestDOMContentUtils.java

2007-03-20 Thread Doug Cutting
[EMAIL PROTECTED] wrote: [ ... ] > -/** > - * Licensed to the Apache Software Foundation (ASF) under one or more > - * contributor license agreements. See the NOTICE file distributed with [ ... ] > +/** > + * Licensed to the Apache Software Foundation (ASF) under one or more > + * contributor lice

Re: [Nutch-dev] ApacheCon in Amsterdam

2007-03-20 Thread Doug Cutting
I will probably be there. Doug Marc Boucher wrote: > I was wondering if anyone is going to ApacheCon > (http://www.eu.apachecon.com) > in May as they have a full day's workshop on Lucene and will other sessions > on Nutch, Hadoop and Solr? > > Marc Boucher -

[Nutch-dev] [jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854 ] Doug Cutting commented on NUTCH-455: Alternately, we could define it as an error to attempt to dedup by a

Re: [Nutch-dev] FW: Nutch release process help

2007-03-06 Thread Doug Cutting
Chris Mattmann wrote: > It's too bad that > this has turned out to be an issue that I've handled incorrectly, and for > that, I apologize. Sorry if I blew this out of proportion. We all help each other run this project. I don't think any grave error was made. I just saw an opportunity to remi

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-06 Thread Doug Cutting
Sami Siren wrote: > It would be more beneficial to everybody if the discussions (related to > release or Nutch) is > done on public (hey this is open source!). The off the list stuff IMO > smells. +1 Folks sometimes wish to discuss project matters off-list to spare others the boring details, but

Re: [Nutch-dev] Nutch JSF front-end code submission - Please advice next steps?

2007-02-28 Thread Doug Cutting
Zaheed Haque wrote: > Its been about a month I been trying to find time to make the > necessary changes so that I could submit the code. Due to enormous > amount of work load I am unable to find the time. I am not sure how > should I proceed, I have personally try to contact some of you off > list.

[Nutch-dev] [jira] Commented: (NUTCH-445) Dom ain İndexing / Query Filter

2007-02-28 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476665 ] Doug Cutting commented on NUTCH-445: Setting the boost to non-zero permits a "site:" query with no o

[Nutch-dev] [jira] Commented: (NUTCH-445) Dom ain İndexing / Query Filter

2007-02-27 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243 ] Doug Cutting commented on NUTCH-445: Note that the "site" field is also used for search-time deduplic

[Nutch-dev] nightly builds moved to hudson

2007-02-23 Thread Doug Cutting
Nutch's nightly builds have been moved to a Hudson server at: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ I've stopped the old nightly build process and added a redirect from the old nightly build distribution directory to this page. Thanks to Nigel Daley for configuring an

[Nutch-dev] [jira] Assigned: (NUTCH-449) Format of junit output should be configurable

2007-02-23 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned NUTCH-449: -- Assignee: Doug Cutting > Format of junit output should be configura

Re: [Nutch-dev] Performance optimization for Nutch index / query

2007-02-23 Thread Doug Cutting
Andrzej Bialecki wrote: > The degree of simplification is very substantial. Our NutchSuperQuery > doesn't have to do much more work than a simple TermQuery, so we can > assume that the cost to run it is the same as TermQuery times some > constant. What we gain then is the cost of not running all

[Nutch-dev] log guards

2007-02-13 Thread Doug Cutting
Doug Cutting (JIRA) wrote: >> this patch in some places removes the log guards > > Most of the log guards are misguided. Log guards should only be used on > DEBUG level messages in performance-critical inner loops. Since INFO is the > expected log level, a guard on INFO &am

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-13 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821 ] Doug Cutting commented on NUTCH-443: > this patch in some places removes the log guards Most of the log gua

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting
Chris Mattmann wrote: > Got it. So, the logic behind this is, why bother waiting until the > following fetch to parse (and create ParseData objects from) the RSS items > out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the > RSS metadata in it. However, it's perfectly accep

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting
Chris Mattmann wrote: > Sorry to be so thick-headed, but could someone explain to me in really > simple language what this change is requesting that is different from the > current Nutch API? I still don't get it, sorry... A Content would no longer generate a single Parse. Instead, a Content co

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting
Renaud Richardet wrote: > I see. I was thinking that I could index the feed items without having > to fetch them individually. Okay, so if Parser#parse returned a Map, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the ques

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting
Renaud Richardet wrote: > The usecase is that you index RSS-feeds, but your users can search each > feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. So should there be two ur

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting
Doğacan Güney wrote: > OK, then should I go forward with this and implement something? This > should be pretty easy, > though I am not sure what to give as keys to a Parse[]. > > I mean, when getParse returned a single Parse, ParseSegment output them > as . But, if getParse > returns an array, w

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-05 Thread Doug Cutting
Doğacan Güney wrote: > I think it would make much more sense to change parse plugins to take > content and return Parse[] instead of Parse. You're right. That does make more sense. Doug - Using Tomcat but need to do more?

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-02 Thread Doug Cutting
Gal Nitzan wrote: > IMHO the data that is needed i.e. the data that will be fetched in the next > fetch process is already available in the element. Each element > represents one web resource. And there is no reason to go to the server and > re-fetch that resource. Perhaps ProtocolOutput shou

Re: [Nutch-dev] Next Nutch release

2007-01-25 Thread Doug Cutting
Dennis Kubes wrote: > Andrzej Bialecki wrote: >> I believe that at this point it's crucial to keep the project >> well-focused (at the moment I think the main focus is on larger >> installations, and not the small ones), and also to make Nutch >> attractive to developers as a reusable "search en

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting
Chris Mattmann wrote: > So, does this render the patch that I wrote obsolete? It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a "fatal" error is logged but the fetcher doesn't exit, in FetcherThread#output(). Doug --

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting
Scott Ganyo (JIRA) wrote: > ... since Hadoop hijacks and reassigns all log formatters (also a bad > practice!) in the org.apache.hadoop.util.LogFormatter static constructor ... FYI, Hadoop no longer does this. Doug - Take

Re: [Nutch-dev] i18n in nutch home page is misnomor

2007-01-25 Thread Doug Cutting
Teruhiko Kurosaka wrote: > I suggest "i18n" be renamed to "l10n", short for > localization. Can you please file an issue in Jira for this? Ideally you could even provide a patch. The source for the website is in subversion at: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site Forres

Re: [Nutch-dev] Finished "How to Become a Nutch Developer"

2007-01-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: > Draft version of "How to Become a Nutch Developer" is on the wiki at: > > http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer > > Please take a look and if you think anything needs to be added, removed, > or changed let me know. Thanks for taking the time to write

Re: [Nutch-dev] How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting
Dennis Kubes wrote: > Can you answer the question of how to add developer names to JIRA or if > that is only for committers? It's not just for committers, but also for regular contributors. I have added you. Anyone else? Doug --

Re: [Nutch-dev] Reviving Nutch 0.7

2007-01-22 Thread Doug Cutting
[EMAIL PROTECTED] wrote: > Yes, certainly, anything that can be shared and decoupled from pieces that > make each branch (not SVN/CVS branch) different, should be decoupled. But I > was really curious about whether people think this is a valid idea/direction, > not necessarily immediately how t

Re: [Nutch-dev] How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting
Andrzej Bialecki wrote: > The workflow is different - I'm not sure about the details, perhaps Doug > can correct me if I'm wrong ... and yes, it uses JIRA extensively. > > 1. An issue is created > 2. patches are added, removed commented, etc... > 3. finally, a candidate patch is selected, and the

Re: [Nutch-dev] Next Nutch release

2007-01-19 Thread Doug Cutting
Dennis Kubes wrote: > I will say that it is difficult for people to understand how to get more > involved. I have been working with Nutch and Hadoop for almost a year > now on a daily basis and only now am I understanding how to contribute > through jira, etc. There needs to be more guidance i

Re: [Nutch-dev] Next Nutch release

2007-01-19 Thread Doug Cutting
Stefan Groschupf wrote: > I don't want to start a emotional discussion here, however talking about > the problem in public might help. What, specifically, is the problem you perceive? Doug - Take Surveys. Earn Cash. Influen

Re: [Nutch-dev] Next Nutch release

2007-01-18 Thread Doug Cutting
Stefan Groschupf wrote: > We run the gui in several production environemnts with patched hadoop > code - since this is from our point of view the clean approach. > Everything else feels like a workaround to fix some strange hadoop > behaviors. Are there issues in Hadoop's Jira for these? If so

Re: [Nutch-dev] How can I get one plugin's root dir

2007-01-16 Thread Doug Cutting
Andrzej Bialecki wrote: > The reason is that if you pack this file into your job JAR, the job jar > would become very large (presumably this 40MB is already compressed?). > Job jar needs to be copied to each tasktracker for each task, so you > will experience performance hit just because of the

Re: [Nutch-dev] What's the status of Nutch-GUI?

2006-12-08 Thread Doug Cutting
Sami Siren wrote: > Stefan Groschupf wrote: >> See: >> http://www.find23.net/nutch_guiToHadoop.pdf >> Section required hadoop changes. > > I quess you refer to these: > > • LocalJobRunner: > • Run as kind of singelton > • Have a kind of jobQueue > • Implement JobSubmissionProtocol statu

Re: [Nutch-dev] Brochure for Nutch

2006-12-08 Thread Doug Cutting
The wiki would be a good place for this. Doug Peter Landolt wrote: > Hello, > > We tried to introduce Nutch at a telecommunication company in Switzerland > as search engine of their future main search solution. As they were also > proofing > commercial products we needed to offer them a brochur

Re: [Nutch-dev] email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Doug Cutting
Sami Siren wrote: > looks like somebody just enabled email-to-jira-comments-feature. I was > just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the bug as well. > This could be achieved by removing the

[Nutch-dev] [jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441552 ] Doug Cutting commented on NUTCH-385: > It would be one thing if whenever (fetcher.threads.per.host > 1), this > trumped the server delay [...] Are

[Nutch-dev] [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ] Doug Cutting commented on NUTCH-353: It's worth noting that Google, Yahoo! and Microsoft's searches all return lots of links to www-XXX.ibm.com.

[Nutch-dev] [jira] Resolved: (NUTCH-304) Change JIRA email address for nutch issues from apache incubator

2006-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-304?page=all ] Doug Cutting resolved NUTCH-304. Resolution: Fixed I just fixed this. Thanks for noticing! > Change JIRA email address for nutch issues from apache incuba

[Nutch-dev] [jira] Commented: (NUTCH-368) Message queueing system

2006-09-18 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_12435539 ] Doug Cutting commented on NUTCH-368: How would you compare this to JMS? http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/jms/package-summary.html Is it

Re: [Nutch-dev] Patch Available status?

2006-08-31 Thread Doug Cutting
Chris Mattmann wrote: > +1. I think that workflow makes a lot of sense. Currently users in the > nutch-developers group can close and resolve issues. In the Hadoop workflow, > would this continue to be the case? In Hadoop, most developers can resolve but not close. Only members of a separate J

Re: [Nutch-dev] Patch Available status?

2006-08-30 Thread Doug Cutting
Sami Siren wrote: > I am not able to do it either, or then I just don't know how, can Doug > help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports "Patch Available". This workflow has one other non-def

Re: [Nutch-dev] Error with Hadoop-0.4.0

2006-07-12 Thread Doug Cutting
Sami Siren wrote: > Patch works for me. OK. I just committed it. Thanks! Doug - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your

Re: [Nutch-dev] Error with Hadoop-0.4.0

2006-07-10 Thread Doug Cutting
Jérôme Charron wrote: In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread "main" java.io.IOException: I

[Nutch-dev] [jira] Reopened: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ] Doug Cutting reopened NUTCH-309: I am re-opening this issue, as the guards were added in far too many places. Jerome, can you please fix these so that guards are only added when (a) the log

Re: [Nutch-dev] 0.8 release

2006-07-05 Thread Doug Cutting
+1 Piotr Kosiorowski wrote: > +1. > P. > Andrzej Bialecki wrote: >> Sami Siren wrote: >>> How would folks feel about releasing 0.8 now, there has been quite a >>> lot of improvements/new features >>> since 0.7 series and I strongly feel that we should push the first >>> 0.8 series release (alfa/

[Nutch-dev] [jira] Resolved: (NUTCH-312) Fix for upcoming incompatibility with Hadoop-0.4

2006-06-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-312?page=all ] Doug Cutting resolved NUTCH-312: Fix Version: 0.8-dev Resolution: Fixed I just upgraded Nutch to Hadoop 0.4.0, incorporating this patch. Thanks, Milind! > Fix for upcom

[Nutch-dev] [jira] Commented: (NUTCH-303) logging improvements

2006-06-22 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-303?page=comments#action_12417346 ] Doug Cutting commented on NUTCH-303: Jerome: thanks very much for all of your great work improving Nutch's logging! > logging impr

Re: [Nutch-dev] svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ ja

2006-06-22 Thread Doug Cutting
[EMAIL PROTECTED] wrote: > NUTCH-309 : Added logging code guards [ ... ] > + if (LOG.isWarnEnabled()) { > +LOG.warn("Line does not contain a field name: " + line); > + } [ ...] -1 I don't think guards should be added everywhere. They make the code bigger and provid

[Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-14 Thread Doug Cutting
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Nutch logging questions

2006-06-09 Thread Doug Cutting
Jérôme Charron wrote: > For now, I have used the same log4 properties than hadoop (see > http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markup&pathrev=411254 > > > ) for the back-end, and > I was thinking to use the stdout for front-end. > What do you think about thi

Re: [Nutch-dev] svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Doug Cutting
Stefan Groschupf wrote: > As far I understand hadoop use commons logging. Should we switch to use > commons logging as well? +1 Doug ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo

[Nutch-dev] [jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] Doug Cutting commented on NUTCH-289: It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these

[Nutch-dev] Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-30 Thread Doug Cutting
Ken Krugler wrote: 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I can speak for myself only .. I'm

[Nutch-dev] [jira] Created: (NUTCH-289) CrawlDatum should store IP address

2006-05-26 Thread Doug Cutting (JIRA)
CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored

[Nutch-dev] [jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-05-26 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] Doug Cutting commented on NUTCH-273: Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL in the

[Nutch-dev] [jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] Doug Cutting commented on NUTCH-288: > Is there a quickfix possible somehow? Someone needs to fix the OpenSearch servlet. It looks like just changing line 146

[Nutch-dev] [jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] Doug Cutting commented on NUTCH-288: > Is there a performant way of doing deduplication and knowing for sure how > many documents are available to view? No.

[Nutch-dev] [jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-22 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412846 ] Doug Cutting commented on NUTCH-272: In 0.8, urls are filtered both when generating and when updating the DB. Strictly speaking, they're only required when updatin

[Nutch-dev] [jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412605 ] Doug Cutting commented on NUTCH-272: Does the existing generate.max.per.host parameter not meet this need? > Max. pages to crawl/fetch per site (emergency li

[Nutch-dev] Re: Following tags

2006-05-19 Thread Doug Cutting
Andrzej Bialecki wrote: I read through your email exchange, and setting aside all emotional content I think this is a valid request - indeed, as far as I can tell other major crawlers don't follow these links. We could either remove this, or make it optional (default not to use them). Is this

[Nutch-dev] [jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] Doug Cutting commented on NUTCH-267: re: it's as if we didn't want it to be re-crawled if we can't find any inlinks to it We prioritize crawling based o

[Nutch-dev] Re: Interleaved (parallel) fetch cycles

2006-05-11 Thread Doug Cutting
Andrzej Bialecki wrote: I'm planning to work on adding support in 0.8 for interleaved fetch cycles. Great! Then, when running an updatedb, the issue of scores and metadata comes into question. We can imagine now that there were some other updatedb-s run in the meantime, not necessarily with

[Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting
Jérôme Charron wrote: Yes Doug, but in fact, the idea is to add the toString(Formatter) method in a common place (Summary). And add one specific Formatter implementation for OpenSearch and another one for search.jsp : The reason is that they should not use the same HTML code : 1. OpenSearch sho

[Nutch-dev] Re: dfs -report

2006-05-10 Thread Doug Cutting
This is a known, fixed, Hadoop bug: http://issues.apache.org/jira/browse/HADOOP-201 I'm going to release Hadoop 0.2.1 with this and one other patch as soon as Subversion is back up, then upgrade Nutch to use 0.2.1. Doug Marko Bauhardt wrote: Hi all, i start nutch-0.8-dev (Revision 405738)

[Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting
Sami Siren wrote: Also a friendly hint to all plugin hackers, you need to enable summary-basic in your existing nutch-site.xml to get things working. Took me some time to realize this fact :) Sounds like we should enable it by default, no? Doug -

[Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting
Jérôme Charron wrote: This means there's no markup in the OpenSearch output? Yes, no markup for now. Doesn't this break any existing application that uses OpenSearch and displays summaries in a web browser? This is an incompatible change which we should avoid. Shouldn't there be? Th

[Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-09 Thread Doug Cutting
Thanks for making this change! A few comments: [EMAIL PROTECTED] wrote: == --- lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/

[Nutch-dev] [jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] Doug Cutting commented on NUTCH-267: Andrzej: your analysis is correct, but it mostly only applies when re-crawling. In an initial crawl, where each url is fetched only

[Nutch-dev] [jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] Doug Cutting commented on NUTCH-267: The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions

[Nutch-dev] Re: generate.max.per.host is per reduce task

2006-05-08 Thread Doug Cutting
Chris Schneider wrote: I just noticed that the generate.max.per.host property is only enforced on a "per reduce task" basis during the first generate job (see Generator.Selector.reduce for details). At a minimum, it should probably be documented this way in nutch-default.xml.template. Yes, bu

[Nutch-dev] [jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] Doug Cutting commented on NUTCH-134: +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't

[Nutch-dev] CommerceNet Events » Blog Archive » T 3 5/11: Stefan Groschupf on Extending Nutch

2006-05-05 Thread Doug Cutting
It seems Stefan is giving a talk... http://events.commerce.net/?p=58 Doug --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM

[Nutch-dev] Re: svn commit: r399515 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-05-05 Thread Doug Cutting
This sort of error will become much harder to make once we upgrade to Hadoop 0.2 and replace most uses of java.io.File with org.apache.hadoop.fs.Path. Doug [EMAIL PROTECTED] wrote: Author: ab Date: Wed May 3 19:42:02 2006 New Revision: 399515 URL: http://svn.apache.org/viewcvs?rev=399515&vi

[Nutch-dev] Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the decl

[Nutch-dev] Re: mapred question

2006-05-02 Thread Doug Cutting
[EMAIL PROTECTED] wrote: As far as we understood from MapRed documentation all reduce tasks must be launched after last map task is finished e.g map and reduce must not work simultaneously. But often in logs we see such records: "map 80%, reduce 10%" and many more records where map is less then 1

[Nutch-dev] [jira] Commented: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-256?page=comments#action_12376993 ] Doug Cutting commented on NUTCH-256: I think this is really a bug in Hadoop's FileSystem.createNewFile() method. I've just fixed that. Does that work for y

[Nutch-dev] [jira] Resolved: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-256?page=all ] Doug Cutting resolved NUTCH-256: Resolution: Fixed Assign To: Doug Cutting This is fixed in Hadoop 0.2. > Cannot open filename index.done.

[Nutch-dev] [jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ] Doug Cutting commented on NUTCH-257: I'd vote to never have Summary#toString() perform entity encoding, to fix search.jsp to encode things itself, and *not* to add

[Nutch-dev] [jira] Commented: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-256?page=comments#action_12376839 ] Doug Cutting commented on NUTCH-256: That's not a fatal exception, right? Everything still works? It should. This is just the DFS version of FileNotFound, whi

[Nutch-dev] Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more "intelligent" on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as "text/html" N

[Nutch-dev] Re: TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)

2006-04-27 Thread Doug Cutting
at java.lang.reflect.Array.getLength(Native Method) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250) -Original Message-----

[Nutch-dev] Re: exception

2006-04-27 Thread Doug Cutting
[EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: Oops. Looks like I introduced a bug yesterday. Let me fix it... Sorry, Doug --- Using Tomcat but need to do more? Need to support web services, secu

[Nutch-dev] Re: exception

2006-04-26 Thread Doug Cutting
This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following typ

[Nutch-dev] Re: CrawlDatum.metaData should never be null

2006-04-25 Thread Doug Cutting
Andrzej Bialecki wrote: > Hmm.. I understand his point. But it means that I have to always put "if (datum.getMetaData() == null)" check, which pollutes the code in all places that deal with metadata. Currently this is just CrawlDbReducer (but it already looks ugly there), but it will be like t

[Nutch-dev] Re: [Proposal] New Lucene sub-project

2006-04-24 Thread Doug Cutting
Jérôme Charron wrote: we think it would be a good idea to split Nutch into a new sub-project based on content analysis manipulation. The components we have identified are : 1. MimeType Repository 2. Language Identifier 3. Content Signature (MD5Signature / TextProfileSignature / ...) (4. Generic

[Nutch-dev] Re: nutch user meeting in San Francisco: May 18th

2006-04-20 Thread Doug Cutting
Folks can say whether they'll attend at: http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1 Doug --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to

[Nutch-dev] Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting
One more thing. This parameter should be set in mapred-default.xml, not hadoop-site.xml or nutch-site.xml. Parameters in those latter files cannot be overridden by application settings, and mapred.map.tasks is sometimes overidden. Doug -

[Nutch-dev] Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting
Anton Potehin wrote: We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that all of the task slots

[Nutch-dev] [jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-04-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375421 ] Doug Cutting commented on NUTCH-173: +1, with a few modifications. Can you please re-generate this against the current sources? This patch does not apply for me. Also

[Nutch-dev] [jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host

2006-04-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-250?page=all ] Doug Cutting resolved NUTCH-250: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. Thanks, Rod. > Generate to log truncation caused

  1   2   3   4   5   6   7   8   9   10   >