[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372556 ] Doug Cutting commented on NUTCH-171: Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant. Overlapping map2 with reduce1 should

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ] Doug Cutting commented on NUTCH-240: First, I hope my critical remarks were not taken personally. I am thankful for this and all of your contributions. Initially, I did

[jira] Commented: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372581 ] Doug Cutting commented on NUTCH-242: Shouldn't you use the returned value of the filter? If so, then this should be done in a mapper, not in the reducer. Add optional

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372597 ] Doug Cutting commented on NUTCH-171: Generate for 20 Segments of 10M in size is almost as fast as 1 segment that is 10M in size. A single 200M URL segment is unweildly

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] Doug Cutting commented on NUTCH-240: The generator store/restore score stuff seems ugly. And it is not used by OPIC. Could we insteadhave a method that computes

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371122 ] Doug Cutting commented on NUTCH-235: I'm concerned about all of the contains() calls this adds to an ArrayList. This is a linear scan, and makes the cost of building

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371142 ] Doug Cutting commented on NUTCH-235: The iterator shouldn't be a problem. When we're indexing we also dedup them by domain, which is much more expensive than creating

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371147 ] Doug Cutting commented on NUTCH-235: +1 This looks good. It will be a little slower for simple crawls, where each link is only processed once, but probably not noticeably

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting
Jérôme Charron wrote: So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). If it were easy to implement all java regex features in

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting
Stefan Groschupf wrote: Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ] Doug Cutting commented on NUTCH-230: Andrzej, that's true if we think links that are filtered are bad links, but if we instead think of them as non-links then this fix

Re: OPIC score calculation issues

2006-03-14 Thread Doug Cutting
Andrzej Bialecki wrote: When we used WebDB it was possible to overlap generate / fetch / update cycles, because we would lock pages selected by FetchListTool for a period of time. Now we don't do this. The advantage is that we don't have to rewrite CrawlDB. But operations on CrawlDB are

Re: AnalyzerFactory

2006-03-13 Thread Doug Cutting
Jérôme Charron wrote: It seems that the usage of AnalyzerFactory was removed while porting Indexer to map/reduce. (AnalyzerFactory is no more called in trunk code) Is it intentional? (if no, I have a patch that I can commit, so thanks to confirm) It was not intentional. Thanks for fixing

Re: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Doug Cutting
Rod Taylor wrote: First is to allow for cleaning up. This consists of a new option to updatedb which can scrub the database of all URLs which no longer match URLFilter settings (regex-urlfilter.txt). This allows a change in the urlfilter to be reflected against Nutches current dataset,

Re: Nutch 0.7.2

2006-03-09 Thread Doug Cutting
Piotr Kosiorowski wrote: I found an email from Doug with title [Fwd: Crawler submits forms?] stating: This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. I just want to make sure it was fixed by svn commit: r348533

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-03-08 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Don't generate URLs that don't pass URLFilters. Just to be clear, this is to support folks changing their filters while they're crawling, right? We already filter before we put things into the db, so we're filtering twice now, no? If so, then perhaps there should

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-03-08 Thread Doug Cutting
Andrzej Bialecki wrote: Stefan Groschupf wrote: I notice filtering urls is done in the output format until parsing. Wouldn't it be better to filter it until updating crawlDb? Until == during ? As you observed, doing it at this stage saves space in segment data, and in consequence saves on

Re: Nutch web site

2006-03-06 Thread Doug Cutting
Piotr Kosiorowski wrote: It looks like Nutch web site was updated with site built from latest trunk - the only problem is it contains tutorial for unreleased (yet) version 0.8. I think we talked about it and agreed to keep tutorial for latest release on the Web. I have just updated site in svn

Re: record termination and MapReduce

2006-03-06 Thread Doug Cutting
Toby DiPasquale wrote: I have a question about the MapReduce and NDFS implementations. When writing records into an NDFS file, how does one make sure that records terminate cleanly on block boundaries such that a Map job's input does not span multiple physical blocks? We do not currently

Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java

2006-03-03 Thread Doug Cutting
Jérôme Charron wrote: It seems that NUTCH-143 patch has been commited too... is it intentional? That was indeed a mistake. Thanks for catching it! I just reverted the unintentional changes. Thanks also to: http://svnbook.red-bean.com/en/1.0/ch04s04.html#svn-ch-4-sect-4.2 Doug

[jira] Commented: (NUTCH-221) prepare nutch for upcoming lucene 2.0

2006-03-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-221?page=comments#action_12368779 ] Doug Cutting commented on NUTCH-221: +1 Thanks! prepare nutch for upcoming lucene 2.0 - Key: NUTCH-221 URL

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-02 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Modified: lucene/nutch/trunk/src/plugin/analysis-de/build.xml URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/plugin/analysis-de/build.xml?rev=378655r1=378654r2=378655view=diff == ---

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-02 Thread Doug Cutting
Jérôme Charron wrote: It just ensure that the last modified core version is automatically compiled while compiling a single plugin. From my point of view the time for a whole build is not a problem. If I just work on core, then I can use the fast compile-core target. And if I just work on a

[jira] Created: (NUTCH-218) need DOAP file for Nutch

2006-02-28 Thread Doug Cutting (JIRA)
need DOAP file for Nutch Key: NUTCH-218 URL: http://issues.apache.org/jira/browse/NUTCH-218 Project: Nutch Type: Task Reporter: Doug Cutting Can someone please draft a DOAP file for Nutch, so that we're listed at http

Re: OPIC score calculation issues

2006-02-28 Thread Doug Cutting
Andrzej Bialecki wrote: * CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s from crawl_parse with the same URL, which means that we get: * the original CrawlDatum * (optionally a CrawlDatum that contains just a Signature) * all CrawlDatum.LINKED entries pointing to

Re: Release Planning

2006-02-28 Thread Doug Cutting
Nutch developer wrote: What is the estimated date for a stable version of 0.8? I'm hoping to have a stable release of Hadoop by April 15th. This should substantially stablilize Nutch. So a 0.8 release of Nutch should probably follow shortly thereafter. By the way: What are the criteria

[jira] Resolved: (NUTCH-216) cannot build in windows

2006-02-24 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-216?page=all ] Doug Cutting resolved NUTCH-216: Fix Version: 0.8-dev Resolution: Fixed The reason 'exec' was used was to also restore file permissions, which 'untar' does not. So I switched

Re: Unable to complete a full fetch, reason Child Error

2006-02-24 Thread Doug Cutting
Mike Smith wrote: 060219 142408 task_m_grycae Parent died. Exiting task_m_grycae This means the child process, executing the task, was unable to ping its parent process (the task tracker). 060219 142408 task_m_grycae Child Error java.io.IOException: Task process exit with nonzero status.

Re: still need jetty jars?

2006-02-23 Thread Doug Cutting
Stefan Groschupf wrote: do we still need the lib/jetty-ext/ jars? Since the jobtracker info server is now part of hadoop someone may can delete them. They're included with Nutch so that folks don't have to separately download Hadoop. You should be able to simply download Nutch and run

Re: URL Partitioning (Lexical vs. IP Address)

2006-02-22 Thread Doug Cutting
Chris Schneider wrote: My experience recently seeing attempted fetches of many ingrida.be URLs made me question the Nutch 0.8 algorithm for partitioning URLs among TaskTrackers (and their children processes). As I understand it, Nutch doesn't worry about two lexically distinct domains (e.g.,

Re: Summarier threads in nutch

2006-02-22 Thread Doug Cutting
Jack Tang wrote: In FetchedSegments class, below code shows how to get the hit summaries. public String[] getSummary(HitDetails[] details, Query query) throws IOException { SummaryThread[] threads = new SummaryThread[details.length]; for (int i = 0; i threads.length; i++) {

Re: duplicate libs

2006-02-16 Thread Doug Cutting
Jérôme Charron wrote: Finaly, the more I look at the ant code for plugins the more I think we must redesign it. In the actual ant scripts, each plugin is a ant project, so there is no way to define ant dependencies between plugins. (= if you compile a plugin A that depends on another one (B),

Re: Global locking

2006-02-16 Thread Doug Cutting
Gal Nitzan wrote: I have implemented a down and dirty Global Locking: [ ... ] I changed FetcherThread constructor to create an instance of SyncManager. And in also in the run method I try to get a lock on the host. If not successful I add the url into a ListArraykey,datum for a later

[jira] Resolved: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=all ] Doug Cutting resolved NUTCH-211: Resolution: Fixed I committed this, with a bunch of whitespace fixes. FetchedSegments leave readers open

[jira] Commented: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366505 ] Doug Cutting commented on NUTCH-211: The interfaces that FetchedSegments implements should have a close method. Moreover, these interfaces should extend a Closeable

Re: All tasktrackers access same site at the same time (hadoop) please help

2006-02-15 Thread Doug Cutting
Andrzej Bialecki wrote: (FYI: if you wonder how it was working before, the trick was to generate just 1 split for the fetch job, which then lead to just one task being created for any input fetchlist. I don't think that's right. The generator uses setNumReduceTasks() to the desired number

Re: duplicate libs

2006-02-14 Thread Doug Cutting
Jérôme Charron wrote: Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196? Yes, you're right. I have still provided a patch for a log4j lib. If there is no objection, I will commit it and go ahead for * lib-commons-httpclient * lib-nekohtml +1 Thanks! Doug

duplicate libs

2006-02-13 Thread Doug Cutting
There are a number of duplicated libs in the plugins, namely: commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib log4j-1.2.11.jar src/plugin/clustering-carrot2/lib log4j-1.2.6.jar 1

[jira] Resolved: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-209?page=all ] Doug Cutting resolved NUTCH-209: Resolution: Fixed I just committed this. Michael, the 'bin/hadoop jar' command is not (yet) used by Nutch. Please file a Hadoop bug to add the feature

[jira] Commented: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365798 ] Doug Cutting commented on NUTCH-209: Andrzej, sorry, I didn't see your remark before I committed this! A DFSClassLoader would have problems with plugins, since our plugin

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365618 ] Doug Cutting commented on NUTCH-192: Since these mappings are not something that users should alter, I'm not sure they should be in the config file. I added related

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365619 ] Doug Cutting commented on NUTCH-139: +1 This looks great. Thanks for all the hard work on this one! Standard metadata property names in the ParseData metadata

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Doug Cutting updated NUTCH-192: --- Attachment: (was: metadata08_02_06.patch) meta data support for CrawlDatum Key: NUTCH-192 URL: http

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365643 ] Doug Cutting commented on NUTCH-192: +1 This looks good to me. Thanks for your persistence. meta data support for CrawlDatum

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365450 ] Doug Cutting commented on NUTCH-192: Sorry, I misspoke and overstated things too. There are problems, but not with MapWritable, rather with WritableName: this refers

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365087 ] Doug Cutting commented on NUTCH-193: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365089 ] Doug Cutting commented on NUTCH-139: Jerome: yes, it makes sense, but there's also metadata that's not tightly related to the protocol or the parser, e.g., the nutch

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365130 ] Doug Cutting commented on NUTCH-193: Okay, I've moved the code from Nutch to Hadoop. Now I need to repair Nutch so that it still works! One remaining problem is the need

Re: svn commit: r374731 - in /lucene/nutch/trunk/src/web/jsp: anchors.jsp cached.jsp explain.jsp index.jsp search.jsp text.jsp

2006-02-03 Thread Doug Cutting
[EMAIL PROTECTED] wrote: URL: http://svn.apache.org/viewcvs?rev=374731view=rev Log: removed unused imports Sami, I was in the middle of the process of fixing NUTCH-193 (moving things to the new Hadoop project) when you made this commit. I merged in those changes you made to things still in

[jira] Resolved: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=all ] Doug Cutting resolved NUTCH-193: Resolution: Fixed I just committed this. Phew! move NDFS and MapReduce to a separate project

Re: [jira] Resolved: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting
Doug Cutting (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-193?page=all ] Doug Cutting resolved NUTCH-193: Resolution: Fixed I just committed this. Phew! The major incompatibility I introduced with this was changing the top

[jira] Resolved: (NUTCH-197) NullPointerException in TaskRunner if application jar does not have lib directory

2006-02-01 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-197?page=all ] Doug Cutting resolved NUTCH-197: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Owen! NullPointerException in TaskRunner if application jar does not have lib

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364923 ] Doug Cutting commented on NUTCH-192: I'm worried that this will substantially slow things. I'd like to see some effort made to ensure that: 1. If no metadata is used

[jira] Created: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
move NDFS and MapReduce to a separate project - Key: NUTCH-193 URL: http://issues.apache.org/jira/browse/NUTCH-193 Project: Nutch Type: Task Components: ndfs Versions: 0.8-dev Reporter: Doug Cutting

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ] Doug Cutting commented on NUTCH-191: We've thus far avoided loading job-specific code in the JobTracker and TaskTracker, in order to keep these more reliable. File

[Fwd: NutchCVS/0.8-dev]

2006-01-31 Thread Doug Cutting
FYI Original Message Subject: NutchCVS/0.8-dev Date: Mon, 30 Jan 2006 13:40:45 +0900 (JST) From: [EMAIL PROTECTED] Reply-To: nutch-agent@lucene.apache.org To: nutch-agent@lucene.apache.org Hi, I see that NutchCVS/0.8-dev is trying to crawl the firecat.nihonsoft.org website,

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ] Doug Cutting commented on NUTCH-193: Otis: yes, thanks, I meant org.apache.hadoop.dfs. Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today. I'll

Re: Lucene's VInt for lengths/counts/sizes

2006-01-31 Thread Doug Cutting
Andrzej Bialecki wrote: I wonder, would it be a good idea to replace the (rather wasteful) 4-byte ints with Lucene's variable-byte int encoding, in all places where size matters? I'm not sure there are that many places where it could make a big difference. * UTF8 (2-byte string length)

Re: [Nutch-cvs] svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Doug Cutting
Andrzej Bialecki wrote: Namely? I didn't notice any ... I think it's better to avoid bash-isms, if we easily can. Not all the world looks like Linux. ;-) IFS, at least. I tried running this on Solaris, where /bin/sh is not bash, and it didn't work. It complained about unsetting IFS. Doug

Re: [Nutch-cvs] svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Doug Cutting
Andrzej Bialecki wrote: Right, Solaris /bin/sh doesn't allow that... Hmm. Does this IFS setting/unsetting work for you? I mean, I just tried it on Linux, using the real Bash. I put the nutch distrib in a path containing spaces, and I'm not able to run anything... I initially added it to make

Re: svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Doug Cutting
Rod Taylor wrote: Please don't do that. bash-2.05b$ ls /bin/bash ls: /bin/bash: No such file or directory bash-2.05b$ uname -a FreeBSD home 6.0-RELEASE FreeBSD 6.0-RELEASE #13: Sat Nov 5 00:19:49 EST 2005 [EMAIL

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Attachment: (was: NUTCH-139.jc.review.patch.txt) Standard metadata property names in the ParseData metadata

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Attachment: (was: NUTCH-139.Mattmann.patch.txt) Standard metadata property names in the ParseData metadata

Re: svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Doug Cutting
Andrzej Bialecki wrote: #!/usr/bin/env bash +1 This works on Solaris, Linux cygwin. Does it work on FreeBSD? Doug

Re: older Nutch list archives (@sf.net)?

2006-01-27 Thread Doug Cutting
The Sourceforge archives are still there, just hard to find, e.g.: http://sourceforge.net/mailarchive/forum.php?forum=nutch-developers These lists are also archived at mail-archive.com: http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/ Doug Gordon Mohr (archive.org)

Re: older Nutch list archives (@sf.net)?

2006-01-27 Thread Doug Cutting
Gordon Mohr (archive.org) wrote: Doug Cutting wrote: The Sourceforge archives are still there, just hard to find, e.g.: http://sourceforge.net/mailarchive/forum.php?forum=nutch-developers When I visit that URL, I get: # Permission Denied # # Access to this page is restricted (either

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] Doug Cutting commented on NUTCH-139: I think we're near agreement here. Here are the changes I think this patch still needs: MetadataNames belongs in the protocol package

Re: need volunteer to develop search for apache.org

2006-01-26 Thread Doug Cutting
John X wrote: Please count me in. Thanks, John. I forgot to mention that I'd prefer a committer for this, and you're a committer, so that works well! Is there a timetable for it? No, whenever you can get to it. I'll make you an account and send you the details. Doug

Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Doug Cutting
Andrzej Bialecki wrote: Erhm.. please bear with me. I'd rather see these two classes in a separate package altogether, org.apache.nutch.metadata. The reason is that most likely these two classes will be used elsewhere too, not just in the protocol and parse/fetch related context. I'm

Re: Optimizing which links to fetch

2006-01-25 Thread Doug Cutting
Ken Krugler wrote: It seems that the default behavior of Nutch when sorting links to fetch is to use scoreByLinkCount. This then sets the fetch score for links on a page to be the same as the containing page's in-bound link score (or actually the log of same). Please also see:

Re: Ideas for enhancements

2006-01-25 Thread Doug Cutting
Howie Wang wrote: 1. A String[] HitDetails.getValues(String field) method that returns an array of the values. The current only returns a single string, and Lucene indexes can have multiple values per field. That sounds useful. Please submit a patch against the trunk attached to a bug

Re: Searchable mailing lists on nutch.org?

2006-01-25 Thread Doug Cutting
Andy Liu wrote: We're getting a lot of repeat questions in the mailing lists these days. I think it's partly because people don't know of a way to search the archives. The Mail Archive provides this: http://www.mail-archive.com/index.php?hunt=nutch Whoever maintains the

need volunteer to develop search for apache.org

2006-01-25 Thread Doug Cutting
Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

2006-01-21 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363554 ] Doug Cutting commented on NUTCH-183: Byron, that's exactly what Mike means by speculative execution. MapReduce has a series of problems concerning task-allocation

[jira] Closed: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-179?page=all ] Doug Cutting closed NUTCH-179: -- Resolution: Invalid Closed at submitter's request. Proposition: Enable Nutch to use a parser plugin not just based on content type

[jira] Resolved: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Doug Cutting resolved NUTCH-177: Fix Version: 0.8-dev Resolution: Fixed The problem is that your seed url does not end in a slash, yet your url filter requires a slash. In 0.8-dev

[jira] Resolved: (NUTCH-176) Using -dir: creates an error, when the directory already exists

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-176?page=all ] Doug Cutting resolved NUTCH-176: Resolution: Won't Fix This check is intentionally made to prevent folks from accidentally overwriting crawls. Using -dir: creates an error, when

Re: Generating multiple fetchlists between updates

2006-01-19 Thread Doug Cutting
Andrzej Bialecki wrote: In the 0.7 branch, whenever a segment was generated the WebDB was modified, so that the entries that ended up in the fetchlist wouldn't be immediately available to the next segment generation, if that happened before the WebDB was updated with the data from that first

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ] Doug Cutting commented on NUTCH-136: The mapred-default.xml file is actually the best place to set these. mapreduce segment generator generates 50 % less than excepted

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ] Doug Cutting commented on NUTCH-173: Couldn't you instead use a prefix-urlfilter generated from your crawl seed? PerHost Crawling Policy ( crawl.ignore.external.links

[jira] Resolved: (NUTCH-102) jobtracker does not start when webapps is in src

2006-01-18 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-102?page=all ] Doug Cutting resolved NUTCH-102: Resolution: Fixed I just applied this patch. Thanks, Owen. jobtracker does not start when webapps is in src

Re: NutchQuery adding non required Terms

2006-01-12 Thread Doug Cutting
Stefan Groschupf wrote: Did I miss something in general to be able to support non required terms in nutch? I left OR and nesting out of the API to simplify what query filters have to process. Nutch's query features are approximately what Google supported for its first three years. (Google

Re: Crawl and parse exceptions

2006-01-11 Thread Doug Cutting
Matt Zytaruk wrote: Exception in thread main java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-0/data at org.apache.nutch.ipc.Client.call(Client.java:294) This is an error returned from an RPC call. There should be more details about this in a

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ] Doug Cutting commented on NUTCH-171: I'd like to hear more about why you want multiple segments, what's motivating this patch. The 0.7 -numFetchers parameter was designed

Re: Reporter interface

2006-01-10 Thread Doug Cutting
Andrew McNabb wrote: On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote: To read sequence files directly outside of MapReduce, just use SequenceFile directly, e.g., something like: MyKey key = new MyKey(); MyValue value = new MyValue(); SequenceFile.Reader reader = new

Re: fetch of XXX failed with: java.lang.ClassCastException: java.util.ArrayList

2006-01-10 Thread Doug Cutting
Gal Nitzan wrote: I traced it to ParseData line 147. UTF8.writeString(out, (String) e.getKey()); UTF8.writeString(out, (String) e.getValue()); it seems that Set-Cookie key comes with a ArrayList value? I think that was fixed yesterday by Andrzej.

Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Doug Cutting
[EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13 +6,14 @@ !-- Build deploy all the plugin jars.-- !-- == --

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] Doug Cutting commented on NUTCH-139: We can just use different names, rather than two metaData objects: X-nutch names for derived or other values that are usually protocol

Re: Reporter interface

2006-01-09 Thread Doug Cutting
Andrew McNabb wrote: I'm looking at the Reporter interface, and I would like to verify my understanding of what it is. It appears to me that Reporter.setStatus() is called periodically during an operation to give a human-readable description of how far the progress is so far. Is that correct?

Re: why index not in segment anymore

2006-01-09 Thread Doug Cutting
Stefan Groschupf wrote: in nutch 0.8 the index is not in the segment folder any more. What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle,

Re: Reporter interface

2006-01-09 Thread Doug Cutting
Andrew McNabb wrote: One of the great things about open source is that projects can be used for unintended purposes. In fact, Nutch works well for parallel computing in general, not just for web indexing. Apparently Google has thousands of projects that use MapReduce. The plan is to move

Re: Reporter interface

2006-01-09 Thread Doug Cutting
Andrew McNabb wrote: SequenceFileInputFormat inputformat = new SequenceFileInputFormat(); RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, nullreporter); To read sequence files directly outside of MapReduce, just use SequenceFile directly, e.g., something like:

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key: NUTCH

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key: NUTCH

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key: NUTCH

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] Doug Cutting commented on NUTCH-139: Jerome, Some HTTP headers have multiple values. Correctly reflecting that was I thought the primary motivation for adding multiple

Re: Normalizing URLs with anchors

2006-01-06 Thread Doug Cutting
Ken Krugler wrote: I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different: http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex and http://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it

Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Doug Cutting
Stefan Groschupf wrote: Different parameters are sent to each address. So params.length should equal addresses.length, and if params.length==0 then addresses.length==0 and there's no call to be made. Make sense? It might be clearer if the test were changed to addresses.length==0. Yes,

[jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ] Doug Cutting commented on NUTCH-160: +1 I like this patch. I don't see a need for us to use oro anywhere, since Java now has good builtin regex support. And Java's

Re: Adaptive fetch interval unmodified content detection, episode II

2006-01-06 Thread Doug Cutting
Andrzej Bialecki wrote: For efficiency reasons, most of this information is stored and passed to processing jobs inside instances of CrawlDatum - for the key step of DB update any other parts of segments (such as Content, ParseData or ParseText) are not used, which prevents easy access to

<    1   2   3   4   >