[jira] Commented: (NUTCH-479) Support for OR queries

2007-06-22 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473 ] Doug Cutting commented on NUTCH-479: Neither. It would end up as the Lucene query: +search phrase

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] Doug Cutting commented on NUTCH-392: Anchors, explain, and the cache are used relatively infrequently,

[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854 ] Doug Cutting commented on NUTCH-455: Alternately, we could define it as an error to attempt to dedup by a

[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter

2007-02-27 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243 ] Doug Cutting commented on NUTCH-445: Note that the site field is also used for search-time deduplication, and

[jira] Assigned: (NUTCH-449) Format of junit output should be configurable

2007-02-23 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned NUTCH-449: -- Assignee: Doug Cutting Format of junit output should be configurable

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-13 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821 ] Doug Cutting commented on NUTCH-443: this patch in some places removes the log guards Most of the log guards

[jira] Assigned: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting reassigned NUTCH-392: -- Assignee: Doug Cutting OutputFormat implementations should pass on Progressable

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: NUTCH-392.patch OutputFormat implementations should pass on Progressable

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ] Doug Cutting commented on NUTCH-392: This should not be applied until Nutch uses Hadoop 0.8. It also contains a patch required to make Nutch work correctly

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: (was: NUTCH-392.patch) OutputFormat implementations should pass on Progressable

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: NUTCH-392.patch Oops. Attached the wrong patch. Here's the right one. OutputFormat implementations should pass on Progressable

[jira] Resolved: (NUTCH-304) Change JIRA email address for nutch issues from apache incubator

2006-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-304?page=all ] Doug Cutting resolved NUTCH-304. Resolution: Fixed I just fixed this. Thanks for noticing! Change JIRA email address for nutch issues from apache incubator

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ] Doug Cutting commented on NUTCH-353: It's worth noting that Google, Yahoo! and Microsoft's searches all return lots of links to www-XXX.ibm.com. Just some

[jira] Reopened: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ] Doug Cutting reopened NUTCH-309: I am re-opening this issue, as the guards were added in far too many places. Jerome, can you please fix these so that guards are only added when (a) the log

[jira] Resolved: (NUTCH-312) Fix for upcoming incompatibility with Hadoop-0.4

2006-06-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-312?page=all ] Doug Cutting resolved NUTCH-312: Fix Version: 0.8-dev Resolution: Fixed I just upgraded Nutch to Hadoop 0.4.0, incorporating this patch. Thanks, Milind! Fix for upcoming

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] Doug Cutting commented on NUTCH-289: It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these

[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-05-26 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] Doug Cutting commented on NUTCH-273: Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL in the

[jira] Created: (NUTCH-289) CrawlDatum should store IP address

2006-05-26 Thread Doug Cutting (JIRA)
CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored

[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] Doug Cutting commented on NUTCH-288: Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? No. But we should

[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] Doug Cutting commented on NUTCH-288: Is there a quickfix possible somehow? Someone needs to fix the OpenSearch servlet. It looks like just changing line 146 of

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] Doug Cutting commented on NUTCH-267: re: it's as if we didn't want it to be re-crawled if we can't find any inlinks to it We prioritize crawling based on the number of

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] Doug Cutting commented on NUTCH-267: Andrzej: your analysis is correct, but it mostly only applies when re-crawling. In an initial crawl, where each url is fetched only

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] Doug Cutting commented on NUTCH-134: +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't think

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] Doug Cutting commented on NUTCH-267: The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions

[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ] Doug Cutting commented on NUTCH-257: I'd vote to never have Summary#toString() perform entity encoding, to fix search.jsp to encode things itself, and *not* to add a new

[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host

2006-04-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-250?page=all ] Doug Cutting resolved NUTCH-250: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. Thanks, Rod. Generate to log truncation caused by

[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374272 ] Doug Cutting commented on NUTCH-246: It seems like the Injector should be loading the current time from a job configuration property in the same way that that the

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] Doug Cutting commented on NUTCH-240: +1 for committing Generator.patch.txt now. 0 for committing the rest until I've had more time to think about it. I'm not against it,

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] Doug Cutting commented on NUTCH-240: Also, note that we can now extend Hadoop's new MapReduceBase to implement configure() and close() for many Mappers and Reducers,

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372556 ] Doug Cutting commented on NUTCH-171: Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant. Overlapping map2 with reduce1 should

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ] Doug Cutting commented on NUTCH-240: First, I hope my critical remarks were not taken personally. I am thankful for this and all of your contributions. Initially, I did

[jira] Commented: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372581 ] Doug Cutting commented on NUTCH-242: Shouldn't you use the returned value of the filter? If so, then this should be done in a mapper, not in the reducer. Add optional

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372597 ] Doug Cutting commented on NUTCH-171: Generate for 20 Segments of 10M in size is almost as fast as 1 segment that is 10M in size. A single 200M URL segment is unweildly

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] Doug Cutting commented on NUTCH-240: The generator store/restore score stuff seems ugly. And it is not used by OPIC. Could we insteadhave a method that computes and

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371122 ] Doug Cutting commented on NUTCH-235: I'm concerned about all of the contains() calls this adds to an ArrayList. This is a linear scan, and makes the cost of building a

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371142 ] Doug Cutting commented on NUTCH-235: The iterator shouldn't be a problem. When we're indexing we also dedup them by domain, which is much more expensive than creating an

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371147 ] Doug Cutting commented on NUTCH-235: +1 This looks good. It will be a little slower for simple crawls, where each link is only processed once, but probably not

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ] Doug Cutting commented on NUTCH-230: Andrzej, that's true if we think links that are filtered are bad links, but if we instead think of them as non-links then this fix is

[jira] Commented: (NUTCH-221) prepare nutch for upcoming lucene 2.0

2006-03-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-221?page=comments#action_12368779 ] Doug Cutting commented on NUTCH-221: +1 Thanks! prepare nutch for upcoming lucene 2.0 - Key: NUTCH-221 URL:

[jira] Created: (NUTCH-218) need DOAP file for Nutch

2006-02-28 Thread Doug Cutting (JIRA)
need DOAP file for Nutch Key: NUTCH-218 URL: http://issues.apache.org/jira/browse/NUTCH-218 Project: Nutch Type: Task Reporter: Doug Cutting Can someone please draft a DOAP file for Nutch, so that we're listed at

[jira] Resolved: (NUTCH-216) cannot build in windows

2006-02-24 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-216?page=all ] Doug Cutting resolved NUTCH-216: Fix Version: 0.8-dev Resolution: Fixed The reason 'exec' was used was to also restore file permissions, which 'untar' does not. So I switched it to

[jira] Resolved: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=all ] Doug Cutting resolved NUTCH-211: Resolution: Fixed I committed this, with a bunch of whitespace fixes. FetchedSegments leave readers open --

[jira] Commented: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366505 ] Doug Cutting commented on NUTCH-211: The interfaces that FetchedSegments implements should have a close method. Moreover, these interfaces should extend a Closeable

[jira] Resolved: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-209?page=all ] Doug Cutting resolved NUTCH-209: Resolution: Fixed I just committed this. Michael, the 'bin/hadoop jar' command is not (yet) used by Nutch. Please file a Hadoop bug to add the feature

[jira] Commented: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365798 ] Doug Cutting commented on NUTCH-209: Andrzej, sorry, I didn't see your remark before I committed this! A DFSClassLoader would have problems with plugins, since our plugin

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365618 ] Doug Cutting commented on NUTCH-192: Since these mappings are not something that users should alter, I'm not sure they should be in the config file. I added related

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365619 ] Doug Cutting commented on NUTCH-139: +1 This looks great. Thanks for all the hard work on this one! Standard metadata property names in the ParseData metadata

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Doug Cutting updated NUTCH-192: --- Attachment: (was: metadata08_02_06.patch) meta data support for CrawlDatum Key: NUTCH-192 URL:

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365643 ] Doug Cutting commented on NUTCH-192: +1 This looks good to me. Thanks for your persistence. meta data support for CrawlDatum

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365450 ] Doug Cutting commented on NUTCH-192: Sorry, I misspoke and overstated things too. There are problems, but not with MapWritable, rather with WritableName: this refers to

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365087 ] Doug Cutting commented on NUTCH-193: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere:

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365089 ] Doug Cutting commented on NUTCH-139: Jerome: yes, it makes sense, but there's also metadata that's not tightly related to the protocol or the parser, e.g., the nutch

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365130 ] Doug Cutting commented on NUTCH-193: Okay, I've moved the code from Nutch to Hadoop. Now I need to repair Nutch so that it still works! One remaining problem is the need

[jira] Resolved: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=all ] Doug Cutting resolved NUTCH-193: Resolution: Fixed I just committed this. Phew! move NDFS and MapReduce to a separate project -

[jira] Resolved: (NUTCH-197) NullPointerException in TaskRunner if application jar does not have lib directory

2006-02-01 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-197?page=all ] Doug Cutting resolved NUTCH-197: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Owen! NullPointerException in TaskRunner if application jar does not have lib

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364923 ] Doug Cutting commented on NUTCH-192: I'm worried that this will substantially slow things. I'd like to see some effort made to ensure that: 1. If no metadata is used,

[jira] Created: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
move NDFS and MapReduce to a separate project - Key: NUTCH-193 URL: http://issues.apache.org/jira/browse/NUTCH-193 Project: Nutch Type: Task Components: ndfs Versions: 0.8-dev Reporter: Doug Cutting

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ] Doug Cutting commented on NUTCH-191: We've thus far avoided loading job-specific code in the JobTracker and TaskTracker, in order to keep these more reliable. File

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ] Doug Cutting commented on NUTCH-193: Otis: yes, thanks, I meant org.apache.hadoop.dfs. Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today. I'll

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Attachment: (was: NUTCH-139.jc.review.patch.txt) Standard metadata property names in the ParseData metadata

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Attachment: (was: NUTCH-139.Mattmann.patch.txt) Standard metadata property names in the ParseData metadata

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] Doug Cutting commented on NUTCH-139: I think we're near agreement here. Here are the changes I think this patch still needs: MetadataNames belongs in the protocol

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

2006-01-21 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363554 ] Doug Cutting commented on NUTCH-183: Byron, that's exactly what Mike means by speculative execution. MapReduce has a series of problems concerning task-allocation to

[jira] Closed: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-179?page=all ] Doug Cutting closed NUTCH-179: -- Resolution: Invalid Closed at submitter's request. Proposition: Enable Nutch to use a parser plugin not just based on content type

[jira] Resolved: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Doug Cutting resolved NUTCH-177: Fix Version: 0.8-dev Resolution: Fixed The problem is that your seed url does not end in a slash, yet your url filter requires a slash. In 0.8-dev

[jira] Resolved: (NUTCH-176) Using -dir: creates an error, when the directory already exists

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-176?page=all ] Doug Cutting resolved NUTCH-176: Resolution: Won't Fix This check is intentionally made to prevent folks from accidentally overwriting crawls. Using -dir: creates an error, when the

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ] Doug Cutting commented on NUTCH-136: The mapred-default.xml file is actually the best place to set these. mapreduce segment generator generates 50 % less than excepted

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ] Doug Cutting commented on NUTCH-173: Couldn't you instead use a prefix-urlfilter generated from your crawl seed? PerHost Crawling Policy ( crawl.ignore.external.links )

[jira] Resolved: (NUTCH-102) jobtracker does not start when webapps is in src

2006-01-18 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-102?page=all ] Doug Cutting resolved NUTCH-102: Resolution: Fixed I just applied this patch. Thanks, Owen. jobtracker does not start when webapps is in src

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ] Doug Cutting commented on NUTCH-171: I'd like to hear more about why you want multiple segments, what's motivating this patch. The 0.7 -numFetchers parameter was designed

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] Doug Cutting commented on NUTCH-139: We can just use different names, rather than two metaData objects: X-nutch names for derived or other values that are usually protocol

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key:

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key:

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key:

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] Doug Cutting commented on NUTCH-139: Jerome, Some HTTP headers have multiple values. Correctly reflecting that was I thought the primary motivation for adding multiple

[jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ] Doug Cutting commented on NUTCH-160: +1 I like this patch. I don't see a need for us to use oro anywhere, since Java now has good builtin regex support. And Java's

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ] Doug Cutting commented on NUTCH-153: Paul, Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too? I.e., is at least part of the problem that oro has

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ] Doug Cutting commented on NUTCH-139: Also, since the primary use of multiple metadata values should be for protocols where multiple-values are required, the method to add

[jira] Commented: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362004 ] Doug Cutting commented on NUTCH-152: re 1,2,5: sounds good. re 3: Why is a separate thread needed for stdout? Can you please elaborate on how this causes problems? re 4:

[jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Doug Cutting resolved NUTCH-151: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Paul! CommandRunner can hang after the main thread exec is finished and has

[jira] Resolved: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ] Doug Cutting resolved NUTCH-150: Fix Version: 0.7.2-dev Resolution: Fixed I just committed this. Thanks, Paul! OutlinkExtractor extremely slow on some non-plain text

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-02 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ] Doug Cutting commented on NUTCH-159: mapred.local.dir is the thing to set. if that fails, then there is a bug. what did you have this set to? Specify temp/working

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360665 ] Doug Cutting commented on NUTCH-3: -- I find the naming confusing, where setProperty adds a value. I wonder whether we should provide a 'setProperty' that replaces all values,

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360702 ] Doug Cutting commented on NUTCH-3: -- Yes, I prefer this. +1 multi values of header discarded Key: NUTCH-3 URL:

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-16 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360645 ] Doug Cutting commented on NUTCH-139: I'm confused as to why all of the constant names have X_nutch in them. I'd expect to see something like that in their string values,

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359624 ] Doug Cutting commented on NUTCH-133: It would be great to have some junit tests which illustrate these problems. If we can first all agree on the desired behaviour, then

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2005-12-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359626 ] Doug Cutting commented on NUTCH-134: Can we yet replace Nutch's summarizer with the summarizer in Lucene's contrib directory? Are there features that Nutch requires that

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359634 ] Doug Cutting commented on NUTCH-133: Stefan, sorry I missed the test case. If others agree that these cases should pass, then we should commit the test case alone as a

[jira] Resolved: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-12-01 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-130?page=all ] Doug Cutting resolved NUTCH-130: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. I moved the version to the default.properties file, and

[jira] Resolved: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-12-01 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Doug Cutting resolved NUTCH-116: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Paul, this is great to have! TestNDFS a JUnit test specifically for NDFS

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-14 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12357617 ] Doug Cutting commented on NUTCH-99: --- Sounds good. We should also probably note in the config property descriptions that these port numbers are the first in a range that will

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-10 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12357291 ] Doug Cutting commented on NUTCH-99: --- I cannot get patch on linux to accept this. The absolute DOS paths seem to cause problems. Can you please regenerate this with relative

[jira] Created: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-04 Thread Doug Cutting (JIRA)
protocol-httpclient does not follow redirects when fetching robots.txt -- Key: NUTCH-124 URL: http://issues.apache.org/jira/browse/NUTCH-124 Project: Nutch Type: Bug Components: fetcher

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-20 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332609 ] Doug Cutting commented on NUTCH-88: --- Jerome, This works well now. I've merged your changes to the mapred branch. Thanks! Doug Enhance ParserFactory plugin selection

[jira] Commented: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=comments#action_12332493 ] Doug Cutting commented on NUTCH-116: Paul, This looks like good stuff. I could commit it more easily if changes were restricted to those required by TestNDFS. Changes

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332518 ] Doug Cutting commented on NUTCH-88: --- These both sound like good changes. +1 Enhance ParserFactory plugin selection policy -

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332541 ] Doug Cutting commented on NUTCH-88: --- If it's to happen at parse time then it should happen in the Content constructor, so that it's only done in one place, and we don't rely

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332543 ] Doug Cutting commented on NUTCH-82: --- I do not think we should have multiple versions of the command line tools, since that complicates maintenance. A windows batch file is

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331847 ] Doug Cutting commented on NUTCH-109: Is your HTTP client polite? Does it only have a single connection open the the server at a time, and does it pause

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-10-03 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12331225 ] Doug Cutting commented on NUTCH-99: --- What command line would you add this to? I think this should simply start at the default port (e.g., 7030) and loop trying port+1 until

  1   2   >