[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-27 Thread Chris Schneider (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530755 ] Chris Schneider commented on NUTCH-558: --- The reason that DomainStats does not use URLUtils is that (as

[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-23 Thread Chris Schneider (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529749 ] Chris Schneider commented on NUTCH-558: --- I made a comment in the source about this, but thinking about it

[jira] Created: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-19 Thread Chris Schneider (JIRA)
Need tool to retrieve domain statistics --- Key: NUTCH-558 URL: https://issues.apache.org/jira/browse/NUTCH-558 Project: Nutch Issue Type: New Feature Affects Versions: 0.9.0 Reporter:

[jira] Commented: (NUTCH-351) Protocol forward proxy

2006-11-01 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12446424 ] Chris Schneider commented on NUTCH-351: --- I just noticed a bug in the patch above. I believe it's missing a return sequence between the Host: host and

[jira] Created: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: http://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: fetcher

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441528 ] Chris Schneider commented on NUTCH-385: --- This comment was actually made by Andrzej in response to an email containing the analysis above that I sent him

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441529 ] Chris Schneider commented on NUTCH-385: --- This comment was actually made by Ken Krugler, who was responding to Andrzej's comment above: [with respect to

[jira] Commented: (NUTCH-351) Protocol forward proxy

2006-09-26 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12438002 ] Chris Schneider commented on NUTCH-351: --- I would really appreciate it if Sami could explain in a little more detail what this patch adds to the proxy support

[jira] Updated: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs

2006-09-25 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-371?page=all ] Chris Schneider updated NUTCH-371: -- Description: DeleteDuplicates is supposed to delete documents with duplicate URLs (after deleting documents with identical MD5 hashes), but this part is

[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-08-24 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12430117 ] Chris Schneider commented on NUTCH-273: --- Another reason why it would be better to wait until the next segment to process the target of the redirect is that

[jira] Created: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Chris Schneider (JIRA)
Generator is building fetch list using *lowest* scoring URLs Key: NUTCH-348 URL: http://issues.apache.org/jira/browse/NUTCH-348 Project: Nutch Issue Type: Bug

[jira] Commented: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-06 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12426039 ] Chris Schneider commented on NUTCH-342: --- I apologize for my confusion. I had been thinking that hadoop-env.sh was getting sourced when a Nutch command was

[jira] Created: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-05 Thread Chris Schneider (JIRA)
Nutch commands log to nutch/logs/hadoop.logs by default --- Key: NUTCH-342 URL: http://issues.apache.org/jira/browse/NUTCH-342 Project: Nutch Issue Type: Bug Affects Versions: 0.8

[jira] Updated: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-05 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-342?page=all ] Chris Schneider updated NUTCH-342: -- Attachment: NUTCH-342.patch Here's a patch that defaults NUTCH_LOG_DIR to $HADOOP_LOG_DIR and NUTCH_LOGFILE to $HADOOP_LOG_FILE. Nutch commands log to

[jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-02 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ] Chris Schneider updated NUTCH-336: -- Attachment: NUTCH-336.patch.txt Here's a patch that fixes the problem. It separates a new injectionScore API out from the initialScore API. Harvested

[jira] Created: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-01 Thread Chris Schneider (JIRA)
Harvested links shouldn't get db.score.injected in addition to inbound contributions Key: NUTCH-336 URL: http://issues.apache.org/jira/browse/NUTCH-336 Project:

[jira] Created: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-06 Thread Chris Schneider (JIRA)
CommonGrams loads analysis.common.terms.file for each query --- Key: NUTCH-301 URL: http://issues.apache.org/jira/browse/NUTCH-301 Project: Nutch Type: Improvement Components: searcher Versions:

[jira] Created: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Chris Schneider (JIRA)
Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev

[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374253 ] Chris Schneider commented on NUTCH-246: --- As it turns out, this problem was due to a time synchronization between the jobtracker and the tasktrackers. When the URLs were

[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=all ] Chris Schneider updated NUTCH-246: -- Priority: Minor (was: Blocker) segment size is never as big as topN or crawlDB size in a distributed deployement

[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Chris Schneider (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374049 ] Chris Schneider commented on NUTCH-246: --- A few more details: Stefan and I were able to reproduce this problem using either an injection set of 4500 URLs or a larger set

[jira] Created: (NUTCH-195) RPC call times out while indexing map task is computing splits

2006-01-31 Thread Chris Schneider (JIRA)
RPC call times out while indexing map task is computing splits -- Key: NUTCH-195 URL: http://issues.apache.org/jira/browse/NUTCH-195 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev