[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] Stefan Groschupf commented on NUTCH-354: Since this issue is already closed I can not attach the patch file, so I attach it as text within this comment. If

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] Stefan Groschupf commented on NUTCH-356: Hi Enrico, there will be as much PluginRepositories as Configuration objects. So in case you create many

[jira] Created: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
crawling simulation --- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix

[jira] Updated: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ] Stefan Groschupf updated NUTCH-357: --- Attachment: protocol-simulation-pluginV1.patch A very first preview of a plugin that helps to simulate crawls. This protocol plugin can be used to

[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
MapWritable, nextEntry is not reset when Entries are recycled --- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects

[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Stefan Groschupf updated NUTCH-354: --- Attachment: resetNextEntryInMapWritableV1.patch Resets the next Entry of a recycled entry. MapWritable, nextEntry is not reset when Entries are

[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] Stefan Groschupf commented on NUTCH-343: Thanks for the contribution, also that your patch has a test. :-) Just a small comment from taking a first look to

[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Stefan Groschupf updated NUTCH-341: --- Attachment: doNotDeleteTmpIndexMergeDirV1.patch +1. I agree it makes completly no sense to be required creating a tmp folder manually and nutch deletes

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Attachment: respectFetcherParsePropertyV1.patch Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a contributor to commit this to

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Priority: Major (was: Trivial) Fetcher ignores the fetcher.parse value configured in config file

[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE -- Key: NUTCH-350 URL:

[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-350?page=all ] Stefan Groschupf updated NUTCH-350: --- Attachment: protocolRetryV5.patch This patch will dramatically increase the number of successfully fetched pages of a intranet crawl over the time.

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] Stefan Groschupf commented on NUTCH-322: I think this is a serious problem. Page A server side redirect to Page B. Page A is never writen to the output.

[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
pages that serverside forwards will be refetched every time --- Key: NUTCH-353 URL: http://issues.apache.org/jira/browse/NUTCH-353 Project: Nutch Issue Type: Bug Affects Versions:

[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=all ] Stefan Groschupf updated NUTCH-353: --- Attachment: doNotRefecthForwarderPagesV1.patch Since we discussed that nutch need to be more polite we should fix that asap. pages that serverside

[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Stefan Groschupf resolved NUTCH-322. Resolution: Duplicate duplicate of NUTCH-353 Fetcher discards ProtocolStatus, doesn't store redirected pages

[jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] Stefan Groschupf commented on NUTCH-347: Please submit this patch! Thanks! Build: plugins' Jars not found --

[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] Stefan Groschupf commented on NUTCH-346: +1 I agree, can you please create a patch file and attach it to this bug. Thanks Improve readability of

[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] Stefan Groschupf commented on NUTCH-345: Shouldn't the DeflateUtils also be part of the protocol-http plugin? Also since it is a larger contribution and

[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] Stefan Groschupf commented on NUTCH-349: my vote goes to #2. Having a tool that need to be started manually would be better than complicate the already

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] Stefan Groschupf commented on NUTCH-233: Hi Otis, yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls

[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ] Stefan Groschupf updated NUTCH-348: --- Attachment: sortPatchV1.patch What people think about this kind of solution? Generator is building fetch list using *lowest* scoring URLs

[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-28 Thread Stefan Groschupf (JIRA)
doubling score causes by page internal anchors. --- Key: NUTCH-332 URL: http://issues.apache.org/jira/browse/NUTCH-332 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] Stefan Groschupf commented on NUTCH-318: Yes this happens only in a distributed environment. Please also see my last mail in the hadoop dev list. I think

[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] Stefan Groschupf commented on NUTCH-318: Shouldn't that be fixed in .8 since by today this tool just produce no output?! log4j not proper configured,

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] Stefan Groschupf commented on NUTCH-233: I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages

[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes --- Key: NUTCH-325 URL:

[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ] Stefan Groschupf updated NUTCH-325: --- Attachment: UrlFiltersNPE.patch A patch that uses a Arralist instead of an array and put only entries into the list when the entry is not null. Means

[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-323?page=all ] Stefan Groschupf updated NUTCH-323: --- Attachment: MapWritableCopyConstructor.patch Attached patch add a copy constructor to the map writable and use it in the CrawlDatum.set methode. However

[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
db.score.link.internal and db.score.link.external are ignored - Key: NUTCH-324 URL: http://issues.apache.org/jira/browse/NUTCH-324 Project: Nutch Issue Type: Improvement

[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ] Stefan Groschupf updated NUTCH-324: --- Attachment: InternalAndExternalLinkScoreFactor.patch Multiply the score of a page during distributeScoreToOutlink with db.score.link.internal or

[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-319?page=all ] Stefan Groschupf resolved NUTCH-319. Resolution: Won't Fix Sorry, that is bogus since it is wriiten to the logging stream. OPICScoringFilter should use logging API instead of

[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)
OPICScoringFilter should use logging API instead of printStackTrace --- Key: NUTCH-319 URL: http://issues.apache.org/jira/browse/NUTCH-319 Project: Nutch Issue Type: Bug

[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)
log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318 URL: http://issues.apache.org/jira/browse/NUTCH-318 Project: Nutch Type: Bug Versions: 0.8-dev Reporter:

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV5.patch Release Candidate 1 of this patch. This patch contains: + add IP Address to CrawlDatum Version 5 (as byte[4]) +

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV4.patch Attached a patch that does only use any time 4 byte for the ip. Means we do ignore ipv6. This save us a 4 byte in

[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong

2006-06-07 Thread Stefan Groschupf (JIRA)
java doc of CrawlDb is wrong Key: NUTCH-302 URL: http://issues.apache.org/jira/browse/NUTCH-302 Project: Nutch Type: Bug Reporter: Stefan Groschupf Priority: Trivial Fix For: 0.8-dev CrawlDb has the same java doc as

[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-301?page=all ] Stefan Groschupf updated NUTCH-301: --- Attachment: CommonGramsCacheV1.patch Cache HashMap COMMON_TERMS in configuration instance. CommonGrams loads analysis.common.terms.file for each query

[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] Stefan Groschupf commented on NUTCH-293: Any comments? There was already a posting in the nutch agent mailing list, where someone had banned nutch since nutch does not

[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] Stefan Groschupf commented on NUTCH-293: Hi Andrzej, I agree but writing a queue based fetcher is a big step. I already have some basic code (nio based). Also I don't

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] Stefan Groschupf commented on NUTCH-258: Scott, I agree with you. However we need a clean patch to solve the problem, we can not just comment things out of the code.

[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV1.patch To keep the discussion alive attached a _first draft_ for storing the ip in the crawlDatum for public discussion.

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a 404 for a robots.txt is returned no page is fetched at all from the host)

[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
if a 404 for a robots.txt is returned no page is fetched at all from the host - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter:

[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Attachment: fixNpeRobotRuleSet.patch fix the npe in RobotRuleSet happen in case we use a empthy RuleSet if a 404 for a robots.txt is returned no page is

[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] Stefan Groschupf commented on NUTCH-282: Is that related to host grouping we discussed? Can we in this case close this bug? Showing too few results on a page (Paging

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] Stefan Groschupf commented on NUTCH-286: This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol

[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] Stefan Groschupf commented on NUTCH-292: +1, Can someone create a clean patch file? OpenSearchServlet: OutOfMemoryError: Java heap space

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] Stefan Groschupf commented on NUTCH-291: lastModified will be only indexed if you switch on the index-more plugin. If you think you should change the way lastmodified

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus =

[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ] Stefan Groschupf closed NUTCH-287: -- Resolution: Won't Fix http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html Exception when searching with sort

[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ] Stefan Groschupf closed NUTCH-284: -- Resolution: Won't Fix Yes, I was missing index-basic. NullPointerException during index - Key: NUTCH-284

[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] Stefan Groschupf commented on NUTCH-284: Please try discuss such things first in the user mailing list than open a issue. Maintaining the issue tracking is very time

[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] Stefan Groschupf commented on NUTCH-281: Can you submit a patch file? cached.jsp: base-href needs to be outside comments

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] Stefan Groschupf commented on NUTCH-274: Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector? Empty row in/at end of URL-list

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status.

[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ] Stefan Groschupf closed NUTCH-286: -- Resolution: Won't Fix I hope everybody agree with the statement: We can not detect http response codes based on responded html content. Prune the

[jira] Created: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
support for Crawl-delay in Robots.txt - Key: NUTCH-293 URL: http://issues.apache.org/jira/browse/NUTCH-293 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf

[jira] Updated: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=all ] Stefan Groschupf updated NUTCH-293: --- Attachment: crawlDelayv1.patch A frist darft of a crawl delay support for nutch. The problem I see is that in case ip based delay is configured it can

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] Stefan Groschupf commented on NUTCH-289: +1 Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as Doug suggested. The biggest problem nutch has

[jira] Commented: (NUTCH-249) black- white list url filtering

2006-04-26 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-249?page=comments#action_12376477 ] Stefan Groschupf commented on NUTCH-249: I mean the Class and method naming isn't very well. Blacklist or blocklist? Whitelist or positivivelist? Does this answer the

[jira] Created: (NUTCH-251) Administration GUI

2006-04-21 Thread Stefan Groschupf (JIRA)
Administration GUI -- Key: NUTCH-251 URL: http://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev Having a web based

[jira] Updated: (NUTCH-249) black- white list url filtering

2006-04-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-249?page=all ] Stefan Groschupf updated NUTCH-249: --- Attachment: blackWhiteListV2.patch A concept tryout of black- white list filtering. I'm looking for beta tester and improvement suggestions. (Especially

[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-13 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-246?page=all ] Stefan Groschupf updated NUTCH-246: --- Attachment: injectWithCurTimeMapper.patch setFetchTime moved to Mapper. segment size is never as big as topN or crawlDB size in a distributed

[jira] Created: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Stefan Groschupf (JIRA)
segment size is never as big as topN or crawlDB size in a distributed deployement - Key: NUTCH-246 URL: http://issues.apache.org/jira/browse/NUTCH-246 Project: Nutch Type: Bug

[jira] Created: (NUTCH-247) robot parser to restrict.

2006-04-11 Thread Stefan Groschupf (JIRA)
robot parser to restrict. - Key: NUTCH-247 URL: http://issues.apache.org/jira/browse/NUTCH-247 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For:

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370686 ] Stefan Groschupf commented on NUTCH-233: Sorry, I haven't such url since it happens until reducing a fetch. Reducing provides no logging and map data will be deleted

[jira] Created: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-15 Thread Stefan Groschupf (JIRA)
wrong regular expression hang reduce process for ever -- Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf

[jira] Created: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
improved handling of plugin folder configuration Key: NUTCH-229 URL: http://issues.apache.org/jira/browse/NUTCH-229 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For:

[jira] Updated: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ] Stefan Groschupf updated NUTCH-229: --- Attachment: pluginFolder.patch A patch to be able using relative path that are not in the classpath. improved handling of plugin folder configuration

[jira] Created: (NUTCH-226) CrawlDb Filter tool

2006-03-08 Thread Stefan Groschupf (JIRA)
CrawlDb Filter tool --- Key: NUTCH-226 URL: http://issues.apache.org/jira/browse/NUTCH-226 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Minor A tool to filter a existing crawlDb -- This message is automatically

[jira] Updated: (NUTCH-226) CrawlDb Filter tool

2006-03-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-226?page=all ] Stefan Groschupf updated NUTCH-226: --- Attachment: crawlDbFilter.patch Patch with tool to filter a existing crawlDb. In any case backup your crawlDb first. CrawlDb Filter tool

[jira] Closed: (NUTCH-222) Exception in thread main java.lang.NoClassDefFoundError: invertlink

2006-03-04 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-222?page=all ] Stefan Groschupf closed NUTCH-222: -- Resolution: Fixed Hi, I guess it is a typo, try invertlinks in case the nutch script does not know the command as in your case invertlink it tries to

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367991 ] Stefan Groschupf commented on NUTCH-204: Jérôme, After taking a look to the HitDetails object again - after a some time - I notice I completely had overseen that there

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12368038 ] Stefan Groschupf commented on NUTCH-204: Yes that is a good idea. Thanks for getting this into the sources. Cheers, Stefan multiple field values in HitDetails

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367520 ] Stefan Groschupf commented on NUTCH-204: There is something I don't understand with this patch. The way Lucene manage multi-valued fields is to have many mono-valued

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367539 ] Stefan Groschupf commented on NUTCH-204: Woudn't you end with something very similar as it is now, having one key and multiple values per key? The Lucene Document

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367552 ] Stefan Groschupf commented on NUTCH-204: Make sense, I see, thanks for the clarification. multiple field values in HitDetails ---

[jira] Created: (NUTCH-213) checkstyle

2006-02-18 Thread Stefan Groschupf (JIRA)
checkstyle -- Key: NUTCH-213 URL: http://issues.apache.org/jira/browse/NUTCH-213 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Adding checkstyle target to ant build file to support

[jira] Updated: (NUTCH-213) checkstyle

2006-02-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-213?page=all ] Stefan Groschupf updated NUTCH-213: --- Attachment: checkstyle.patch checkstyle-all-4.1.jar As part of my learning lesson 'whitespace' I added a checkstyle target to the build

[jira] Commented: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366645 ] Stefan Groschupf commented on NUTCH-211: Raghavendra, I'm not sure if I also close the linkDB reader, may be I missed that. I will check this later today and may come

[jira] Updated: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=all ] Stefan Groschupf updated NUTCH-211: --- Attachment: closeable160206.patch Now also closing linkdb reader and file system, thanks to Raghavendra. FetchedSegments leave readers open

[jira] Created: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Stefan Groschupf (JIRA)
FetchedSegments leave readers open --- Key: NUTCH-211 URL: http://issues.apache.org/jira/browse/NUTCH-211 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-15 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12366472 ] Stefan Groschupf commented on NUTCH-204: Any improvment suggestions or negative comments? If not it would be great if one with write access to the svn can commit this

[jira] Assigned: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=all ] Stefan Groschupf reassigned NUTCH-211: -- Assign To: Stefan Groschupf FetchedSegments leave readers open -- Key: NUTCH-211 URL:

[jira] Updated: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-211?page=all ] Stefan Groschupf updated NUTCH-211: --- Attachment: closeFetchSegments.patch NutchBean, FetchedSegments,FetchedSegments.Segment IndexSearcher and HitContent now extends / implements the hadoop

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Stefan Groschupf updated NUTCH-192: --- Attachment: metadata08_02_06.patch Doug, I'm afraid there is a missunderstanding or may be I just do not understand your comments. A plugin never need

[jira] Created: (NUTCH-204) multiple field values in HitDetails

2006-02-06 Thread Stefan Groschupf (JIRA)
multiple field values in HitDetails --- Key: NUTCH-204 URL: http://issues.apache.org/jira/browse/NUTCH-204 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Stefan Groschupf Fix

[jira] Updated: (NUTCH-204) multiple field values in HitDetails

2006-02-06 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-204?page=all ] Stefan Groschupf updated NUTCH-204: --- Attachment: DetailGetValues070206.patch Patch that adding getValues to HitDetails. multiple field values in HitDetails

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364788 ] Stefan Groschupf commented on NUTCH-192: That's true. In any case I don't wan't to store the class id map. Since if we do that, you are right we can use strings. What

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364795 ] Stefan Groschupf commented on NUTCH-192: A perfect plan, I will do that so and commit a new patch. :) THANKS! meta data support for CrawlDatum

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Stefan Groschupf updated NUTCH-192: --- Attachment: metadata010206.patch As discussed... meta data support for CrawlDatum Key: NUTCH-192

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364683 ] Stefan Groschupf commented on NUTCH-192: Andrzej, Doug. I'm not sure if I understand you correct, do you suggest to have string keys and values, or just string keys?

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ] Stefan Groschupf commented on NUTCH-192: * plus whatever it takes to put the class name-id mapping in the MapWritable header (the mapping table): let's assume 40

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Stefan Groschupf updated NUTCH-192: --- Attachment: metadata310106.patch Now 1 byte for the class type and the size of the type itself, this means we can have only 2 byte keys and 2 byte values

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-01-30 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Stefan Groschupf updated NUTCH-192: --- Attachment: metadata300106.patch Attached a first suggestion for a patch to adding meta data support into crawlDatum. In general I created a MapWritable

[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-01-29 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364401 ] Stefan Groschupf commented on NUTCH-14: --- I didn't see that anymore, but I didn't make any newer heavy load test. We may can close this for now. NullPointerException

[jira] Commented: (NUTCH-59) meta data support in webdb

2006-01-26 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364136 ] Stefan Groschupf commented on NUTCH-59: --- Nutch 0.8 is very different to 0.7 in the way it stores page data and linkgraph. Therefore a reimplementation of meta data

[jira] Resolved: (NUTCH-127) uncorrect values using -du, or ls does not return items

2006-01-23 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-127?page=all ] Stefan Groschupf resolved NUTCH-127: Resolution: Fixed I guess it is solved, thanks. If able to reproduce it again I will just reopen this or a new report. Thanks! uncorrect values

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12363116 ] Stefan Groschupf commented on NUTCH-169: Thanks, we will fix this in the beginning of next week. remove static NutchConf --- Key:

  1   2   >