nutch is loosing not modified pages
Hi, in the fetcher line 192 in case the status is NOTMODIFIED we collect null as content but we already have the content. I'm worry what is happen with a page that does not change for 60 days, since the concept of nutch is do delete segments that are older than db.default.fetch.interval, isn't it? If this is true, may be someone with write access can change null to content. Thanks for any comments. Stefan
Re: nutch is loosing not modified pages
Stefan Groschupf wrote: Hi, in the fetcher line 192 in case the status is NOTMODIFIED we collect null as content but we already have the content. I'm worry what is happen with a page that does not change for 60 days, since the concept of nutch is do delete segments that are older than db.default.fetch.interval, isn't it? If this is true, may be someone with write access can change null to content. This requires a more systematic approach, which is a part of the adaptive fetch patch. In that patch pages which are older than maximum fetch interval (a system-wide setting) will be forced on the fetchlist, no matter what their state. This also ensures that pages in the GONE state are checked from time to time. I'll be working on this patch next week, with the goal of committing it, and I could use some testing and code review then ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] Dawid Weiss commented on NUTCH-134: --- (back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a plain-text only summarized is ideal for clustering for example. HTML is quite uncomfortable to work with. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev Reporter: Andrzej Bialecki Attachments: summarizer.060506.patch Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: http chunked content
As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests issued from http plugin are in HTTP 1.0, so the servers should never return some chuncked content. I think that the readChunkedContent was included in the code for a future use. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Created: (NUTCH-265) Getting Clustered results in better form.
Getting Clustered results in better form. - Key: NUTCH-265 URL: http://issues.apache.org/jira/browse/NUTCH-265 Project: Nutch Type: Improvement Components: searcher Versions: 0.7.2 Reporter: Kris K The cluster results are coming with title and link to URL. For improvement it should be clustered keyword phrases (Like Vivisimo type). Any person can share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Merging segments
Chris Fellows wrote: Hello, So the last discussion on merging segments was back in Jan. Has there been any progress in this direction? What would be the benefit of being able merge segments? Would being able to merge segments open up new functionality options or is merging just a convience? Also, what's the estimate for how involved merge functionality development is? Relief is on the way. Fine folks at houxou.com have sponsored the development of a brand-new SegmentMerger + slicer, and decided to donate it to the project - big thanks! I'm running some final tests, and will commit it today/tomorrow. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-265) Getting Clustered results in better form.
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ] Dawid Weiss commented on NUTCH-265: --- The clustering interface is very simple in Nutch because it usually needs to be adjusted to the needs of a particular application. Maintaing a complex user interface is not among Nutch's objectives, so I doubt if it's possible. Carrot2, which Nutch internally uses, has a JavaScript-powered interface which could be added to Nutch if there are folks that really think it is worth the effort. See this one: http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=nutchprocessingChain=carrot2.process.lingo-yahooapiresultsRequested=100 Getting Clustered results in better form. - Key: NUTCH-265 URL: http://issues.apache.org/jira/browse/NUTCH-265 Project: Nutch Type: Improvement Components: searcher Versions: 0.7.2 Reporter: Kris K The cluster results are coming with title and link to URL. For improvement it should be clustered keyword phrases (Like Vivisimo type). Any person can share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: http chunked content
I'm almost sure that this is not related to http 1.0 requests. Am 08.05.2006 um 03:20 schrieb Jérôme Charron: As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests issued from http plugin are in HTTP 1.0, so the servers should never return some chuncked content. I think that the readChunkedContent was included in the code for a future use. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] Doug Cutting commented on NUTCH-134: +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't think this has bad performance implications. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev Reporter: Andrzej Bialecki Attachments: summarizer.060506.patch Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: http chunked content
http://www.apple.com for example answer with chunked content also if you request with a http 1.0 header. Am 08.05.2006 um 03:20 schrieb Jérôme Charron: As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests issued from http plugin are in HTTP 1.0, so the servers should never return some chuncked content. I think that the readChunkedContent was included in the code for a future use. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Merging segments
That's great. Well, my follow up to that then is: Will the new tool allow any form of diff'ing segments? In practice this would allow you to run a crawl on a series of sites one week. Then run another crawl on the same sites a week or so later. Diff the segments and allow users to search on changes within the search domain. --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Chris Fellows wrote: Hello, So the last discussion on merging segments was back in Jan. Has there been any progress in this direction? What would be the benefit of being able merge segments? Would being able to merge segments open up new functionality options or is merging just a convience? Also, what's the estimate for how involved merge functionality development is? Relief is on the way. Fine folks at houxou.com have sponsored the development of a brand-new SegmentMerger + slicer, and decided to donate it to the project - big thanks! I'm running some final tests, and will commit it today/tomorrow. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Merging segments
Chris Fellows wrote: That's great. Well, my follow up to that then is: Will the new tool allow any form of diff'ing segments? In practice this would allow you to run a No, it does only two things - merging and slicing. That's already one too many... ;) crawl on a series of sites one week. Then run another crawl on the same sites a week or so later. Diff the segments and allow users to search on changes within the search domain. Interesting concept, but I think it would be better implemented as a variant of de-duplication, rather than segment content manipulation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: http chunked content
Furthermore, we can read in HTTP/1.1 specification that A server MUST NOT send transfer-codings to an HTTP/1.0 client. I once did an socket implementation against Anonymizer. This is well established proxy service that services $100K+ government and private contracts. Their server always sent chunked content despite all headers. I'm pretty sure that there are other well established servers that send chunked content despite the rfc. Guessing that it might have something to do with wanting to control content compression. All the browsers can handle it, and that's probably all apple is concerned with - even though they're overriding an rfc spec req. Chris --- Jérôme Charron [EMAIL PROTECTED] wrote: http://www.apple.com for example answer with chunked content also if you request with a http 1.0 header. Stefan, I don't see any Transfer-Encoding: chunked header in responses from www.apple.com Furthermore, we can read in HTTP/1.1 specification that A server MUST NOT send transfer-codings to an HTTP/1.0 client. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: http chunked content
Just remembered, got around it by using HTTPClient which handles reading the response (chunked or not) transparently. Haven't looked at the nutch code, but if we were to use HTTPClient 3.0.x or later, should take care of it. --- Chris Fellows [EMAIL PROTECTED] wrote: Furthermore, we can read in HTTP/1.1 specification that A server MUST NOT send transfer-codings to an HTTP/1.0 client. I once did an socket implementation against Anonymizer. This is well established proxy service that services $100K+ government and private contracts. Their server always sent chunked content despite all headers. I'm pretty sure that there are other well established servers that send chunked content despite the rfc. Guessing that it might have something to do with wanting to control content compression. All the browsers can handle it, and that's probably all apple is concerned with - even though they're overriding an rfc spec req. Chris --- Jérôme Charron [EMAIL PROTECTED] wrote: http://www.apple.com for example answer with chunked content also if you request with a http 1.0 header. Stefan, I don't see any Transfer-Encoding: chunked header in responses from www.apple.com Furthermore, we can read in HTTP/1.1 specification that A server MUST NOT send transfer-codings to an HTTP/1.0 client. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Created: (NUTCH-266) hadoop bug when doing updatedb
hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Type: Bug Versions: 0.8-dev Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev I constantly get the following error message 060508 230637 Running job: job_pbhn3t 060508 230637 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 060508 230637 job_pbhn3t java.io.IOException: Target /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) at org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: http chunked content
Chris Fellows wrote: Just remembered, got around it by using HTTPClient which handles reading the response (chunked or not) transparently. Haven't looked at the nutch code, but if we were to use HTTPClient 3.0.x or later, should take care of it. Take a look at protocol-httpclient. This discussion is on whether/how to fix protocol-http. The other plugin already supports this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-264) Tools for merging and filtering CrawlDb-s and LinkDb-s
[ http://issues.apache.org/jira/browse/NUTCH-264?page=all ] Andrzej Bialecki closed NUTCH-264: --- Resolution: Fixed A version of this patch was included in rev. 405183 Tools for merging and filtering CrawlDb-s and LinkDb-s -- Key: NUTCH-264 URL: http://issues.apache.org/jira/browse/NUTCH-264 Project: Nutch Type: New Feature Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: patch.txt This patch contains implementations and unit tests for two new commands: * mergedb: merges one or more CrawlDb-s, optionally filtering urls through the current URLFilters. * mergelinkdb: as above, only for LinkDb-s. Optional filtering is applied both to toUrls and fromUrls in Inlinks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-263) MapWritable.equals() doesn't work properly
[ http://issues.apache.org/jira/browse/NUTCH-263?page=all ] Andrzej Bialecki closed NUTCH-263: --- Resolution: Fixed Patch applied in rev. 405179. If further improvements are needed please re-open this issue. MapWritable.equals() doesn't work properly -- Key: NUTCH-263 URL: http://issues.apache.org/jira/browse/NUTCH-263 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: patch1.txt MapWritable.equals() is sensitive to the order in which map entries have been created. E.g. this fails but it should succeed: MapWritable map1 = new MapWritable(); MapWritable map2 = new MapWritable(); map1.put(new UTF8(key1), new UTF8(val1)); map1.put(new UTF8(key2), new UTF8(val2)); map2.put(new UTF8(key2), new UTF8(val2)); map2.put(new UTF8(key1), new UTF8(val1)); assertTrue(map1.equals(map2)); Users expect that this should not be the case, i.e. this class should follow the same rules as Map.equals() (Returns true if the given object is also a map and the two Maps represent the same mappings). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: http chunked content
Okay, saw the code in the http-protocol plugin. I remember looking at this about a year ago. RFC 2616 (HTTP/1.1) does say, as Jerome pointed out: A server MUST NOT send transfer-codings to an HTTP/1.0 client. Regardless, I can attest that there are servers out there that return chunked content regardless of the client. We had a socket implementation akin to the HttpResponse.java in http-protocol plugin and were stumped on how to handle identifying whether the response was chunked or not - as we could not reliably use the Transfer-coding header. The only way we could see was trying to use the initial hex characters denoting the size of the first chunk. The chunk-size field is a string of hex digits indicating the size of the chunk. The chunked encoding is ended by any chunk whose size is zero, followed by the trailer, which is terminated by an empty line. - more from RFC 2616 But in practice this was error prone. Switching over to apache httpclient eliminated this problem, as it transparently handles chunked and un-chunked content. But httpclient is much more heavy weight and so the conversion could only be done after implementing some basic resource pooling on the primary httpclient object. It does look like this would be a serious refactor job as nutch uses all java.net classes. On the other hand, it might simplify some areas of the nutch protocol classes and httpclient does have some interesting built in support for multi-threading/performance tuning requests. I hope this helps towards a solution. Best Regards, Chris --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Chris Fellows wrote: Just remembered, got around it by using HTTPClient which handles reading the response (chunked or not) transparently. Haven't looked at the nutch code, but if we were to use HTTPClient 3.0.x or later, should take care of it. Take a look at protocol-httpclient. This discussion is on whether/how to fix protocol-http. The other plugin already supports this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value
Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Chris Schneider Priority: Minor Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if indexer.boost.by.link.count was true, the indexer boost value was scaled based on the log of the # of inbound links: if (boostByLinkCount) res *= (float)Math.log(Math.E + linkCount); This is no longer true (even before Andrzej implemented scoring filters). Instead, the boost value is just the square root (or some other scorePower) of the page score. Shouldn't the invertlinks command, which creates the linkdb, have some affect on the boost value calculated during indexing (either via the OPICScoringFilter or some other built-in filter)? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] Doug Cutting commented on NUTCH-267: The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions for all links to a page. The OPIC contribution of a link from page P is OPIC(P) / numOutLinks(P). Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Chris Schneider Priority: Minor Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if indexer.boost.by.link.count was true, the indexer boost value was scaled based on the log of the # of inbound links: if (boostByLinkCount) res *= (float)Math.log(Math.E + linkCount); This is no longer true (even before Andrzej implemented scoring filters). Instead, the boost value is just the square root (or some other scorePower) of the page score. Shouldn't the invertlinks command, which creates the linkdb, have some affect on the boost value calculated during indexing (either via the OPICScoringFilter or some other built-in filter)? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira