Re: Deleting stale URLs from Nutch/Solr
On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). [...] Thank you for your help. Your suggestions look promising, but I think that I did not make myself adequately clear. Once we have completed a site crawl with Nutch, ideally I would like to be able to find stale links without doing a complete recrawl, i.e., only through restarting the crawl from where it last left off. Is that possible. I tried a simple test on a local webserver with five pages in a three-level hierarchy. The crawl completes, and discovers all five URLs as expected. Now, I remove a tertiary page. Ideally, I would like to be able run a recrawl, and have Nutch dicover the now-missing URL. However, when I try that, it finds no new links, and exits. ./bin/nutch readdb crawl/crawldb -stats shows me: CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 5 retry 0:5 min score: 0.333 avg score: 0.4664 max score: 1.0 status 2 (db_fetched): 5 CrawlDb statistics: done Regards, Gora
Re: Deleting stale URLs from Nutch/Solr
Gora Mohanty wrote: On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). [...] Thank you for your help. Your suggestions look promising, but I think that I did not make myself adequately clear. Once we have completed a site crawl with Nutch, ideally I would like to be able to find stale links without doing a complete recrawl, i.e., only through restarting the crawl from where it last left off. Is that possible. I tried a simple test on a local webserver with five pages in a three-level hierarchy. The crawl completes, and discovers all five URLs as expected. Now, I remove a tertiary page. Ideally, I would like to be able run a recrawl, and have Nutch dicover the now-missing URL. However, when I try that, it finds no new links, and exits. I assume you mean that the generate step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days. You can pretend that the time moved on using the -adddays parameter. Then Nutch will generate a new fetchlist, and when it discovers that the page is missing it will mark it as gone - actually, you could then take that information directly from the Nutch segment and instead of processing the CrawlDb you could process the segment to collect a partial list of gone pages. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch in WebSphere
I'm very new at this, so forgive my novice questions. I'm trying to install nutch in WebSphere 6.1. While I can see that others have done this before, I've been unsuccessful. I keep getting this error: Error 500: java.lang.Error: java.lang.NoClassDefFoundError: org.apache.jsp._search (wrong name: com/ibm/_jsp/_search) I thought it was a conflict with the base WebSphere jars and the jars in the nutch lib. I attempted to resolve this by having the applications jars load first, but I still get this error. I'm not sure if this is complicated by the fact that I want to run my crawl on a different node, and just use WebSphere to serve the results. I'll be exporting and importing the crawl directory next, so maybe I'll ask that question as well... where should I place the crawl directory in relation to my WebSphere war installation? Inside the installedApps directory, or can I specify where somehow?
Re: Deleting stale URLs from Nutch/Solr
On Tue, 27 Oct 2009 07:29:10 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] I assume you mean that the generate step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days. Yes, it was indeed stopping at the generate step, and your explanation makes sense. You can pretend that the time moved on using the -adddays parameter. [...] Thanks. This worked exactly as you said. I have tested this, and the removed page indeed shows up with status db_gone, and I can now script a solution for my problem with stale URLs, along the lines that you have suggested. Thank you very much for this quick and thorough response. As I imagine that this is a common requirement, I will write up a brief blog entry on this by the weekend, along with a solution. Regards, Gora
Re: How to index files only with specific type
If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. thx 2009/10/26 BELLINI ADAM mbel...@msn.com: disable the html-parser from the nutch-site and keep only your parser. you can also add in uour filter file this : -(htm|html)$ thx Date: Mon, 26 Oct 2009 17:53:11 +0300 Subject: How to index files only with specific type From: dfun...@gmail.com To: nutch-user@lucene.apache.org Hi, I've create parser and indexer to specific file type(geo xml meta file - kml). I am trying to crawl couple of sites, and index only files of this type. I don't want to index html or anything else. How can I achieve this? Thanks.- _ Save up to 84% on Windows 7 until Jan 3—eligible CDN College University students only. Hurry—buy it now for $39.99! http://go.microsoft.com/?linkid=9691635
Re: How to index files only with specific type
Dmitriy Fundak wrote: If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. It's possible to do this with a custom indexing filter - see other indexing filters to get a feeling of what's involved. Or you could do this with a scoring filter, too, although the scoring API looks more complicated. Either way, when you execute the Indexer, these filters are run in a chain, and if one of them returns null then that document is discarded, i.e. it's not added to the output index. So, it's easy to examine in your indexing filter the content type (or just a URL of the document) and either pass the document on or reject it by returning null. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to index files only with specific type
Checking url postfix and returning null if it's not one I need helped. Thanks, Andrzej. 2009/10/27 Andrzej Bialecki a...@getopt.org: Dmitriy Fundak wrote: If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. It's possible to do this with a custom indexing filter - see other indexing filters to get a feeling of what's involved. Or you could do this with a scoring filter, too, although the scoring API looks more complicated. Either way, when you execute the Indexer, these filters are run in a chain, and if one of them returns null then that document is discarded, i.e. it's not added to the output index. So, it's easy to examine in your indexing filter the content type (or just a URL of the document) and either pass the document on or reject it by returning null. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
How to run fetch from local
I had generated the segments after crawling process. Then I downloaded the segments to local from crawldb. Below are the four segments I generated and downloaded from crawldb. Now if I run fetch upon these four segments then I get the below error. Please help me how to run fetch in local. [nu...@devcluster01 search]$ ls -lrt db/segments/crawled_22/segments/ total 32 drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022065049 drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022065828 drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022071136 drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022104701 [nu...@devcluster01 search]$ bin/nutch fetch db/segments/crawled_22/segments/20091022065049 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: db/segments/crawled_22/segments/20091022065049 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/db/segments/crawled_22/segments/20091022065049/crawl_generate at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:101) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003) -- View this message in context: http://www.nabble.com/How-to-run-fetch-from-local-tp26075786p26075786.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch indexes less pages, then it fetches
Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html Sent from the Nutch - User mailing list archive at Nabble.com.
Redirect handling
Hi All, I've done some googling, but found different answers, so I would appreciate if you tell me which is the correct one: - when page redirected, content of target page is fetched and associated with the source (initial) page URL - when page redirected, new entry with the redirect target url and contents added to the db If the second option is the correct one, then one more question. When I have a NutchDocument instance which represents target URL, is that possible to retrieve it's redirect source URL somehow? Thanks -- View this message in context: http://www.nabble.com/Redirect-handling-tp26079767p26079767.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Redirect handling
There are two different types of redirect. When a web site returns a 301 status (redirect permanent), it means the url you requested is no longer valid, don't ask for it again. When it returns a 307 status (temporary redirect), it means keep asking for the url you asked for, and I'll tell you where to go from there. In the first case, Nutch should remove the first URL from its database and put the redirection target in in its place. In the second case, Nutch should leave the original URL in its database, but also go to the redirection target. I don't know if that's actually what Nutch does, but I assume so. On Tue, Oct 27, 2009 at 11:30 AM, caezar caeza...@gmail.com wrote: Hi All, I've done some googling, but found different answers, so I would appreciate if you tell me which is the correct one: - when page redirected, content of target page is fetched and associated with the source (initial) page URL - when page redirected, new entry with the redirect target url and contents added to the db If the second option is the correct one, then one more question. When I have a NutchDocument instance which represents target URL, is that possible to retrieve it's redirect source URL somehow? Thanks -- View this message in context: http://www.nabble.com/Redirect-handling-tp26079767p26079767.html Sent from the Nutch - User mailing list archive at Nabble.com. -- http://www.linkedin.com/in/paultomblin
Nutch in Websphere
I'm very new at this, so forgive my novice questions. I'm trying to install nutch in WebSphere 6.1. While I can see that others have done this before, I've been unsuccessful. I keep getting this error: Error 500: java.lang.Error: java.lang.NoClassDefFoundError: org.apache.jsp._search (wrong name: com/ibm/_jsp/_search) I thought it was a conflict with the base WebSphere jars and the jars in the nutch lib. I attempted to resolve this by having the applications jars load first, but I still get this error. I'm not sure if this is complicated by the fact that I want to run my crawl on a different node, and just use WebSphere to serve the results. I'll be exporting and importing the crawl directory next, so maybe I'll ask that question as well... where should I place the crawl directory in relation to my WebSphere war installation? Inside the installedApps directory, or can I specify where somehow? Is there an install guide for WebSphere, instead of Tomcat?
ERROR: Checksum Error
This is my second time receiving this error: Map output lost, rescheduling: getMapOutput (attempt_200910271443_0012_m_01_0,0) failed : org.apache.hadoop.fs.ChecksumException: Checksum Error --- Does anyone know why I am getting this error and how to fix it? I tried deleting all my data nodes and formatting the namenode to no avail. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Nutch indexes less pages, then it fetches
check the parse data first, maybe it parse unsuccessful. 2009/10/27 caezar caeza...@gmail.com Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks