Nutch changes 0.9.txt
Hi Does anybody know what this means exactly: 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files in parse-plugins.xml (Chris A. Mattmann via siren) In my crawl log file it says: Error parsing: http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf This maybe a stupid question, but does the Nutch crawler only retrieve and index links i.e. URL's and not pdf's? The .pdf isn't in the crawl-urlfilter.txt file either. And I can see it in the parse-plugins.xml file: Thanks Paul
Re: Help please trying to crawl local file system
Did you set the agent name in the nutch configuration. I think even when crawling only the local file system the agent name still needs to be set. If not set I believe nothing is fetched and errors are thrown but you would only see this if your logging was setup for it. Dennis Kubes jim shirreffs wrote: I googled and googled and goolged I am trying to crawl my local file system and can't seem to get it right. I use this command bin/mutch crawl urls -dir crawl My urls dir contains one file (files) that looks like this file:///c:/joms c:/joms exists I've modified the config file crawl-urlfilter.txt #-^(file|ftp|mailto|sw|swf): -^(http|ftp|mailto|sw|swf): # skip everything else . web spaces #-. +.* And the config file nutch-site.xml adding plugin.includes protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic file.content.limit -1 And lastly I've modified regex-urlfilter.txt #file systems +^file:///c:/top/directory/ -. # skip file: ftp: and mailto: urls #-^(file|ftp|mailto): -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept anything else +. I don't get any errors but nothing gets crawled either. If anyone can point out my mistake(s) I would greatly appreciate it. thanks in advance jim s ps it would also be nice to know this email is getting into the nutch-users mailing list
Nutch 0.9 officially released!
Hi Folks, After some hard work from all folks involved, we've managed to push out Apache Nutch, release 0.9. This is the second release of Nutch based entirely on the underlying Hadoop platform. This release includes several critical bug fixes, as well as key speedups described in more detail at Sami Siren's blog: http://blog.foofactory.fi/2007/03/twice-speed-half-size.html See the list of changes made in this version: http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt The release is available here. http://www.apache.org/dyn/closer.cgi/lucene/nutch/ Special thanks to (in no particular order): Andrzej Bialecki, Dennis Kubes, Sami Siren, and the rest of the Nutch development team for providing lots of help along the way, and for allowing me to be the release manager! Enjoy the new release! Cheers, Chris
Re: Unable to load native-hadoop library
yeah,it is 32-bit,and it is 1.5.0_04 JDK. lots of the commands throw this warn information. just for example, bin/nutcher readdb nutcherdata/test/crawl/crawldb/ -stats it says 2007-04-06 08:58:09,992 WARN util.NativeCodeLoader (NativeCodeLoader.java:(51)) - Unable to load native-hadoop library for your platform... using built in-java classes where applicable Andrzej Bialecki wrote: wangxu wrote: Linux wangxu.com 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux Andrzej Bialecki wrote: wangxu wrote: when I use nutch-nightly0.9 ,I got this: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable And I echo $JAVA_LIBRARY_PATH,then I got: JAVA_LIBRARY_PATH: nutch/lib/native/Linux-i386-32 How can I correct it? (Please send Nutch-related questions first to Nutch groups). What is your operating system (uname -a) ? Currently, native libs are available only for 32-bit JVMs - so if you are running a 64-bit JVM it won't work. Also, I assume you are using a Sun JDK 1.5 or newer. If all of the above is correct, then you could try to send us the complete command that the bin/nutch script comes up with - simply echo the last command just before it executes, and copy this.
Re: Run Job Crashing
Figured this one out, just in case some other newbe has the same problem. Windows places hidden files in the urls dir if one customizes the folder view. These files must be removed or Nutch thinks they url files and processes them. One the hidden files are removed all is well. jim s anyone else has - Original Message - From: "jim shirreffs" <[EMAIL PROTECTED]> To: "nutch lucene apache" Sent: Thursday, April 05, 2007 11:51 AM Subject: Run Job Crashing Nutch-0.8.1 Windows 2000/Windows XP Java 1.6 cygwin1.dll nov/2004 and gygwin1 latest release Very strange, ran the crawler once S bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and everything worked until this error Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070404094549 Indexer: adding segment: crawl/segments/20070404095026 Indexer: adding segment: crawl/segments/20070404095504 Optimizing index. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.apache.nutch.indexer.Indexer.index(Indexer.java:296) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) Tried running the crawler again $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and now I consistantly get this error $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50 run java in NUTCH_JAVA_HOME D:\java\jdk1.6 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) I have one file localhost in my url dir and it looks like this http://localhost My crawl-urlfiler.xml looks like this # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto|swf|sw): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*localhost/ # skip everything else My nutch-site.xml looks like this http.agent.name RadioCity http.agent.description nutch web crawler http.agent.url www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch http.agent.email jpsb at flash.net I am getting the same behavor on two separate hosts. If anyone can suggest what I might be doing wrong I would greatly appreicate it. jim s PS tried to mail from a different host but did not see message in mailing list. Hope only this messages gets into mailing list.
Help please trying to crawl local file system
I googled and googled and goolged I am trying to crawl my local file system and can't seem to get it right. I use this command bin/mutch crawl urls -dir crawl My urls dir contains one file (files) that looks like this file:///c:/joms c:/joms exists I've modified the config file crawl-urlfilter.txt #-^(file|ftp|mailto|sw|swf): -^(http|ftp|mailto|sw|swf): # skip everything else . web spaces #-. +.* And the config file nutch-site.xml adding plugin.includes protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic file.content.limit -1 And lastly I've modified regex-urlfilter.txt #file systems +^file:///c:/top/directory/ -. # skip file: ftp: and mailto: urls #-^(file|ftp|mailto): -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept anything else +. I don't get any errors but nothing gets crawled either. If anyone can point out my mistake(s) I would greatly appreciate it. thanks in advance jim s ps it would also be nice to know this email is getting into the nutch-users mailing list
Run Job Crashing
Nutch-0.8.1 Windows 2000/Windows XP Java 1.6 cygwin1.dll nov/2004 and gygwin1 latest release Very strange, ran the crawler once S bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and everything worked until this error Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070404094549 Indexer: adding segment: crawl/segments/20070404095026 Indexer: adding segment: crawl/segments/20070404095504 Optimizing index. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.apache.nutch.indexer.Indexer.index(Indexer.java:296) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) Tried running the crawler again $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50 and now I consistantly get this error $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50 run java in NUTCH_JAVA_HOME D:\java\jdk1.6 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) I have one file localhost in my url dir and it looks like this http://localhost My crawl-urlfiler.xml looks like this # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto|swf|sw): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*localhost/ # skip everything else My nutch-site.xml looks like this http.agent.name RadioCity http.agent.description nutch web crawler http.agent.url www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch http.agent.email jpsb at flash.net I am getting the same behavor on two separate hosts. If anyone can suggest what I might be doing wrong I would greatly appreicate it. jim s PS tried to mail from a different host but did not see message in mailing list. Hope only this messages gets into mailing list.
Re: Using nutch as a web crawler
Nutch has a file called crawl-urlfilter.txt where you can set your site domain or site list, so nutch will only crawl this list. Download nutch and see it working, is better for you :). Take a look: http://lucene.apache.org/nutch/tutorial8.html Regards, On 4/5/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: Thanks. Can you please tell me how can I plugin in my own handling when nutch sees a site instead of building the search database for that site? On 4/3/07, Lourival Júnior <[EMAIL PROTECTED]> wrote: > I have total certainty that nutch is what are you looking for. Take a look > to nutch's documentation for more details and you will see :). > > On 4/3/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I would like to know if know if it is a good idea to use nutch web > > carwler? > > Basically, this is what I need: > > 1. I have a list of web site > > 2. I want the web crawler to go thru each site, parser the anchor. if > > it is the same domain, go thru the same step for 3 level. > > 3. For each link, write to a new file. > > > > Is nutch a good solution? or there is other better open source > > alternative for my purpose? > > > > Thank you. > > > > > > -- > Lourival Junior > Universidade Federal do Pará > Curso de Bacharelado em Sistemas de Informação > http://www.ufpa.br/cbsi > Msn: [EMAIL PROTECTED] > -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
RE: help needed on filters
All your REGEX looks fine however I would do the following: ^- http://([a-z0-9]*\.)*example.com/stores/.*/merch #ignore anything with ? in it ^- http://([a-z0-9]*\.)*example.com.*\? #allow only home page ^+ http://([a-z0-9]*\.)*example.com/$ #allow only htm file ^+ http://([a-z0-9]*\.)*example.com/.*?\.htm #allow only do file ^+ http://([a-z0-9]*\.)*example.com/.*?\.do HTH, Gal. > -Original Message- > From: cha [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 05, 2007 10:34 AM > To: nutch-user@lucene.apache.org > Subject: help needed on filters > > > Hi, > > I want to crawl only .htm,.html and .do pages from my web-site.Secondly I > want to ignore the following urls from crawling > > http://www.example.com/stores/abcd/merch-cats-pg/abcd.* > http://www.example.com/stores/abcd/merch-cats/abcd.* > http://www.example.com/stores/abcd/merch/abd.* > > I have set all the filters in regex-urlfilters and crawl-urlfilter files. > > Follwing is just the code which fulfill my purpose : > > # skip URLs containing certain characters as probable queries, etc. > -^http://www.example.com/stores/.*/merch.* > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*example.com/.*\.htm$ > +http://([a-z0-9]*\.)*example.com/.*\.do > +http://([a-z0-9]*\.)*example.com/$ > > > Its crawl all the required pages correctly the only problem I get is > getting > ? or some other characters after htm. So i pass the htm$. > > But after giving that it is not crawling the merchant pages & neglect > lotsa > of urls , which i require. > > So dont know what to do?? > > Please let me know with your valuable suggestions. > > Cheers, > Cha > -- > View this message in context: http://www.nabble.com/help-needed-on- > filters-tf3530069.html#a9851344 > Sent from the Nutch - User mailing list archive at Nabble.com.
Re: [Nutch-general] Removing pages from index immediately
[EMAIL PROTECTED] wrote: Hi Enis, Right, I can easily delete the page from the Lucene index, though I'd prefer to follow the Nutch protocol and avoid messing something up by touching the index directly. However, I don't want that page to re-appear in one of the subsequent fetches. Well, it won't re-appear, because it will remain missing, but it would be great to be able to tell Nutch to "forget it" "from everywhere". Is that doable? I could read and re-write the *Db Maps, but that's a lot of IO... just to get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch flags a given page as "forget this page as soon as possible" and it just happens later on. Somehow you need to flag those pages, and keep track of them, so they have to remain CrawlDb. The simplest way to do this is, I think, through a scoring filter API - you can add your own filter, which during updatedb operation flags unwanted urls (by means of putting a piece of metadata in CrawlDatum), and then during the generate step it checks this metadata and returns the generateScore = Float.MIN_VALUE - which means this page will never be selected for fetching as long as there are other unfetched pages. You can also modify the Generator to completely skip such flagged pages. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [Nutch-general] Removing pages from index immediately
Andrzej Bialecki wrote: [EMAIL PROTECTED] wrote: Hi Enis, Right, I can easily delete the page from the Lucene index, though I'd prefer to follow the Nutch protocol and avoid messing something up by touching the index directly. However, I don't want that page to re-appear in one of the subsequent fetches. Well, it won't re-appear, because it will remain missing, but it would be great to be able to tell Nutch to "forget it" "from everywhere". Is that doable? I could read and re-write the *Db Maps, but that's a lot of IO... just to get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch flags a given page as "forget this page as soon as possible" and it just happens later on. Somehow you need to flag those pages, and keep track of them, so they have to remain CrawlDb. The simplest way to do this is, I think, through a scoring filter API - you can add your own filter, which during updatedb operation flags unwanted urls (by means of putting a piece of metadata in CrawlDatum), and then during the generate step it checks this metadata and returns the generateScore = Float.MIN_VALUE - which means this page will never be selected for fetching as long as there are other unfetched pages. You can also modify the Generator to completely skip such flagged pages. Maybe we should permanently remove the urls that failed fetching k times from the crawldb, during updatedb operation. Since the web is highly dynamic there can be as many gone sites as new sites(or slightly less). As far as i know once a url is entered to the crawldb it will stay there with one of the possible states : STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right? This way Otis's case will also be resolved.
Re: Nutch Step by Step Maybe someone will find this useful ?
Great work, could you just post these into the nutch wiki as a step by step tutorial to new comers. zzcgiacomini wrote: I have spent sometime playing with nutch-0 and collecting notes from the mailing lists ... may be someone will find these notes useful end could point me out mistakes I am not at all a nutch expert... -Corrado 0) CREATE NUTCH USER AND GROUP Create a nutch user and group and perform all the following logged in as nutch user. put this line in your .bash_profile export JAVA_HOME=/opt/jdk export PATH=$JAVA_HOME/bin:$PATH 1) GET HADOOP and NUTCH downloaded the nutch and hadoop trunks as well explained on http://lucene.apache.org/hadoop/version_control.html (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk) (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk) 2) BUILD HADOOP Ex: Build and produce the tar file cd hadoop/trunk ant tar To build hadoop with native libraries 64bits proceed as follow : A ) dowonload and install latest lzo library (http://www.oberhumer.com/opensource/lzo/download/) Note: the current available pkgs for fc5 are too old tar xvzf lzo-2.02.tar.gz cd lzo-2.02 ./configure --prefix=/opt/lzo-2.02 make install B) compile native 64bit libs for hadoop if needed cd hadoop/trunk/src/native export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server export JVM_DATA_MODEL=64 CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" ./configure cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h in config.h replace the line #define HADOOP_LZO_LIBRARY libnotfound.so with this one #define HADOOP_LZO_LIBRARY "liblzo2.so" make 3) BUILD NUTCH nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar cd nutch/trunk ant tar 4) INSTALL copy and untar the genearated .tar.gz file on the machines that will participate to the engine activities In my case I only have two identical machines available called myhost2 and myhost1. On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10 on both machines create the directory: mkdir /disk10/hadoopFs/ copy hadoop 64bit native libraries if needed mkdir /opt/nutch/lib/native/Linux-x86_64 cp -fl hadoop/trunk/src/native/lib/.libs/* /opt/nutch/lib/native/Linux-x86_64 5) CONFIG I will use the myhost1 as the master machine running the nodename and jobtracker tasks; it will also run the datanode and tasktraker on it. myhost2 will only run datanode and takstraker. A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used fs.default.name : myhost1.mydomain.org:9010 mapred.job.tracker : myhost1.mydomain.org:9011 mapred.map.tasks: 40 mapred.reduce.tasks : 3 dfs.name.dir: /opt/hadoopFs/name dfs.data.dir: /opt/hadoopFs/data mapred.system.dir : /opt/hadoopFs/mapreduce/system mapred.local.dir: /opt/hadoopFs/mapreduce/local dfs.replication : 2 "The mapred.map.tasks property tell how many tasks you want to run in parallel. This should be a multiple of the number of computers that you have. In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks. "The dfs.replication property states how many servers a single file should be replicated to before it becomes available. Because we are using 2 servers I have set this at 2. may be you want also change nutch-site by adding with a different value then the default of 3 http.redirect.max : 10 B) be sure that your conf/slaves file contains the name of the slaves machines. In my cases: myhost1.mydomain.org myhost2.mydomain.org C) create directories for pids and log files on both machines mkdir /opt/nutch/pids mkdir /opt/
Re: [Nutch-general] Removing pages from index immediately
Hi Enis, Right, I can easily delete the page from the Lucene index, though I'd prefer to follow the Nutch protocol and avoid messing something up by touching the index directly. However, I don't want that page to re-appear in one of the subsequent fetches. Well, it won't re-appear, because it will remain missing, but it would be great to be able to tell Nutch to "forget it" "from everywhere". Is that doable? I could read and re-write the *Db Maps, but that's a lot of IO... just to get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch flags a given page as "forget this page as soon as possible" and it just happens later on. Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Enis Soztutar <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Thursday, April 5, 2007 3:29:55 AM Subject: Re: [Nutch-general] Removing pages from index immediately Since hadoop's map files are write once, it is not possible to delete some urls from the crawldb and linkdb. The only thing you can do is to create the map files once again without the deleted urls. But running the crawl once more as you suggested seems more appropriate. Deleting documents from the index is just lucene stuff. In your case it seems that every once in a while, you crawl the whole site, and create the indexes and db's and then just throw the old one out. And between two crawls you can delete the urls from the index. [EMAIL PROTECTED] wrote: > Hi, > > I'd like to be able to immediately remove certain pages from Nutch (index, > crawldb, linkdb...). > The scenario is that I'm using Nutch to index a single site or a set of > internal sites. Once in a while editors of the site remove a page from the > site. When that happens, I want to update at least the index and ideally > crawldb, linkdb, so that people searching the index don't get the missing > page in results and end up going there, hitting the 404. > > I don't think there is a "direct" way to do this with Nutch, is there? > If there really is no direct way to do this, I was thinking I'd just put the > URL of the recently removed page into the first next fetchlist and then > somehow get Nutch to immediately remove that page/URL once it hits a 404. > How does that sound? > > Is there a way to configure Nutch to delete the page after it gets a 404 for > it even just once? I thought I saw the setting for that somewhere a few > weeks ago, but now I can't find it. > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
help needed on filters
Hi, I want to crawl only .htm,.html and .do pages from my web-site.Secondly I want to ignore the following urls from crawling http://www.example.com/stores/abcd/merch-cats-pg/abcd.* http://www.example.com/stores/abcd/merch-cats/abcd.* http://www.example.com/stores/abcd/merch/abd.* I have set all the filters in regex-urlfilters and crawl-urlfilter files. Follwing is just the code which fulfill my purpose : # skip URLs containing certain characters as probable queries, etc. -^http://www.example.com/stores/.*/merch.* # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*example.com/.*\.htm$ +http://([a-z0-9]*\.)*example.com/.*\.do +http://([a-z0-9]*\.)*example.com/$ Its crawl all the required pages correctly the only problem I get is getting ? or some other characters after htm. So i pass the htm$. But after giving that it is not crawling the merchant pages & neglect lotsa of urls , which i require. So dont know what to do?? Please let me know with your valuable suggestions. Cheers, Cha -- View this message in context: http://www.nabble.com/help-needed-on-filters-tf3530069.html#a9851344 Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch Step by Step Maybe someone will find this useful ?
2007/4/5, Enis Soztutar <[EMAIL PROTECTED]>: Great work, could you just post these into the nutch wiki as a step by step tutorial to new comers. Exactly what I wanted to say, both points. :) Cheers, t.n.a.
Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?
Corrado, Would it be possible for you to add this to the Wiki? Also, there are several other tutorials: http://lucene.apache.org/nutch/tutorial8.html http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchHadoopTutorial Maybe you can combine them? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: zzcgiacomini <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Wednesday, April 4, 2007 10:53:54 AM Subject: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ? I have spent sometime playing with nutch-0 and collecting notes from the mailing lists ... may be someone will find these notes useful end could point me out mistakes I am not at all a nutch expert... -Corrado 0) CREATE NUTCH USER AND GROUP Create a nutch user and group and perform all the following logged in as nutch user. put this line in your .bash_profile export JAVA_HOME=/opt/jdk export PATH=$JAVA_HOME/bin:$PATH 1) GET HADOOP and NUTCH downloaded the nutch and hadoop trunks as well explained on http://lucene.apache.org/hadoop/version_control.html (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk) (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk) 2) BUILD HADOOP Ex: Build and produce the tar file cd hadoop/trunk ant tar To build hadoop with native libraries 64bits proceed as follow : A ) dowonload and install latest lzo library (http://www.oberhumer.com/opensource/lzo/download/) Note: the current available pkgs for fc5 are too old tar xvzf lzo-2.02.tar.gz cd lzo-2.02 ./configure --prefix=/opt/lzo-2.02 make install B) compile native 64bit libs for hadoop if needed cd hadoop/trunk/src/native export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server export JVM_DATA_MODEL=64 CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" ./configure cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h in config.h replace the line #define HADOOP_LZO_LIBRARY libnotfound.so with this one #define HADOOP_LZO_LIBRARY "liblzo2.so" make 3) BUILD NUTCH nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar cd nutch/trunk ant tar 4) INSTALL copy and untar the genearated .tar.gz file on the machines that will participate to the engine activities In my case I only have two identical machines available called myhost2 and myhost1. On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10 on both machines create the directory: mkdir /disk10/hadoopFs/ copy hadoop 64bit native libraries if needed mkdir /opt/nutch/lib/native/Linux-x86_64 cp -fl hadoop/trunk/src/native/lib/.libs/* /opt/nutch/lib/native/Linux-x86_64 5) CONFIG I will use the myhost1 as the master machine running the nodename and jobtracker tasks; it will also run the datanode and tasktraker on it. myhost2 will only run datanode and takstraker. A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used fs.default.name : myhost1.mydomain.org:9010 mapred.job.tracker : myhost1.mydomain.org:9011 mapred.map.tasks: 40 mapred.reduce.tasks : 3 dfs.name.dir: /opt/hadoopFs/name dfs.data.dir: /opt/hadoopFs/data mapred.system.dir : /opt/hadoopFs/mapreduce/system mapred.local.dir: /opt/hadoopFs/mapreduce/local dfs.replication : 2 "The mapred.map.tasks property tell how many tasks you want to run in parallel. This should be a multiple of the number of computers that you have. In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks. "The dfs.replication property states how many servers a single file should be replicated to before it becomes available. Because we are using 2 servers I have se
Re: Removing pages from index immediately
Since hadoop's map files are write once, it is not possible to delete some urls from the crawldb and linkdb. The only thing you can do is to create the map files once again without the deleted urls. But running the crawl once more as you suggested seems more appropriate. Deleting documents from the index is just lucene stuff. In your case it seems that every once in a while, you crawl the whole site, and create the indexes and db's and then just throw the old one out. And between two crawls you can delete the urls from the index. [EMAIL PROTECTED] wrote: Hi, I'd like to be able to immediately remove certain pages from Nutch (index, crawldb, linkdb...). The scenario is that I'm using Nutch to index a single site or a set of internal sites. Once in a while editors of the site remove a page from the site. When that happens, I want to update at least the index and ideally crawldb, linkdb, so that people searching the index don't get the missing page in results and end up going there, hitting the 404. I don't think there is a "direct" way to do this with Nutch, is there? If there really is no direct way to do this, I was thinking I'd just put the URL of the recently removed page into the first next fetchlist and then somehow get Nutch to immediately remove that page/URL once it hits a 404. How does that sound? Is there a way to configure Nutch to delete the page after it gets a 404 for it even just once? I thought I saw the setting for that somewhere a few weeks ago, but now I can't find it. Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share
Re: ERROR org.apache.nutch.protocol.http.Http:?java.net.SocketTimeoutException: Read timed out
HI, What I can suggest you, at this moment is try to read the properties value of default.xml and find out which property deals with Server socket connection, then only you will be able to mention that property value in you nutch-site.xml. I havn't had done much with this.But will update if I get something related with this issue. Regards, Ratnesh, V2Solutions India cha wrote: > > HI Ratnesh, > > I am crawling the internet. I am able to get all the crawl pages but this > error do appear in my error log..I dont know what it mean for. I have used > two filter regex and crawl for my crawling..Is something do with that?? > > How should i eliminate the above menitioned error.Something need to be set > or modified in nutch-site.xml? > > Cheers, > cha > > Ratnesh,V2Solutions India wrote: >> >> This socket exception normally comes , if fetcher is not able to get the >> page to crawl?? >> I mean there is some problem with the server connection. >> if you r crawling for local stored pages, then check whether the server >> is started or not?? >> >> I have tested the same for my local crawl, but for internet specific >> crawl I don't have enough idea?? >> >> >> Ratnesh V2Solutions India >> >> >> cha wrote: >>> >>> HI ppl, >>> >>> when i crawl my website , it is giving me following error , though >>> crawling is doing fine. >>> >>> Can anyone tell me what the error is about?? Do i have to set anything >>> in nutch-site.xml?? >>> >>> Following are the error logs: >>> >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException: >>> Read timed out >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.net.SocketInputStream.socketRead0(Native Method) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.net.SocketInputStream.read(Unknown Source) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.BufferedInputStream.read1(Unknown Source) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.BufferedInputStream.read(Unknown Source) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.FilterInputStream.read(Unknown Source) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.PushbackInputStream.read(Unknown Source) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.FilterInputStream.read(Unknown Source) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214) >>> >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:146) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208) >>> >>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144) >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException: >>> Read timed out >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.net.SocketInputStream.socketRead0(Native Method) >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.net.SocketInputStream.read(Unknown Source) >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.BufferedInputStream.read1(Unknown Source) >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.BufferedInputStream.read(Unknown Source) >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.FilterInputStream.read(Unknown Source) >>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.PushbackInputStream.read(Unknown Source) >>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> java.io.FilterInputStream.read(Unknown Source) >>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >>> org.apache.nutch.protocol.http.Http:? at >>> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214) >>> >>> [20
Re: ERROR org.apache.nutch.protocol.http.Http:?java.net.SocketTimeoutException: Read timed out
HI Ratnesh, I am crawling the internet. I am able to get all the crawl pages but this error do appear in my error log..I dont know what it mean for. I have used two filter regex and crawl for my crawling..Is something do with that?? How should i eliminate the above menitioned error.Something need to be set or modified in nutch-site.xml? Cheers, cha Ratnesh,V2Solutions India wrote: > > This socket exception normally comes , if fetcher is not able to get the > page to crawl?? > I mean there is some problem with the server connection. > if you r crawling for local stored pages, then check whether the server is > started or not?? > > I have tested the same for my local crawl, but for internet specific crawl > I don't have enough idea?? > > > Ratnesh V2Solutions India > > > cha wrote: >> >> HI ppl, >> >> when i crawl my website , it is giving me following error , though >> crawling is doing fine. >> >> Can anyone tell me what the error is about?? Do i have to set anything in >> nutch-site.xml?? >> >> Following are the error logs: >> >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException: >> Read timed out >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.net.SocketInputStream.socketRead0(Native Method) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.net.SocketInputStream.read(Unknown Source) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.BufferedInputStream.read1(Unknown Source) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.BufferedInputStream.read(Unknown Source) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.FilterInputStream.read(Unknown Source) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.PushbackInputStream.read(Unknown Source) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.FilterInputStream.read(Unknown Source) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214) >> >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:146) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208) >> >> [2007-04-04 16:23:21,218] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144) >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException: >> Read timed out >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.net.SocketInputStream.socketRead0(Native Method) >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.net.SocketInputStream.read(Unknown Source) >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.BufferedInputStream.read1(Unknown Source) >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.BufferedInputStream.read(Unknown Source) >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.FilterInputStream.read(Unknown Source) >> [2007-04-04 16:23:22,046] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.PushbackInputStream.read(Unknown Source) >> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> java.io.FilterInputStream.read(Unknown Source) >> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214) >> >> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:146) >> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) >> [2007-04-04 16:23:22,062] [FetcherThread] ERROR >> org.apache.nutch.protocol.http.Http:? at >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208) >> >> [2007-0