Help Needed with Error: java.lang.StackOverflowError
During a crawl of about 3.8M tlds to a depth of 2, when I try to index the segments, I get the following error: java.lang.StackOverflowError at java.util.regex.Pattern$Loop.match(Pattern.java:4295) Any help with this error would be much appreciated, I have encountered this before. here is the last 10 lines of the hadoop.log file: tail -n 10 hadoop.log.2010-01-10 at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$Ques.match(Pattern.java:3691) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$Ques.match(Pattern.java:3691) 2010-01-11 00:31:53,221 WARN io.UTF8 - truncating long string: 62492 chars, starting with java.lang.StackOverf Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Help Needed with Error: java.lang.StackOverflowError
bin/nutch -Xss1024k index crawl1/indexes crawl1/crawldb crawl1/linkdb crawl/segments/* Exception in thread main java.lang.NoClassDefFoundError: index Caused by: java.lang.ClassNotFoundException: index at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: index. Program will exit. Do you have to set the -Xss flag somewhere else? Thanks, Eric On Jan 11, 2010, at 8:36 AM, Godmar Back wrote: Very intriguing, considering that we teach our students to avoid recursion where possible for this very reason. Googling reveals http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you could try increasing the Java stack size in bin/nutch (-Xss), or use an alternate regexp if you can. Just out of curiosity, why does a performance critical program such as Nutch use Sun's backtracking-based regexp implementation rather than an efficient Thompson-based one? Do you need the additional expressiveness provided by PCRE? - Godmar On Mon, Jan 11, 2010 at 11:24 AM, Eric Osgood e...@lakemeadonline.com wrote: During a crawl of about 3.8M tlds to a depth of 2, when I try to index the segments, I get the following error: java.lang.StackOverflowError at java.util.regex.Pattern$Loop.match(Pattern.java:4295) Any help with this error would be much appreciated, I have encountered this before. here is the last 10 lines of the hadoop.log file: tail -n 10 hadoop.log.2010-01-10 at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$Ques.match(Pattern.java:3691) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$Ques.match(Pattern.java:3691) 2010-01-11 00:31:53,221 WARN io.UTF8 - truncating long string: 62492 chars, starting with java.lang.StackOverf Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Help Needed with Error: java.lang.StackOverflowError
How do I set the bin/nutch stack size and the hadoop job stack size? --Eric On Jan 11, 2010, at 9:22 AM, Fuad Efendi wrote: Also, put it in Hadoop settings for tasks... http://www.tokenizer.ca/ -Original Message- From: Godmar Back [mailto:god...@gmail.com] Sent: January-11-10 11:53 AM To: nutch-user@lucene.apache.org Subject: Re: Help Needed with Error: java.lang.StackOverflowError On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood e...@lakemeadonline.com wrote: Do you have to set the -Xss flag somewhere else? Yes, in bin/nutch - looking for where it sets -Xmx - Godmar
Re: Help Needed with Error: java.lang.StackOverflowError
In the hadoop-env.sh, how do you add such options as -Xss, -Xms, -Xmx? --Eric On Jan 11, 2010, at 9:34 AM, Mischa Tuffield wrote: You can set it in hadoop-env.sh, and then run it. Or you could ad it to your /etc/bashrc or the bashrc file of the user which runs hadoop. Mischa On 11 Jan 2010, at 17:26, Eric Osgood wrote: How do I set the bin/nutch stack size and the hadoop job stack size? --Eric On Jan 11, 2010, at 9:22 AM, Fuad Efendi wrote: Also, put it in Hadoop settings for tasks... http://www.tokenizer.ca/ -Original Message- From: Godmar Back [mailto:god...@gmail.com] Sent: January-11-10 11:53 AM To: nutch-user@lucene.apache.org Subject: Re: Help Needed with Error: java.lang.StackOverflowError On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood e...@lakemeadonline.com wrote: Do you have to set the -Xss flag somewhere else? Yes, in bin/nutch - looking for where it sets -Xmx - Godmar ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: ERROR: Too Many Fetch Failures
I have a 3-node cluster. I changed the solr server to one of the nodes rather than have the master node do both the master work and serve solr. I tried to crawl 100k urls again last and failed with too many fetch failures during the map and shuffle errors during the reduce. This just started happening - the only new additions to the cluster would be the solr server and adding a dell 2850 as a node. Here is my hadoop-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://opel:9000/value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property property namemapred.job.tracker/name valueopel:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value30/value description define mapred.map tasks to be number of slave hosts /description /property property namemapred.reduce.tasks/name value6/value description define mapred.reduce tasks to be number of slave hosts /description /property property namedfs.name.dir/name value/home/hadoop/filesystem/name/value /property property namefs.checkpoint.dir/name value/home/hadoop/filesystem/name2/value finaltrue/final /property property namedfs.data.dir/name value/home/hadoop/filesystem/data/value /property property namemapred.system.dir/name value/home/hadoop/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/home/hadoop/filesystem/mapreduce/local/value /property property namedfs.replication/name value3/value /property /configuration Let me know if you need any other information - I have no idea how to fix this problem. Thanks, Eric On Nov 20, 2009, at 1:30 AM, Julien Nioche wrote: It was probably a one-off, network related problem. Can you tell us a bit more about your cluster configuration? 2009/11/19 Eric Osgood e...@lakemeadonline.com Julien, Thanks for your help, how would I go about fixing this error now that it is diagnosed? On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote: could be a communication problem between the node and the master. It is not a fetching problem in the Nutch sense of the term but a Hadoop- related issue. 2009/11/19 Eric Osgood e...@lakemeadonline.com This is the first time I have received this error while crawling. During a crawl of 100K pages, one of the nodes had a task failed and cited Too Many Fetch Failures as the reason. The job completed successfully but took about 3 times longer than normal. Here is the log output 2009-11-19 11:19:56,377 WARN mapred.TaskTracker - Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 197) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 65) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close (DFSClient.java:1575) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.hadoop.util.LineReader.close(LineReader.java:91) at org.apache.hadoop.mapred.LineRecordReader.close (LineRecordReader.java:169) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close (MapTask.java:198) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346) at org.apache.hadoop.mapred.Child.main(Child.java:158) 2009-11-19 11:19:56,380 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_m_29_1 2009-11-19 11:20:21,135 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_r_04_1 Can Anyone tell me how to resolve this error? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood http://www.calpoly.edu/%7Eeosgood http://www.calpoly.edu/%7Eeosgood, www.lakemeadonline.com -- DigitalPebble Ltd http://www.digitalpebble.com -- DigitalPebble Ltd http://www.digitalpebble.com
ERROR: Too Many Fetch Failures
This is the first time I have received this error while crawling. During a crawl of 100K pages, one of the nodes had a task failed and cited Too Many Fetch Failures as the reason. The job completed successfully but took about 3 times longer than normal. Here is the log output 2009-11-19 11:19:56,377 WARN mapred.TaskTracker - Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 197) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 65) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close (DFSClient.java:1575) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.hadoop.util.LineReader.close(LineReader.java:91) at org.apache.hadoop.mapred.LineRecordReader.close (LineRecordReader.java:169) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close (MapTask.java:198) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346) at org.apache.hadoop.mapred.Child.main(Child.java:158) 2009-11-19 11:19:56,380 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_m_29_1 2009-11-19 11:20:21,135 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_r_04_1 Can Anyone tell me how to resolve this error? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: ERROR: Too Many Fetch Failures
Julien, Thanks for your help, how would I go about fixing this error now that it is diagnosed? On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote: could be a communication problem between the node and the master. It is not a fetching problem in the Nutch sense of the term but a Hadoop-related issue. 2009/11/19 Eric Osgood e...@lakemeadonline.com This is the first time I have received this error while crawling. During a crawl of 100K pages, one of the nodes had a task failed and cited Too Many Fetch Failures as the reason. The job completed successfully but took about 3 times longer than normal. Here is the log output 2009-11-19 11:19:56,377 WARN mapred.TaskTracker - Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 197) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 65) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close (DFSClient.java:1575) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.hadoop.util.LineReader.close(LineReader.java:91) at org.apache.hadoop.mapred.LineRecordReader.close (LineRecordReader.java:169) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close (MapTask.java:198) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346) at org.apache.hadoop.mapred.Child.main(Child.java:158) 2009-11-19 11:19:56,380 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_m_29_1 2009-11-19 11:20:21,135 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_r_04_1 Can Anyone tell me how to resolve this error? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood http://www.calpoly.edu/%7Eeosgood, www.lakemeadonline.com -- DigitalPebble Ltd http://www.digitalpebble.com
Re: ERROR: Too Many Fetch Failures
Julien, Another thought - I just installed tomcat and solr - would that interfere with hadoop? On Nov 19, 2009, at 2:41 PM, Eric Osgood wrote: Julien, Thanks for your help, how would I go about fixing this error now that it is diagnosed? On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote: could be a communication problem between the node and the master. It is not a fetching problem in the Nutch sense of the term but a Hadoop- related issue. 2009/11/19 Eric Osgood e...@lakemeadonline.com This is the first time I have received this error while crawling. During a crawl of 100K pages, one of the nodes had a task failed and cited Too Many Fetch Failures as the reason. The job completed successfully but took about 3 times longer than normal. Here is the log output 2009-11-19 11:19:56,377 WARN mapred.TaskTracker - Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 197) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 65) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close (DFSClient.java:1575) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.hadoop.util.LineReader.close(LineReader.java:91) at org.apache.hadoop.mapred.LineRecordReader.close (LineRecordReader.java:169) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close (MapTask.java:198) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346) at org.apache.hadoop.mapred.Child.main(Child.java:158) 2009-11-19 11:19:56,380 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_m_29_1 2009-11-19 11:20:21,135 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_r_04_1 Can Anyone tell me how to resolve this error? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood http://www.calpoly.edu/%7Eeosgood, www.lakemeadonline.com -- DigitalPebble Ltd http://www.digitalpebble.com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
HELP - ERROR: org.apache.hadoop.fs.ChecksumException: Checksum Error
Hi, I think that the checksum error during fetch is leading a bunch of other errors I am getting when I try to run updateb and generate after a fetch. errors during updatedb: --- java.lang.RuntimeException: problem advancing post rec#1018238 Caused by: java.io.IOException: can't find class: org.apache.nutch.protocgl.ProtocolStatus because org.apache.nutch.protocgl.ProtocolStatus --- errors during generate: --- java.lang.ArrayIndexOutOfBoundsException: 1107937 org.apache.hadoop.fs.ChecksumException: Checksum Error java.io.IOException: Task: attempt_200910271443_0022_r_06_0 - The reduce copier failed . . . -- Any help would greatly be appreciated, I don't really know where to start to fix these problems since this is first time I have encountered - my guess is that they are rooted in the checksum error I get when fetching sometimes. Thanks for the help, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
ERROR: Checksum Error
This is my second time receiving this error: Map output lost, rescheduling: getMapOutput (attempt_200910271443_0012_m_01_0,0) failed : org.apache.hadoop.fs.ChecksumException: Checksum Error --- Does anyone know why I am getting this error and how to fix it? I tried deleting all my data nodes and formatting the namenode to no avail. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Targeting Specific Links
Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. Thanks, Eric On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Scoring Filter Plugin
Hi, I am trying to implement a scoring filter plugin that filters url links. I was told that if I set the score of a link to Float.MinValue, it would never get selected for fetch. In my plugin, if a link doesnt have a high enough score when it gets to the generateSortValue, I set its score to Float.MinValue, however it is still getting fetched. Is there another to tell the fetcher to not fetch certain links based on their score? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Targeting Specific Links
Also, In the scoring-links plugin, I set the return value for ScoringFilter.generatorSortValue() to Float.MinValue for all urls and it still fetched everything - maybe Float.MinValue isn't the correct value to set so a link never gets fetched? Thanks, Eric On Oct 22, 2009, at 1:10 PM, Eric Osgood wrote: Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. Thanks, Eric On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: ERROR: current leaseholder is trying to recreate file.
Andrzej, I updated nutch to the trunk last night and I split up a crawl of 1.6M into 4 chunks of 400K using the updated generator. However, the first crawl of 400K crashed last night with some new errors I have never seen before: org.apache.hadoop.fs.ChecksumException: Checksum Error java.io.IOException: Could not obtain block: blk_-8206810763586975866_5190 file=/user/hadoop/crawl/segments/ 20091020170107/crawl_generate/part-9 Do you know why I would be getting these errors? I had a lost tracker error also - could these problems be related? Thanks, Eric On Oct 20, 2009, at 2:13 PM, Andrzej Bialecki wrote: Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/ crawl_fetch/part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Please see this issue: https://issues.apache.org/jira/browse/NUTCH-692 Apply the patch that is attached there, rebuild Nutch, and tell me if this fixes your problem. (the patch will be applied to trunk anyway, since others confirmed that it fixes this issue). Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
ERROR: current leaseholder is trying to recreate file.
This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/ part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: ERROR: current leaseholder is trying to recreate file.
Andrzej, I just downloaded the most recent trunk from svn as per your recommendations for fixing the generate bug. As soon I have it all rebuilt with my configs I will let you know how a crawl of ~1.6mln pages goes. Hopefully no errors! Thanks, Eric On Oct 20, 2009, at 2:13 PM, Andrzej Bialecki wrote: Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/ crawl_fetch/part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Please see this issue: https://issues.apache.org/jira/browse/NUTCH-692 Apply the patch that is attached there, rebuild Nutch, and tell me if this fixes your problem. (the patch will be applied to trunk anyway, since others confirmed that it fixes this issue). Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Dynamic Html Parsing
Is there a way to enable Dynamic Html parsing in Nutch using a plugin or setting? Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Andrzej, Where do I get the nightly builds from? I tried to use the eclipse plugin that supports svn to no avail. Is there a ftp, http server where I can download the nutch source fresh? Thanks, Eric On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote: Eric Osgood wrote: When I set generate.update.db to true and then run generate, it only runs twice and generates 100K for the 1st gen, 62.5K for the second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, for a topN of 100K it should run 16 times and create 16 distinct lists if I am not mistaken. There was a bug in this code that I fixed recently - please get a new nightly build and try it again. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
Ok, I think I am on the right track now, but just to be sure: the code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct? Thanks, Eric On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote: Eric Osgood wrote: Andrzej, Where do I get the nightly builds from? I tried to use the eclipse plugin that supports svn to no avail. Is there a ftp, http server where I can download the nutch source fresh? Personally I prefer to use a command-line svn, even though I do development in Eclipse - I'm probably old-fashioned but I always want to be very clear on what's going on when I do an update. See the instructions here: http://lucene.apache.org/nutch/version_control.html -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
O ok, You learn something new everyday! I didn't know that the trunk was the most recent build. Good to know! So this current trunk does have a fix for the generator bug? On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote: Eric Osgood wrote: So the trunk contains the most recent nightly update? It's the other way around - nightly build is created from a snapshot of the trunk. The trunk is always the most recent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Incremental Whole Web Crawling
When I set generate.update.db to true and then run generate, it only runs twice and generates 100K for the 1st gen, 62.5K for the second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, for a topN of 100K it should run 16 times and create 16 distinct lists if I am not mistaken. Eric On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote: Hey, Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is running. -Gaurang 2009/10/5 Andrzej Bialecki a...@getopt.org Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Yes. When this property is set to true, then each fetchlist will be different, because the records for those pages that are already on another fetchlist will be temporarily locked. Please note that this lock holds only for 1 week, so you need to fetch all segments within one week from generating them. You can fetch and updatedb in arbitrary order, so once you fetched some segments you can run the parsing and updatedb just from these segments, without waiting for all 16 segments to be processed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: generate/fetch using multiple machines
yes, using a hadoop cluster. I would recommend the tutorial called NutchHadoopTutorial on the wiki. On Oct 6, 2009, at 8:56 AM, Gaurang Patel wrote: All- Idea on how to configure nutch to generate/fetch on multiple machines simultaneously? -Gaurang
Hadoop Script
Has anyone written a script for whole web crawling using Hadoop? The script for nutch doesn't work since the data is inside the HDFS (tail - f wont work with this). Thanks, Eric
Re: Hadoop Script
Sorry Ryan, I should have clarified that I am using Nutch as my crawler. There is a script for Nutch to do Whole web crawling, but it is not compatible with Hadoop. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software - eosg...@calpoly.edu e...@lakemeadonline.com - www.calpoly.edu/eosgood www.lakemeadonline.com On Oct 6, 2009, at 12:24 PM, Ryan Smith wrote: This isnt a script per-se but this may help. http://code.google.com/p/hbase-writer Its a plugin for heritrix2 web crawler to write crawled site data to hbase tables, which run on hadoop. Each url is written as a rowkey in the hbase table. HTH, -Ryan On Tue, Oct 6, 2009 at 3:02 PM, Eric e...@lakemeadonline.com wrote: Has anyone written a script for whole web crawling using Hadoop? The script for nutch doesn't work since the data is inside the HDFS (tail -f wont work with this). Thanks, Eric
Targeting Specific Links
Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software - eosg...@calpoly.edu e...@lakemeadonline.com - www.calpoly.edu/eosgood www.lakemeadonline.com
Re: Targeting Specific Links
Andrzej, How would I check for a flag during fetch? Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non- relevant links. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/eosgood, www.lakemeadonline.com On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote: Eric Osgood wrote: Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters. Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Targeting Specific Links for Crawling
Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. Thanks, EO
Incremental Whole Web Crawling
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Thanks, EO
Re: Targeting Specific Links for Crawling
Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote: how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
Re: indexing just certain content
Adam, You could turn off all the indexing plugins and write your own plugin that only indexes certain meta content from your intranet - giving you complete control of the fields indexed. Eric On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote: hi does anybody know if it's possible to index just certain content ? i mean i need to dont index some garbage and repetitive data on my intranet. in other way if it is possible to tell the indexer dont index the content between certain div tags like: div id=bla bla plz dont index this bla bla bla /div thx to all _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
Re: Incremental Whole Web Crawling
Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Thanks, Eric On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote: Eric wrote: My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Yes. Make sure that you have the generate.update.db property set to true, and then generate 16 segments each having 100k urls. After you finish generating them, then you can start fetching. Similarly, you can do the same for the next level, only you will have to generate more segments. This could be done much simpler with a modified Generator that outputs multiple segments from one job, but it's not implemented yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Original tags, attribute defs, multiword tokens, how is this done.
On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote: Question Four ( is will start hunting for this ): Last one, promise.. The indexes themselves. Is there an explanation written up for each of the fields in the index. http://wiki.apache.org/nutch/IndexStructure is the closest thing I've found apart from reading the code. Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University, Fargo, North Dakota, USA PGP.sig Description: This is a digitally signed message part
Re: Original tags, attribute defs, multiword tokens, how is this done.
On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote: Question Four ( is will start hunting for this ): Last one, promise.. The indexes themselves. Is there an explanation written up for each of the fields in the index. http://wiki.apache.org/nutch/IndexStructure is the closest thing I've found apart from reading the code. -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University PGP.sig Description: This is a digitally signed message part
Re: Index Disaster Recovery
On Mar 16, 2009, at 7:55 PM, Otis Gospodnetic wrote: Eric, There are a couple of ways you can back up a Lucene index built by Solr: 1) have a look at the Solr replication scripts, specifically snapshooter. This script creates a snapshot of an index. It's typically triggered by Solr after its commit or optimize calls, when the index is stable and not being modified. If you use snapshooter to create index snapshots, you could simply grab a snapshot and there is your backup. 2) have a look at Solr's new replication mechanism (info on the Solr Wiki), which does something similar to the above, but without relying on replication (shell) scripts. It does everything via HTTP. In my 10 years of using Lucene and N years of using Solr and Nutch I've never had index corruption. Nowadays Lucene even has transactions, so it's much harder (theoretically impossible) to corrupt the index. Thank you for the information. I happened to read about snapshooter about 10 minutes after I sent that message, but didn't know about replication. It inspires confidence that you haven't experienced index corruption in your years of using this technology. Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University
Index Disaster Recovery
What do people do when 'something goes wrong' with a crawl? First some background; We are a small-ish university using nutch to crawl 60,000 - 100,000 pages across 50 or so domains. This probably puts us in a different category than most nutch users. Our crawl cycle consists of a script to crawl everything, one domain at a time, each Sunday and run search across all the indexes (one per domain). Our original reason for this was that merging was taking too long, but this also keeps one bad index (or a crawl with bad results) from destroying everything. Maybe we're worrying about nothing since we haven't had any problems in almost a year of production use (knock on wood) and I don't know how often indexes 'blow up'. We also move the previous week's indexes out of the way before replacing them so we have a backup if something happens. We have been moving things to a CMS and want to move to a system where pages are indexed as they are edited, while still being able to crawl things that don't fit in CMS. This would be a big incentive for most of our people to use CMS. The solr back end looks promising, but I'm not sure how to implement a recovery plan with solr. Any thoughts or experience with backing up solr indexes? Is it as simple as moving the index like we do with nutch indexes? Thanks, Eric -- -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure Phone: (701) 231-8693 North Dakota State University, Fargo, North Dakota, USA signature.asc Description: OpenPGP digital signature
Re: How to use versions from the trunk
On Mar 5, 2009, at 4:12 PM, Jim Van Sciver wrote: Hello all, newbie question alert. I have been using Nutch 0.9 and want to try out versions on the trunk. To do this I have installed the latest build, #743 March 5 2009, and made the small number of necessary changes for a trial run: - add my organization name to conf/nutch-default.xml file - add single domain name to conf/crawl-urlfilter.txt (this is an intranet crawl) - create a single entry urls.txt file Then I run the nutch crawl command with depth 2 (first trial run) and get a stack dump. (See below). I feel that I've missed something in the installation. Could someone tell me what? Many thanks. -- - Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:620) at java.security.SecureClassLoader.defineClass (SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: 268) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java: 319) You need to be using Java 6. Hadoop 0.19 requires it. Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University
Re: what is needed to index for about 10000 domains
On Mar 3, 2009, at 10:32 PM, Jasper Kamperman wrote: There is a way to tell nutch to look at only the beginning of a file, it's this section in your config.xml: property namefile.content.limit/name value65536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property this is from the nutch-default.xml in 0.9, don't know whether it has changed in 1.0 . This might also depend upon what type of files you're trying to index. We ended up using -1 for unlimited after running into some 15MB pdf files. The pdf parser would barf if it didn't get the whole file. This was with 0.9, don't know if 1.0 includes Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University PGP.sig Description: This is a digitally signed message part
Re: AW: Does not locate my urls or filter problem.
Koch Martina wrote: Please check your nutch-site.xml. If the property urlfilter.regex.file there points to another file than your crawl-urlfilter.txt this setting takes precedence. You can also disable the urlfilter-regex plugin by removing it from the plugin.includes property of nutch-site.xml and check if your crawl starts fecthing URLs. Kind regards, Martina In nutch-0.9, nutch-default.xml has urlfilter.regex.file set to regex-urlfilter.txt Thanks, Eric signature.asc Description: OpenPGP digital signature
Re: Build #722 won't start on Mac OS X, 10.4.11
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Feb 14, 2009, at 20:16, David M. Cole wrote: Hiya: Brand new to Nutch. Was able to get it to work with Tomcat on a Mac OS X PPC machine, 450MHz, dual processor, running OS X 10.4.11 with the latest version of Java (1.5.0_16-132). Indexed great, am able to search via OpenSearch option with zero problems. Unfortunately, I need HTTP authorization (basic, not digest or NTLM) for the site I'm trying to index. I downloaded nightly build #722 the other day, added the credentials info into 'conf/httpclient-auth.xml' and have not been able to get it to launch -- I receive the error Bad version number in .class file on the command line when I run a crawl command. Later versions of nutch-dev use Hadoop 0.19 which requires Java 1.6. They used some features introduced in 1.6 If you ask google about the 'Bad version number' you'll find that it refers to cases exactly like this where a library need a (usually) newer jvm. Eric - -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure (701) 231-8693 North Dakota State University, Fargo, North Dakota -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.1 (Darwin) iD8DBQFJmBsxCnMyGd/wX/sRAjT3AJ9URHF4uw+6wpeV6aWDreLjOSD/hgCdFsJv of/zE4M/RcBZbeg1nvBp+PE= =PsdR -END PGP SIGNATURE-
Re: Crawler not fetching all the links
On Jan 14, 2009, at 12:44 PM, ahammad wrote: Hello, I'm still unable to find why Nutch is unable to fetch and index all the links that are on the page. To recap, the Nutch urls file contains a link to a jhtml file that contains roughly 2000 links, all hosted on the same server in the same folder. Previously, I only got 111 links when I crawl. This was due to this: property namedb.max.outlinks.per.page/name value100/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property You may also want to change this one: property namefile.content.limit/name value65536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University PGP.sig Description: This is a digitally signed message part
Incremental indexing
How is it done ? for now the what i do is to merge 2 crawl into a new one : bin/nutch mergedb crawl/crawldb crawl1/crawldb/ crawl2/crawldb/ Is that the only solution ?
Re: Can I update my search engine without restarting tomcat?
On Jun 19, 2008, at 1:20 PM, John Thompson wrote: You can set up an account in Tomcat Manager if you don't have one already. The Manager lets you go in and independently start/stop/reload any of the different webapps you have running. This is exactly how I get new Nutch crawls/indexes to be active on our production server. Won't restarting the webapp cause Tomcat to serve up error pages to users who are trying to connect to the webapp at that moment? We set up our own servlet which uses a NutchBean in a Singleton pattern. It runs an update thread which periodically checks a known file which contains a directory name. When loading a new search db, we copy the directory where we want it (we use a directory with a date in the name) and edit the file to point to the new dir. A client never has problems related to unavailability because they either get the old NutchBean, referencing the old dir, or the new NutchBean, referencing the new dir. We keep the old ones around for a period of time (default 5 minutes) in case anyone has a search page open and wants previous/next page of results. We haven't run into any problems with this setup. If anyone wants more information, let me know. -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Re: two questions about nutch url filter when inject
On Jun 18, 2008, at 9:38 AM, beansproud wrote: Second, when I changed this file, the output of nutch doesn't show any chang. And when I recomplied, it changed. This takes me 3 hours, can anybody tell me why ? What changes did you make to the file, and what specifically changed when you recompiled? eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Re: Field phrases
On Jun 8, 2008, at 12:31 PM, Aldarris wrote: Hello: How can I search for a field phrase? A content phrase is described in the help files -- one just wraps the words in quotes. But what if one wants to search for url, for example, abc.go.com/ folder? Apparently, I need to search for abc,go, com and folder next to each other. What is the proper syntax? url:abc url:folder produces quite a few wrong results. It should work with with url:abc go com folder eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Re: Ignoring robots.txt
On May 27, 2008, at 11:42 AM, Vijay Krishnan wrote: Hi all, Do you know what file in Nutch parses robots.txt? If there is no option in Nutch to ignore robots.txt, I would at least like to modify the source. Thanks, Vijay src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java parses robots.txt src/plugin/parse-html/src/java/org/apache/nutch/parse/html/ HTMLMetaProcessor.java parses robot rules from html documents. eric Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Re: Problems with indexing sub-section of a site
On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote: Hello! In short: Is it possible to tell Nutch to follow the links through one larger name space, but only index (add to its database) the content of links that are in a sub-name space of that? The background: I have started to experiment with crawling my blog with Nutch. The problem is that this blog doesn't have its own domain. Instead, it it is hosted on a larger site, which also hosts discussion forums and other people's blogs. My URL there is http://www.geekzone.co.nz/foobar;, so naturally I thought that adding something in the crawl-urlfilter.txt file would help. Something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar But look at the bottom of that page: The navigation links to the other pages in my blog - or to 'next' page - actually lead out of my namespace. Thus, they are not being picked up anymore and Nutch never sees the additional links that I have on those other pages. Since eventually I would like this to be a bit more generic (I don't want anything specific for my blog, that's just a test case), I thought that maybe I have to open it up to the root URL, making the filter something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz But then it picks up a ton of other stuff that I am not interested to have in my database. So, now I'm wondering whether it is possible to tell Nutch to follow links through one namespace, but only add those pages into its index database that are in a specific sub-namespace of the first one? Did a quick scan of the page in question, and I noticed the urls are of this form: http://www.geekzone.co.nz/blog.asp?blogid=207 Could you filter like +^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207 You'll have to comment out the default ? killer or put this rule before it. Maybe there's something I'm missing, though. Eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Re: Error: Generator: 0 records selected for fetching, exiting ...
On May 21, 2008, at 7:22 AM, Abhijit Bera wrote: I totally give up. I tried whole web crawling and still I get the above error! :( Have you tried it with an empty crawl directory? If those pages had been successfully crawled before and are in your crawldb, the fetch interval probably hasn't passed yet. eric Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Null pointer error when perform search
Hi, I am new to Nutch and I got a null pointer exception whenI try to submit the search through demo app. Please see the error message below. I have modified the demo app to run in its webapp context other than in ROOT context. The first page shown and I put in the keyword to search and got the error. Did I do somthing wrong? Please help. Thanks. - Eric *type* Exception report *message* *description* *The server encountered an internal error () that prevented it from fulfilling this request.* *exception* org.apache.jasper.JasperException org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:370) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) *root cause* java.lang.NullPointerException org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) org.apache.nutch.searcher.NutchBean.init(NutchBean.java:82) org.apache.nutch.searcher.NutchBean .init(NutchBean.java:72) org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64) org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:112) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java :97) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:322) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java :291) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) *note* *The full stack trace of the root cause is available in the Apache Tomcat/5.5.9 logs.*
java.util.MissingResourceException on resin
hello, I intalled nutch on resin3.0. When I open search.jsp, I get this error below. search.jsp line 116 is - i18n:bundle baseName=org.nutch.jsp.search / any thoughts? Thank you java.util.MissingResourceException: Can't find bundle for base name org.nutch.jsp.search, locale ko_KR at java.util.ResourceBundle.throwMissingResourceException( ResourceBundle.java:837) at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:727) at java.util.ResourceBundle.getBundle(ResourceBundle.java:550) at org.apache.taglibs.i18n.BundleTag.findBundle(BundleTag.java:309) at org.apache.taglibs.i18n.BundleTag.doStartTag(BundleTag.java:333) at _jsp._search__jsp._jspService(search.jsp:116) at com.caucho.jsp.JavaPage.service(JavaPage.java:63) at com.caucho.jsp.Page.pageservice(Page.java:570) at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java :159) at com.caucho.server.webapp.WebAppFilterChain.doFilter( WebAppFilterChain.java:163) at com.caucho.server.dispatch.ServletInvocation.service( ServletInvocation.java:208) at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:259) at com.caucho.server.port.TcpConnection.run(TcpConnection.java:341) at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:490) at com.caucho.util.ThreadPool.run(ThreadPool.java:423) at java.lang.Thread.run(Thread.java:595)
Nutch0.6 and Nutch 0.7 crawlers
hello. I tried to crawl a certain site using both nutch 0.6 and nutch 0.7, just to compare how they are different. However I get less urls crawled using nutch0-7 than nutch0-6. I'll paste 2 different log files below. As you can see below, both 0.6 and 0.7 fetch same number of urls in first depth, but in second depth, nutch0.7 fetches only 15 urls while nutch0.7fetches 34 urls. Of course, the configuration and settings are same. Can you tell me why I get this results? Thank you Eric Park log for Nutch 0.6 crawler -- 060328 182513 logging at INFO 060328 182513 fetching http://www.qmind.co.kr/sub4/sub4_1.htm 060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_3.htm 060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_5.htm 060328 182513 fetching http://www.qmind.co.kr/sub3/sub3_21.htm 060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_1.htm 060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_4.htm 060328 182513 fetching http://www.qmind.co.kr/sub5/sub5_4.htm 060328 182513 fetching http://www.qmind.co.kr/sub2/sub2_21.htm 060328 182513 fetching http://www.qmind.co.kr/sub6/sub6_2.htm 060328 182513 fetching http://qmind.co.kr/ 060328 182513 fetching http://www.qmind.co.kr/sub3/sub3_11.htm 060328 182514 fetching http://www.qmind.co.kr/sub2/sub2_11.htm 060328 182515 fetching http://www.qmind.co.kr/sub6/sub6_1.htm 060328 182516 fetching http://www.qmind.co.kr/sub1/sub1_2.htm 060328 182517 fetching http://www.qmind.co.kr/sub5/sub5_1.htm 060328 182528 status: segment 20060328182513, 15 pages, 0 errors, 271658 bytes, 15149 ms 060328 182528 status: 0.99016434 pages/s, 140.09691 kb/s, 18110.533bytes/page 060328 182529 Updating /usr/local/nutch-0.7/crawl_folder/crawl.qmind2/db 060328 182529 Updating for /usr/local/nutch-0.7 /crawl_folder/crawl.qmind2/segments/20060328182513 060328 182529 Processing document 0 060328 182530 Finishing update 060328 182530 Processing pagesByURL: Sorted 437 instructions in 0.013seconds. 060328 182530 Processing pagesByURL: Sorted 33615.38461538462instructions/second 060328 182530 Processing pagesByURL: Merged to new DB containing 51 records in 0.014 seconds 060328 182530 Processing pagesByURL: Merged 3642.8571428571427records/second 060328 182530 Processing pagesByMD5: Sorted 65 instructions in 0.0020seconds. 060328 182530 Processing pagesByMD5: Sorted 32500.0 instructions/second 060328 182530 Processing pagesByMD5: Merged to new DB containing 51 records in 0.0040 seconds 060328 182530 Processing pagesByMD5: Merged 12750.0 records/second 060328 182530 Processing linksByMD5: Sorted 437 instructions in 0.0060seconds. 060328 182530 Processing linksByMD5: Sorted 72833.333instructions/second 060328 182530 Processing linksByMD5: Merged to new DB containing 275 records in 0.018 seconds 060328 182530 Processing linksByMD5: Merged 15277.778 records/second 060328 182530 Processing linksByURL: Sorted 260 instructions in 0.0040seconds. 060328 182530 Processing linksByURL: Sorted 65000.0 instructions/second 060328 182530 Processing linksByURL: Merged to new DB containing 275 records in 0.016 seconds 060328 182530 Processing linksByURL: Merged 17187.5 records/second 060328 182530 Processing linksByMD5: Sorted 274 instructions in 0.0040seconds. 060328 182530 Processing linksByMD5: Sorted 68500.0 instructions/second 060328 182530 Processing linksByMD5: Merged to new DB containing 275 records in 0.0090 seconds 060328 182530 Processing linksByMD5: Merged 30555.556 records/second 060328 182530 Update finished 060328 182530 FetchListTool started 060328 182530 Processing pagesByURL: Sorted 35 instructions in 0.0020seconds. 060328 182530 Processing pagesByURL: Sorted 17500.0 instructions/second 060328 182530 Processing pagesByURL: Merged to new DB containing 51 records in 0.0010 seconds 060328 182530 Processing pagesByURL: Merged 51000.0 records/second 060328 182530 Processing pagesByMD5: Sorted 35 instructions in 0.0020seconds. 060328 182530 Processing pagesByMD5: Sorted 17500.0 instructions/second 060328 182530 Processing pagesByMD5: Merged to new DB containing 51 records in 0.0020 seconds 060328 182530 Processing pagesByMD5: Merged 25500.0 records/second 060328 182530 Processing linksByMD5: Copied file (4096 bytes) in 0.0010secs. 060328 182530 Processing linksByURL: Copied file (4096 bytes) in 0.0020secs. 060328 182530 Processing /usr/local/nutch-0.7/crawl_folder/crawl.qmind2/segments/20060328182530/fetchlist.unsorted: Sorted 35 entries in 0.0020 seconds. 060328 182530 Processing /usr/local/nutch-0.7/crawl_folder/crawl.qmind2/segments/20060328182530/fetchlist.unsorted: Sorted 17500.0 entries/second 060328 182530 Overall processing: Sorted 35 entries in 0.0020 seconds. 060328 182530 Overall processing: Sorted 5.714285714285714E-5 entries/second 060328 182530 FetchListTool completed 060328 182530 logging at INFO 060328 182530 fetching http://www.qmind.co.kr/sub2/sub2_14.htm 060328 182530 fetching http://www.qmind.co.kr/sub2/sub2_16.htm 060328
Re: Nutch0.6 and Nutch 0.7 crawlers
hello, the problem is they are not unwanted URLS. I crawled on the site 'www.qmind.co.kr'. I found that the nutch7.0 crawler works just fine in first depth. However in second depth, it filters out any links that start with 'www.qmind.co.kr'. It only crawls urls starting with 'qmind.co.kr'. I can't figure out why it filters out urls starting with 'www' in second depth. Nutch 6.0 works just fine. Are there any known bugs in Nutch7.0 crawler? thank you, Erci Park 2006/4/12, Andrzej Bialecki [EMAIL PROTECTED]: eric park wrote: hello. I tried to crawl a certain site using both nutch 0.6 and nutch 0.7, just to compare how they are different. However I get less urls crawled using nutch0-7 than nutch0-6. I'll paste 2 different log files below. As you can see below, both 0.6 and 0.7 fetch same number of urls in first depth, but in second depth, nutch0.7 fetches only 15 urls while nutch0.7fetches 34 urls. Of course, the configuration and settings are same. IIRC (it was long ago...) the version 0.6 had a bug where unwanted URLs would slip through the URLFilters. This was tightened in 0.7. Please check that the URLs that are rejected in 0.7 are really valid URLs, i.e. that they should be accepted. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com