max-out links count in Nutch.
HI all. Do the db.max.outlinks.per.page value in the Nutch-default.xml has limitation ? when i crawl using the default value of 100 it fail to get many links ? Do this value controls the number of links to be fetched from a page ? Any suggestion would greatly help. Thanks in advance regards -Hussain
Re: Out of memor error while updating
Should i change the value of 'io.sort.mb' and or io.sort.factor ? and if so what should i change to so to eliminate the error? Yes, since it looks like it crah until sorting. Also is there any minimum requirement of RAM for nutch to do indexing and searching ? Well, not really but you should have 1 GB RAM if you want to do serious things. You can setup the memory: from the bin/nutch script: # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. ... JAVA_HEAP_MAX=-Xmx1000m HTH Stefan Any help is greatly appreciated Thanks in advance regards -Hussain. - Original Message - From: Stefan Groschupf [EMAIL PROTECTED] style.com To: nutch-user@lucene.apache.org Sent: Monday, December 26, 2005 7:18 PM Subject: Re: Out of memor error while updating Do you have a stack trace? Is it may related to a 'too many file open Exception?'. Also you can try to minimalize 'io.sort.mb' and or io.sort.factor. Stefan Am 26.12.2005 um 09:27 schrieb K.A.Hussain Ali: HI all, I am using Nutch to crawl few sites and when i crawl for certain depth and do updation of webdb while updating the webdb i get an Out of Memory error I increased the jvm size using java_opts and even reduced the token size of per page in the nutch-default.xml but still i get such an error. I am using tomcat and i have only one application running on it. what is the system requirement of Nutch to get rid of this error ? I even tried things mentioned in the mailing list but nothing turns to be fruitful. Any help is greatly appreciated. Thanks in advance regards -Hussain. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: How to run Nutch?
This is the error message that I get: [EMAIL PROTECTED] nutch-nightly]# bin/start-all.sh cat: /root/.slaves: Arquivo ou diretório não encontrado starting namenode, logging to /usr/nutch-nightly/nutch-root-namenode- localhost.l ocaldomain.log 051227 085214 parsing file:/usr/nutch-nightly/conf/nutch-default.xml 051227 085214 parsing file:/usr/nutch-nightly/conf/nutch-site.xml Exception in thread main java.lang.RuntimeException: Not a host:port pair: local at org.apache.nutch.ndfs.DataNode.createSocketAddr (DataNode.java:54) at org.apache.nutch.ndfs.NameNode.init(NameNode.java:52) at org.apache.nutch.ndfs.NameNode.main(NameNode.java:349) starting jobtracker, logging to /usr/nutch-nightly/nutch-root- jobtracker-localhost.localdomain.log 051227 085215 parsing file:/usr/nutch-nightly/conf/nutch-default.xml 051227 085215 parsing file:/usr/nutch-nightly/conf/nutch-site.xml Exception in thread main java.lang.RuntimeException: Bad mapred.job.tracker: local at org.apache.nutch.mapred.JobTracker.getAddress (JobTracker.java:254) at org.apache.nutch.mapred.JobTracker.init(JobTracker.java:228) at org.apache.nutch.mapred.JobTracker.startTracker (JobTracker.java:45) at org.apache.nutch.mapred.JobTracker.main(JobTracker.java:1070) cat: /root/.slaves: Arquivo ou diretório não encontrado [EMAIL PROTECTED] nutch-nightly]# My nutch-site.xml is the saved nutch-default.xml, without modifications: .. property namefs.default.name/name valuelocal/value descriptionThe name of the default file system. Either the literal string local or a host:port for NDFS./description /property property namendfs.datanode.port/name value50010/value descriptionThe port number that the ndfs datanode server uses as a starting point to look for a free port to listen on. /description /property property namendfs.name.dir/name value/tmp/nutch/ndfs/name/value descriptionDetermines where on the local filesystem the NDFS name node should store the name table./description /property property namendfs.data.dir/name value/tmp/nutch/ndfs/data/value descriptionDetermines where on the local filesystem an NDFS data node should store its blocks. If this is a comma- or space-delimited list of directories, then data will be stored in all named directories, typically on different devices./description /property !-- map/reduce properties -- property namemapred.job.tracker/name valuelocal/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.job.tracker.info.port/name value50030/value descriptionThe port that the MapReduce job tracker info webserver runs at. /description /property property namemapred.task.tracker.output.port/name value50040/value descriptionThe port number that the MapReduce task tracker output server uses as a starting point to look for a free port to listen on. /description /property property namemapred.task.tracker.report.port/name value50050/value descriptionThe port number that the MapReduce task tracker report server uses as a starting point to look for a free port to listen on. /description /property property namemapred.local.dir/name value/tmp/nutch/mapred/local/value descriptionThe local directory where MapReduce stores intermediate data files. May be a space- or comma- separated list of directories on different devices in order to spread disk i/o. /description /property property namemapred.system.dir/name value/tmp/nutch/mapred/system/value descriptionThe shared directory where MapReduce stores control files. /description /property property namemapred.temp.dir/name value/tmp/nutch/mapred/temp/value descriptionA shared directory for temporary files. /description /property ... As a final reminder, if that matters, this computer is on a small network (with a router) with another computer that runs another OS performing other tasks. Thank you for your attention
Re: How to run Nutch?
Do have a one or a muli machine installation planed? Am 27.12.2005 um 13:05 schrieb carmmello: Things got better. Using webapps directly under nutch-nightly and using local: 5 in the nutch-site.xml, I got: [EMAIL PROTECTED] nutch-nightly]# bin/start-all.sh cat: /root/.slaves: Arquivo ou diretório não encontrado (file or folder not found) starting namenode, logging to /usr/nutch-nightly/nutch-root-namenode- localhost.localdomain.log starting jobtracker, logging to /usr/nutch-nightly/nutch-root- jobtracker-localhost.localdomain.log cat: /root/.slaves: Arquivo ou diretório não encontrado (file or folder not found) [EMAIL PROTECTED] nutch-nightly]#
document markup to control indexing
Hi all, Another open source search engine, HtDig, allows web page authors to mark up a page such that some sections are not indexed. The syntax looks like the following: !--htdig_noindex-- ... material inside is not indexed ... !--/htdig_noindex-- Does a similar feature exist in Nutch? If the answer is write a plugin does anyone have tips on where to start? Also, how hard is something like this for a Nutch newbie who doesn't know anything about HTML parsing? I have a bunch of documents already marked up with the htdig syntax, and in the interests of interoperability I'm tempted to follow the syntax exactly. -Jeff
Re: document markup to control indexing
Hi Jeff Pls refer to getText() method in org.apache.nutch.parse.html.DOMContentUtils class (of course parse-html plugin). You can add your filter easily;) /Jack On 12/27/05, Jeff Breidenbach jeff@jab.org wrote: Hi all, Another open source search engine, HtDig, allows web page authors to mark up a page such that some sections are not indexed. The syntax looks like the following: !--htdig_noindex-- ... material inside is not indexed ... !--/htdig_noindex-- Does a similar feature exist in Nutch? If the answer is write a plugin does anyone have tips on where to start? Also, how hard is something like this for a Nutch newbie who doesn't know anything about HTML parsing? I have a bunch of documents already marked up with the htdig syntax, and in the interests of interoperability I'm tempted to follow the syntax exactly. -Jeff -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: document markup to control indexing
Pls refer to getText() method in org.apache.nutch.parse.html.DOMContentUtils class (of course parse-html plugin). You can add your filter easily;) Wow! That was really easy. Thanks. --Jeff
Re: How to run Nutch?
|Do have a one or a muli machine installation planed? Just one machine.
Re: Distributed search corrupted output problem
Ed, it is definitely not a encoding problem with rpc calls. Following test pass on my box. It would be interesting to find the problem but setting up a distributed system to verify your problem is too time expansive. Can you try using the latest sources and check if this still occurs? I will read some more code and see if I can find anything that is like a problem. It would be great if one from the community can verify if this is really a bug and if it reproducible. That search results using distribute search are different is a known problem (see jira). Can you provide a secodn tomcat running on a other port or may just a other tomcat context running a nutch ui pointing to a local index? Stefan /** * Copyright 2005 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.nutch.ipc; import java.lang.reflect.Method; import java.net.InetSocketAddress; import junit.framework.TestCase; import org.apache.nutch.io.UTF8; public class TestEncoding extends TestCase { private int PORT = 50232; private String TEXT = 座頭市; // no idea what this means :) public void testEncoding() throws Exception { Server server = RPC.getServer(new HelloWorld(), PORT); server.start(); Method method = HelloWorld.class.getMethod(helloWorld, new Class[] { UTF8.class }); Object[][] parameter = new Object[1][1]; parameter[0][0] = new UTF8(TEXT); UTF8[] values = (UTF8[]) RPC.call(method, parameter, new InetSocketAddress[] { new InetSocketAddress (127.0.0.1, PORT) }); assertEquals(TEXT, values[0].toString()); } class HelloWorld { public UTF8 helloWorld(UTF8 utf8) { return utf8; } } } Am 27.12.2005 um 05:38 schrieb Ed Whittaker: Hi, I'm running nutch-0.7.1 on a couple of RedHat-9 linux machines. When I execute catalina.sh start in the crawl directory (i.e. not using distributed search) and query with a 2 Kanji Japanese string everything works fine, i.e. the pages seem relevant and the output is in the correct encoding. However, when I run a distributed search using one search server specified in search-servers.txt and the same index as used above, the *returned pages are not the same* and the *output is corrupted*. To see an example of this go to: http://asked.ru/search.jsp?query=%E6%9D%B1%E4%BA%AC This queries nutch with the string for Tokyo in Japanese. Unfortunately, I can't provide access to an example of the working (non-distributed) setup but trust me it looks good. Note, this is not a problem concerning the Tomcat integration with Apache since accessing the distributed search setup via http://localhost: 8080 gives identical (corrupted) output to what you'll get if you click on the above link. I would guess this is some socket encoding problem since that is ostensibly the only difference in the 2 configurations, isn't it? Does anyone have a distributed search setup which doesn't have these encoding problems? i.e. is it something wrong with my setup somewhere. Or, is this a known bug? -Ed
Re: How to run Nutch?
Stefan Wrote: OK, you are somehow trying to get map reduce multimachine installation running on one machine, that of course will fail. Just download or build a 0.8 release. decompress the archive into a folder called nutch-0.8. than try: cd nutch-0.8 bin/nutch the result should looks like: Usage: nutch COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdbread / dump crawl db readlinkdbread / dump link db admin database administration, including creation Than you can start with the crawling command, you do not need any configuration change for now!!! I have tried this at the begining, but this does not work. Just see my original post, that initiated this topic. Tank you for all your attention.
multibyte character support status
What is the current state and plan for multibyte character support by Nutch? As far as I can tell... The PDF plugin uses PDFBox (www.pdfbox.org) which does not work with Japanese and probably other multibyte characters and code sets. The Word plugin uses POI (http://jakarta.apache.org/poi/), which doesn't seem to support Japanese. Some patches to make it possible to support Japanese (and hopefully other code sets) have been submitted to the POI project but they have not been integrated because the project currently has no committer. RTF document plugin and PowerPoint plugin use home-grown parsers. What is the status of multibyte code set (and single byte code set other than ISO-8859-1) support by these plugins? -Kuro
Re: file to http mapping
Thank you, this approach worked nicely. On Tue, 27 Dec 2005 3:03 am, Stefan Groschupf wrote: Jeff, no such solution does not exists. Take a look to the index-filters. I suggest following solution: + write a index filter that add a field 'webserverUrl' to the lucene document that contains your rewritten url (should by from type keyword). + change the jsp page such that the 'webserverUrl' is used as link url instead the real url. Chainging the url behind the sense make no sense since nutch uses the url as key for all records. HTH Stefan
How can I set a search server over NDFS
Hi, I have tried all available samples but was unsuccessful. I am using the following command to start the server: bin/nutch-daemon.sh start server 9003 crawl I have setup a directory /hosts with the file search-servers.txt which contains localhost 9003 but the tomcat client does not connect to my search server at all. Any idea what am i doing wrong? Gal
Re: Trouble setting NDFS on multiple machines
The exception means that one client is unable to connect to one *datanode*. Check that the box that had this exception can open a connection to all other datanodes with the correct port. try telnet machineNameAsUsedInNameNode DATANODE_PORT Is it able to connect? Stefan Am 27.12.2005 um 22:20 schrieb Gal Nitzan: Hi, For some reason I am having trouble setting NDFS on multiple machines I keep on getting an exception. My settings follows the guide lines i.e Doug's cheat sheet on all three machines: namefs.default.name/name valuenutchmst1.XX.com:9000/value all machines seems to be connecting to the namenode: 051227 223242 10 Opened server at 50010 051227 223242 11 Starting DataNode in: /nutch/ndfs/data/data 051227 223242 11 using BLOCKREPORT_INTERVAL of 3500482msec 051227 223242 12 Client connection to x.x.22.185:9000: starting 051227 230013 Server connection on port 9000 from x.x.22.186: starting 051227 230013 Got brand-new heartbeat from nutchnd1:50010 051227 230013 Block report from nutchnd1:50010: 0 blocks. 051227 230013 Server connection on port 9000 from x.x.22.183: starting 051227 230013 Got brand-new heartbeat from nutchws1:50010 051227 230013 Block report from nutchws1:50010: 0 blocks. The problem::: [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt 051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- defaul t.xml 051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- site.x ml 051227 230324 No FS indicated, using default:nutchmst1.XXX.com:9000 051227 230324 Client connection to x.x.22.185:9000: starting Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt urls /urls.txt 051227 230422 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- defaul t.xml 051227 230423 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- site.x ml 051227 230423 No FS indicated, using default:nutchmst1.xxx.com:9000 051227 230423 Client connection to x.x.22.185:9000: starting Exception in thread main java.lang.NullPointerException at java.net.Socket.init(Socket.java:357) at java.net.Socket.init(Socket.java:207) at org.apache.nutch.ndfs.NDFSClient $NDFSOutputStream.nextBlockOu tputStream(NDFSClient.java:573) at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init (NDFS Client.java:521) at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at org.apache.nutch.fs.NDFSFileSystem.createRaw (NDFSFileSystem.j ava:71) at org.apache.nutch.fs.NFSDataOutputStream$Summer.init (NFSData OutputStream.java:41) at org.apache.nutch.fs.NFSDataOutputStream.init (NFSDataOutputS tream.java:129) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.ja va:187) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.ja va:174) at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile (NDFSFileSy stem.java:178) at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile (NDFSFile System.java:153) at org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java: 46 ) at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) I know I am missing somthing but I can't figure out what. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: How can I set a search server over NDFS
Try to real dns name of the box or 127.0.0.1 instead of localhost. Any exception? Stefan Am 28.12.2005 um 00:42 schrieb Gal Nitzan: Hi, I have tried all available samples but was unsuccessful. I am using the following command to start the server: bin/nutch-daemon.sh start server 9003 crawl I have setup a directory /hosts with the file search-servers.txt which contains localhost 9003 but the tomcat client does not connect to my search server at all. Any idea what am i doing wrong? Gal --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
How can I set a search server over NDFS - Revised
Do I need to run server if I want to use the search to use NDFS? Any way, in the nutch-site.xml which reside under tomcat the serch.dir = crawl and the name of the ndfs root is the same. However, I still get 0 results though I know for sure there are documents in the index. On Wed, 2005-12-28 at 01:42 +0200, Gal Nitzan wrote: Hi, I have tried all available samples but was unsuccessful. I am using the following command to start the server: bin/nutch-daemon.sh start server 9003 crawl I have setup a directory /hosts with the file search-servers.txt which contains localhost 9003 but the tomcat client does not connect to my search server at all. Any idea what am i doing wrong? Gal
Re: Trouble setting NDFS on multiple machines
Interesting! That is not a feature that is a bug, may you can open a minor bug report. Thanks. Stefan Am 28.12.2005 um 01:35 schrieb Gal Nitzan: Thanks for the prompt reply. However it seems that the problem was working with JDK 1.5 When changed to 1.4.2 All seems to be working. Thanks. Gal. On Wed, 2005-12-28 at 01:24 +0100, Stefan Groschupf wrote: The exception means that one client is unable to connect to one *datanode*. Check that the box that had this exception can open a connection to all other datanodes with the correct port. try telnet machineNameAsUsedInNameNode DATANODE_PORT Is it able to connect? Stefan Am 27.12.2005 um 22:20 schrieb Gal Nitzan: Hi, For some reason I am having trouble setting NDFS on multiple machines I keep on getting an exception. My settings follows the guide lines i.e Doug's cheat sheet on all three machines: namefs.default.name/name valuenutchmst1.XX.com:9000/value all machines seems to be connecting to the namenode: 051227 223242 10 Opened server at 50010 051227 223242 11 Starting DataNode in: /nutch/ndfs/data/data 051227 223242 11 using BLOCKREPORT_INTERVAL of 3500482msec 051227 223242 12 Client connection to x.x.22.185:9000: starting 051227 230013 Server connection on port 9000 from x.x.22.186: starting 051227 230013 Got brand-new heartbeat from nutchnd1:50010 051227 230013 Block report from nutchnd1:50010: 0 blocks. 051227 230013 Server connection on port 9000 from x.x.22.183: starting 051227 230013 Got brand-new heartbeat from nutchws1:50010 051227 230013 Block report from nutchws1:50010: 0 blocks. The problem::: [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt 051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- defaul t.xml 051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- site.x ml 051227 230324 No FS indicated, using default:nutchmst1.XXX.com:9000 051227 230324 Client connection to x.x.22.185:9000: starting Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt urls /urls.txt 051227 230422 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- defaul t.xml 051227 230423 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- site.x ml 051227 230423 No FS indicated, using default:nutchmst1.xxx.com:9000 051227 230423 Client connection to x.x.22.185:9000: starting Exception in thread main java.lang.NullPointerException at java.net.Socket.init(Socket.java:357) at java.net.Socket.init(Socket.java:207) at org.apache.nutch.ndfs.NDFSClient $NDFSOutputStream.nextBlockOu tputStream(NDFSClient.java:573) at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init (NDFS Client.java:521) at org.apache.nutch.ndfs.NDFSClient.create (NDFSClient.java:83) at org.apache.nutch.fs.NDFSFileSystem.createRaw (NDFSFileSystem.j ava:71) at org.apache.nutch.fs.NFSDataOutputStream$Summer.init (NFSData OutputStream.java:41) at org.apache.nutch.fs.NFSDataOutputStream.init (NFSDataOutputS tream.java:129) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.ja va:187) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.ja va:174) at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile (NDFSFileSy stem.java:178) at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile (NDFSFile System.java:153) at org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java: 46 ) at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) I know I am missing somthing but I can't figure out what. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: Trouble setting NDFS on multiple machines
Thanks for the prompt reply. However it seems that the problem was working with JDK 1.5 When changed to 1.4.2 All seems to be working. Thanks. Gal. On Wed, 2005-12-28 at 01:24 +0100, Stefan Groschupf wrote: The exception means that one client is unable to connect to one *datanode*. Check that the box that had this exception can open a connection to all other datanodes with the correct port. try telnet machineNameAsUsedInNameNode DATANODE_PORT Is it able to connect? Stefan Am 27.12.2005 um 22:20 schrieb Gal Nitzan: Hi, For some reason I am having trouble setting NDFS on multiple machines I keep on getting an exception. My settings follows the guide lines i.e Doug's cheat sheet on all three machines: namefs.default.name/name valuenutchmst1.XX.com:9000/value all machines seems to be connecting to the namenode: 051227 223242 10 Opened server at 50010 051227 223242 11 Starting DataNode in: /nutch/ndfs/data/data 051227 223242 11 using BLOCKREPORT_INTERVAL of 3500482msec 051227 223242 12 Client connection to x.x.22.185:9000: starting 051227 230013 Server connection on port 9000 from x.x.22.186: starting 051227 230013 Got brand-new heartbeat from nutchnd1:50010 051227 230013 Block report from nutchnd1:50010: 0 blocks. 051227 230013 Server connection on port 9000 from x.x.22.183: starting 051227 230013 Got brand-new heartbeat from nutchws1:50010 051227 230013 Block report from nutchws1:50010: 0 blocks. The problem::: [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt 051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- defaul t.xml 051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- site.x ml 051227 230324 No FS indicated, using default:nutchmst1.XXX.com:9000 051227 230324 Client connection to x.x.22.185:9000: starting Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt urls /urls.txt 051227 230422 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- defaul t.xml 051227 230423 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- site.x ml 051227 230423 No FS indicated, using default:nutchmst1.xxx.com:9000 051227 230423 Client connection to x.x.22.185:9000: starting Exception in thread main java.lang.NullPointerException at java.net.Socket.init(Socket.java:357) at java.net.Socket.init(Socket.java:207) at org.apache.nutch.ndfs.NDFSClient $NDFSOutputStream.nextBlockOu tputStream(NDFSClient.java:573) at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init (NDFS Client.java:521) at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at org.apache.nutch.fs.NDFSFileSystem.createRaw (NDFSFileSystem.j ava:71) at org.apache.nutch.fs.NFSDataOutputStream$Summer.init (NFSData OutputStream.java:41) at org.apache.nutch.fs.NFSDataOutputStream.init (NFSDataOutputS tream.java:129) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.ja va:187) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.ja va:174) at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile (NDFSFileSy stem.java:178) at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile (NDFSFile System.java:153) at org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java: 46 ) at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) I know I am missing somthing but I can't figure out what. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Can we search based on two fileds?
Hi everyone, I am currently indexing a single website, say www.somesite.com. But I do not want to crawl urls with certain pattern let's say nocrawl, ie www.somesite.com/nocrawl.html or www.somesite.com/apage.php?nocrawl. I want to discard any urls that contains the pattern 'nocrawl'. How do I do it? I am using nutch version 7.1. Also I want to use the 'crawl' command for crawling these pages. Thank you for you support. -- Keep on smiling :) Kumar
Crawler problem in 0.7 and 0.7.1
Hi all, I encountered problems when I run nutch 0.7 and 0.7.1 crawler. Although I have added a number of root url in a plain text file *urls *as it the crawler seems unwillingly to fetch any of the urls. However, when In fall back to the nutch 0.6, everything just works fine under it. Therefore, I wondering if this problem happen to all of you? Currently, I am running nutch 0.7.1 with JDK1.5 update 6 on Ubuntu 5.10. Anywhere I came across the same problem under my apple Mac too. Below are the content of the log of the crawler, it shows that the crawler returrns 0 entry. Thanks in advance. 051227 212142 parsing file:/opt/nutch-0.7.1/conf/nutch-default.xml 051227 212143 parsing file:/opt/nutch- 0.7.1/conf/crawl-tool.xml 051227 212143 parsing file:/opt/nutch-0.7.1/conf/nutch-site.xml 051227 212143 No FS indicated, using default:local 051227 212143 crawl started in: crawl.test 051227 212143 rootUrlFile = urls 051227 212143 threads = 10 051227 212143 depth = 3 ... ... ..051227 212143 *Added 0 pages* 051227 212143 FetchListTool started 051227 212144 *Overall processing: Sorted 0 entries in 0.0 seconds. *051227 212144 Overall processing: Sorted NaN entries/second 051227 212144 FetchListTool completed 051227 212144 logging at INFO 051227 212145 Updating /opt/nutch-0.7.1/crawl.test/db 051227 212145 Updating for /opt/nutch-0.7.1 /crawl.test/segments/20051227212143 051227 212145 Finishing update 051227 212145 Update finished 051227 212145 FetchListTool started *051227 212145 Overall processing: Sorted 0 entries in 0.0 seconds.* 051227 212145 Overall processing: Sorted NaN entries/second 051227 212145 FetchListTool completed 051227 212145 logging at INFO 051227 212146 Updating /opt/nutch-0.7.1/crawl.test/db 051227 212146 Updating for /opt/nutch-0.7.1 /crawl.test/segments/20051227212145 051227 212146 Finishing update 051227 212146 Update finished 051227 212146 FetchListTool started 051227 212146 Overall processing: Sorted 0 entries in 0.0 seconds. 051227 212146 Overall processing: Sorted NaN entries/second 051227 212146 FetchListTool completed 051227 212146 logging at INFO 051227 212147 Updating /opt/nutch- 0.7.1/crawl.test/db 051227 212147 Updating for /opt/nutch-0.7.1 /crawl.test/segments/20051227212146 051227 212147 Finishing update 051227 212147 Update finished 051227 212147 Updating /opt/nutch-0.7.1/crawl.test/segments from /opt/nutch- 0.7.1/crawl.test/db 051227 212147 reading /opt/nutch-0.7.1/crawl.test/segments/20051227212143 051227 212148 reading /opt/nutch-0.7.1/crawl.test/segments/20051227212145 051227 212148 reading /opt/nutch-0.7.1/crawl.test/segments/20051227212146 051227 212148 Sorting pages by url... 051227 212148 Getting updated scores and anchors from db... 051227 212148 Sorting updates by segment... 051227 212148 Updating segments... 051227 212148 Done updating /opt/nutch-0.7.1/crawl.test/segments from /opt/nutch-0.7.1/crawl.test/db 051227 212148 indexing segment: /opt/nutch- 0.7.1 /crawl.test/segments/20051227212143 051227 212148 * Opening segment 20051227212143 051227 212148 * Indexing segment 20051227212143 051227 212148 * Optimizing index... 051227 212148 * Moving index to NFS if needed... 051227 212148 DONE indexing segment 20051227212143: total 0 records in 0.026s (NaN rec/s). 051227 212148 done indexing 051227 212148 indexing segment: /opt/nutch-0.7.1 /crawl.test/segments/20051227212145 051227 212148 * Opening segment 20051227212145 051227 212148 * Indexing segment 20051227212145 051227 212148 * Optimizing index... 051227 212148 * Moving index to NFS if needed... 051227 212148 DONE indexing segment 20051227212145: total 0 records in 0.075s (NaN rec/s). 051227 212148 done indexing 051227 212148 indexing segment: /opt/nutch-0.7.1 /crawl.test/segments/20051227212146 051227 212148 * Opening segment 20051227212146 051227 212148 * Indexing segment 20051227212146 051227 212148 * Optimizing index... 051227 212148 * Moving index to NFS if needed... *051227 212148 DONE indexing segment 20051227212146: total 0 records in 0.011 s (NaN rec/s). *051227 212148 done indexing 051227 212148 Reading url hashes... 051227 212148 Sorting url hashes... 051227 212148 Deleting url duplicates... 051227 212148 Deleted 0 url duplicates. 051227 212148 Reading content hashes... 051227 212148 Sorting content hashes... 051227 212148 Deleting content duplicates... 051227 212148 Deleted 0 content duplicates. 051227 212148 Duplicate deletion complete locally. Now returning to NFS... 051227 212148 DeleteDuplicates complete 051227 212148 Merging segment indexes... 051227 212148 crawl finished: crawl.test Rgds Chih-How Bong
Re: Crawler problem in 0.7 and 0.7.1
Hi there, Can u check ur crawl filter.txt file? I guess there is slight handling problem in code. +^http://([a-z0-9]*\.)*google.com works but +^http://([a-z0-9]*\.)*google.com/ doesnt work U see the leading slash messes and wont allow to inject urls. So try removing / at the end in crawlurl filter.txt file and then it should work HTH Pushpesh On 12/28/05, Chih How Bong [EMAIL PROTECTED] wrote: Hi all, I encountered problems when I run nutch 0.7 and 0.7.1 crawler. Although I have added a number of root url in a plain text file *urls *as it the crawler seems unwillingly to fetch any of the urls. However, when In fall back to the nutch 0.6, everything just works fine under it. Therefore, I wondering if this problem happen to all of you? Currently, I am running nutch 0.7.1 with JDK1.5 update 6 on Ubuntu 5.10. Anywhere I came across the same problem under my apple Mac too. Below are the content of the log of the crawler, it shows that the crawler returrns 0 entry. Thanks in advance. 051227 212142 parsing file:/opt/nutch-0.7.1/conf/nutch-default.xml 051227 212143 parsing file:/opt/nutch- 0.7.1/conf/crawl-tool.xml 051227 212143 parsing file:/opt/nutch-0.7.1/conf/nutch-site.xml 051227 212143 No FS indicated, using default:local 051227 212143 crawl started in: crawl.test 051227 212143 rootUrlFile = urls 051227 212143 threads = 10 051227 212143 depth = 3 ... ... ..051227 212143 *Added 0 pages* 051227 212143 FetchListTool started 051227 212144 *Overall processing: Sorted 0 entries in 0.0 seconds. *051227 212144 Overall processing: Sorted NaN entries/second 051227 212144 FetchListTool completed 051227 212144 logging at INFO 051227 212145 Updating /opt/nutch-0.7.1/crawl.test/db 051227 212145 Updating for /opt/nutch-0.7.1 /crawl.test/segments/20051227212143 051227 212145 Finishing update 051227 212145 Update finished 051227 212145 FetchListTool started *051227 212145 Overall processing: Sorted 0 entries in 0.0 seconds.* 051227 212145 Overall processing: Sorted NaN entries/second 051227 212145 FetchListTool completed 051227 212145 logging at INFO 051227 212146 Updating /opt/nutch-0.7.1/crawl.test/db 051227 212146 Updating for /opt/nutch-0.7.1 /crawl.test/segments/20051227212145 051227 212146 Finishing update 051227 212146 Update finished 051227 212146 FetchListTool started 051227 212146 Overall processing: Sorted 0 entries in 0.0 seconds. 051227 212146 Overall processing: Sorted NaN entries/second 051227 212146 FetchListTool completed 051227 212146 logging at INFO 051227 212147 Updating /opt/nutch- 0.7.1/crawl.test/db 051227 212147 Updating for /opt/nutch-0.7.1 /crawl.test/segments/20051227212146 051227 212147 Finishing update 051227 212147 Update finished 051227 212147 Updating /opt/nutch-0.7.1/crawl.test/segments from /opt/nutch- 0.7.1/crawl.test/db 051227 212147 reading /opt/nutch-0.7.1/crawl.test/segments/20051227212143 051227 212148 reading /opt/nutch-0.7.1/crawl.test/segments/20051227212145 051227 212148 reading /opt/nutch-0.7.1/crawl.test/segments/20051227212146 051227 212148 Sorting pages by url... 051227 212148 Getting updated scores and anchors from db... 051227 212148 Sorting updates by segment... 051227 212148 Updating segments... 051227 212148 Done updating /opt/nutch-0.7.1/crawl.test/segments from /opt/nutch-0.7.1/crawl.test/db 051227 212148 indexing segment: /opt/nutch- 0.7.1 /crawl.test/segments/20051227212143 051227 212148 * Opening segment 20051227212143 051227 212148 * Indexing segment 20051227212143 051227 212148 * Optimizing index... 051227 212148 * Moving index to NFS if needed... 051227 212148 DONE indexing segment 20051227212143: total 0 records in 0.026s (NaN rec/s). 051227 212148 done indexing 051227 212148 indexing segment: /opt/nutch-0.7.1 /crawl.test/segments/20051227212145 051227 212148 * Opening segment 20051227212145 051227 212148 * Indexing segment 20051227212145 051227 212148 * Optimizing index... 051227 212148 * Moving index to NFS if needed... 051227 212148 DONE indexing segment 20051227212145: total 0 records in 0.075s (NaN rec/s). 051227 212148 done indexing 051227 212148 indexing segment: /opt/nutch-0.7.1 /crawl.test/segments/20051227212146 051227 212148 * Opening segment 20051227212146 051227 212148 * Indexing segment 20051227212146 051227 212148 * Optimizing index... 051227 212148 * Moving index to NFS if needed... *051227 212148 DONE indexing segment 20051227212146: total 0 records in 0.011 s (NaN rec/s). *051227 212148 done indexing 051227 212148 Reading url hashes... 051227 212148 Sorting url hashes... 051227 212148 Deleting url duplicates... 051227 212148 Deleted 0 url duplicates. 051227 212148 Reading content hashes... 051227 212148 Sorting content hashes... 051227 212148 Deleting content duplicates... 051227 212148 Deleted 0 content duplicates. 051227 212148 Duplicate deletion complete locally. Now returning to NFS... 051227 212148 DeleteDuplicates
Re: Is any one able to successfully run Distributed Crawl?
Have you tried the following: http://wiki.apache.org/nutch/HardwareRequirements and http://wiki.apache.org/nutch/ There are no quick answer if one is planning to crawl million pages..Read..Try.. Read.. On 12/28/05, Pushpesh Kr. Rajwanshi [EMAIL PROTECTED] wrote: Hi, I want to know if anyone is able to successfully run distributed crawl on multiple machines involving crawling millions of pages? and how hard is to do that? Do i just have to do some configuration and set up or do some implementations also? Also can anyone tell me if i want to crawl around 20,000 websites (say for depth 5) in a day, is it possible and if yes then how many machines would i roughly require? and what all configurations i will need? I would appreciate even some very approximate numbers also as i can understand it might not be trivial to find out or may be :-) TIA Pushpesh