Re: fetcher.threads.per.host bug in 0.7.1?
Is there a bug in 0.7.1 that causes the fetcher.threads.per.host setting to be ignored? Why do you think it's getting ignored? Is it because of the "Exceeded http.max.delays" errors below? These show up when the fetcher.threads.per.host limit causes a thread to delay and then loop, because another thread is already accessing a page from the same host. When a thread has looped more than http.max.delays times, it triggers the that error. So it's actually a sign that fetcher.threads.per.host is being used, not ignored. Looks like you're going after a bunch of pages from the same domain (fas.org), which means you're going to get a bunch of these errors even with just three threads. -- Ken [snip] fetcher.threads.per.host 1 This number is the maximum number of threads that should be allowed to access a host at one time. Fetch Log 060109 202235 fetching http://www.fas.org/irp/news/1998/06/prs_rel21.html 060109 202250 fetch of http://www.fas.org/irp/news/1998/04/t04141998_t0414asd-3.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202250 fetch of http://www.fas.org/asmp/campaigns/smallarms/sawgconf.PDF failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202250 fetching http://www.fas.org/irp/commission/testhaas.htm 060109 202250 fetching http://www.fas.org/asmp/profiles/bahrain.htm 060109 202250 fetching http://www.fas.org/irp/cia/product/dci_speech_03082001.html 060109 202306 fetching http://www.fas.org/irp/news/1998/06/980609-drug10.htm 060109 202321 fetch of http://www.fas.org/irp/commission/testhaas.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202321 fetch of http://www.fas.org/asmp/profiles/bahrain.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202321 fetching http://www.fas.org/irp/news/1998/04/980422-terror2.htm 060109 202321 fetching http://www.fas.org/irp//congress/2004_cr/index.html 060109 202321 fetching http://www.fas.org/irp//congress/2001_rpt/index.html 060109 202338 fetching http://www.fas.org/irp/budget/fy98_navy/0601152n.htm 060109 202354 fetching http://www.fas.org/irp/dia/product/cent21strat.htm 060109 202408 fetch of http://www.fas.org/irp/news/1998/04/980422-terror2.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202408 fetch of http://www.fas.org/irp//congress/2004_cr/index.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202408 fetching http://www.fas.org/faspir/2001/v54n2/qna.htm 060109 202408 fetching http://www.fas.org/graphics/predator/index.htm 060109 202409 fetching http://www.fas.org/irp/doddir/dod/5200-1r/chapter_6.htm 060109 202425 fetching http://www.fas.org/irp//congress/1995_hr/140.htm -- Ken Krugler Krugle, Inc. +1 530-470-9200
RE: http status 500?
Okay, I did that and restarted tomcat from my crawl.test directory..still get an error when searching. Do I need to rerun the crawl? andy -Original Message- From: Jerry Russell [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 5:59 PM To: nutch-user@lucene.apache.org Subject: Re: http status 500? In the file {tomcatroot}\webapps\nutch-0.7\WEB-INF\classes\nutch-site.xml Add the following inside the tag: searcher.dir /full/path/to/the/directory/containing/segments Jerry [EMAIL PROTECTED] http://circuitscout.com Andy Morris wrote: >What config file, the nutch-daemon.sh or just the nutch file? Or in the >tomcat folder? >Andy > >-Original Message- >From: Jerry Russell [mailto:[EMAIL PROTECTED] >Sent: Monday, January 09, 2006 5:22 PM >To: nutch-user@lucene.apache.org >Subject: Re: http status 500? > >Hi Andy, > > Have you added the path to the segments directory to your >configuration file? If not, you need to do that, or start tomcat while >that is your current directory. Is there any stack trace, or error in >the catalina.out? > >Jerry >[EMAIL PROTECTED] >http://circuitscout.com > >Andy Morris wrote: > > > >>Okay, I think I got nutch working and tomcat runs. I did a crawl and >> >> >it > > >>got some data I think. When I go to the search page and do a search I >>get an http status 500 page >>javax.servlet.ServletException: Not implemented >> >>root cause >> >>java.lang.Error: Not implemented >> >>any ideas? Do I need to build tomcat from cratch. This is on a fedora >>core 2 box with tomcat 4.1.27-13 from rpm. >> >>andy >> >> >> >> >> > > >
Re: http status 500?
In the file {tomcatroot}\webapps\nutch-0.7\WEB-INF\classes\nutch-site.xml Add the following inside the tag: searcher.dir /full/path/to/the/directory/containing/segments Jerry [EMAIL PROTECTED] http://circuitscout.com Andy Morris wrote: What config file, the nutch-daemon.sh or just the nutch file? Or in the tomcat folder? Andy -Original Message- From: Jerry Russell [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 5:22 PM To: nutch-user@lucene.apache.org Subject: Re: http status 500? Hi Andy, Have you added the path to the segments directory to your configuration file? If not, you need to do that, or start tomcat while that is your current directory. Is there any stack trace, or error in the catalina.out? Jerry [EMAIL PROTECTED] http://circuitscout.com Andy Morris wrote: Okay, I think I got nutch working and tomcat runs. I did a crawl and it got some data I think. When I go to the search page and do a search I get an http status 500 page javax.servlet.ServletException: Not implemented root cause java.lang.Error: Not implemented any ideas? Do I need to build tomcat from cratch. This is on a fedora core 2 box with tomcat 4.1.27-13 from rpm. andy
RE: http status 500?
What config file, the nutch-daemon.sh or just the nutch file? Or in the tomcat folder? Andy -Original Message- From: Jerry Russell [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 5:22 PM To: nutch-user@lucene.apache.org Subject: Re: http status 500? Hi Andy, Have you added the path to the segments directory to your configuration file? If not, you need to do that, or start tomcat while that is your current directory. Is there any stack trace, or error in the catalina.out? Jerry [EMAIL PROTECTED] http://circuitscout.com Andy Morris wrote: >Okay, I think I got nutch working and tomcat runs. I did a crawl and it >got some data I think. When I go to the search page and do a search I >get an http status 500 page >javax.servlet.ServletException: Not implemented > >root cause > >java.lang.Error: Not implemented > >any ideas? Do I need to build tomcat from cratch. This is on a fedora >core 2 box with tomcat 4.1.27-13 from rpm. > >andy > > >
Re: http status 500?
Hi Andy, Have you added the path to the segments directory to your configuration file? If not, you need to do that, or start tomcat while that is your current directory. Is there any stack trace, or error in the catalina.out? Jerry [EMAIL PROTECTED] http://circuitscout.com Andy Morris wrote: Okay, I think I got nutch working and tomcat runs. I did a crawl and it got some data I think. When I go to the search page and do a search I get an http status 500 page javax.servlet.ServletException: Not implemented root cause java.lang.Error: Not implemented any ideas? Do I need to build tomcat from cratch. This is on a fedora core 2 box with tomcat 4.1.27-13 from rpm. andy
http status 500?
Okay, I think I got nutch working and tomcat runs. I did a crawl and it got some data I think. When I go to the search page and do a search I get an http status 500 page javax.servlet.ServletException: Not implemented root cause java.lang.Error: Not implemented any ideas? Do I need to build tomcat from cratch. This is on a fedora core 2 box with tomcat 4.1.27-13 from rpm. andy
No cluster results
No cluster results is displayed next to the search results. Is this because I turned clustering on after running the fetch and the indexing? nutch-site.xml nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|clustering-carrot2 Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins.
Re: Search result is an empty site
Never mind solved it for tomcat 5 run export JAVA_OPTS="-Xmx128m -Xms128m" Håvard W. Kongsgård wrote: No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know it works. Searching using site like "china site:www.fas.org" also works. Dominik Friedrich wrote: If you use the mapred version from svn trunk you might have run into the same problem as I have. In the mapred version the searcher.dir property in nutch-default.xml is set to crawl and not . anymore. If you use this version you have either to put the index and the segments dirs into a folder called crawl and start tomcat from above that folder or change that value in the nutch-site.xml in webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment. regards Dominik Håvard W. Kongsgård wrote: Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
Re: Multi CPU support
Teruhiko Kurosaka wrote: Can I use MapReduce to run Nutch on a multi CPU system? Yes. I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters like these: mapred.tasktracker.tasks.maxiumum=2 mapred.job.tracker=localhost:9001 mapred.reduce.tasks=2 (or 1?) and bin/start-all.sh ? That should work. You'd probably want to set the default number of map tasks to be a multiple of the number of CPUs, and the number of reduce tasks to be exactly the number of cpus. Don't use start-all.sh, but rather just: bin/nutch-daemon.sh start tasktracker bin/nutch-daemon.sh start jobtracker Must I use NDFS for MapReduce? No. Doug
Multi CPU support
Can I use MapReduce to run Nutch on a multi CPU system? I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters like these: mapred.tasktracker.tasks.maxiumum=2 mapred.job.tracker=localhost:9001 mapred.reduce.tasks=2 (or 1?) and bin/start-all.sh ? Must I use NDFS for MapReduce? Do I need to do anything else to make sure that the two processes run on different CPUs? Is this the only way to take advantage of a multi CPU system? -kuro
Creating Multiple Nutch Beans for Searching
will the performance gets drastically reduced, if i create a Nutch Bean for each and every user? can some one shed light on this issue?
Re: Is any one able to successfully run Distributed Crawl?
Pushpesh Kr. Rajwanshi wrote: Just wanted to confirm that this distributed crawl you did using nutch version 0.7.1 or some other version? And was that a successful distributed crawl using map reduce or some work around for distributed crawl? No, this is 0.8-dev. This was using in early December using the version of Nutch then in the mapred branch. This version has since been merged into the trunk and will be eventually released as 0.8. I believe everything in my previous message is still relevant to the current trunk. Doug
Re: Full Range of Results Not Showing
Neal Whitley wrote: I have Nutch 0.7 up and running however when I search there are a number times where Nutch finds more total matching pages than in returns on a search. Example: On a search Nutch finds 81 matching pages but only returns 46 in a result set. Hits *46-46* (out of about 81 total matching pages): Nutch, like Yahoo! and Google, only shows two hits from a site. Are there "more from site" links with some hits? There should be. Is there a "show all hits" link at the bottom of the last page? There should be. Doug
Fetching only the pages in an urlfile
Hi, How, or where, can I specify to the fetcher to only fetch content/pages of the urls in the specified urlfile? I.E., I want to avoid unnecessary fetching of extract content...I want to avoid the following (dumplinks output using readdb): "from http://www.eurekalert.org/pub_releases/2005-12/uopm-mut120805.php to http://www.upmc.edu/"; (I don't want to fetch "http://www.upmc.edu/"; url, but only " http://www.eurekalert.org/pub_releases/2005-12/uopm-mut120805.php";) Thanks in advance for any help! Vish
Re: Appropriate MapReduce Hardware
Chris Schneider wrote: 2) The TaskTracker nodes should probably also be DataNodes in such a relatively small system. No significant data is saved on the TaskTracker machine, except in its role as a DataNode. It is actually optimal for TaskTracker and DataNodes to both be run on all slave boxes. That way map tasks can be assigned to nodes where their input data is local, and reduce tasks can write the first copy of their output locally, reducing network i/o. (These optimizations are not in the current code, but will be soon.) 3) The NameNode box probably wants to keep large indexes of blocks in memory, but I wouldn't expect these to exceed the same 2GB metric we're using for the TaskTrackers. Likewise, I wouldn't expect the CPU speed to be a major constraint (mostly network bound). Finally, I can't imagine why the NameNode would need tons of disk space. 4) I would imagine that the JobTracker would have even less need for big RAM and a fast CPU, let alone hard drive space. I'd probably start with this running on the same box as the NameNode. I typically run the NameNode and JobTracker on the same box, the master. Ideally this box might be configured differently (e.g.,, a RAID for higher disk reliability) but practically speaking its fine and simpler to have it configured the same as the others. I usually run a cron entry on the NameNode box which periodically copies NDFS name data to another drive or machine with rsync, since this is a single-point of failure. 7) Since the local network will probably be the gating performance parameter, we'll need a 1GB network. Yes, I've benchmarked 30 & 180 node NDFS systems with 100MB networking, and the network does appear the be the bottleneck. Doug
Re: Help on language
Would you tell me where i can get help document on How to use NGramProfile to train the language identifier and how to detect it. Marathi language used in India. Uses Devanagari Script and also space is used for separator. Will it be OK if i use Stop Analyzer instead of NutchDocumentAnalyzer with my custom stopwords? where i have to make changes in Nutch code?
RE: Help on language
Could you tell me where Marathi is used and what script (a set of letters) is used to write it? Does Marathi use spaces to separate words? If so, I don't see much problem from the architectural point of view. You just write the analyzer plugin (not very easy for some languages but do-able). But if it doesn't use spaces, like Japanese (also Korean and Chinese?), then you'd have a problem. Currently, the Query expressions analysis assumes that words are separated by spaces for non-CJK (Chinese, Japanese and Korean) characters, and a single CJK character forms a word, an invalid assumption. The analysis part of the Query expression is not made plugable yet. (I'm trying to come up with some proposal.) Oh, by the way, you'd need a dev version of Nutch to use the plugable language analyzer. The stable version has the generic analyzer hard-coded. -kuro > -Original Message- > From: Sameer Tamsekar [mailto:[EMAIL PROTECTED] > Sent: 2006-1-08 2:40 > To: nutch-user@lucene.apache.org > Subject: Help on language > > Hello, > > I am working on building custom analyzer and language detector > for native language("Marathi") , does anybody have idea how to extend > nutch for using this language. > > Regards, > > Sameer >
fetcher.threads.per.host bug in 0.7.1?
Is there a bug in 0.7.1 that causes the fetcher.threads.per.host setting to be ignored? Nutch-site.xml fetcher.server.delay 15.0 The number of seconds the fetcher will delay between successive requests to the same server. fetcher.threads.fetch 3 The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). fetcher.threads.per.host 1 This number is the maximum number of threads that should be allowed to access a host at one time. Fetch Log 060109 202235 fetching http://www.fas.org/irp/news/1998/06/prs_rel21.html 060109 202250 fetch of http://www.fas.org/irp/news/1998/04/t04141998_t0414asd-3.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202250 fetch of http://www.fas.org/asmp/campaigns/smallarms/sawgconf.PDF failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202250 fetching http://www.fas.org/irp/commission/testhaas.htm 060109 202250 fetching http://www.fas.org/asmp/profiles/bahrain.htm 060109 202250 fetching http://www.fas.org/irp/cia/product/dci_speech_03082001.html 060109 202306 fetching http://www.fas.org/irp/news/1998/06/980609-drug10.htm 060109 202321 fetch of http://www.fas.org/irp/commission/testhaas.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202321 fetch of http://www.fas.org/asmp/profiles/bahrain.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202321 fetching http://www.fas.org/irp/news/1998/04/980422-terror2.htm 060109 202321 fetching http://www.fas.org/irp//congress/2004_cr/index.html 060109 202321 fetching http://www.fas.org/irp//congress/2001_rpt/index.html 060109 202338 fetching http://www.fas.org/irp/budget/fy98_navy/0601152n.htm 060109 202354 fetching http://www.fas.org/irp/dia/product/cent21strat.htm 060109 202408 fetch of http://www.fas.org/irp/news/1998/04/980422-terror2.htm failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202408 fetch of http://www.fas.org/irp//congress/2004_cr/index.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 060109 202408 fetching http://www.fas.org/faspir/2001/v54n2/qna.htm 060109 202408 fetching http://www.fas.org/graphics/predator/index.htm 060109 202409 fetching http://www.fas.org/irp/doddir/dod/5200-1r/chapter_6.htm 060109 202425 fetching http://www.fas.org/irp//congress/1995_hr/140.htm
Full Range of Results Not Showing
I have Nutch 0.7 up and running however when I search there are a number times where Nutch finds more total matching pages than in returns on a search. Example: On a search Nutch finds 81 matching pages but only returns 46 in a result set. Hits 46-46 (out of about 81 total matching pages): Why is it doing this? Or, what do I need to correct? Thanks, Neal
Fedora core 2 install
Okay, I think I have nutch set up properly. I have java and tomcat inistalled. I can run a crawl and it processes the urls in the urls file. When I got to the search site and do a search I get an error http status 500... description The server encountered an internal error () that prevented it from fulfilling this request. exception exception javax.servlet.ServletException: org.apache.nutch.clustering.OnlineClustererFactory root cause java.lang.NoClassDefFoundError: org.apache.nutch.clustering.OnlineClustererFactory Is there something I am missing. I started tomcat from my crawl.test directory. Was that correct? Fedora core2 does not have the catalina.sh file to start tomcat, I installed tomcat from yum. I can get to the web site and the search site. Andy
Nutch freezing - deflateBytes
Our nutch installation (version .7, running on Mandrake linux) continues to freeze sporadically during fetching. Our developer has it pinned down to the deflateBytes library. "it looped in the native method called deflateBytes for very long time. Some times, it took several hours." That's all we've got so far. Has anyone run into a problem with this library before, or know of a quick way around the issue? Thanks.
Re: Search result is an empty site
No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know it works. Searching using site like "china site:www.fas.org" also works. Dominik Friedrich wrote: If you use the mapred version from svn trunk you might have run into the same problem as I have. In the mapred version the searcher.dir property in nutch-default.xml is set to crawl and not . anymore. If you use this version you have either to put the index and the segments dirs into a folder called crawl and start tomcat from above that folder or change that value in the nutch-site.xml in webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment. regards Dominik Håvard W. Kongsgård wrote: Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
Re: Search result is an empty site
If you use the mapred version from svn trunk you might have run into the same problem as I have. In the mapred version the searcher.dir property in nutch-default.xml is set to crawl and not . anymore. If you use this version you have either to put the index and the segments dirs into a folder called crawl and start tomcat from above that folder or change that value in the nutch-site.xml in webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment. regards Dominik Håvard W. Kongsgård wrote: Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
Search result is an empty site
Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2
Re: Help on language
> > I am working on building custom analyzer To build a custom analyzer, take a look at analysis-de and analysis-fr plugins (they use some lucene analyzers). A specific analyzer is used depending on the language guessed by the language identifier. > and language detector > > for native language("Marathi") , does anybody have idea how to extend > > nutch for using this language. Use the org.apache.nutch.analysis.lang.NGramProfile command to generate a profile of ngrams for Marathi from a textual corpus. Usage for creating a new profile is: NGramProfile -create profilename filename encoding Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Help needed please !Please Ignore
On Mon, 2006-01-09 at 02:06 +0200, Gal Nitzan wrote: > Hi, > > I see only one fetcher task but I have three tasktrackers. > > What am I missing? > > Thanks, > > G. > > >