Re: Nutch Implementation query

2008-01-29 Thread bhupal
Hai Jaya, There is a class NutchBean in src/java/org/apache/nutch/searcher, you can use this to run nutch. bhupal. Jaya Ghosh wrote: Hello, Greetings from India! I went through your tutorial Latest step by Step Installation guide for dummies: Nutch 0.9 I have downloaded

Re: Simple crawl fails to find any URLs

2008-01-29 Thread bhupal
Hi, Look at your conf/nutch-default.xml. I think you have not added crawl-urlfilter plugin in plugin-include property. bhupal. Barry Haddow wrote: Hi I'm try to get the nutch/hadoop example from http://wiki.apache.org/nutch/NutchHadoopTutorial running. I've set up the

Re: Simple crawl fails to find any URLs

2008-01-29 Thread bhupal
hi in plugin.includes value change urlfilter-regex to urlfilter-(crawl|regex) bhupal Barry Haddow wrote: Hi Bhupal The plugin.includes is below - I haven't changed it at all. What should it be? thanks and regards, Barry property nameplugin.includes/name

Re: Simple crawl fails to find any URLs

2008-01-29 Thread Barry Haddow
Hi Susam My urls file is [EMAIL PROTECTED] conf]$ hadoop dfs -cat urls/urllist.txt http://lucene.apache.org I'm using the crawl-urlfilter.txt suggested in the tutorial - ie changing +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ to read +^http://([a-z0-9]*\.)*apache.org/ When I run nutch crawl urls

Re: Simple crawl fails to find any URLs

2008-01-29 Thread Barry Haddow
Hi Bhupal The plugin.includes is below - I haven't changed it at all. What should it be? thanks and regards, Barry property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic| anchor)|query-(basic|site|url)|summary-basic|scoring-opic|

Re: Need some advise about updating crawl data

2008-01-29 Thread bhupal
Hai Kevin, After you replace the crawl folder, just do touch. Use this command touch your_webapp_folder/WEB-INF/web.xml bye, bhupal Kevin.Y wrote: I'm using nutch0.9 to crawl some specified content urls, such as http://x/art/1.htm http://x/art/2.htm http://x/art/3.htm

trying to perform an intentionally slow crawl - fetcher.server.delay ignored?

2008-01-29 Thread John Funke
For the sake of politeness, I am trying to run an intentionally slow crawl against one of our internal servers by setting the fetcher.server.delayvalue to 20, but no matter what I change this value to, it continues to fetch at the same speed. I am running the latest stable version of 0.9. Also set

Can IndexReader be opened on a hadoop directory?

2008-01-29 Thread Kenji
I'm trying to open a Lucene index created on a hadoop dfs. Configuration nutchConf = NutchConfiguration.create(); FileSystem fs = FileSystem.get(nutchConf); Path lastIndex = this.dataConf.lastIndexDir(); IndexReader idxReader = IndexReader.open(fs.getUri().toString() +

Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci
Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use

Re: Can IndexReader be opened on a hadoop directory?

2008-01-29 Thread Andrzej Bialecki
Kenji wrote: I'm trying to open a Lucene index created on a hadoop dfs. Configuration nutchConf = NutchConfiguration.create(); FileSystem fs = FileSystem.get(nutchConf); Path lastIndex = this.dataConf.lastIndexDir(); IndexReader idxReader = IndexReader.open(fs.getUri().toString() +

Re: Simple crawl fails to find any URLs

2008-01-29 Thread Barry Haddow
Hi OK, now I get more output on the console, so the crawl might have worked. How can I extract the crawled files from the dfs? And should I be worried about the following error in hadoop.log: 2008-01-29 09:54:54,428 WARN mapred.ReduceTask - java.io.FileNotFoundException:

RE: Nutch Implementation query

2008-01-29 Thread Jaya Ghosh
Hello Bhupal, Thanks for the mail. I used src/java/org/apache/nutch/searcher It gave Total hits: 0 Where am I going wrong? In the crawl-urlfilter.txt file I specified the location where my online documentation is stored in html format. I have followed all the instructions from the tutorial

nutch won't crawl on windows

2008-01-29 Thread blackwater dev
I have nutch 0.8.1 loaded on my XP machine. I created a directory named urls and in there a file named yooroc which contains the line: http://www.yooroc.com I then edited crawl-urlfilter.txt and added this line: s+^http://([a-z0-9]*\.)*yooroc.com/ Then in nutch-site.xml I have this: ?xml

New Installation - Problems - Error 500

2008-01-29 Thread Paul Stewart
Hi folks... Just installing a new server for Nutch - testing at this point... Ran a crawl with no problems but can't do a search without getting an Error 500. CentOS5.1, Tomcat5.5.20, Java SDK 1.5.0_14 The last time I installed Nutch I ran into a similar issue and it had to do with a config

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen
Hi there, On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote: Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to create a seperate file

Re: nutch won't crawl on windows

2008-01-29 Thread blackwater dev
Any thoughts on this? I get the same error with nutch 9. Thanks. On Jan 29, 2008 9:19 AM, blackwater dev [EMAIL PROTECTED] wrote: I have nutch 0.8.1 loaded on my XP machine. I created a directory named urls and in there a file named yooroc which contains the line: http://www.yooroc.com

Re: New Installation - Problems - Error 500

2008-01-29 Thread Martin Kuen
Hi, if you type java -version in your shell the shell will output the java version you are using. I assume the output will refer to to gcj not to the sun-jdk. You should change your environment variables or create the necassary ones. Open a shell and in your tomcat installation's root directory

Problems in Cygwin

2008-01-29 Thread Wilson Melo
I asked this question ten days ago. As I got no answer, I am posting it again: I have been using Nutch in the Linux (Fedora) and decided to try the Cygwin in Windows XP. I tried Nutch 0.9 (the official release), with Cygwin, without any problems. After that, I decided to try the latest

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci
Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. No other way besides invert the inverted file? Martin Kuen wrote: Hi, On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote: Hi, I am new to nutch and I

RE: Nutch Implementation query

2008-01-29 Thread kishore.krishna2
Hi Can u attach the crawl-urlfilter... Thanx kishore -Original Message- From: Jaya Ghosh [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 29, 2008 5:22 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch Implementation query Hello Bhupal, Thanks for the mail. I used

Re: trying to perform an intentionally slow crawl - fetcher.server.delay ignored?

2008-01-29 Thread Andrzej Bialecki
John Funke wrote: For the sake of politeness, I am trying to run an intentionally slow crawl against one of our internal servers by setting the fetcher.server.delayvalue to 20, but no matter what I change this value to, it continues to fetch at the same speed. I am running the latest stable

Re: Tomcat query

2008-01-29 Thread Vinci
Hi, Here is the anwser for q1 and q3, 1. the tomcat is for the online search interface. If you won't include the documentation in the release product, you don't need to include it in the package, just setup the tomcat on the server where has the index file located, modify the config file and

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen
Hi, On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote: Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I

Re: New Installation - Problems - Error 500

2008-01-29 Thread Andrzej Bialecki
Paul Stewart wrote: java.lang.NoClassDefFoundError: org.apache.hadoop.util.ReflectionUtils java.lang.Class.initializeClass(libgcj.so.7rh) This is not coming from Sun JDK - it's coming from GCJ. Check which version of Java is used by Tomcat.

RE: New Installation - Problems - Error 500

2008-01-29 Thread Paul Stewart
Thanks.. my apologies as new to Java (to complicate matters). When I check in the tomcat.conf file I can't find a place to specify. When I do a search, there is multiple versions installed: /usr/bin/java /usr/share/java /usr/include/c++/4.1.1/gnu/java /usr/include/c++/4.1.1/java /usr/java

Re: Simple crawl fails to find any URLs

2008-01-29 Thread Barry Haddow
Hi I just tried the crawl again, no changes to the configuration since this morning, using the exact same command. No URLs. The only error in hadoop.log is WARN crawl.Crawl - No URLs to fetch - check your seed list and URL filters. Is there anywhere else I should look for errors? The nutch

RE: New Installation - Problems - Error 500

2008-01-29 Thread Paul Stewart
Thanks for the reply... Java -version shows this: java version 1.4.2 gij (GNU libgcj) version 4.1.2 20070626 (Red Hat 4.1.2-14) I used all pre-built packages hoping that they would do the trick ;) I updated the tomcat startup script with the proper JAVA_HOME and now I get: [EMAIL PROTECTED]

Re: New Installation - Problems - Error 500

2008-01-29 Thread Martin Kuen
Hi, On Jan 29, 2008 7:14 PM, Paul Stewart [EMAIL PROTECTED] wrote: Thanks for the reply... Java -version shows this: java version 1.4.2 gij (GNU libgcj) version 4.1.2 20070626 (Red Hat 4.1.2-14) I just had a closer look at your stacktrace and your gij version. It's version 1.4.2 and it

RE: New Installation - Problems - Error 500

2008-01-29 Thread Paul Stewart
Thanks to everyone for their help... I installed apache-tomcat by hand tonight and I have Nutch up and running now... Just a few questions if you don't mind: In Tomcat, I have webapps/nutch-0.9 as the directory making the URL http://www.blahblah.com:8080/nutch-0.9 I want it in the root URL - if

Re: New Installation - Problems - Error 500

2008-01-29 Thread John Mendenhall
Just a few questions if you don't mind: In Tomcat, I have webapps/nutch-0.9 as the directory making the URL http://www.blahblah.com:8080/nutch-0.9 I want it in the root URL - if I move the files up I just get a blank page even after restarting Tomcat? Also, the port is 8080 - where

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci
Hi, thank you.:) Seems I need to write a Java program to write out the file and do the transformation. Another question to the dumped linkdb: I find escaped html appear in the end of the link, is it the fault of the parser (the html most likely not valid, but I really don't need the chunk of the

Dedup: Job Failed and crawl stopped at depth 1

2008-01-29 Thread Vinci
I run the 0.9 crawler with parameter -depth 2 -threads 1, and I get the job failed message for a dynamic-content site: Dedup: starting Dedup: adding indexes in: /var/crawl/indexes Exception in thread main java.io.IOException: Job failed! at