Hai Jaya,
There is a class NutchBean in src/java/org/apache/nutch/searcher, you can
use this to run nutch.
bhupal.
Jaya Ghosh wrote:
Hello,
Greetings from India!
I went through your tutorial Latest step by Step Installation guide for
dummies: Nutch 0.9
I have downloaded
Hi,
Look at your conf/nutch-default.xml.
I think you have not added crawl-urlfilter plugin in plugin-include
property.
bhupal.
Barry Haddow wrote:
Hi
I'm try to get the nutch/hadoop example from
http://wiki.apache.org/nutch/NutchHadoopTutorial
running.
I've set up the
hi
in plugin.includes value change urlfilter-regex to urlfilter-(crawl|regex)
bhupal
Barry Haddow wrote:
Hi Bhupal
The plugin.includes is below - I haven't changed it at all. What should it
be?
thanks and regards,
Barry
property
nameplugin.includes/name
Hi Susam
My urls file is
[EMAIL PROTECTED] conf]$ hadoop dfs -cat urls/urllist.txt
http://lucene.apache.org
I'm using the crawl-urlfilter.txt suggested in the tutorial - ie changing
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read
+^http://([a-z0-9]*\.)*apache.org/
When I run
nutch crawl urls
Hi Bhupal
The plugin.includes is below - I haven't changed it at all. What should it be?
thanks and regards,
Barry
property
nameplugin.includes/name
valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|
anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
Hai Kevin,
After you replace the crawl folder, just do touch.
Use this command
touch your_webapp_folder/WEB-INF/web.xml
bye,
bhupal
Kevin.Y wrote:
I'm using nutch0.9 to crawl some specified content urls, such as
http://x/art/1.htm
http://x/art/2.htm
http://x/art/3.htm
For the sake of politeness, I am trying to run an intentionally slow crawl
against one of our internal servers by setting the
fetcher.server.delayvalue to 20, but no matter what I change this
value to, it continues to
fetch at the same speed. I am running the latest stable version of 0.9. Also
set
I'm trying to open a Lucene index created on a hadoop dfs.
Configuration nutchConf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(nutchConf);
Path lastIndex = this.dataConf.lastIndexDir();
IndexReader idxReader = IndexReader.open(fs.getUri().toString() +
Hi,
I am new to nutch and I am trying to run a nutch to fetch something from
specific websites. Currently I am running 0.9.
As I have limited resources, I don't want nutch be too aggressive, so I want
to set some delay, but I am confused with the value of http.max.delays, does
it use
Kenji wrote:
I'm trying to open a Lucene index created on a hadoop dfs.
Configuration nutchConf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(nutchConf);
Path lastIndex = this.dataConf.lastIndexDir();
IndexReader idxReader = IndexReader.open(fs.getUri().toString() +
Hi
OK, now I get more output on the console, so the crawl might have worked. How
can I extract the crawled files from the dfs?
And should I be worried about the following error in hadoop.log:
2008-01-29 09:54:54,428 WARN mapred.ReduceTask -
java.io.FileNotFoundException:
Hello Bhupal,
Thanks for the mail. I used src/java/org/apache/nutch/searcher
It gave Total hits: 0
Where am I going wrong?
In the crawl-urlfilter.txt file I specified the location where my online
documentation is stored in html format. I have followed all the instructions
from the tutorial
I have nutch 0.8.1 loaded on my XP machine.
I created a directory named urls and in there a file named yooroc which
contains the line:
http://www.yooroc.com
I then edited crawl-urlfilter.txt and added this line:
s+^http://([a-z0-9]*\.)*yooroc.com/
Then in nutch-site.xml I have this:
?xml
Hi folks...
Just installing a new server for Nutch - testing at this point...
Ran a crawl with no problems but can't do a search without getting an
Error 500.
CentOS5.1, Tomcat5.5.20, Java SDK 1.5.0_14
The last time I installed Nutch I ran into a similar issue and it had to
do with a config
Hi there,
On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote:
Hi,
Thank you :)
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file.
You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
create a seperate file
Any thoughts on this?
I get the same error with nutch 9.
Thanks.
On Jan 29, 2008 9:19 AM, blackwater dev [EMAIL PROTECTED] wrote:
I have nutch 0.8.1 loaded on my XP machine.
I created a directory named urls and in there a file named yooroc which
contains the line:
http://www.yooroc.com
Hi,
if you type java -version in your shell the shell will output the java
version you are using. I assume the output will refer to to gcj not to the
sun-jdk. You should change your environment variables or create the
necassary ones.
Open a shell and in your tomcat installation's root directory
I asked this question ten days ago. As I got no answer, I am posting it
again:
I have been using Nutch in the Linux (Fedora) and decided to try the Cygwin
in Windows XP.
I tried Nutch 0.9 (the official release), with Cygwin, without any
problems.
After that, I decided to try the latest
Hi,
Thank you :)
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file. No other way besides invert the
inverted file?
Martin Kuen wrote:
Hi,
On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:
Hi,
I am new to nutch and I
Hi
Can u attach the crawl-urlfilter...
Thanx
kishore
-Original Message-
From: Jaya Ghosh [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 29, 2008 5:22 PM
To: nutch-user@lucene.apache.org
Subject: RE: Nutch Implementation query
Hello Bhupal,
Thanks for the mail. I used
John Funke wrote:
For the sake of politeness, I am trying to run an intentionally slow crawl
against one of our internal servers by setting the
fetcher.server.delayvalue to 20, but no matter what I change this
value to, it continues to
fetch at the same speed. I am running the latest stable
Hi,
Here is the anwser for q1 and q3,
1. the tomcat is for the online search interface. If you won't include the
documentation in the release product, you don't need to include it in the
package, just setup the tomcat on the server where has the index file
located, modify the config file and
Hi,
On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:
Hi,
I am new to nutch and I am trying to run a nutch to fetch something from
specific websites. Currently I am running 0.9.
As I have limited resources, I don't want nutch be too aggressive, so I
want
to set some delay, but I
Paul Stewart wrote:
java.lang.NoClassDefFoundError: org.apache.hadoop.util.ReflectionUtils
java.lang.Class.initializeClass(libgcj.so.7rh)
This is not coming from Sun JDK - it's coming from GCJ. Check which
version of Java is used by Tomcat.
Thanks.. my apologies as new to Java (to complicate matters).
When I check in the tomcat.conf file I can't find a place to specify. When I
do a search, there is multiple versions installed:
/usr/bin/java
/usr/share/java
/usr/include/c++/4.1.1/gnu/java
/usr/include/c++/4.1.1/java
/usr/java
Hi
I just tried the crawl again, no changes to the configuration since this
morning, using the exact same command. No URLs. The only error in hadoop.log
is
WARN crawl.Crawl - No URLs to fetch - check your seed list and URL filters.
Is there anywhere else I should look for errors? The nutch
Thanks for the reply...
Java -version shows this:
java version 1.4.2
gij (GNU libgcj) version 4.1.2 20070626 (Red Hat 4.1.2-14)
I used all pre-built packages hoping that they would do the trick ;)
I updated the tomcat startup script with the proper JAVA_HOME and now I
get:
[EMAIL PROTECTED]
Hi,
On Jan 29, 2008 7:14 PM, Paul Stewart [EMAIL PROTECTED] wrote:
Thanks for the reply...
Java -version shows this:
java version 1.4.2
gij (GNU libgcj) version 4.1.2 20070626 (Red Hat 4.1.2-14)
I just had a closer look at your stacktrace and your gij version. It's
version 1.4.2 and it
Thanks to everyone for their help... I installed apache-tomcat by hand
tonight and I have Nutch up and running now...
Just a few questions if you don't mind:
In Tomcat, I have webapps/nutch-0.9 as the directory making the URL
http://www.blahblah.com:8080/nutch-0.9
I want it in the root URL - if
Just a few questions if you don't mind:
In Tomcat, I have webapps/nutch-0.9 as the directory making the URL
http://www.blahblah.com:8080/nutch-0.9
I want it in the root URL - if I move the files up I just get a blank
page even after restarting Tomcat? Also, the port is 8080 - where
Hi,
thank you.:)
Seems I need to write a Java program to write out the file and do the
transformation.
Another question to the dumped linkdb: I find escaped html appear in the end
of the link, is it the fault of the parser (the html most likely not valid,
but I really don't need the chunk of the
I run the 0.9 crawler with parameter -depth 2 -threads 1, and I get the job
failed message for a dynamic-content site:
Dedup: starting
Dedup: adding indexes in: /var/crawl/indexes
Exception in thread main java.io.IOException: Job failed!
at
32 matches
Mail list logo