Hello,
I'm working to the development of a multi-agents software that
involves some information indexing, information retrieval and
information categorization tasks. I want to build the training set for
categorization using a set of HTML pages fetched from DMOZ RDF dumps.
I have tried the
Hi Novelli
Do you insist on HtmlParser in Nutch?
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net
http://htmlparser.sourceforge.net/
Regards
/Jack
On 7/29/05, Giovanni Novelli [EMAIL PROTECTED] wrote:
Hello,
I'm working to the development of a multi-agents
Hello Joe,
If you are using whole web crawling you should change regex-urlfilter.txt
insead of crawl-urlfilter.txt.
Piotr
On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote:
I have a simple question: I'm using Nutch to do some
whole-web crawling (just a small dataset). Somehow
Nutch has gotten
Two options:
bin/nutch readdb crawl/db -stats
or use Luke (Google for luke lucene) to open the Lucene index.
Erik
On Jul 28, 2005, at 9:44 PM, blackwater dev wrote:
After I finish a crawl...what is the best way to go into my crawl
directory and get the number of indexed pages?
Hello,
First one will give you number of pages in WebDB and not all of them
are indexed.
Regards,
Piotr
On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote:
Two options:
bin/nutch readdb crawl/db -stats
or use Luke (Google for luke lucene) to open the Lucene index.
Erik
On
Now what I tried (after what you said):
1. I started the command out of the Superuser Terminal (Suse 9.3)
´= same Problem
2. I stopped Suse s firewall in Yast2 = same Problem
3. the file is urls without any extension
To the misconfiguration of network:
I m not that pro in linux, so where
try reinstall a new version J2EE?
I guess JVM has problem to interface to file system,
Michael,
--- Nils Hoeller [EMAIL PROTECTED] wrote:
Now what I tried (after what you said):
1. I started the command out of the Superuser
Terminal (Suse 9.3)
´= same Problem
2. I stopped Suse s
I ve now downloaded the newest J2EE from java.sun.com
I ve installed it with by executing the bin file.
Should I do anything more?
The Problem is: I ve got still the exception.
java -version gives me (if this matters)
java version 1.5.0_04
Java(TM) 2 Runtime Environment, Standard Edition
No :-(
I ve added the PATH, but same Error!
What does the exception mean exactly ?
Is this a really a problem with my machine?
Thanks Nils
Am Freitag, den 29.07.2005, 06:55 -0700 schrieb Feng (Michael) Ji:
the java path setting in my Linux (redhat 9) server is
as followings:
http://java.sun.com/j2se/1.4.2/docs/api/java/net/UnknownHostException.html
the IP problem of your server?
Michael,
--- Nils Hoeller [EMAIL PROTECTED] wrote:
No :-(
I ve added the PATH, but same Error!
What does the exception mean exactly ?
Is this a really a problem with my machine?
It seems I found the error !!
... don t kill me , but when I use
the official nutch-0.6 Version everything is going right!
The Problem only exist with the nutch-nightly versions!!
Do you know why ?
Anyway I go playing with the old version, till
I start implementing my thoughts.
Thanks to all
I am using nutch-nightly, everything going well,
Michael,
--- Nils Hoeller [EMAIL PROTECTED] wrote:
It seems I found the error !!
... don t kill me , but when I use
the official nutch-0.6 Version everything is going
right!
The Problem only exist with the nutch-nightly
versions!!
Hey Michael,
from which Date is your nutch-nightly?
I used the 2 days ago build version.
The crawler is running fine in this moment
and fetching all of the sites i wanted.
As I said with version nutch-0.6.
When I now start the nutch-nightly version,
I get the same old exception of the
my nightly version is about 1 month ago, I might try
latest nutch if I have time later on, but I don't
think that will be the issue,
nutch provides some high level calls, mostly are for
demo purpose I guess;
any fancy customized system needs an effort of
programming at least in the Nutch API
java.net.UnknownHostException: linux: linux
Something is wrong with your DNS configuration, I'm
guessing.
--- Nils Hoeller [EMAIL PROTECTED] wrote:
Hi
my Problem is:
I ve done everything as descriped in the Getting
Started Tutorial at
nutch.org.
When I now run the command:
Hello Kamil,
Do you want to generate a fetchlist with urls that are present in WebDB
but where not fetched till now?
I am not sure what you are trying to achive but, you can generate any
fetchlist you want using latest tool by Andrzej Bialecki
16 matches
Mail list logo