Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
Hello, I'm working to the development of a multi-agents software that involves some information indexing, information retrieval and information categorization tasks. I want to build the training set for categorization using a set of HTML pages fetched from DMOZ RDF dumps. I have tried the

Re: Text extraction from HTML

2005-07-29 Thread Jack Tang
Hi Novelli Do you insist on HtmlParser in Nutch? Or some alternatives are available, maybe, you can try htmlparser hosted on sf.net http://htmlparser.sourceforge.net/ Regards /Jack On 7/29/05, Giovanni Novelli [EMAIL PROTECTED] wrote: Hello, I'm working to the development of a multi-agents

Re: Preventing the fetch command from going to certain URLs

2005-07-29 Thread Piotr Kosiorowski
Hello Joe, If you are using whole web crawling you should change regex-urlfilter.txt insead of crawl-urlfilter.txt. Piotr On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote: I have a simple question: I'm using Nutch to do some whole-web crawling (just a small dataset). Somehow Nutch has gotten

Re: [Nutch-general] number of indexed pages

2005-07-29 Thread Erik Hatcher
Two options: bin/nutch readdb crawl/db -stats or use Luke (Google for luke lucene) to open the Lucene index. Erik On Jul 28, 2005, at 9:44 PM, blackwater dev wrote: After I finish a crawl...what is the best way to go into my crawl directory and get the number of indexed pages?

Re: [Nutch-general] number of indexed pages

2005-07-29 Thread Piotr Kosiorowski
Hello, First one will give you number of pages in WebDB and not all of them are indexed. Regards, Piotr On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote: Two options: bin/nutch readdb crawl/db -stats or use Luke (Google for luke lucene) to open the Lucene index. Erik On

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s firewall in Yast2 = same Problem 3. the file is urls without any extension To the misconfiguration of network: I m not that pro in linux, so where

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
try reinstall a new version J2EE? I guess JVM has problem to interface to file system, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
I ve now downloaded the newest J2EE from java.sun.com I ve installed it with by executing the bin file. Should I do anything more? The Problem is: I ve got still the exception. java -version gives me (if this matters) java version 1.5.0_04 Java(TM) 2 Runtime Environment, Standard Edition

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
No :-( I ve added the PATH, but same Error! What does the exception mean exactly ? Is this a really a problem with my machine? Thanks Nils Am Freitag, den 29.07.2005, 06:55 -0700 schrieb Feng (Michael) Ji: the java path setting in my Linux (redhat 9) server is as followings:

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
http://java.sun.com/j2se/1.4.2/docs/api/java/net/UnknownHostException.html the IP problem of your server? Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: No :-( I ve added the PATH, but same Error! What does the exception mean exactly ? Is this a really a problem with my machine?

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
It seems I found the error !! ... don t kill me , but when I use the official nutch-0.6 Version everything is going right! The Problem only exist with the nutch-nightly versions!! Do you know why ? Anyway I go playing with the old version, till I start implementing my thoughts. Thanks to all

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
I am using nutch-nightly, everything going well, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: It seems I found the error !! ... don t kill me , but when I use the official nutch-0.6 Version everything is going right! The Problem only exist with the nutch-nightly versions!!

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
Hey Michael, from which Date is your nutch-nightly? I used the 2 days ago build version. The crawler is running fine in this moment and fetching all of the sites i wanted. As I said with version nutch-0.6. When I now start the nutch-nightly version, I get the same old exception of the

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
my nightly version is about 1 month ago, I might try latest nutch if I have time later on, but I don't think that will be the issue, nutch provides some high level calls, mostly are for demo purpose I guess; any fancy customized system needs an effort of programming at least in the Nutch API

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Vacuum Joe
java.net.UnknownHostException: linux: linux Something is wrong with your DNS configuration, I'm guessing. --- Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command:

Re: prioritizing newly injected urls for fetching

2005-07-29 Thread Kamil Wnuk
Hello Kamil, Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now? I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki