Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
Hello,
I'm working to the development of a multi-agents software that
involves some information indexing, information retrieval and
information categorization tasks. I want to build the training set for
categorization using a set of HTML pages fetched from DMOZ RDF dumps.
I have tried the HtmlParser coming with Nutch but I wasn't able to
make it work without adjusting global configuration Nutch's xml;
perhaps it's the only way to make such plugin work? Does Lucene expose
any good HTML parser in the contrib section to parse web pages found
in the wild?

Best regards,
Giovanni Novelli

P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.


Re: Text extraction from HTML

2005-07-29 Thread Jack Tang
Hi Novelli

Do you insist on HtmlParser in Nutch? 
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net

http://htmlparser.sourceforge.net/

Regards
/Jack

On 7/29/05, Giovanni Novelli [EMAIL PROTECTED] wrote:
 Hello,
 I'm working to the development of a multi-agents software that
 involves some information indexing, information retrieval and
 information categorization tasks. I want to build the training set for
 categorization using a set of HTML pages fetched from DMOZ RDF dumps.
 I have tried the HtmlParser coming with Nutch but I wasn't able to
 make it work without adjusting global configuration Nutch's xml;
 perhaps it's the only way to make such plugin work? Does Lucene expose
 any good HTML parser in the contrib section to parse web pages found
 in the wild?
 
 Best regards,
 Giovanni Novelli
 
 P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Preventing the fetch command from going to certain URLs

2005-07-29 Thread Piotr Kosiorowski
Hello Joe,
If you are using whole web crawling you should change regex-urlfilter.txt 
insead of crawl-urlfilter.txt.

Piotr

On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote:
 I have a simple question: I'm using Nutch to do some
 whole-web crawling (just a small dataset).  Somehow
 Nutch has gotten a lot of URLs from af.wikipedia.org
 into its segments, and when I generate another
 segments (using -topN 2) it wants to crawl a bunch
 more urls from af.wikipedia.org.  I don't want to
 crawl any of the Afrikaans Wikipedia.  Is there a way
 to block that?  Also, I want to block it from ever
 crawling domains like 33.44.55.66, because those are
 usually very badly configured servers with worthless
 content.
 
 I tried to put those things into crawl-urlfilter.txt
 file and the banned-hosts.txt file, but it seems that
 the fetch command doesn't pay attention to those two
 files.
 
 Should I be using crawl instead of fetch?
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com



Re: [Nutch-general] number of indexed pages

2005-07-29 Thread Erik Hatcher

Two options:

bin/nutch readdb crawl/db -stats

or use Luke (Google for luke lucene) to open the Lucene index.

Erik

On Jul 28, 2005, at 9:44 PM, blackwater dev wrote:


After I finish a crawl...what is the best way to go into my crawl
directory and get the number of indexed pages?

Thanks!


---
SF.Net email is Sponsored by the Better Software Conference  EXPO  
September

19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile  Plan-Driven Development * Managing Projects  Teams *  
Testing  QA
Security * Process Improvement  Measurement * http://www.sqe.com/ 
bsce5sf

___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general





Re: [Nutch-general] number of indexed pages

2005-07-29 Thread Piotr Kosiorowski
Hello,
First one will give you number of pages in WebDB and not all of them
are indexed.

Regards,
Piotr

On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote:
 Two options:
 
  bin/nutch readdb crawl/db -stats
 
 or use Luke (Google for luke lucene) to open the Lucene index.
 
  Erik
 
 On Jul 28, 2005, at 9:44 PM, blackwater dev wrote:
 
  After I finish a crawl...what is the best way to go into my crawl
  directory and get the number of indexed pages?
 
  Thanks!
 
 
  ---
  SF.Net email is Sponsored by the Better Software Conference  EXPO
  September
  19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
  Agile  Plan-Driven Development * Managing Projects  Teams *
  Testing  QA
  Security * Process Improvement  Measurement * http://www.sqe.com/
  bsce5sf
  ___
  Nutch-general mailing list
  Nutch-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/nutch-general
 
 



Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
Now what I tried (after what you said):

1. I started the command out of the Superuser Terminal (Suse 9.3)
   ´= same Problem 

2. I stopped Suse s firewall in Yast2 = same Problem

3. the file is urls without any extension

To the misconfiguration of network:

I m not that pro in linux, so where do I have to search? 
Actually I m going into internet over PPPoE ,
tomorrow when my router arrives I go directly over lan.
As i mentioned: Stoping the firewall (also what I thought
to be the reason for the exception) doesn t help.

What else could be configured ? 

The exception is everytime:

run java in /usr/java/jdk1.5.0_04
050729 131449 parsing
file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
050729 131449 parsing
file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
050729 131449 parsing
file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
050729 131449 No FS indicated, using default:local
050729 131449 crawl started in: crawl.test
050729 131449 rootUrlFile = urls
050729 131449 threads = 10
050729 131449 depth = 3
Exception in thread main java.lang.RuntimeException:
java.net.UnknownHostException: linux: linux
at org.apache.nutch.io.SequenceFile
$Writer.init(SequenceFile.java:67)
at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
at
org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
at
org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
Caused by: java.net.UnknownHostException: linux: linux
at java.net.InetAddress.getLocalHost(InetAddress.java:1308)
at org.apache.nutch.io.SequenceFile
$Writer.init(SequenceFile.java:64)
crawl.log 20L, 1180C1,1
Anfang


Thanks for your help

Nils


Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb Feng (Michael) Ji:
 try change your user-mode to superuser in linux? seems
 it is an IO error from JVM,
 
 Michael
 
 --- Nils Hoeller [EMAIL PROTECTED] wrote:
 
  Hi 
  
  my Problem is:
  
  I ve done everything as descriped in the Getting
  Started Tutorial at
  nutch.org. 
  
  When I now run the command: bin/nutch crawl urls
  -dir crawl.test -depth
  3  crawl.log
  
  I get this Exception in the log file:
  run java in /usr/java/jdk1.5.0_04
  050828 104004 parsing
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
  050828 104004 parsing
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
  050828 104004 parsing
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
  050828 104004 No FS indicated, using default:local
  050828 104004 crawl started in: crawl.test
  050828 104004 rootUrlFile = urls
  050828 104004 threads = 10
  050828 104004 depth = 3
  Exception in thread main
  java.lang.RuntimeException:
  java.net.UnknownHostException: linux: linux
  at org.apache.nutch.io.SequenceFile
  $Writer.init(SequenceFile.java:67)
  at
 
 org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
  at
 
 org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
  at
 
 org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
  at
 
 org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
  at
 
 org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
  Caused by: java.net.UnknownHostException: linux:
  linux
  at
 
 java.net.InetAddress.getLocalHost(InetAddress.java:1308)
  at org.apache.nutch.io.SequenceFile
  $Writer.init(SequenceFile.java:64)
  ... 5 more
  
  
  My urls file looks like this:
  
  http://www.nutch.org/
  
  I ve also tried:
  
  http://www.ifis.uni-luebeck.de/ which I d like to
  get nutched
  
  Also in the urlfilter conf is written
  
  +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/
  +^http://([a-z0-9]*\.)*nutch.org/
  
  
  Can anyone give me a Hint?
  Where is the error?
  
  Thanks Nils
  
  
 
 
 
   
 
 Start your day with Yahoo! - make it your home page 
 http://www.yahoo.com/r/hs 
 



Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
try reinstall a new version J2EE?

I guess JVM has problem to interface to file system,

Michael,

--- Nils Hoeller [EMAIL PROTECTED] wrote:

 Now what I tried (after what you said):
 
 1. I started the command out of the Superuser
 Terminal (Suse 9.3)
´= same Problem 
 
 2. I stopped Suse s firewall in Yast2 = same Problem
 
 3. the file is urls without any extension
 
 To the misconfiguration of network:
 
 I m not that pro in linux, so where do I have to
 search? 
 Actually I m going into internet over PPPoE ,
 tomorrow when my router arrives I go directly over
 lan.
 As i mentioned: Stoping the firewall (also what I
 thought
 to be the reason for the exception) doesn t help.
 
 What else could be configured ? 
 
 The exception is everytime:
 
 run java in /usr/java/jdk1.5.0_04
 050729 131449 parsing

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
 050729 131449 parsing

file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
 050729 131449 parsing

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
 050729 131449 No FS indicated, using default:local
 050729 131449 crawl started in: crawl.test
 050729 131449 rootUrlFile = urls
 050729 131449 threads = 10
 050729 131449 depth = 3
 Exception in thread main
 java.lang.RuntimeException:
 java.net.UnknownHostException: linux: linux
 at org.apache.nutch.io.SequenceFile
 $Writer.init(SequenceFile.java:67)
 at

org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
 at

org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
 at

org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
 at

org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
 at

org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
 Caused by: java.net.UnknownHostException: linux:
 linux
 at

java.net.InetAddress.getLocalHost(InetAddress.java:1308)
 at org.apache.nutch.io.SequenceFile
 $Writer.init(SequenceFile.java:64)
 crawl.log 20L, 1180C  
  1,1
 Anfang
 
 
 Thanks for your help
 
 Nils
 
 
 Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb
 Feng (Michael) Ji:
  try change your user-mode to superuser in linux?
 seems
  it is an IO error from JVM,
  
  Michael
  
  --- Nils Hoeller [EMAIL PROTECTED] wrote:
  
   Hi 
   
   my Problem is:
   
   I ve done everything as descriped in the Getting
   Started Tutorial at
   nutch.org. 
   
   When I now run the command: bin/nutch crawl urls
   -dir crawl.test -depth
   3  crawl.log
   
   I get this Exception in the log file:
   run java in /usr/java/jdk1.5.0_04
   050828 104004 parsing
  
 

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
   050828 104004 parsing
  
 

file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
   050828 104004 parsing
  
 

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
   050828 104004 No FS indicated, using
 default:local
   050828 104004 crawl started in: crawl.test
   050828 104004 rootUrlFile = urls
   050828 104004 threads = 10
   050828 104004 depth = 3
   Exception in thread main
   java.lang.RuntimeException:
   java.net.UnknownHostException: linux: linux
   at org.apache.nutch.io.SequenceFile
   $Writer.init(SequenceFile.java:67)
   at
  
 

org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
   at
  
 

org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
   at
  
 

org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
   at
  
 

org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
   at
  
 

org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
   Caused by: java.net.UnknownHostException: linux:
   linux
   at
  
 

java.net.InetAddress.getLocalHost(InetAddress.java:1308)
   at org.apache.nutch.io.SequenceFile
   $Writer.init(SequenceFile.java:64)
   ... 5 more
   
   
   My urls file looks like this:
   
   http://www.nutch.org/
   
   I ve also tried:
   
   http://www.ifis.uni-luebeck.de/ which I d like
 to
   get nutched
   
   Also in the urlfilter conf is written
   
   +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/
   +^http://([a-z0-9]*\.)*nutch.org/
   
   
   Can anyone give me a Hint?
   Where is the error?
   
   Thanks Nils
   
   
  
  
  
  
 
 
  Start your day with Yahoo! - make it your home
 page 
  http://www.yahoo.com/r/hs 
  
 
 





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
I ve now downloaded the newest J2EE from java.sun.com

I ve installed it with by executing the bin file.
Should I do anything more? 

The Problem is: I ve got still the exception.

java -version gives me (if this matters)
java version 1.5.0_04
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_04-b05)
Java HotSpot(TM) Client VM (build 1.5.0_04-b05, mixed mode, sharing)

This are my Env. Var. in my .bashrc
export NUTCH_JAVA_HOME=/usr/java/jdk1.5.0_04
export JAVA_HOME=/usr/java/jdk1.5.0_04
export CATALINA_HOME=/home/nils/jakarta-tomcat-4.1.27

For Tomcat they are working, so I guess they ll do also for nutch (the
java path)

It s getting really frustrating...:-(

Thanks anyway 

Nils

Am Freitag, den 29.07.2005, 05:05 -0700 schrieb Feng (Michael) Ji:
 try reinstall a new version J2EE?
 
 I guess JVM has problem to interface to file system,
 
 Michael,
 
 --- Nils Hoeller [EMAIL PROTECTED] wrote:
 
  Now what I tried (after what you said):
  
  1. I started the command out of the Superuser
  Terminal (Suse 9.3)
 ´= same Problem 
  
  2. I stopped Suse s firewall in Yast2 = same Problem
  
  3. the file is urls without any extension
  
  To the misconfiguration of network:
  
  I m not that pro in linux, so where do I have to
  search? 
  Actually I m going into internet over PPPoE ,
  tomorrow when my router arrives I go directly over
  lan.
  As i mentioned: Stoping the firewall (also what I
  thought
  to be the reason for the exception) doesn t help.
  
  What else could be configured ? 
  
  The exception is everytime:
  
  run java in /usr/java/jdk1.5.0_04
  050729 131449 parsing
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
  050729 131449 parsing
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
  050729 131449 parsing
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
  050729 131449 No FS indicated, using default:local
  050729 131449 crawl started in: crawl.test
  050729 131449 rootUrlFile = urls
  050729 131449 threads = 10
  050729 131449 depth = 3
  Exception in thread main
  java.lang.RuntimeException:
  java.net.UnknownHostException: linux: linux
  at org.apache.nutch.io.SequenceFile
  $Writer.init(SequenceFile.java:67)
  at
 
 org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
  at
 
 org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
  at
 
 org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
  at
 
 org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
  at
 
 org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
  Caused by: java.net.UnknownHostException: linux:
  linux
  at
 
 java.net.InetAddress.getLocalHost(InetAddress.java:1308)
  at org.apache.nutch.io.SequenceFile
  $Writer.init(SequenceFile.java:64)
  crawl.log 20L, 1180C  
   1,1
  Anfang
  
  
  Thanks for your help
  
  Nils
  
  
  Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb
  Feng (Michael) Ji:
   try change your user-mode to superuser in linux?
  seems
   it is an IO error from JVM,
   
   Michael
   
   --- Nils Hoeller [EMAIL PROTECTED] wrote:
   
Hi 

my Problem is:

I ve done everything as descriped in the Getting
Started Tutorial at
nutch.org. 

When I now run the command: bin/nutch crawl urls
-dir crawl.test -depth
3  crawl.log

I get this Exception in the log file:
run java in /usr/java/jdk1.5.0_04
050828 104004 parsing
   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
050828 104004 parsing
   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
050828 104004 parsing
   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
050828 104004 No FS indicated, using
  default:local
050828 104004 crawl started in: crawl.test
050828 104004 rootUrlFile = urls
050828 104004 threads = 10
050828 104004 depth = 3
Exception in thread main
java.lang.RuntimeException:
java.net.UnknownHostException: linux: linux
at org.apache.nutch.io.SequenceFile
$Writer.init(SequenceFile.java:67)
at
   
  
 
 org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
at
   
  
 
 org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
at
   
  
 
 org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
at
   
  
 
 org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
at
   
  
 
 org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
Caused by: java.net.UnknownHostException: linux:
linux
at
   
  
 
 java.net.InetAddress.getLocalHost(InetAddress.java:1308)
at org.apache.nutch.io.SequenceFile
$Writer.init(SequenceFile.java:64)
... 5 more


My urls file looks like this:

http://www.nutch.org/

   

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
No :-(

I ve added the PATH, but same Error!

What does the exception mean exactly ? 
Is this a really a problem with my machine? 

Thanks Nils

Am Freitag, den 29.07.2005, 06:55 -0700 schrieb Feng (Michael) Ji:
 the java path setting in my Linux (redhat 9) server is
 as followings:
 
 
 PATH=/home/michael/J2EE/jdk/bin:$PATH:$HOME/bin:./
 export PATH
 export JAVA_HOME=/home/michael/J2EE/jdk
 export CATALINA_HOME=/home/michael/SE/tomcat4
 
 
 will that help you?
 
 Michael,
 
 --- Nils Hoeller [EMAIL PROTECTED] wrote:
 
  I ve now downloaded the newest J2EE from
  java.sun.com
  
  I ve installed it with by executing the bin file.
  Should I do anything more? 
  
  The Problem is: I ve got still the exception.
  
  java -version gives me (if this matters)
  java version 1.5.0_04
  Java(TM) 2 Runtime Environment, Standard Edition
  (build 1.5.0_04-b05)
  Java HotSpot(TM) Client VM (build 1.5.0_04-b05,
  mixed mode, sharing)
  
  This are my Env. Var. in my .bashrc
  export NUTCH_JAVA_HOME=/usr/java/jdk1.5.0_04
  export JAVA_HOME=/usr/java/jdk1.5.0_04
  export
  CATALINA_HOME=/home/nils/jakarta-tomcat-4.1.27
  
  For Tomcat they are working, so I guess they ll do
  also for nutch (the
  java path)
  
  It s getting really frustrating...:-(
  
  Thanks anyway 
  
  Nils
  
  Am Freitag, den 29.07.2005, 05:05 -0700 schrieb Feng
  (Michael) Ji:
   try reinstall a new version J2EE?
   
   I guess JVM has problem to interface to file
  system,
   
   Michael,
   
   --- Nils Hoeller [EMAIL PROTECTED] wrote:
   
Now what I tried (after what you said):

1. I started the command out of the Superuser
Terminal (Suse 9.3)
   ´= same Problem 

2. I stopped Suse s firewall in Yast2 = same
  Problem

3. the file is urls without any extension

To the misconfiguration of network:

I m not that pro in linux, so where do I have to
search? 
Actually I m going into internet over PPPoE ,
tomorrow when my router arrives I go directly
  over
lan.
As i mentioned: Stoping the firewall (also what
  I
thought
to be the reason for the exception) doesn t
  help.

What else could be configured ? 

The exception is everytime:

run java in /usr/java/jdk1.5.0_04
050729 131449 parsing
   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
050729 131449 parsing
   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
050729 131449 parsing
   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
050729 131449 No FS indicated, using
  default:local
050729 131449 crawl started in: crawl.test
050729 131449 rootUrlFile = urls
050729 131449 threads = 10
050729 131449 depth = 3
Exception in thread main
java.lang.RuntimeException:
java.net.UnknownHostException: linux: linux
at org.apache.nutch.io.SequenceFile
$Writer.init(SequenceFile.java:67)
at
   
  
 
 org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
at
   
  
 
 org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
at
   
  
 
 org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
at
   
  
 
 org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
at
   
  
 
 org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
Caused by: java.net.UnknownHostException: linux:
linux
at
   
  
 
 java.net.InetAddress.getLocalHost(InetAddress.java:1308)
at org.apache.nutch.io.SequenceFile
$Writer.init(SequenceFile.java:64)
crawl.log 20L, 1180C  
 
 1,1
Anfang


Thanks for your help

Nils


Am Donnerstag, den 28.07.2005, 18:41 -0700
  schrieb
Feng (Michael) Ji:
 try change your user-mode to superuser in
  linux?
seems
 it is an IO error from JVM,
 
 Michael
 
 --- Nils Hoeller [EMAIL PROTECTED] wrote:
 
  Hi 
  
  my Problem is:
  
  I ve done everything as descriped in the
  Getting
  Started Tutorial at
  nutch.org. 
  
  When I now run the command: bin/nutch crawl
  urls
  -dir crawl.test -depth
  3  crawl.log
  
  I get this Exception in the log file:
  run java in /usr/java/jdk1.5.0_04
  050828 104004 parsing
 

   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
  050828 104004 parsing
 

   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
  050828 104004 parsing
 

   
  
 
 file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
  050828 104004 No FS indicated, using
default:local
  050828 104004 crawl started in: crawl.test
  050828 104004 rootUrlFile = urls
  050828 104004 threads = 10
  050828 104004 depth = 3
  Exception in thread main
  

Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
http://java.sun.com/j2se/1.4.2/docs/api/java/net/UnknownHostException.html

the IP problem of your server?

Michael,

--- Nils Hoeller [EMAIL PROTECTED] wrote:

 No :-(
 
 I ve added the PATH, but same Error!
 
 What does the exception mean exactly ? 
 Is this a really a problem with my machine? 
 
 Thanks Nils
 
 Am Freitag, den 29.07.2005, 06:55 -0700 schrieb Feng
 (Michael) Ji:
  the java path setting in my Linux (redhat 9)
 server is
  as followings:
  
  
  PATH=/home/michael/J2EE/jdk/bin:$PATH:$HOME/bin:./
  export PATH
  export JAVA_HOME=/home/michael/J2EE/jdk
  export CATALINA_HOME=/home/michael/SE/tomcat4
  
  
  will that help you?
  
  Michael,
  
  --- Nils Hoeller [EMAIL PROTECTED] wrote:
  
   I ve now downloaded the newest J2EE from
   java.sun.com
   
   I ve installed it with by executing the bin
 file.
   Should I do anything more? 
   
   The Problem is: I ve got still the exception.
   
   java -version gives me (if this matters)
   java version 1.5.0_04
   Java(TM) 2 Runtime Environment, Standard Edition
   (build 1.5.0_04-b05)
   Java HotSpot(TM) Client VM (build 1.5.0_04-b05,
   mixed mode, sharing)
   
   This are my Env. Var. in my .bashrc
   export NUTCH_JAVA_HOME=/usr/java/jdk1.5.0_04
   export JAVA_HOME=/usr/java/jdk1.5.0_04
   export
   CATALINA_HOME=/home/nils/jakarta-tomcat-4.1.27
   
   For Tomcat they are working, so I guess they ll
 do
   also for nutch (the
   java path)
   
   It s getting really frustrating...:-(
   
   Thanks anyway 
   
   Nils
   
   Am Freitag, den 29.07.2005, 05:05 -0700 schrieb
 Feng
   (Michael) Ji:
try reinstall a new version J2EE?

I guess JVM has problem to interface to file
   system,

Michael,

--- Nils Hoeller [EMAIL PROTECTED] wrote:

 Now what I tried (after what you said):
 
 1. I started the command out of the
 Superuser
 Terminal (Suse 9.3)
´= same Problem 
 
 2. I stopped Suse s firewall in Yast2 = same
   Problem
 
 3. the file is urls without any extension
 
 To the misconfiguration of network:
 
 I m not that pro in linux, so where do I
 have to
 search? 
 Actually I m going into internet over PPPoE
 ,
 tomorrow when my router arrives I go
 directly
   over
 lan.
 As i mentioned: Stoping the firewall (also
 what
   I
 thought
 to be the reason for the exception) doesn t
   help.
 
 What else could be configured ? 
 
 The exception is everytime:
 
 run java in /usr/java/jdk1.5.0_04
 050729 131449 parsing

   
  
 

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
 050729 131449 parsing

   
  
 

file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
 050729 131449 parsing

   
  
 

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
 050729 131449 No FS indicated, using
   default:local
 050729 131449 crawl started in: crawl.test
 050729 131449 rootUrlFile = urls
 050729 131449 threads = 10
 050729 131449 depth = 3
 Exception in thread main
 java.lang.RuntimeException:
 java.net.UnknownHostException: linux: linux
 at org.apache.nutch.io.SequenceFile
 $Writer.init(SequenceFile.java:67)
 at

   
  
 

org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
 at

   
  
 

org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
 at

   
  
 

org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
 at

   
  
 

org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
 at

   
  
 

org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
 Caused by: java.net.UnknownHostException:
 linux:
 linux
 at

   
  
 

java.net.InetAddress.getLocalHost(InetAddress.java:1308)
 at org.apache.nutch.io.SequenceFile
 $Writer.init(SequenceFile.java:64)
 crawl.log 20L, 1180C  

  
  1,1
 Anfang
 
 
 Thanks for your help
 
 Nils
 
 
 Am Donnerstag, den 28.07.2005, 18:41 -0700
   schrieb
 Feng (Michael) Ji:
  try change your user-mode to superuser in
   linux?
 seems
  it is an IO error from JVM,
  
  Michael
  
  --- Nils Hoeller [EMAIL PROTECTED]
 wrote:
 
=== message truncated ===





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
It seems I found the error !!

... don t kill me , but when I use
the official nutch-0.6 Version everything is going right!

The Problem only exist with the nutch-nightly versions!!

Do you know why ? 

Anyway I go playing with the old version, till
I start implementing my thoughts.

Thanks to all

Nils



Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
I am using nutch-nightly, everything going well, 

Michael,

--- Nils Hoeller [EMAIL PROTECTED] wrote:

 It seems I found the error !!
 
 ... don t kill me , but when I use
 the official nutch-0.6 Version everything is going
 right!
 
 The Problem only exist with the nutch-nightly
 versions!!
 
 Do you know why ? 
 
 Anyway I go playing with the old version, till
 I start implementing my thoughts.
 
 Thanks to all
 
 Nils
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Nils Hoeller
Hey Michael, 

from which Date is your nutch-nightly?
I used the 2 days ago build version.

The crawler is running fine in this moment
and fetching all of the sites i wanted.
As I said with version nutch-0.6.

When I now start the nutch-nightly version, 
I get the same old exception of the unknownHost.

Has there been deep changes (in crawling part, where the 
error seems to exist) from 0.6 to the todays nighly versions ?

One last question:

What is the nutch-daemon good for? 
Can I use him for that case:

I want to have a nutch process running, that
looks at the urls file every few seconds/minutes and
performs a crawl/index when a new url has been appended.

So this should give me a on demand crawling/indexing service?

Can I do this, with the nutch-daemon.

Greetings Nils

Am Freitag, den 29.07.2005, 07:22 -0700 schrieb Feng (Michael) Ji:
 I am using nutch-nightly, everything going well, 
 
 Michael,
 
 --- Nils Hoeller [EMAIL PROTECTED] wrote:
 
  It seems I found the error !!
  
  ... don t kill me , but when I use
  the official nutch-0.6 Version everything is going
  right!
  
  The Problem only exist with the nutch-nightly
  versions!!
  
  Do you know why ? 
  
  Anyway I go playing with the old version, till
  I start implementing my thoughts.
  
  Thanks to all
  
  Nils
  
  
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 



Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Feng \(Michael\) Ji
my nightly version is about 1 month ago, I might try
latest nutch if I have time later on, but I don't
think that will be the issue,

nutch provides some high level calls, mostly are for
demo purpose I guess; 

any fancy customized system needs an effort of
programming at least in the Nutch API level, if no in
Lucene API level; actually, that is what I am
preparing to do now...

Michael,

--- Nils Hoeller [EMAIL PROTECTED] wrote:

 Hey Michael, 
 
 from which Date is your nutch-nightly?
 I used the 2 days ago build version.
 
 The crawler is running fine in this moment
 and fetching all of the sites i wanted.
 As I said with version nutch-0.6.
 
 When I now start the nutch-nightly version, 
 I get the same old exception of the unknownHost.
 
 Has there been deep changes (in crawling part, where
 the 
 error seems to exist) from 0.6 to the todays nighly
 versions ?
 
 One last question:
 
 What is the nutch-daemon good for? 
 Can I use him for that case:
 
 I want to have a nutch process running, that
 looks at the urls file every few seconds/minutes and
 performs a crawl/index when a new url has been
 appended.
 
 So this should give me a on demand crawling/indexing
 service?
 
 Can I do this, with the nutch-daemon.
 
 Greetings Nils
 
 Am Freitag, den 29.07.2005, 07:22 -0700 schrieb Feng
 (Michael) Ji:
  I am using nutch-nightly, everything going well, 
  
  Michael,
  
  --- Nils Hoeller [EMAIL PROTECTED] wrote:
  
   It seems I found the error !!
   
   ... don t kill me , but when I use
   the official nutch-0.6 Version everything is
 going
   right!
   
   The Problem only exist with the nutch-nightly
   versions!!
   
   Do you know why ? 
   
   Anyway I go playing with the old version, till
   I start implementing my thoughts.
   
   Thanks to all
   
   Nils
   
   
  
  
  __
  Do You Yahoo!?
  Tired of spam?  Yahoo! Mail has the best spam
 protection around 
  http://mail.yahoo.com 
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Problem Starting Nutch (Tutorial like)

2005-07-29 Thread Vacuum Joe
java.net.UnknownHostException: linux: linux

Something is wrong with your DNS configuration, I'm
guessing.

--- Nils Hoeller [EMAIL PROTECTED] wrote:

 Hi 
 
 my Problem is:
 
 I ve done everything as descriped in the Getting
 Started Tutorial at
 nutch.org. 
 
 When I now run the command: bin/nutch crawl urls
 -dir crawl.test -depth
 3  crawl.log
 
 I get this Exception in the log file:
 run java in /usr/java/jdk1.5.0_04
 050828 104004 parsing

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
 050828 104004 parsing

file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
 050828 104004 parsing

file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
 050828 104004 No FS indicated, using default:local
 050828 104004 crawl started in: crawl.test
 050828 104004 rootUrlFile = urls
 050828 104004 threads = 10
 050828 104004 depth = 3
 Exception in thread main
 java.lang.RuntimeException:
 java.net.UnknownHostException: linux: linux
 at org.apache.nutch.io.SequenceFile
 $Writer.init(SequenceFile.java:67)
 at

org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
 at

org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
 at

org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
 at

org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
 at

org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
 Caused by: java.net.UnknownHostException: linux:
 linux
 at

java.net.InetAddress.getLocalHost(InetAddress.java:1308)
 at org.apache.nutch.io.SequenceFile
 $Writer.init(SequenceFile.java:64)
 ... 5 more
 
 
 My urls file looks like this:
 
 http://www.nutch.org/
 
 I ve also tried:
 
 http://www.ifis.uni-luebeck.de/ which I d like to
 get nutched
 
 Also in the urlfilter conf is written
 
 +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/
 +^http://([a-z0-9]*\.)*nutch.org/
 
 
 Can anyone give me a Hint?
 Where is the error?
 
 Thanks Nils
 
 





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


Re: prioritizing newly injected urls for fetching

2005-07-29 Thread Kamil Wnuk
 Hello Kamil,
 
 Do you want to generate a fetchlist with urls that are present in WebDB
 but where not fetched till now?
 
 I am not sure what you are trying to achive but, you can generate any
 fetchlist you want using latest tool by Andrzej Bialecki
 (http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself).
 There was also (some time ago) discussion on the nutch mailing list
 about refetchonly param for fetchlist generator - some ideas are still
 not implemented but you can read how it works currently.
 Regards
 Piotr

Hi Piotr,
Thanks for your advice. The sources you directed me to helped me track
down my issue.  I realized I was updating my webdb right after inject
operations and immediately before generating new fetchlists.  As a
result the scores that I meant for newly injected links to have were
being altered.  Thus I was initially misled to think that the
db.score.injected property did not work as advertised.

So I changed the order of my scripts a bit and now everything is working.

-Kamil