Text extraction from HTML
Hello, I'm working to the development of a multi-agents software that involves some information indexing, information retrieval and information categorization tasks. I want to build the training set for categorization using a set of HTML pages fetched from DMOZ RDF dumps. I have tried the HtmlParser coming with Nutch but I wasn't able to make it work without adjusting global configuration Nutch's xml; perhaps it's the only way to make such plugin work? Does Lucene expose any good HTML parser in the contrib section to parse web pages found in the wild? Best regards, Giovanni Novelli P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
Re: Text extraction from HTML
Hi Novelli Do you insist on HtmlParser in Nutch? Or some alternatives are available, maybe, you can try htmlparser hosted on sf.net http://htmlparser.sourceforge.net/ Regards /Jack On 7/29/05, Giovanni Novelli [EMAIL PROTECTED] wrote: Hello, I'm working to the development of a multi-agents software that involves some information indexing, information retrieval and information categorization tasks. I want to build the training set for categorization using a set of HTML pages fetched from DMOZ RDF dumps. I have tried the HtmlParser coming with Nutch but I wasn't able to make it work without adjusting global configuration Nutch's xml; perhaps it's the only way to make such plugin work? Does Lucene expose any good HTML parser in the contrib section to parse web pages found in the wild? Best regards, Giovanni Novelli P.S.: This is a crosspost as I'm relying on both Lucene and Nutch. -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Preventing the fetch command from going to certain URLs
Hello Joe, If you are using whole web crawling you should change regex-urlfilter.txt insead of crawl-urlfilter.txt. Piotr On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote: I have a simple question: I'm using Nutch to do some whole-web crawling (just a small dataset). Somehow Nutch has gotten a lot of URLs from af.wikipedia.org into its segments, and when I generate another segments (using -topN 2) it wants to crawl a bunch more urls from af.wikipedia.org. I don't want to crawl any of the Afrikaans Wikipedia. Is there a way to block that? Also, I want to block it from ever crawling domains like 33.44.55.66, because those are usually very badly configured servers with worthless content. I tried to put those things into crawl-urlfilter.txt file and the banned-hosts.txt file, but it seems that the fetch command doesn't pay attention to those two files. Should I be using crawl instead of fetch? __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [Nutch-general] number of indexed pages
Two options: bin/nutch readdb crawl/db -stats or use Luke (Google for luke lucene) to open the Lucene index. Erik On Jul 28, 2005, at 9:44 PM, blackwater dev wrote: After I finish a crawl...what is the best way to go into my crawl directory and get the number of indexed pages? Thanks! --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA Security * Process Improvement Measurement * http://www.sqe.com/ bsce5sf ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] number of indexed pages
Hello, First one will give you number of pages in WebDB and not all of them are indexed. Regards, Piotr On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote: Two options: bin/nutch readdb crawl/db -stats or use Luke (Google for luke lucene) to open the Lucene index. Erik On Jul 28, 2005, at 9:44 PM, blackwater dev wrote: After I finish a crawl...what is the best way to go into my crawl directory and get the number of indexed pages? Thanks! --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA Security * Process Improvement Measurement * http://www.sqe.com/ bsce5sf ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: Problem Starting Nutch (Tutorial like)
Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s firewall in Yast2 = same Problem 3. the file is urls without any extension To the misconfiguration of network: I m not that pro in linux, so where do I have to search? Actually I m going into internet over PPPoE , tomorrow when my router arrives I go directly over lan. As i mentioned: Stoping the firewall (also what I thought to be the reason for the exception) doesn t help. What else could be configured ? The exception is everytime: run java in /usr/java/jdk1.5.0_04 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050729 131449 No FS indicated, using default:local 050729 131449 crawl started in: crawl.test 050729 131449 rootUrlFile = urls 050729 131449 threads = 10 050729 131449 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) crawl.log 20L, 1180C1,1 Anfang Thanks for your help Nils Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb Feng (Michael) Ji: try change your user-mode to superuser in linux? seems it is an IO error from JVM, Michael --- Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I get this Exception in the log file: run java in /usr/java/jdk1.5.0_04 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050828 104004 No FS indicated, using default:local 050828 104004 crawl started in: crawl.test 050828 104004 rootUrlFile = urls 050828 104004 threads = 10 050828 104004 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) ... 5 more My urls file looks like this: http://www.nutch.org/ I ve also tried: http://www.ifis.uni-luebeck.de/ which I d like to get nutched Also in the urlfilter conf is written +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/ +^http://([a-z0-9]*\.)*nutch.org/ Can anyone give me a Hint? Where is the error? Thanks Nils Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: Problem Starting Nutch (Tutorial like)
try reinstall a new version J2EE? I guess JVM has problem to interface to file system, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s firewall in Yast2 = same Problem 3. the file is urls without any extension To the misconfiguration of network: I m not that pro in linux, so where do I have to search? Actually I m going into internet over PPPoE , tomorrow when my router arrives I go directly over lan. As i mentioned: Stoping the firewall (also what I thought to be the reason for the exception) doesn t help. What else could be configured ? The exception is everytime: run java in /usr/java/jdk1.5.0_04 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050729 131449 No FS indicated, using default:local 050729 131449 crawl started in: crawl.test 050729 131449 rootUrlFile = urls 050729 131449 threads = 10 050729 131449 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) crawl.log 20L, 1180C 1,1 Anfang Thanks for your help Nils Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb Feng (Michael) Ji: try change your user-mode to superuser in linux? seems it is an IO error from JVM, Michael --- Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I get this Exception in the log file: run java in /usr/java/jdk1.5.0_04 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050828 104004 No FS indicated, using default:local 050828 104004 crawl started in: crawl.test 050828 104004 rootUrlFile = urls 050828 104004 threads = 10 050828 104004 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) ... 5 more My urls file looks like this: http://www.nutch.org/ I ve also tried: http://www.ifis.uni-luebeck.de/ which I d like to get nutched Also in the urlfilter conf is written +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/ +^http://([a-z0-9]*\.)*nutch.org/ Can anyone give me a Hint? Where is the error? Thanks Nils Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: Problem Starting Nutch (Tutorial like)
I ve now downloaded the newest J2EE from java.sun.com I ve installed it with by executing the bin file. Should I do anything more? The Problem is: I ve got still the exception. java -version gives me (if this matters) java version 1.5.0_04 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_04-b05) Java HotSpot(TM) Client VM (build 1.5.0_04-b05, mixed mode, sharing) This are my Env. Var. in my .bashrc export NUTCH_JAVA_HOME=/usr/java/jdk1.5.0_04 export JAVA_HOME=/usr/java/jdk1.5.0_04 export CATALINA_HOME=/home/nils/jakarta-tomcat-4.1.27 For Tomcat they are working, so I guess they ll do also for nutch (the java path) It s getting really frustrating...:-( Thanks anyway Nils Am Freitag, den 29.07.2005, 05:05 -0700 schrieb Feng (Michael) Ji: try reinstall a new version J2EE? I guess JVM has problem to interface to file system, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s firewall in Yast2 = same Problem 3. the file is urls without any extension To the misconfiguration of network: I m not that pro in linux, so where do I have to search? Actually I m going into internet over PPPoE , tomorrow when my router arrives I go directly over lan. As i mentioned: Stoping the firewall (also what I thought to be the reason for the exception) doesn t help. What else could be configured ? The exception is everytime: run java in /usr/java/jdk1.5.0_04 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050729 131449 No FS indicated, using default:local 050729 131449 crawl started in: crawl.test 050729 131449 rootUrlFile = urls 050729 131449 threads = 10 050729 131449 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) crawl.log 20L, 1180C 1,1 Anfang Thanks for your help Nils Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb Feng (Michael) Ji: try change your user-mode to superuser in linux? seems it is an IO error from JVM, Michael --- Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I get this Exception in the log file: run java in /usr/java/jdk1.5.0_04 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050828 104004 No FS indicated, using default:local 050828 104004 crawl started in: crawl.test 050828 104004 rootUrlFile = urls 050828 104004 threads = 10 050828 104004 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) ... 5 more My urls file looks like this: http://www.nutch.org/
Re: Problem Starting Nutch (Tutorial like)
No :-( I ve added the PATH, but same Error! What does the exception mean exactly ? Is this a really a problem with my machine? Thanks Nils Am Freitag, den 29.07.2005, 06:55 -0700 schrieb Feng (Michael) Ji: the java path setting in my Linux (redhat 9) server is as followings: PATH=/home/michael/J2EE/jdk/bin:$PATH:$HOME/bin:./ export PATH export JAVA_HOME=/home/michael/J2EE/jdk export CATALINA_HOME=/home/michael/SE/tomcat4 will that help you? Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: I ve now downloaded the newest J2EE from java.sun.com I ve installed it with by executing the bin file. Should I do anything more? The Problem is: I ve got still the exception. java -version gives me (if this matters) java version 1.5.0_04 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_04-b05) Java HotSpot(TM) Client VM (build 1.5.0_04-b05, mixed mode, sharing) This are my Env. Var. in my .bashrc export NUTCH_JAVA_HOME=/usr/java/jdk1.5.0_04 export JAVA_HOME=/usr/java/jdk1.5.0_04 export CATALINA_HOME=/home/nils/jakarta-tomcat-4.1.27 For Tomcat they are working, so I guess they ll do also for nutch (the java path) It s getting really frustrating...:-( Thanks anyway Nils Am Freitag, den 29.07.2005, 05:05 -0700 schrieb Feng (Michael) Ji: try reinstall a new version J2EE? I guess JVM has problem to interface to file system, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s firewall in Yast2 = same Problem 3. the file is urls without any extension To the misconfiguration of network: I m not that pro in linux, so where do I have to search? Actually I m going into internet over PPPoE , tomorrow when my router arrives I go directly over lan. As i mentioned: Stoping the firewall (also what I thought to be the reason for the exception) doesn t help. What else could be configured ? The exception is everytime: run java in /usr/java/jdk1.5.0_04 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050729 131449 No FS indicated, using default:local 050729 131449 crawl started in: crawl.test 050729 131449 rootUrlFile = urls 050729 131449 threads = 10 050729 131449 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) crawl.log 20L, 1180C 1,1 Anfang Thanks for your help Nils Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb Feng (Michael) Ji: try change your user-mode to superuser in linux? seems it is an IO error from JVM, Michael --- Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I get this Exception in the log file: run java in /usr/java/jdk1.5.0_04 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050828 104004 No FS indicated, using default:local 050828 104004 crawl started in: crawl.test 050828 104004 rootUrlFile = urls 050828 104004 threads = 10 050828 104004 depth = 3 Exception in thread main
Re: Problem Starting Nutch (Tutorial like)
http://java.sun.com/j2se/1.4.2/docs/api/java/net/UnknownHostException.html the IP problem of your server? Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: No :-( I ve added the PATH, but same Error! What does the exception mean exactly ? Is this a really a problem with my machine? Thanks Nils Am Freitag, den 29.07.2005, 06:55 -0700 schrieb Feng (Michael) Ji: the java path setting in my Linux (redhat 9) server is as followings: PATH=/home/michael/J2EE/jdk/bin:$PATH:$HOME/bin:./ export PATH export JAVA_HOME=/home/michael/J2EE/jdk export CATALINA_HOME=/home/michael/SE/tomcat4 will that help you? Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: I ve now downloaded the newest J2EE from java.sun.com I ve installed it with by executing the bin file. Should I do anything more? The Problem is: I ve got still the exception. java -version gives me (if this matters) java version 1.5.0_04 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_04-b05) Java HotSpot(TM) Client VM (build 1.5.0_04-b05, mixed mode, sharing) This are my Env. Var. in my .bashrc export NUTCH_JAVA_HOME=/usr/java/jdk1.5.0_04 export JAVA_HOME=/usr/java/jdk1.5.0_04 export CATALINA_HOME=/home/nils/jakarta-tomcat-4.1.27 For Tomcat they are working, so I guess they ll do also for nutch (the java path) It s getting really frustrating...:-( Thanks anyway Nils Am Freitag, den 29.07.2005, 05:05 -0700 schrieb Feng (Michael) Ji: try reinstall a new version J2EE? I guess JVM has problem to interface to file system, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: Now what I tried (after what you said): 1. I started the command out of the Superuser Terminal (Suse 9.3) ´= same Problem 2. I stopped Suse s firewall in Yast2 = same Problem 3. the file is urls without any extension To the misconfiguration of network: I m not that pro in linux, so where do I have to search? Actually I m going into internet over PPPoE , tomorrow when my router arrives I go directly over lan. As i mentioned: Stoping the firewall (also what I thought to be the reason for the exception) doesn t help. What else could be configured ? The exception is everytime: run java in /usr/java/jdk1.5.0_04 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050729 131449 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050729 131449 No FS indicated, using default:local 050729 131449 crawl started in: crawl.test 050729 131449 rootUrlFile = urls 050729 131449 threads = 10 050729 131449 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) crawl.log 20L, 1180C 1,1 Anfang Thanks for your help Nils Am Donnerstag, den 28.07.2005, 18:41 -0700 schrieb Feng (Michael) Ji: try change your user-mode to superuser in linux? seems it is an IO error from JVM, Michael --- Nils Hoeller [EMAIL PROTECTED] wrote: === message truncated === Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: Problem Starting Nutch (Tutorial like)
It seems I found the error !! ... don t kill me , but when I use the official nutch-0.6 Version everything is going right! The Problem only exist with the nutch-nightly versions!! Do you know why ? Anyway I go playing with the old version, till I start implementing my thoughts. Thanks to all Nils
Re: Problem Starting Nutch (Tutorial like)
I am using nutch-nightly, everything going well, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: It seems I found the error !! ... don t kill me , but when I use the official nutch-0.6 Version everything is going right! The Problem only exist with the nutch-nightly versions!! Do you know why ? Anyway I go playing with the old version, till I start implementing my thoughts. Thanks to all Nils __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Problem Starting Nutch (Tutorial like)
Hey Michael, from which Date is your nutch-nightly? I used the 2 days ago build version. The crawler is running fine in this moment and fetching all of the sites i wanted. As I said with version nutch-0.6. When I now start the nutch-nightly version, I get the same old exception of the unknownHost. Has there been deep changes (in crawling part, where the error seems to exist) from 0.6 to the todays nighly versions ? One last question: What is the nutch-daemon good for? Can I use him for that case: I want to have a nutch process running, that looks at the urls file every few seconds/minutes and performs a crawl/index when a new url has been appended. So this should give me a on demand crawling/indexing service? Can I do this, with the nutch-daemon. Greetings Nils Am Freitag, den 29.07.2005, 07:22 -0700 schrieb Feng (Michael) Ji: I am using nutch-nightly, everything going well, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: It seems I found the error !! ... don t kill me , but when I use the official nutch-0.6 Version everything is going right! The Problem only exist with the nutch-nightly versions!! Do you know why ? Anyway I go playing with the old version, till I start implementing my thoughts. Thanks to all Nils __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Problem Starting Nutch (Tutorial like)
my nightly version is about 1 month ago, I might try latest nutch if I have time later on, but I don't think that will be the issue, nutch provides some high level calls, mostly are for demo purpose I guess; any fancy customized system needs an effort of programming at least in the Nutch API level, if no in Lucene API level; actually, that is what I am preparing to do now... Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: Hey Michael, from which Date is your nutch-nightly? I used the 2 days ago build version. The crawler is running fine in this moment and fetching all of the sites i wanted. As I said with version nutch-0.6. When I now start the nutch-nightly version, I get the same old exception of the unknownHost. Has there been deep changes (in crawling part, where the error seems to exist) from 0.6 to the todays nighly versions ? One last question: What is the nutch-daemon good for? Can I use him for that case: I want to have a nutch process running, that looks at the urls file every few seconds/minutes and performs a crawl/index when a new url has been appended. So this should give me a on demand crawling/indexing service? Can I do this, with the nutch-daemon. Greetings Nils Am Freitag, den 29.07.2005, 07:22 -0700 schrieb Feng (Michael) Ji: I am using nutch-nightly, everything going well, Michael, --- Nils Hoeller [EMAIL PROTECTED] wrote: It seems I found the error !! ... don t kill me , but when I use the official nutch-0.6 Version everything is going right! The Problem only exist with the nutch-nightly versions!! Do you know why ? Anyway I go playing with the old version, till I start implementing my thoughts. Thanks to all Nils __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Problem Starting Nutch (Tutorial like)
java.net.UnknownHostException: linux: linux Something is wrong with your DNS configuration, I'm guessing. --- Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I get this Exception in the log file: run java in /usr/java/jdk1.5.0_04 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050828 104004 No FS indicated, using default:local 050828 104004 crawl started in: crawl.test 050828 104004 rootUrlFile = urls 050828 104004 threads = 10 050828 104004 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) ... 5 more My urls file looks like this: http://www.nutch.org/ I ve also tried: http://www.ifis.uni-luebeck.de/ which I d like to get nutched Also in the urlfilter conf is written +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/ +^http://([a-z0-9]*\.)*nutch.org/ Can anyone give me a Hint? Where is the error? Thanks Nils Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: prioritizing newly injected urls for fetching
Hello Kamil, Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now? I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki (http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself). There was also (some time ago) discussion on the nutch mailing list about refetchonly param for fetchlist generator - some ideas are still not implemented but you can read how it works currently. Regards Piotr Hi Piotr, Thanks for your advice. The sources you directed me to helped me track down my issue. I realized I was updating my webdb right after inject operations and immediately before generating new fetchlists. As a result the scores that I meant for newly injected links to have were being altered. Thus I was initially misled to think that the db.score.injected property did not work as advertised. So I changed the order of my scripts a bit and now everything is working. -Kamil