How to control contents to be indexed?
In the process of crawling and indexing, some pages are just used as temporary links to the pages I want to index, so how can I control those kinds of pages not being indexed? Or which part of nutch should I extend?
Bug in closing the database?
Hi everyone, I'm constantly encounter this problem when Nutch comes to database closing stage. The crawler causes my system hung and needs to be restarted. Can anyone help me to figure out this! Here is my log file before hanging: 060210 161454 Finishing update 060210 161959 Processing pagesByURL: Sorted 25968131 instructions in 305.23seconds. 060210 161959 Processing pagesByURL: Sorted 85077.25649510205instructions/second 060210 162403 Processing pagesByURL: Merged to new DB containing 4088028 records in 133.31 seconds 060210 162403 Processing pagesByURL: Merged 30665.576475883277records/second 060210 162428 Processing pagesByMD5: Sorted 3479369 instructions in 25.146seconds. 060210 162428 Processing pagesByMD5: Sorted 138366.6984808717instructions/second 060210 162529 Processing pagesByMD5: Merged to new DB containing 4088028 records in 56.115 seconds 060210 162529 Processing pagesByMD5: Merged 72850.89548249131 records/second 060210 163426 Processing linksByMD5: Sorted 25926879 instructions in 536.227seconds. 060210 163426 Processing linksByMD5: Sorted 48350.56608488569instructions/second 060210 164415 Processing linksByMD5: Merged to new DB containing 50840622 records in 465.897 seconds 060210 164415 Processing linksByMD5: Merged 109124.16692960032records/second 060210 164840 Processing linksByURL: Sorted 22072525 instructions in 264.309seconds. 060210 164840 Processing linksByURL: Sorted 83510.30422724916instructions/second 060210 165830 Processing linksByURL: Merged to new DB containing 50840622 records in 478.304 seconds 060210 165830 Processing linksByURL: Merged 106293.53298320733records/second 060210 170433 Processing linksByMD5: Sorted 23422502 instructions in 362.723seconds. 060210 170433 Processing linksByMD5: Sorted 64574.0744314532instructions/second 060210 171409 Processing linksByMD5: Merged to new DB containing 50840622 records in 453.612 seconds 060210 171409 Processing linksByMD5: Merged 112079.53493293827records/second The weird thing is that Nutch processed linksByMD5 twice, and number of instructions are not the same. Regards, Giang
Server list
Is there any way to have a list of servers that you want to crawl? I want to crawl only three servers on campus. Do I just add them to the domain list? Andy
Re: Server list
Hi Add the domains to the url file Rgds Prabhu On 2/10/06, Andy Morris [EMAIL PROTECTED] wrote: Is there any way to have a list of servers that you want to crawl? I want to crawl only three servers on campus. Do I just add them to the domain list? Andy
RE: How to control contents to be indexed?
If you control the temporary links pages, then just add a robots meta tag. Take a look at http://www.robotstxt.org/wc/meta-user.html to see what your options are. Jake. -Original Message- From: Elwin [mailto:[EMAIL PROTECTED] Sent: Friday, February 10, 2006 4:38 AM To: nutch-user@lucene.apache.org Subject: How to control contents to be indexed? In the process of crawling and indexing, some pages are just used as temporary links to the pages I want to index, so how can I control those kinds of pages not being indexed? Or which part of nutch should I extend?
Re: How to control contents to be indexed?
Thank you. But what I want to crawl are just from the internent and certainly I can't control them. 2006/2/10, Vanderdray, Jacob [EMAIL PROTECTED]: If you control the temporary links pages, then just add a robots meta tag. Take a look at http://www.robotstxt.org/wc/meta-user.html to see what your options are. Jake. -Original Message- From: Elwin [mailto:[EMAIL PROTECTED] Sent: Friday, February 10, 2006 4:38 AM To: nutch-user@lucene.apache.org Subject: How to control contents to be indexed? In the process of crawling and indexing, some pages are just used as temporary links to the pages I want to index, so how can I control those kinds of pages not being indexed? Or which part of nutch should I extend? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
local in hadoop-default.xml
Hi, when I didn't overwrite the hadoop-default.xml and tried to use the default settings, I get errors like: - bad mapred.job.tracker: local or - Not a host:port pair: local is anybody using nutch/hadoop without using map-reduce? Regards Michael -- Michael Nebel http://www.nebel.de/ http://www.netluchs.de/
Error while indexing (mapred)
Hi, I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data and 4.6M fetched urls in my crawldb. I'm using the mapred code from trunk (revision 374061, Wed, 01 Feb 2006). I was able to generate the indexes from the crawldb and linkdb, but I started to see this error recently while running a dedup on my indexes: 060210 061707 reduce 9% 060210 061710 reduce 10% 060210 061713 reduce 11% 060210 061717 reduce 12% 060210 061719 reduce 11% 060210 061723 reduce 10% 060210 061725 reduce 11% 060210 061726 reduce 10% 060210 061729 reduce 11% 060210 061730 reduce 9% 060210 061732 reduce 10% 060210 061736 reduce 11% 060210 061739 reduce 12% 060210 061742 reduce 10% 060210 061743 reduce 9% 060210 061745 reduce 10% 060210 061746 reduce 100% Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329) at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349) I can see a lot of these messages in the jobtracker log on the master: ... 060210 061743 Task 'task_r_4t50k4' has been lost. 060210 061743 Task 'task_r_79vn7i' has been lost. ... On every single slave, I get this file not found exception in the tasktracker log: 060210 061749 Server handler 0 on 50040 caught: java.io.FileNotFoundException: /var/epile/nutch/mapred/local/task_m_273opj/part-4.out java.io.FileNotFoundException: /var/epile/nutch/mapred/local/task_m_273opj/part-4.out at org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121) at org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:226) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93) at org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121) at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68) at org.apache.nutch.ipc.Server$Handler.run(Server.java:215) I used to be able to complete the index dedupping successfully when my segments/crawldb was smaller, but I don't see why this would be related to the FileNotFoundException. I'm by far not running out of disk space and my hard discs work properly. Has anyone encountered a similar issue or has a clue about what's happening? Thanks, Florent
JobTracker does not start properly
hello, JobTracker doesnt want to start. i am using a current version from the trunk. when trying to start jobtracker with ./bin/hadoop jobtracker i get the following stacktrace after some operations. Exception in thread main java.lang.NullPointerException at org.apache.hadoop.mapred.JobTrackerInfoServer.init(JobTrackerInfoServer.java:56) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:303) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:813) any ideas? regards, ud
nutch configuration
After some time, I downloaded the latest nightly version of Nutch (2006-02-10). Going through nutch-default.xml I could not find, anymore, nor fs.defaul nor marpred.reduce properties. Where to find them? Is the FAQ that deals about mapreduce still valid? Tanks
Re: Which version of rss does parse-rss plugin support?
Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps.
Re: nutch configuration
in the apache hadoop configuration, this inside the hadoop jar. Am 10.02.2006 um 17:33 schrieb carmmello: After some time, I downloaded the latest nightly version of Nutch (2006-02-10). Going through nutch-default.xml I could not find, anymore, nor fs.defaul nor marpred.reduce properties. Where to find them? Is the FAQ that deals about mapreduce still valid? Tanks --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: nutch configuration
Hi, and if you want to change any of the parameters - just create a hadoop-site.xml. And attention ndfs.* is now called dfs.* :-) Regards Michael Stefan Groschupf wrote: in the apache hadoop configuration, this inside the hadoop jar. Am 10.02.2006 um 17:33 schrieb carmmello: After some time, I downloaded the latest nightly version of Nutch (2006-02-10). Going through nutch-default.xml I could not find, anymore, nor fs.defaul nor marpred.reduce properties. Where to find them? Is the FAQ that deals about mapreduce still valid? Tanks --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: The latest svn version is not stable
Rafit Izhak_Ratzin wrote: I just check out the latest svn version (376446), I built it from scratch. When I tried to run the jobtrucker I got the next message in the jobtracker log file: 060209 164707 Property 'sun.cpu.isalist' is Exception in thread main java.lang.NullPointerException Okay. I think I just fixed this. Please give it a try. Thanks, Doug
Re: nutch inject problem with hadoop
Michael Nebel wrote: I upgraded to the last version from the svn today. After having some nuts and bolts fixes (missing hadoop-site.xml, webapps-dir). I just fixed these issues. I finally tried to inject a new set of urls. Doing so, I get the exception below. I am not seeing this. Are you still seeing it, with the current sources? If so, can you provide more details? What OS, JVM? Thanks, Doug
Re: nutch inject problem with hadoop
Michael Nebel wrote: Now it's complaining about a missing class org/apache/nutch/util/LogFormatter :-( That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter. Doug
Re: nutch inject problem with hadoop
have you ever heard of someone, who forgot to make ant clean and to update the plugin-directory? no? now you have! My mistake. Sorry Michael Doug Cutting wrote: Michael Nebel wrote: Now it's complaining about a missing class org/apache/nutch/util/LogFormatter :-( That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter. Doug
Need nutch 0.5
I need a copy of nutch 0.5 which is no longer on the Apache servers. Can someone send me a copy? Thanks a million. Scott
Re: Which version of rss does parse-rss plugin support?
According to the code: theOutlinks.add(new Outlink(r.getLink(), r .getDescription())); I can see that item description is also included. However, when I tried with this feed: http://kgrimm.bravejournal.com/feed.rss I can only get the title and description for channel and failed to search the words in item description. From the above code, the item description is combined with outlink url, is it used as contentTitle for that url? When the outlink is fetched and parsed, I think new data about that url will be generated. 在06-2-11,Chris Mattmann [EMAIL PROTECTED] 写道: Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to