Re: Adding additional metadata
Something like this may work for your filter. I have not tested this but maybe it will give you a better idea of what you need to do for the author data. This is based on nutch-1.0 so I'm not sure if this would work for the trunk version. public class AuthorFilter implements HtmlParseFilter { public ParseResult filter(Content content, ParseResult parse, HTMLMetaTags metaTags, DocumentFragment doc) { parse.get(content.getUrl()).getData().getParseMeta().set("author", metaTags.getGeneralTags().getProperty("author")); return parse; } } You will also need a indexing filter that will store the author data in the index. -Jason On Fri, Jan 8, 2010 at 6:00 AM, MilleBii wrote: > For lastModified just enable the index|query-more plugins it will do > the job for you. > > For other meta searc the mailing list its explained many times how to do it > > 2010/1/8, Erlend Garåsen : > > > > Hello, > > > > I have tried to add additional metadata by changing the code in > > HtmlParser.java and MoreIndexingFilter.java without any luck. Do I > > really have to do something which is mentioned on the following wiki in > > order to fetch the content of the metadata, i.e. write my own parser, > > filter and a plugin.xml file: > > > http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html > > > > I find the plugin examples complicated and difficult to understand. What > > the existing HtmlParser does is good for me as long as I am able to > > fetch two additional metadata (author and lastModified) which are > > included in many of my university's webpages. > > > > The last thing I tried to do was to make HtmlParser implement the > > HtmlParseFilter interface, but the implemented required method does not > run. > > > > My hope was that we could use Solr/Nutch instead of Ultraseek, but it > > requires that we are able to parse our metadata successfully. > > > > Erlend > > -- > > Erlend Garåsen > > Center for Information Technology Services > > University of Oslo > > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: > 31050 > > > > > -- > -MilleBii- >
Re: crawl-urlfilter.txt & regex-urlfilter.txt
Hi Ken, If you are doing a whole-web crawl then only the regex-urlfiter.txt file is used. The regex-filter.txt is used during the inject and fetchlist generation. You can also use the regex-urlfitler when updating the crawldb but you will have to add the '-filter' option when running the updatedb command. -Jason On Wed, Jan 6, 2010 at 7:04 AM, Godmar Back wrote: > On Wed, Jan 6, 2010 at 5:36 AM, Ken Ken wrote: > > > > > > > Hello, > > > > I have a few questions on these two files in nutch-1.0/conf > > > > > > Ken, > > I can't answer your question, but I'm struggling with a similar one; > perhaps > both questions can be answered at the same time. > My question is which crawl-urlfilter.txt is being used when, and if it > matters. > > If you have your nutch in say /opt/nutch, and your tomcat in /opt/tomcat, > then after deploying the .war you have /opt/nutch/conf/crawl-urlfilter.txt > and /opt/tomcat/webapps/nutch/WEB-INF/classes/crawl-urlfilter.txt > > I'd assume that bin/nutch will look at the former, and Tomcat will look at > the latter, but how do you keep the two in sync? > > Specifically, if you rebuild nutch.war and redeploy, won't that override > the > crawl-urlfilter.txt in the WEB-INF tree? > > - Godmar >
Re: Multiple Nutch instances for crawling?
The crawl directory ( crawldb, linkdb, etc. ) can be shared between jobs although depending on the implementation you may run into locking issues. For example if there is a 'generate' job running while a 'updatedb' job is trying to be executed ( on the same crawldb ) the 'updatedb' job will faile because the 'generate' job has a lock on the crawldb. On Thu, Dec 17, 2009 at 10:15 PM, Jun Mao wrote: > Is that still true if I start two jobs( they will not share crawdb,linkdb) and write index to two different locations? > > Thanks, > > Jun > > -Original Message- > From: MilleBii [mailto:mille...@gmail.com] > Sent: 2009年12月17日 16:57 > To: nutch-user@lucene.apache.org > Subject: Re: Multiple Nutch instances for crawling? > > I guess because of the different nutch-site.xml & url filter that you want > to use it won't work... but you could try installing nutch twice run the > crawl/fetch/parse from those two locations. And joined the segments to > recreate a unified searchable index (make sure you put all your segments > under the same location). > > Just one comment though I think hadoop will serialize your jobs any how so > you won't get a parallel execution of your hadoop jobs unless you run them > from different hardware. > > 2009/12/16 Christopher Bader > >> Felix, >> >> I've had trouble running multiple instances. I would be interested in >> hearing from anyone who has done it successfully. >> >> CB >> >> >> On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann wrote: >> >> > Hi, >> > >> > I would like to run at least two instances of nutch ONLY for crawling at >> > one time; one for very frequently updated sites and one for other sites. >> > Will the nutch instances get in trouble when running several >> > crawlscripts, especially the nutch confdir variable? >> > >> > Thanks! >> > Felix. >> > >> > >> > >> > >> > > > > -- > -MilleBii- >
Re: Multiple Nutch instances for crawling?
Nope. I do not know how that version of nutch works. I am not familiar with that project On Thu, Dec 17, 2009 at 3:39 PM, Felix Zimmermann wrote: > Hi Jason, > > thank you very much for your detailled description, I'll give this a > try. Do you know how the multiple instances are realised in the > nutch-gui on http://github.com/101tec/nutch ? > > Felix. > > Am Donnerstag, den 17.12.2009, 15:11 -0800 schrieb J.G.Konrad: >> I have upgraded to 0.19.2 but the capacity scheduler feature is >> available in 0.19.1. You will need to download the hadoop common >> package ( http://hadoop.apache.org/common/releases.html ) to get the >> capacity scheduler jar. It is not included in the Nutch releases. >> >> After downloading you will want to copy the >> contrib/capacity-scheduler/hadoop-0.19.x-capacity-scheduler.jar into >> lib directory where you have your nutch code. There is also a example >> configuration file that comes with the hadoop package that is quite >> explanatory ( conf/capacity-scheduler ). The docs can be found here >> http://hadoop.apache.org/common/docs/r0.19.1/capacity_scheduler.html >> >> The first thing to do is to set the scheduler and define the queues. >> This is done in the hadoop-site.xml file that is used to start the >> jobtracker. This is what mine looks like for defining two queues (plus >> the default). >> >> >> mapred.jobtracker.taskScheduler >> org.apache.hadoop.mapred.CapacityTaskScheduler >> >> >> >> mapred.queue.names >> nutchFetchCycle,nutchIndex,default >> >> >> The other file to add is conf/capacity-scheduler.conf. This is where >> the properties of the queues are defined. Here is some of my scheduler >> conf >> >> >> >> mapred.capacity-scheduler.queue.nutchFetchCycle.guaranteed-capacity >> 50 >> Percentage of the number of slots in the cluster that are >> guaranteed to be available for jobs in this queue. >> >> >> >> >> mapred.capacity-scheduler.queue.nutchIndex.guaranteed-capacity >> 50 >> Percentage of the number of slots in the cluster that are >> guaranteed to be available for jobs in this queue. >> >> >> >> mapred.capacity-scheduler.queue.default.guaranteed-capacity >> 0 >> Percentage of the number of slots in the cluster that are >> guaranteed to be available for jobs in this queue. >> >> >> >> You will need to restart your jobtracker in order for these changes >> to be applied. You will be able to see the queues if you visit the web >> interface of the jobtracker. >> >> To use the different queues you will need to set the >> mapred.job.queue.name property. To accomplish this I have two >> directores, nutch-fetch and nutch-index. Each directory has their own >> conf/hadoop-site.xml file with the different queue names ( I also have >> different number of map/reduce tasks ). >> >> nutch-index: >> >> mapred.job.queue.name >> nutchIndex >> >> >> nutch-fetch: >> >> mapred.job.queue.name >> nutchFetchCycle >> >> >> >> When a job is started from one of the directories the job will be >> placed in the corresponding queue and will be able to run >> simultaneously with jobs in the other queue. In order for this to work >> there needs to be at least the capacity for 2 map and 2 reduce tasks >> so each queue will be guaranteed 1 of each ( in this example since >> it's a 50/50 distribution). >> >> Good luck with your concurrent fetches and don't forget to set the >> generate.update.crawldb property to 'true' so that you will generate >> different fetch lists for each instance. >> >> Enjoy, >> Jason >> >> >> On Thu, Dec 17, 2009 at 1:03 PM, Yves Petinot wrote: >> > Jason, >> > >> > that sounds really good ! ... did you have to upgrade the default version >> > of >> > Hadoop or were you able to get the distro that comes with Nutch (0.19.1 for >> > me , which I assume is the standard) to accept it ? If the latter worked >> > for >> > you, can you take use through your configuration changes ? >> > >> > thanks a bunch ;-) >> > >> > -y >> > >> > J.G.Konrad wrote: >> >> >> >> I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup >> >> although it is not for doing concurren
Re: Multiple Nutch instances for crawling?
I have upgraded to 0.19.2 but the capacity scheduler feature is available in 0.19.1. You will need to download the hadoop common package ( http://hadoop.apache.org/common/releases.html ) to get the capacity scheduler jar. It is not included in the Nutch releases. After downloading you will want to copy the contrib/capacity-scheduler/hadoop-0.19.x-capacity-scheduler.jar into lib directory where you have your nutch code. There is also a example configuration file that comes with the hadoop package that is quite explanatory ( conf/capacity-scheduler ). The docs can be found here http://hadoop.apache.org/common/docs/r0.19.1/capacity_scheduler.html The first thing to do is to set the scheduler and define the queues. This is done in the hadoop-site.xml file that is used to start the jobtracker. This is what mine looks like for defining two queues (plus the default). mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.CapacityTaskScheduler mapred.queue.names nutchFetchCycle,nutchIndex,default The other file to add is conf/capacity-scheduler.conf. This is where the properties of the queues are defined. Here is some of my scheduler conf mapred.capacity-scheduler.queue.nutchFetchCycle.guaranteed-capacity 50 Percentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. mapred.capacity-scheduler.queue.nutchIndex.guaranteed-capacity 50 Percentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. mapred.capacity-scheduler.queue.default.guaranteed-capacity 0 Percentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. You will need to restart your jobtracker in order for these changes to be applied. You will be able to see the queues if you visit the web interface of the jobtracker. To use the different queues you will need to set the mapred.job.queue.name property. To accomplish this I have two directores, nutch-fetch and nutch-index. Each directory has their own conf/hadoop-site.xml file with the different queue names ( I also have different number of map/reduce tasks ). nutch-index: mapred.job.queue.name nutchIndex nutch-fetch: mapred.job.queue.name nutchFetchCycle When a job is started from one of the directories the job will be placed in the corresponding queue and will be able to run simultaneously with jobs in the other queue. In order for this to work there needs to be at least the capacity for 2 map and 2 reduce tasks so each queue will be guaranteed 1 of each ( in this example since it's a 50/50 distribution). Good luck with your concurrent fetches and don't forget to set the generate.update.crawldb property to 'true' so that you will generate different fetch lists for each instance. Enjoy, Jason On Thu, Dec 17, 2009 at 1:03 PM, Yves Petinot wrote: > Jason, > > that sounds really good ! ... did you have to upgrade the default version of > Hadoop or were you able to get the distro that comes with Nutch (0.19.1 for > me , which I assume is the standard) to accept it ? If the latter worked for > you, can you take use through your configuration changes ? > > thanks a bunch ;-) > > -y > > J.G.Konrad wrote: >> >> I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup >> although it is not for doing concurrent fetching but it could be used >> for that purpose. Using the capacity scheduler you set up two >> separate queues and allocate half of the available resources >> (map/reduce tasks) to each queue. Technically there are three queues >> in my setup , two specialized and the 'default'. The capacity >> scheduler requires the 'default' queue to be defined although you >> don't have to send any jobs to it. >> >> The only catch is you are not guaranteed symmetrical distribution if >> you have multiple machines in the cluster. That may or may not be a >> issues depending on your requirements. >> >> -Jason >> >> >> On Thu, Dec 17, 2009 at 11:00 AM, Yves Petinot wrote: >> >>>>> >>>>> Just one comment though I think hadoop will serialize your jobs any how >>>>> so >>>>> you won't get a parallel execution of your hadoop jobs unless you run >>>>> them >>>>> from different hardware. >>>>> >>> >>> I'm actually wondering if someone on this list has been able to use >>> Hadoop's >>> Fair Scheduler (or any other scheduler for that matter). This would >>> definitely solve this problem (which i'm experiencing too). Is it at all >>> possible to chan
Re: Multiple Nutch instances for crawling?
I have integrated the CapacityTaskScheduler into my Nutch 1.0 setup although it is not for doing concurrent fetching but it could be used for that purpose. Using the capacity scheduler you set up two separate queues and allocate half of the available resources (map/reduce tasks) to each queue. Technically there are three queues in my setup , two specialized and the 'default'. The capacity scheduler requires the 'default' queue to be defined although you don't have to send any jobs to it. The only catch is you are not guaranteed symmetrical distribution if you have multiple machines in the cluster. That may or may not be a issues depending on your requirements. -Jason On Thu, Dec 17, 2009 at 11:00 AM, Yves Petinot wrote: >>> Just one comment though I think hadoop will serialize your jobs any how >>> so >>> you won't get a parallel execution of your hadoop jobs unless you run >>> them >>> from different hardware. > > I'm actually wondering if someone on this list has been able to use Hadoop's > Fair Scheduler (or any other scheduler for that matter). This would > definitely solve this problem (which i'm experiencing too). Is it at all > possible to change the default scheduler with Nutch 1.0 (and the version of > Hadoop that comes with it) or do we have to wait until the next hadoop > upgrade ? > > hopefully someone on the list can shed some light on this issue, > > cheers, > > -y > > > MilleBii wrote: >> >> I guess because of the different nutch-site.xml & url filter that you want >> to use it won't work... but you could try installing nutch twice run the >> crawl/fetch/parse from those two locations. And joined the segments to >> recreate a unified searchable index (make sure you put all your segments >> under the same location). >> >> Just one comment though I think hadoop will serialize your jobs any how so >> you won't get a parallel execution of your hadoop jobs unless you run them >> from different hardware. >> >> 2009/12/16 Christopher Bader >> >> >>> >>> Felix, >>> >>> I've had trouble running multiple instances. I would be interested in >>> hearing from anyone who has done it successfully. >>> >>> CB >>> >>> >>> On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann >>> wrote: >>> >>> Hi, I would like to run at least two instances of nutch ONLY for crawling at one time; one for very frequently updated sites and one for other sites. Will the nutch instances get in trouble when running several crawlscripts, especially the nutch confdir variable? Thanks! Felix. >> >> >> >> > >
Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'?
Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'? Shouldn't the crawldb entry have a status of 'db_gone'? This is happening in nutch-1.0 Here is one example of what I'm talking about = [jkon...@rampage search]$ ./bin/nutch readseg -get testParseSegment/20091202111849 "http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s"; Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Fri Nov 27 16:28:09 PST 2009 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 7776000 seconds (90 days) Score: 7.535359E-10 Signature: null Metadata: _ngt_: 1259781530311 Crawl Fetch:: Version: 7 Status: 37 (fetch_gone) Fetch time: Wed Dec 02 12:25:21 PST 2009 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 2.47059988E10 Signature: null Metadata: _ngt_: 1259781530311_pst_: notfound(14), lastModified=0: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s [jkon...@rampage search]$ ./bin/nutch readdb testParseSegment/c -url "http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s"; URL: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s Version: 7 Status: 1 (db_unfetched) Fetch time: Sat Apr 03 01:25:21 PDT 2010 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 2.47059988E10 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s = Thanks, Jason
support for robot rules that include a wild card
I'm using nutch-1.0 and have noticed after running some tests that the robot rules parser does not support wildcard (a.k.a globbing) in rules. This means the rule will not work like it was expected to by the person who wrote the robots.txt file. For example User-Agent: * Disallow: /somepath/*/someotherpath Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt ) User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? With the popularity of the wildcard (*) in robots.txt files these days what are the plans/thoughts on adding support for it in Nutch? Thanks, Jason