Re: nutch crawling
Hi, A similar question has been posted yesterday (Query in nutch) - as Lewis suggested, NUTCH-585 [1] might be what you need. Best, Elisabeth [1] https://issues.apache.org/jira/browse/NUTCH-585 On 29.02.2012 12:15, sanjay87 wrote: Hi Techies, I am having some queries related to Nutch- the web crawler. I am actually done with Crawling the website and indexing the same in SOLR, but the problem here is – the Nutch crawler crawls at a domain level i.e. the menu items , anchor text and everything which is actually not needed. I only need to crawl the legitimate content present in the site. I tried to crawl the localhost:8080/solr/admin page and the response is not legitimate. The content field is having all the data which is actually not needed. We have tried a lot of options and still we are unable to find a solution, please provide your valuable inputs. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-tp3786913p3786913.html Sent from the Nutch - User mailing list archive at Nabble.com.
Distributed Indexing on MapReduce
Hi all, I am looking into reusing some existing code for distributed indexing to test a Mahout tool I am working on https://issues.apache.org/jira/browse/MAHOUT-944 What I want is to index the Apache Public Mail Archives dataset (200G) via MapReduce on Hadoop. I have been going through the Nutch and contrib/index code and from my understanding I have to: * Create an InputFormat / RecordReader / InputSplit class for splitting the e-mails across mappers * Create a Mapper which emits the e-mails as key value pairs * Create a Reducer which indexes the e-mails on the local filesystem (or straight to HDFS?) * Copy these indexes from local filesystem to HDFS. In the same Reducer? I am unsure about the final steps. How to get to the end result, a bunch of index shards on HDFS. It seems that each Reducer needs to be aware of a directory they eventually write to on HDFS. I don't see how to get each reducer to copy its shard to HDFS How do I set this up? Cheers, Frank
Re: http.redirect.max
Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex. -Original Message- From: xuyuanme xuyua...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change response = getResponse(u, datum, *false*) call to response = getResponse(u, datum, *true*) in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B lewis john mcgibbney wrote I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Featured link support in Nutch
Hi What is a featured link? Maybe Solr's elevator component is what your are looking for? cheers On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose stannyfarg...@gmail.com wrote: Hi All, I am working on replacing our current site search with Nutch-Solr. I am very new to this technologies but I like what it's offering. I got the basic setup working. I was wondering how would we implement 'Featured link' using Nutch-Solr. I would like to hear your thoughts. Thanks in advance. -Stan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred
you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: Hello: I am running multiple small crawls on one machine. I notice that they are conflicting because they all access /tmp/hadoop-username/mapred How do I change the location of this folder ? Do I have use hadoop to run multiple crawlers each specific to a site ? thanks Jeremy -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred
That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.iowrote: you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: Hello: I am running multiple small crawls on one machine. I notice that they are conflicting because they all access /tmp/hadoop-username/mapred How do I change the location of this folder ? Do I have use hadoop to run multiple crawlers each specific to a site ? thanks Jeremy -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Only fetching initial seedlist
Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not parse some of them? For example with the following seedlist: http://www.domain.com/?_PageId=492AreaId=441 http://www.domain.com/?_PageId=631AreaId=11 http://www.domain.com/?_PageId=490AreaId=19 Nutch would not fetch and parse all the urls? I am not that interested in the outlinks, my general purpose is to crawl, fetch and parse the seedlist ONLY. I am using the crawl command with a depth of 1 and infinite topN. I have also tried injecting manually. Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred
How did you define that property so it's different so each job? Remi On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.io wrote: you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: Hello: I am running multiple small crawls on one machine. I notice that they are conflicting because they all access /tmp/hadoop-username/mapred How do I change the location of this folder ? Do I have use hadoop to run multiple crawlers each specific to a site ? thanks Jeremy -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Only fetching initial seedlist
This question comes a lot, try searching the mailinglist archive On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote: Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not parse some of them? For example with the following seedlist: http://www.domain.com/?_PageId=492AreaId=441 http://www.domain.com/?_PageId=631AreaId=11 http://www.domain.com/?_PageId=490AreaId=19 Nutch would not fetch and parse all the urls? I am not that interested in the outlinks, my general purpose is to crawl, fetch and parse the seedlist ONLY. I am using the crawl command with a depth of 1 and infinite topN. I have also tried injecting manually. Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html Sent from the Nutch - User mailing list archive at Nabble.com.
different fetch interval for each depth urls
Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex.
Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred
It is a small number of crawlers, so I copied a runtime for each. therefore different configuration files. Jeremy On Thu, Mar 1, 2012 at 10:57 PM, remi tassing tassingr...@gmail.com wrote: How did you define that property so it's different so each job? Remi On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.io wrote: you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: Hello: I am running multiple small crawls on one machine. I notice that they are conflicting because they all access /tmp/hadoop-username/mapred How do I change the location of this folder ? Do I have use hadoop to run multiple crawlers each specific to a site ? thanks Jeremy -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/**markus17 http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred
You can also pass it to most jobs with $ nutch job -Dhadoop.tmp.dir=bla args. This can be even automatic with some shell scripting. On Fri, 2 Mar 2012 00:49:36 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: It is a small number of crawlers, so I copied a runtime for each. therefore different configuration files. Jeremy On Thu, Mar 1, 2012 at 10:57 PM, remi tassing wrote: How did you define that property so it's different so each job? Remi On Friday, March 2, 2012, Jeremy Villalobos wrote: That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma wrote: you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com [4] wrote: Hello: I am running multiple small crawls on one machine. I notice that they are conflicting because they all access /tmp/hadoop-username/mapred How do I change the location of this folder ? Do I have use hadoop to run multiple crawlers each specific to a site ? thanks Jeremy -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/**markus17 [5] 050-8536600 / 06-50258350 Links: -- [1] mailto:tassingr...@gmail.com [2] mailto:jeremyvillalo...@gmail.com [3] mailto:markus.jel...@openindex.io [4] mailto:jeremyvillalo...@gmail.com [5] http://www.linkedin.com/in/**markus17 [6] http://www.linkedin.com/in/markus17 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Only fetching initial seedlist
indeed. check urlfilters and plugins. On Fri, 2 Mar 2012 05:59:20 +0200, remi tassing tassingr...@gmail.com wrote: This question comes a lot, try searching the mailinglist archive On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote: Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not parse some of them? For example with the following seedlist: http://www.domain.com/?_PageId=492AreaId=441 http://www.domain.com/?_PageId=631AreaId=11 http://www.domain.com/?_PageId=490AreaId=19 Nutch would not fetch and parse all the urls? I am not that interested in the outlinks, my general purpose is to crawl, fetch and parse the seedlist ONLY. I am using the crawl command with a depth of 1 and infinite topN. I have also tried injecting manually. Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: different fetch interval for each depth urls
Well, you could set a new default fetch interval in your configuration after the first crawl cycle but the depth information is lost if you continue crawling so there is no real solution. What problem are you trying to solve anyway? On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alx...@aim.com wrote: Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Featured link support in Nutch
That's what I was looking for. Thanks Markus! I also have another question (did lot of search on this already). We want to get results by using 'starts with' or prefix query. e.g. Return all results where url starts with http://auto.yahoo.com Thanks again! On Thu, Mar 1, 2012 at 3:59 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi What is a featured link? Maybe Solr's elevator component is what your are looking for? cheers On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose stannyfarg...@gmail.com wrote: Hi All, I am working on replacing our current site search with Nutch-Solr. I am very new to this technologies but I like what it's offering. I got the basic setup working. I was wondering how would we implement 'Featured link' using Nutch-Solr. I would like to hear your thoughts. Thanks in advance. -Stan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Featured link support in Nutch
wildcard query or edge ngram filter. terms component can also do this or even facet.prefix! On Thu, 1 Mar 2012 23:15:23 -0800, Stany Fargose stannyfarg...@gmail.com wrote: That's what I was looking for. Thanks Markus! I also have another question (did lot of search on this already). We want to get results by using 'starts with' or prefix query. e.g. Return all results where url starts with http://auto.yahoo.com [1] Thanks again! On Thu, Mar 1, 2012 at 3:59 PM, Markus Jelsma wrote: Hi What is a featured link? Maybe Solr's elevator component is what your are looking for? cheers On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose wrote: Hi All, I am working on replacing our current site search with Nutch-Solr. I am very new to this technologies but I like what it's offering. I got the basic setup working. I was wondering how would we implement 'Featured link' using Nutch-Solr. I would like to hear your thoughts. Thanks in advance. -Stan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 [4] 050-8536600 / 06-50258350 Links: -- [1] http://auto.yahoo.com [2] mailto:markus.jel...@openindex.io [3] mailto:stannyfarg...@gmail.com [4] http://www.linkedin.com/in/markus17 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350