Re: nutch crawling

2012-03-01 Thread Elisabeth Adler
Hi, A similar question has been posted yesterday (Query in nutch) - as Lewis suggested, NUTCH-585 [1] might be what you need. Best, Elisabeth [1] https://issues.apache.org/jira/browse/NUTCH-585 On 29.02.2012 12:15, sanjay87 wrote: Hi Techies, I am having some queries related to Nutch- the

Distributed Indexing on MapReduce

2012-03-01 Thread Frank Scholten
Hi all, I am looking into reusing some existing code for distributed indexing to test a Mahout tool I am working on https://issues.apache.org/jira/browse/MAHOUT-944 What I want is to index the Apache Public Mail Archives dataset (200G) via MapReduce on Hadoop. I have been going through the

Re: http.redirect.max

2012-03-01 Thread alxsss
Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex.

Re: Featured link support in Nutch

2012-03-01 Thread Markus Jelsma
Hi What is a featured link? Maybe Solr's elevator component is what your are looking for? cheers On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose stannyfarg...@gmail.com wrote: Hi All, I am working on replacing our current site search with Nutch-Solr. I am very new to this technologies

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Markus Jelsma
you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: Hello: I am running multiple small crawls on

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Jeremy Villalobos
That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.iowrote: you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine

Only fetching initial seedlist

2012-03-01 Thread James Ford
Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not parse some of them? For example with the following seedlist: http://www.domain.com/?_PageId=492AreaId=441

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread remi tassing
How did you define that property so it's different so each job? Remi On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1,

Re: Only fetching initial seedlist

2012-03-01 Thread remi tassing
This question comes a lot, try searching the mailinglist archive On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote: Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not parse some of

different fetch interval for each depth urls

2012-03-01 Thread alxsss
Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex.

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Jeremy Villalobos
It is a small number of crawlers, so I copied a runtime for each. therefore different configuration files. Jeremy On Thu, Mar 1, 2012 at 10:57 PM, remi tassing tassingr...@gmail.com wrote: How did you define that property so it's different so each job? Remi On Friday, March 2, 2012,

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Markus Jelsma
You can also pass it to most jobs with $ nutch job -Dhadoop.tmp.dir=bla args. This can be even automatic with some shell scripting. On Fri, 2 Mar 2012 00:49:36 -0500, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: It is a small number of crawlers, so I copied a runtime for each.  

Re: Only fetching initial seedlist

2012-03-01 Thread Markus Jelsma
indeed. check urlfilters and plugins. On Fri, 2 Mar 2012 05:59:20 +0200, remi tassing tassingr...@gmail.com wrote: This question comes a lot, try searching the mailinglist archive On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote: Hello, I am having a problem getting nutch to

Re: different fetch interval for each depth urls

2012-03-01 Thread Markus Jelsma
Well, you could set a new default fetch interval in your configuration after the first crawl cycle but the depth information is lost if you continue crawling so there is no real solution. What problem are you trying to solve anyway? On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alx...@aim.com

Re: Featured link support in Nutch

2012-03-01 Thread Stany Fargose
That's what I was looking for. Thanks Markus! I also have another question (did lot of search on this already). We want to get results by using 'starts with' or prefix query. e.g. Return all results where url starts with http://auto.yahoo.com Thanks again! On Thu, Mar 1, 2012 at 3:59 PM,

Re: Featured link support in Nutch

2012-03-01 Thread Markus Jelsma
wildcard query or edge ngram filter. terms component can also do this or even facet.prefix! On Thu, 1 Mar 2012 23:15:23 -0800, Stany Fargose stannyfarg...@gmail.com wrote: That's what I was looking for. Thanks Markus! I also have another question (did lot of search on this already). We want