Hi,
A similar question has been posted yesterday (Query in nutch) - as
Lewis suggested, NUTCH-585 [1] might be what you need.
Best,
Elisabeth
[1] https://issues.apache.org/jira/browse/NUTCH-585
On 29.02.2012 12:15, sanjay87 wrote:
Hi Techies,
I am having some queries related to Nutch- the
Hi all,
I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944
What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.
I have been going through the
Hello,
I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones
redirected urls to later depths.
What is the correct config setting to have nutch crawl redirected urls
immediately. I need it because I have restriction on depth be at most 2.
Thanks.
Alex.
Hi
What is a featured link? Maybe Solr's elevator component is what your
are looking for?
cheers
On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose
stannyfarg...@gmail.com wrote:
Hi All,
I am working on replacing our current site search with Nutch-Solr. I
am
very new to this technologies
you can either:
1. run on hadoop
2. not run multiple concurrent jobs on a local machine
3. set a hadoop.tmp.dir per job
4. merge all crawls to a single crawl
On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos
jeremyvillalo...@gmail.com wrote:
Hello:
I am running multiple small crawls on
That is what I was looking for, thank you.
this property was added to:
$NUCHT_DIR/runtime/local/conf/nutch-site.xml
Jeremy
On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.iowrote:
you can either:
1. run on hadoop
2. not run multiple concurrent jobs on a local machine
Hello,
I am having a problem getting nutch to crawl and fetch the initial seedlist
only. It seems like nutch tend to skip some urls? Or it does not parse some
of them?
For example with the following seedlist:
http://www.domain.com/?_PageId=492AreaId=441
How did you define that property so it's different so each job?
Remi
On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com
wrote:
That is what I was looking for, thank you.
this property was added to:
$NUCHT_DIR/runtime/local/conf/nutch-site.xml
Jeremy
On Thu, Mar 1,
This question comes a lot, try searching the mailinglist archive
On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote:
Hello,
I am having a problem getting nutch to crawl and fetch the initial
seedlist
only. It seems like nutch tend to skip some urls? Or it does not parse
some
of
Hello,
I need to have different fetch intervals for initial seed urls and urls
extracted from them at depth 1. How this can be achieved. I tried -adddays
option in generate command but it seems it cannot be used to solve this issue.
Thanks in advance.
Alex.
It is a small number of crawlers, so I copied a runtime for each.
therefore different configuration files.
Jeremy
On Thu, Mar 1, 2012 at 10:57 PM, remi tassing tassingr...@gmail.com wrote:
How did you define that property so it's different so each job?
Remi
On Friday, March 2, 2012,
You can also pass it to most jobs with $ nutch job
-Dhadoop.tmp.dir=bla args. This can be even automatic with some shell
scripting.
On Fri, 2 Mar 2012 00:49:36 -0500, Jeremy Villalobos
jeremyvillalo...@gmail.com wrote:
It is a small number of crawlers, so I copied a runtime for each.
indeed. check urlfilters and plugins.
On Fri, 2 Mar 2012 05:59:20 +0200, remi tassing tassingr...@gmail.com
wrote:
This question comes a lot, try searching the mailinglist archive
On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote:
Hello,
I am having a problem getting nutch to
Well, you could set a new default fetch interval in your configuration
after the first crawl cycle but the depth information is lost if you
continue crawling so there is no real solution.
What problem are you trying to solve anyway?
On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alx...@aim.com
That's what I was looking for. Thanks Markus!
I also have another question (did lot of search on this already). We want
to get results by using 'starts with' or prefix query.
e.g. Return all results where url starts with http://auto.yahoo.com
Thanks again!
On Thu, Mar 1, 2012 at 3:59 PM,
wildcard query or edge ngram filter. terms component can also do this
or even facet.prefix!
On Thu, 1 Mar 2012 23:15:23 -0800, Stany Fargose
stannyfarg...@gmail.com wrote:
That's what I was looking for. Thanks Markus!
I also have another question (did lot of search on this already). We
want
16 matches
Mail list logo