Hi :
Which setting should I modify in order to do normalization before
filtering? Should I swap the order in plugin.includes property?
Regards
On 7 June 2012 21:24, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:
Hi,
On Wed, Jun 6, 2012 at 10:16 AM, Markus Jelsma
Hi all:
Like I suspected, this vulnerability affects more properties apart from the
ones I described in NUTCH-1385.
For instance, the property plugin.includes:
valueplugin_1|plugin_2/value
This is fine, it will load both plugins.
valueplugin_1|plugin_2
/value
This is not fine
Hi,
Just checking if anyone could comment on my post below. :)
Thanks in advance.
Safdar
On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy
safdar.kurei...@gmail.com wrote:
Hi,
I'm trying to build an incremental crawler, using the various Nutch
crawl tools (generate + fetch/parse +
The wiki information is not complete and doesnt work in all the cases, I
have done some modifications, should i mail it your personal id?
Regards,
Bharat Goyal
On Monday 11 June 2012 11:00 AM, abhishek tiwari wrote:
Thanks for your response . i am very new in nutch and hadoop Actually
i
Hi Ali,
Please check out this post [0] I found. I need to agree with the
response in the thread ans state that I don't know how Hadoop ensures
even distribution of workload but we can assume that by explicitly
specifying the mapper and reducers we can ensure that all 'will' be
used across your
Hi Bharat,
If you are able to, it would be great if you were able to add this
info to the Nutch wiki. As you mention, the current Hadoop tutorial
seems awfully convoluted and we could do with simplifying it.
If you are able to contribute your efforts, please sign up to the wiki
and I will add
Hello, I have changed
namehttp.agent.name/name
valueMyNameSpider/value
namehttp.robots.agents/name
valueMyNameSpider,*/value
When I look at my website stats, I always
Robots / Spiders visitors
Nutch with a link http://nutch.apache.org/;
you have a solution for the name of the spider
That's the idea indeed. The urlmeta plugin allows to do that simply by
setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for
description etc...)
On 11 June 2012 22:45, Sebastian Nagel wastl.na...@googlemail.com wrote:
Hi Sandeep,
tracking the seed(s) for a document could be done
forgot to say : this would work by adding a seed metadata to the urls in
the seed list, the value of which is then propagated by the scoring filter
in urlmeta
On 12 June 2012 14:41, Julien Nioche lists.digitalpeb...@gmail.com wrote:
That's the idea indeed. The urlmeta plugin allows to do that
Guys,
This has to do with the way URLs are grouped for politeness and not so much
with the number of blocks in the input. Limiting the URLs by # host names,
domains or IP is a way of ensuring an even distribution across the cluster.
See nutch-default.xml for details
J.
On 12 June 2012 13:06,
Why? Unnecessary pages are already filtered out in the parse step?
On Tue, Jun 12, 2012 at 12:52 AM, remi tassing tassingr...@gmail.com wrote:
Certainty, but you might need them to avoid crawling unnecessary pages
On Monday, June 11, 2012, Matthias Paul wrote:
Hi,
wouldn't it be better
Hello,
I am currently trying to use Nutch as a web site mirroring tool. To be more
explicit, I only need to download the pages, not to index them (I do not
intend to use it as a search engine). I couldn't figure a simpler way to
accomplish my task, so what I do now is:
- crawl the site, using
Hello,
Here's a workaround as a last resort: I think you can add simple code to
remove all occurrences of the string http://www.example.com/; from a dump
if you are going to use a Java program anyway.
Best,
Emre
On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu vlad.paune...@gmail.comwrote:
Hello David,
can you specify which version of Nutch you are using?
I've run a local test crawl with Nutch 1.5 two weeks ago
and just looked into the Apache log file. All seems correct:
127.0.0.1 - - [31/May/2012:22:25:46 +0200] GET /robots.txt HTTP/1.0 404 462
-
sn-test-crawler/Nutch-1.5
Thanks Julien,
I've missed that urlmeta passes the tags to the outlinks.
Sebastian
On 06/12/2012 03:42 PM, Julien Nioche wrote:
forgot to say : this would work by adding a seed metadata to the urls in
the seed list, the value of which is then propagated by the scoring filter
in urlmeta
On
According to the documentation nutch inject is supposed to allow for an entry
with custom score and fetchInterval as well as custom metadata values.
I have tried injecting a tab delimited text file with entires like:
http://www.domain-one.com/ nutch.score=10 nutch.fetchInterval=172800
My current workaround would be to delete the .com and .au lines from
the configuration file.
You could also activate the option +P in suffix-urlfilter.txt:
# uncomment the line below to filter on url path
#+P
The pattern are then exclusively applied to the path of the URL
and not to host or
Hi Magnús,
Firstly
On Tue, Jun 12, 2012 at 4:56 PM, Magnús Skúlason magg...@gmail.com wrote:
However I would like to allow users to edit and extend the
content showed on my site. Like adding a better description, adding
tags and sorting items into categories.
I have not built a search engine
Hello again,
Thanks. This does not seem very generalizable though. Is there a up-to-date
way to achieve focusing using a plugin?
Hi everybody
I have an unusual issue. when i run nutch on top off hadoop, after the
map tasks finish, the reduce task start to finish very fast almost all
of them finish in less than 2 hours but there is alway one or two that
take a lot longer. this is a link to the list of a completed reduce
Hi Magnus
-Original Message-
From: Magnús Skúlason [mailto:magg...@gmail.com]
Sent: Wednesday, 13 June 2012 1:57 AM
To: nutch-u...@lucene.apache.org
Subject: focused crawl extended with user generated content
Hi,
I am using nutch for a focused crawl vertical search engine, so
21 matches
Mail list logo