just for reference in the archives:
https://issues.apache.org/jira/browse/NUTCH-1538
--Roland
Am 04.03.2013 11:05, schrieb Julien Nioche:
Hi Roland,
Can you please open a JIRA for this? Thanks for investigating, the
explanation makes a lot of sense
Julien
On 4 March 2013 07:26, Roland <[email protected]> wrote:
Hi all,
I've read the sources ;)
(no, not really all, but enough, I hope)
So, major difference between generator & fetcher are the fields that it's
loading from db.
As I had fetcher.store.content=true in the beginning, there was a lot data
in the contents fields.
I run with fetcher.parse=true and that's why it loads all content during
start-up of fetcherJob.
I did this in my local 2.1 sources:
Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java
==============================**==============================**=======
--- src/java/org/apache/nutch/**fetcher/FetcherJob.java (revision
1448112)
+++ src/java/org/apache/nutch/**fetcher/FetcherJob.java (working copy)
@@ -140,6 +140,8 @@
if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) {
ParserJob parserJob = new ParserJob();
fields.addAll(parserJob.**getFields(job));
+ fields.remove(WebPage.Field.**CONTENT); // FIXME
+ fields.remove(WebPage.Field.**OUTLINKS); // FIXME
}
ProtocolFactory protocolFactory = new ProtocolFactory(job.**
getConfiguration());
fields.addAll(protocolFactory.**getFields());
and now start-up time of an fetcherJob is about 10 minutes :)
--Roland
Am 22.02.2013 10:28, schrieb Roland:
Hi Julien,
ok, so thanks for the clarification, I think I have to read the sources :)
--Roland
Am 22.02.2013 10:10, schrieb Julien Nioche:
Hi Roland
My previous email should have started with "The point Alex is making is
..."
and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be interesting
to find out. The behaviour of the fetcher is how I expect GORA to behave
in
its current form i.e. pull everything - filter - process.
Julien
On 21 February 2013 16:58, Roland <[email protected]> wrote:
Hi Julien,
the point I personally don't get, is: why is generating fast - fetching
not.
If it's possible to filter the generatorJob at the backend (what I think
it does), shouldn't it be possible to do the same for the fetcher?
--Roland
Am 21.02.2013 12:27, schrieb Julien Nioche:
Lewis,
The point is whether the filtering is done on the backend side (e.g.
using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to
mapreduce
even if only a handful of entries are to be processed.
Makes sense?
Julien
On 21 February 2013 01:52, Lewis John Mcgibbney
<[email protected]>****wrote:
Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.
if (Mark.GENERATE_MARK.checkMark(****page) != null) {
if (GeneratorJob.LOG.****isDebugEnabled()) {
GeneratorJob.LOG.debug("****Skipping " + url + "; already
generated");
}
return;
Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.
On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:
Hi,
Are those filters put on all data selected from hbase or sent to
hbase
as
filters to select a subset of all hbase records?
Thanks.
Alex.
-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage
Hi Alex,
On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:
The generator also does not have filters. Its mapper goes over all
records as far as I know. If you use hadoop you can see how many
records
go
as input to mappers. Also see this
I don't think this is true. The GeneratorMapper filters URLs
before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.
So the issue here, as you correctly explain, is with the Fetcher
obtaining
the URLs which have been marked with a desired batchId. This would be
done
via scanner in Gora.
--
*Lewis*