Re: Custom Generator or ScoringFilter (or Fetch)

Lewis John Mcgibbney Wed, 13 Jan 2016 17:55:30 -0800

Hi Lex,

On Wed, Jan 13, 2016 at 2:49 PM, <[email protected]> wrote:


> Thanks for the response Lewis.
>

np


>
> I'll give nucth 2.3.1 a spin later tonight.
>

Nice


>
> I didn't have success with batchId. I thought I could overwrite this in the
> DB with 123 and then ./fetch 123 would get all urls marked with 123.
>

Well yes this is the case... however please consider that batches are
generated based on the presence of a marker indicating that the URL is
suitable to be fetched. In addition, the size of any given batch is
determined by the default value Long.MAX_VALUE. You can restrict (reduce)
this by passing in the -topN parameter to the generate command. Please see
the command line arguments for further details. Scroll down to the bottom
to see the CLI parameters for 2.X.
http://wiki.apache.org/nutch/bin/nutch%20generate



> I seem to be missing where the generate command stores its segments.
>

Ah... so this is where you are lacking some context. Nutch 2.X does not
work off of the concept of segment(s). The entire persistence system is
managed via a Gora datastore e.g. a database. This is to say that all of
the data structures from Nutch 1.X e.g. crawldb, linkdb and segments are
represented as an equivalent Gora datastore manifestation.


>
> For now I'm happy looking through the code for the first time.
>

This would be advised but Nutch is quite extensive so this may take some
time.


> I think I'll try building a generator or fetch job which can
> prioritize/boost domains.


This can be done within the InjectorJob as explained here
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java#L51-L60
If you have any questions about this then please ask.


> I'm no Java wiz but it'll be a good exercise
> regardless if it works or not.
>
> Agreed!
hth

Re: Custom Generator or ScoringFilter (or Fetch)

Reply via email to