I got 2.3.1rc2 working. The updated monog gora drivers helped.
I was getting "IllegalArgumentException: can't serialize" when trying to
run ./nutch fetch 1453178985-847903190

This works now in 2.3.1.

Thanks for all your help Lewis.


On Thu, Jan 14, 2016 at 2:54 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Lex,
>
> On Wed, Jan 13, 2016 at 2:49 PM, <[email protected]>
> wrote:
>
> > Thanks for the response Lewis.
> >
>
> np
>
>
> >
> > I'll give nucth 2.3.1 a spin later tonight.
> >
>
> Nice
>
>
> >
> > I didn't have success with batchId. I thought I could overwrite this in
> the
> > DB with 123 and then ./fetch 123 would get all urls marked with 123.
> >
>
> Well yes this is the case... however please consider that batches are
> generated based on the presence of a marker indicating that the URL is
> suitable to be fetched. In addition, the size of any given batch is
> determined by the default value Long.MAX_VALUE. You can restrict (reduce)
> this by passing in the -topN parameter to the generate command. Please see
> the command line arguments for further details. Scroll down to the bottom
> to see the CLI parameters for 2.X.
> http://wiki.apache.org/nutch/bin/nutch%20generate
>
>
>
> > I seem to be missing where the generate command stores its segments.
> >
>
> Ah... so this is where you are lacking some context. Nutch 2.X does not
> work off of the concept of segment(s). The entire persistence system is
> managed via a Gora datastore e.g. a database. This is to say that all of
> the data structures from Nutch 1.X e.g. crawldb, linkdb and segments are
> represented as an equivalent Gora datastore manifestation.
>
>
> >
> > For now I'm happy looking through the code for the first time.
> >
>
> This would be advised but Nutch is quite extensive so this may take some
> time.
>
>
> > I think I'll try building a generator or fetch job which can
> > prioritize/boost domains.
>
>
> This can be done within the InjectorJob as explained here
>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java#L51-L60
> If you have any questions about this then please ask.
>
>
> > I'm no Java wiz but it'll be a good exercise
> > regardless if it works or not.
> >
> > Agreed!
> hth
>

Reply via email to