I got 2.3.1rc2 working. The updated monog gora drivers helped. I was getting "IllegalArgumentException: can't serialize" when trying to run ./nutch fetch 1453178985-847903190
This works now in 2.3.1. Thanks for all your help Lewis. On Thu, Jan 14, 2016 at 2:54 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Lex, > > On Wed, Jan 13, 2016 at 2:49 PM, <[email protected]> > wrote: > > > Thanks for the response Lewis. > > > > np > > > > > > I'll give nucth 2.3.1 a spin later tonight. > > > > Nice > > > > > > I didn't have success with batchId. I thought I could overwrite this in > the > > DB with 123 and then ./fetch 123 would get all urls marked with 123. > > > > Well yes this is the case... however please consider that batches are > generated based on the presence of a marker indicating that the URL is > suitable to be fetched. In addition, the size of any given batch is > determined by the default value Long.MAX_VALUE. You can restrict (reduce) > this by passing in the -topN parameter to the generate command. Please see > the command line arguments for further details. Scroll down to the bottom > to see the CLI parameters for 2.X. > http://wiki.apache.org/nutch/bin/nutch%20generate > > > > > I seem to be missing where the generate command stores its segments. > > > > Ah... so this is where you are lacking some context. Nutch 2.X does not > work off of the concept of segment(s). The entire persistence system is > managed via a Gora datastore e.g. a database. This is to say that all of > the data structures from Nutch 1.X e.g. crawldb, linkdb and segments are > represented as an equivalent Gora datastore manifestation. > > > > > > For now I'm happy looking through the code for the first time. > > > > This would be advised but Nutch is quite extensive so this may take some > time. > > > > I think I'll try building a generator or fetch job which can > > prioritize/boost domains. > > > This can be done within the InjectorJob as explained here > > https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java#L51-L60 > If you have any questions about this then please ask. > > > > I'm no Java wiz but it'll be a good exercise > > regardless if it works or not. > > > > Agreed! > hth >

