Hi Lex, On Wed, Jan 13, 2016 at 2:49 PM, <[email protected]> wrote:
> Thanks for the response Lewis. > np > > I'll give nucth 2.3.1 a spin later tonight. > Nice > > I didn't have success with batchId. I thought I could overwrite this in the > DB with 123 and then ./fetch 123 would get all urls marked with 123. > Well yes this is the case... however please consider that batches are generated based on the presence of a marker indicating that the URL is suitable to be fetched. In addition, the size of any given batch is determined by the default value Long.MAX_VALUE. You can restrict (reduce) this by passing in the -topN parameter to the generate command. Please see the command line arguments for further details. Scroll down to the bottom to see the CLI parameters for 2.X. http://wiki.apache.org/nutch/bin/nutch%20generate > I seem to be missing where the generate command stores its segments. > Ah... so this is where you are lacking some context. Nutch 2.X does not work off of the concept of segment(s). The entire persistence system is managed via a Gora datastore e.g. a database. This is to say that all of the data structures from Nutch 1.X e.g. crawldb, linkdb and segments are represented as an equivalent Gora datastore manifestation. > > For now I'm happy looking through the code for the first time. > This would be advised but Nutch is quite extensive so this may take some time. > I think I'll try building a generator or fetch job which can > prioritize/boost domains. This can be done within the InjectorJob as explained here https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java#L51-L60 If you have any questions about this then please ask. > I'm no Java wiz but it'll be a good exercise > regardless if it works or not. > > Agreed! hth

