RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari
Hi Shiva, Having looked at the specific site, I have to amend my recommended max-depth from 1 to 2, since I assume you want to fetch the stories themselves, not just the hubpages. If you want to crawl continuously, as Markus suggested, I still think you should keep the depth at 2, but define

RE: Issues while crawling pagination

2018-07-28 Thread Markus Jelsma
Hello, Yossi's suggestion is excellent if your case is crawl everything once, and never again. However, if you need to crawl future articles as well, and have to deal with mutations, then let the crawler run continuously without regard for depth. The latter is the usual case, because after

RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari
Hi Shiva, My suggestion would be to programmatically generate a seeds file containing these 497342 URLs (since you know them in advance), and then use a very low max-depth (probably 1), and a high number of iterations, since only a small number will be fetched in each iteration, unless you set

Reg: Issues while crawling pagination

2018-07-28 Thread ShivaKarthik S
Hi Can you help me in figuring out the issue while crawling a hub page having pagination. Problem what i am facing is what depth to give and how to handle pagination. I have a hubpage which has a pagination of more than 4.95L. e.g. https://www.jagran.com/latest-news-page497342.html --

Re: using any23 with nutch

2018-07-28 Thread govind nitk
Tried 2.3-SNAPSHOT instead of 2.3 as : Error persists. On Sat, Jul 28, 2018 at 5:57 PM govind nitk wrote: > > hi all, > > I want to use any23 2.3-snapshot version with nutch. This is what I have > done: > 1. have "mvn install" in any23 repo. > so jars are released in local ~/.m2 dir. > >

using any23 with nutch

2018-07-28 Thread govind nitk
hi all, I want to use any23 2.3-snapshot version with nutch. This is what I have done: 1. have "mvn install" in any23 repo. so jars are released in local ~/.m2 dir. ex. /home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar 2. nutch repo,

Re: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1

2018-07-28 Thread govind nitk
+1 for build plugins test - success On Thu, Jul 26, 2018 at 10:25 PM Roannel Fernández Hernández wrote: > +1 Great work, folks > > - Mensaje original - > > De: "Sebastian Nagel" > > Para: user@nutch.apache.org > > CC: d...@nutch.apache.org > > Enviados: Jueves, 26 de Julio 2018