Hi Shiva,
Having looked at the specific site, I have to amend my recommended max-depth
from 1 to 2, since I assume you want to fetch the stories themselves, not just
the hubpages.
If you want to crawl continuously, as Markus suggested, I still think you
should keep the depth at 2, but define
Hello,
Yossi's suggestion is excellent if your case is crawl everything once, and
never again. However, if you need to crawl future articles as well, and have to
deal with mutations, then let the crawler run continuously without regard for
depth.
The latter is the usual case, because after
Hi Shiva,
My suggestion would be to programmatically generate a seeds file containing
these 497342 URLs (since you know them in advance), and then use a very low
max-depth (probably 1), and a high number of iterations, since only a small
number will be fetched in each iteration, unless you set
Hi
Can you help me in figuring out the issue while crawling a hub page having
pagination. Problem what i am facing is what depth to give and how to
handle pagination.
I have a hubpage which has a pagination of more than 4.95L.
e.g. https://www.jagran.com/latest-news-page497342.html
--
Tried 2.3-SNAPSHOT instead of 2.3 as :
Error persists.
On Sat, Jul 28, 2018 at 5:57 PM govind nitk wrote:
>
> hi all,
>
> I want to use any23 2.3-snapshot version with nutch. This is what I have
> done:
> 1. have "mvn install" in any23 repo.
> so jars are released in local ~/.m2 dir.
>
>
hi all,
I want to use any23 2.3-snapshot version with nutch. This is what I have
done:
1. have "mvn install" in any23 repo.
so jars are released in local ~/.m2 dir.
ex.
/home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar
2. nutch repo,
+1 for build
plugins test - success
On Thu, Jul 26, 2018 at 10:25 PM Roannel Fernández Hernández
wrote:
> +1 Great work, folks
>
> - Mensaje original -
> > De: "Sebastian Nagel"
> > Para: user@nutch.apache.org
> > CC: d...@nutch.apache.org
> > Enviados: Jueves, 26 de Julio 2018
7 matches
Mail list logo