RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari
Hi Shiva,

Having looked at the specific site, I have to amend my recommended max-depth 
from 1 to 2, since I assume you want to fetch the stories themselves, not just 
the hubpages.

If you want to crawl continuously, as Markus suggested, I still think you 
should keep the depth at 2, but define the first hubpage(s) to have a very high 
priority and very short recrawl delay. This is because stories are always added 
on the first page, and then get pushed back. I suspect that if you don't limit 
depth, and especially if you don't limit yourself to the domain, you will find 
yourself crawling the whole internet eventually. If you do limit to the domain, 
that won't be a problem, but unless you give special treatment to the first 
page(s), you will be continuously recrawling hundreds of thousands of static 
pages.

Yossi.

> -Original Message-
> From: Markus Jelsma 
> Sent: 29 July 2018 00:53
> To: user@nutch.apache.org
> Subject: RE: Issues while crawling pagination
> 
> Hello,
> 
> Yossi's suggestion is excellent if your case is crawl everything once, and 
> never
> again. However, if you need to crawl future articles as well, and have to deal
> with mutations, then let the crawler run continuously without regard for 
> depth.
> 
> The latter is the usual case, because after all, if you got this task a few 
> months
> ago you wouldn't need to go to a depth of 497342 right?
> 
> Regards,
> Markus
> 
> 
> 
> 
> -Original message-
> > From:Yossi Tamari 
> > Sent: Saturday 28th July 2018 23:09
> > To: user@nutch.apache.org; shivakarthik...@gmail.com;
> > nu...@lucene.apache.org
> > Subject: RE: Issues while crawling pagination
> >
> > Hi Shiva,
> >
> > My suggestion would be to programmatically generate a seeds file containing
> these 497342 URLs (since you know them in advance), and then use a very low
> max-depth (probably 1), and a high number of iterations, since only a small
> number will be fetched in each iteration, unless you set a very low 
> crawl-delay.
> > (Mathematically, If you fetch 1 URL per second from this domain, fetching
> 497342 URLs will take 138 hours).
> >
> > Yossi.
> >
> > > -Original Message-
> > > From: ShivaKarthik S 
> > > Sent: 28 July 2018 23:20
> > > To: nu...@lucene.apache.org; user@nutch.apache.org
> > > Subject: Reg: Issues while crawling pagination
> > >
> > >  Hi
> > >
> > > Can you help me in figuring out the issue while crawling a hub page
> > > having pagination. Problem what i am facing is what depth to give
> > > and how to handle pagination.
> > > I have a hubpage which has a pagination of more than 4.95L.
> > > e.g. https://www.jagran.com/latest-news-page497342.html  is
> > > the number of pages under the hubpage latest-news>
> > >
> > >
> > > --
> > > Thanks and Regards
> > > Shiva
> >
> >



RE: Issues while crawling pagination

2018-07-28 Thread Markus Jelsma
Hello,

Yossi's suggestion is excellent if your case is crawl everything once, and 
never again. However, if you need to crawl future articles as well, and have to 
deal with mutations, then let the crawler run continuously without regard for 
depth.

The latter is the usual case, because after all, if you got this task a few 
months ago you wouldn't need to go to a depth of 497342 right?

Regards,
Markus


 
 
-Original message-
> From:Yossi Tamari 
> Sent: Saturday 28th July 2018 23:09
> To: user@nutch.apache.org; shivakarthik...@gmail.com; nu...@lucene.apache.org
> Subject: RE: Issues while crawling pagination
> 
> Hi Shiva,
> 
> My suggestion would be to programmatically generate a seeds file containing 
> these 497342 URLs (since you know them in advance), and then use a very low 
> max-depth (probably 1), and a high number of iterations, since only a small 
> number will be fetched in each iteration, unless you set a very low 
> crawl-delay.
> (Mathematically, If you fetch 1 URL per second from this domain, fetching 
> 497342 URLs will take 138 hours).
> 
>   Yossi.
> 
> > -Original Message-
> > From: ShivaKarthik S 
> > Sent: 28 July 2018 23:20
> > To: nu...@lucene.apache.org; user@nutch.apache.org
> > Subject: Reg: Issues while crawling pagination
> > 
> >  Hi
> > 
> > Can you help me in figuring out the issue while crawling a hub page having
> > pagination. Problem what i am facing is what depth to give and how to handle
> > pagination.
> > I have a hubpage which has a pagination of more than 4.95L.
> > e.g. https://www.jagran.com/latest-news-page497342.html  > the number of pages under the hubpage latest-news>
> > 
> > 
> > --
> > Thanks and Regards
> > Shiva
> 
> 


RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari
Hi Shiva,

My suggestion would be to programmatically generate a seeds file containing 
these 497342 URLs (since you know them in advance), and then use a very low 
max-depth (probably 1), and a high number of iterations, since only a small 
number will be fetched in each iteration, unless you set a very low crawl-delay.
(Mathematically, If you fetch 1 URL per second from this domain, fetching 
497342 URLs will take 138 hours).

Yossi.

> -Original Message-
> From: ShivaKarthik S 
> Sent: 28 July 2018 23:20
> To: nu...@lucene.apache.org; user@nutch.apache.org
> Subject: Reg: Issues while crawling pagination
> 
>  Hi
> 
> Can you help me in figuring out the issue while crawling a hub page having
> pagination. Problem what i am facing is what depth to give and how to handle
> pagination.
> I have a hubpage which has a pagination of more than 4.95L.
> e.g. https://www.jagran.com/latest-news-page497342.html  the number of pages under the hubpage latest-news>
> 
> 
> --
> Thanks and Regards
> Shiva



Reg: Issues while crawling pagination

2018-07-28 Thread ShivaKarthik S
 Hi

Can you help me in figuring out the issue while crawling a hub page having
pagination. Problem what i am facing is what depth to give and how to
handle pagination.
I have a hubpage which has a pagination of more than 4.95L.
e.g. https://www.jagran.com/latest-news-page497342.html 


-- 
Thanks and Regards
Shiva


Re: using any23 with nutch

2018-07-28 Thread govind nitk
Tried 2.3-SNAPSHOT instead of 2.3 as :



Error persists.


On Sat, Jul 28, 2018 at 5:57 PM govind nitk  wrote:

>
> hi all,
>
> I want to use any23 2.3-snapshot version with nutch. This is what I have
> done:
> 1. have "mvn install" in any23 repo.
> so jars are released in local ~/.m2 dir.
>
> ex. 
> /home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar
>
> 2. nutch repo, plugins/any23/ivy.xml
>
>  conf="*->default">
>
> 3. In nutch repo, have changed ivy setting as below:
> 
>  
> value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]"
>   override="false" />
>
> 
>   
>  
>  
>   
> 
>
>  >
>
>
> So, expectation is any23 will start using my_local releases.
>
> But its failing with below error:
>
> resolve-default:
> [ivy:resolve] :: loading settings :: file =
> /home/govind/apache/nutch/ivy/ivysettings.xml
> [ivy:resolve]
> [ivy:resolve] :: problems summary ::
> [ivy:resolve]  WARNINGS
> [ivy:resolve] module not found: org.apache.any23#apache-any23;2.3
> [ivy:resolve]  local-maven2: tried
> [ivy:resolve]
> /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.xml
> [ivy:resolve]   -- artifact
> org.apache.any23#apache-any23;2.3!apache-any23.jar:
> [ivy:resolve]
> /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.jar
> [ivy:resolve] ::
> [ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
> [ivy:resolve] ::
> [ivy:resolve] :: org.apache.any23#apache-any23;2.3: not found
> [ivy:resolve] ::
> [ivy:resolve]
> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
> Target 'resolve-default' failed with message 'impossible to resolve
> dependencies:
>
>
> Am I missing something in local resolver defined for any23 ?
> Is it the case, that we can not use the  locally released jars in nutch ?
> Is there any other hack I can use to this resolved ?
>
> Regards,
> Govind
>


using any23 with nutch

2018-07-28 Thread govind nitk
hi all,

I want to use any23 2.3-snapshot version with nutch. This is what I have
done:
1. have "mvn install" in any23 repo.
so jars are released in local ~/.m2 dir.
ex. 
/home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar

2. nutch repo, plugins/any23/ivy.xml



3. In nutch repo, have changed ivy setting as below:



  
 
 
  





So, expectation is any23 will start using my_local releases.

But its failing with below error:

resolve-default:
[ivy:resolve] :: loading settings :: file =
/home/govind/apache/nutch/ivy/ivysettings.xml
[ivy:resolve]
[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve] module not found: org.apache.any23#apache-any23;2.3
[ivy:resolve]  local-maven2: tried
[ivy:resolve]
/home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.xml
[ivy:resolve]   -- artifact
org.apache.any23#apache-any23;2.3!apache-any23.jar:
[ivy:resolve]
/home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.jar
[ivy:resolve] ::
[ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::
[ivy:resolve] :: org.apache.any23#apache-any23;2.3: not found
[ivy:resolve] ::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Target 'resolve-default' failed with message 'impossible to resolve
dependencies:


Am I missing something in local resolver defined for any23 ?
Is it the case, that we can not use the  locally released jars in nutch ?
Is there any other hack I can use to this resolved ?

Regards,
Govind


Re: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1

2018-07-28 Thread govind nitk
+1 for build
plugins test - success






On Thu, Jul 26, 2018 at 10:25 PM Roannel Fernández Hernández 
wrote:

> +1 Great work, folks
>
> - Mensaje original -
> > De: "Sebastian Nagel" 
> > Para: user@nutch.apache.org
> > CC: d...@nutch.apache.org
> > Enviados: Jueves, 26 de Julio 2018 11:05:06
> > Asunto: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1
> >
> > Hi Folks,
> >
> > A first candidate for the Nutch 1.15 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/nutch/1.15/
> >
> > The release candidate is a zip and tar.gz archive of the binary and
> sources
> > in:
> >   https://github.com/apache/nutch/tree/release-1.15
> >
> > The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
> >555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachenutch-1015/
> >
> > We addressed 119 Issues:
> >https://s.apache.org/nczS
> >
> > Please vote on releasing this package as Apache Nutch 1.15.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Nutch PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Nutch 1.15.
> > [ ] -1 Do not release this package because…
> >
> > Cheers,
> > Sebastian
> > (On behalf of the Nutch PMC)
> >
> > P.S. Here is my +1.
> >
> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad
> de las Ciencias Informáticas.
> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu
>