RE: Issues while crawling pagination
Hi Shiva, Having looked at the specific site, I have to amend my recommended max-depth from 1 to 2, since I assume you want to fetch the stories themselves, not just the hubpages. If you want to crawl continuously, as Markus suggested, I still think you should keep the depth at 2, but define the first hubpage(s) to have a very high priority and very short recrawl delay. This is because stories are always added on the first page, and then get pushed back. I suspect that if you don't limit depth, and especially if you don't limit yourself to the domain, you will find yourself crawling the whole internet eventually. If you do limit to the domain, that won't be a problem, but unless you give special treatment to the first page(s), you will be continuously recrawling hundreds of thousands of static pages. Yossi. > -Original Message- > From: Markus Jelsma > Sent: 29 July 2018 00:53 > To: user@nutch.apache.org > Subject: RE: Issues while crawling pagination > > Hello, > > Yossi's suggestion is excellent if your case is crawl everything once, and > never > again. However, if you need to crawl future articles as well, and have to deal > with mutations, then let the crawler run continuously without regard for > depth. > > The latter is the usual case, because after all, if you got this task a few > months > ago you wouldn't need to go to a depth of 497342 right? > > Regards, > Markus > > > > > -Original message- > > From:Yossi Tamari > > Sent: Saturday 28th July 2018 23:09 > > To: user@nutch.apache.org; shivakarthik...@gmail.com; > > nu...@lucene.apache.org > > Subject: RE: Issues while crawling pagination > > > > Hi Shiva, > > > > My suggestion would be to programmatically generate a seeds file containing > these 497342 URLs (since you know them in advance), and then use a very low > max-depth (probably 1), and a high number of iterations, since only a small > number will be fetched in each iteration, unless you set a very low > crawl-delay. > > (Mathematically, If you fetch 1 URL per second from this domain, fetching > 497342 URLs will take 138 hours). > > > > Yossi. > > > > > -Original Message- > > > From: ShivaKarthik S > > > Sent: 28 July 2018 23:20 > > > To: nu...@lucene.apache.org; user@nutch.apache.org > > > Subject: Reg: Issues while crawling pagination > > > > > > Hi > > > > > > Can you help me in figuring out the issue while crawling a hub page > > > having pagination. Problem what i am facing is what depth to give > > > and how to handle pagination. > > > I have a hubpage which has a pagination of more than 4.95L. > > > e.g. https://www.jagran.com/latest-news-page497342.html is > > > the number of pages under the hubpage latest-news> > > > > > > > > > -- > > > Thanks and Regards > > > Shiva > > > >
RE: Issues while crawling pagination
Hello, Yossi's suggestion is excellent if your case is crawl everything once, and never again. However, if you need to crawl future articles as well, and have to deal with mutations, then let the crawler run continuously without regard for depth. The latter is the usual case, because after all, if you got this task a few months ago you wouldn't need to go to a depth of 497342 right? Regards, Markus -Original message- > From:Yossi Tamari > Sent: Saturday 28th July 2018 23:09 > To: user@nutch.apache.org; shivakarthik...@gmail.com; nu...@lucene.apache.org > Subject: RE: Issues while crawling pagination > > Hi Shiva, > > My suggestion would be to programmatically generate a seeds file containing > these 497342 URLs (since you know them in advance), and then use a very low > max-depth (probably 1), and a high number of iterations, since only a small > number will be fetched in each iteration, unless you set a very low > crawl-delay. > (Mathematically, If you fetch 1 URL per second from this domain, fetching > 497342 URLs will take 138 hours). > > Yossi. > > > -Original Message- > > From: ShivaKarthik S > > Sent: 28 July 2018 23:20 > > To: nu...@lucene.apache.org; user@nutch.apache.org > > Subject: Reg: Issues while crawling pagination > > > > Hi > > > > Can you help me in figuring out the issue while crawling a hub page having > > pagination. Problem what i am facing is what depth to give and how to handle > > pagination. > > I have a hubpage which has a pagination of more than 4.95L. > > e.g. https://www.jagran.com/latest-news-page497342.html > the number of pages under the hubpage latest-news> > > > > > > -- > > Thanks and Regards > > Shiva > >
RE: Issues while crawling pagination
Hi Shiva, My suggestion would be to programmatically generate a seeds file containing these 497342 URLs (since you know them in advance), and then use a very low max-depth (probably 1), and a high number of iterations, since only a small number will be fetched in each iteration, unless you set a very low crawl-delay. (Mathematically, If you fetch 1 URL per second from this domain, fetching 497342 URLs will take 138 hours). Yossi. > -Original Message- > From: ShivaKarthik S > Sent: 28 July 2018 23:20 > To: nu...@lucene.apache.org; user@nutch.apache.org > Subject: Reg: Issues while crawling pagination > > Hi > > Can you help me in figuring out the issue while crawling a hub page having > pagination. Problem what i am facing is what depth to give and how to handle > pagination. > I have a hubpage which has a pagination of more than 4.95L. > e.g. https://www.jagran.com/latest-news-page497342.html the number of pages under the hubpage latest-news> > > > -- > Thanks and Regards > Shiva
Reg: Issues while crawling pagination
Hi Can you help me in figuring out the issue while crawling a hub page having pagination. Problem what i am facing is what depth to give and how to handle pagination. I have a hubpage which has a pagination of more than 4.95L. e.g. https://www.jagran.com/latest-news-page497342.html -- Thanks and Regards Shiva
Re: using any23 with nutch
Tried 2.3-SNAPSHOT instead of 2.3 as : Error persists. On Sat, Jul 28, 2018 at 5:57 PM govind nitk wrote: > > hi all, > > I want to use any23 2.3-snapshot version with nutch. This is what I have > done: > 1. have "mvn install" in any23 repo. > so jars are released in local ~/.m2 dir. > > ex. > /home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar > > 2. nutch repo, plugins/any23/ivy.xml > > conf="*->default"> > > 3. In nutch repo, have changed ivy setting as below: > > > value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]" > override="false" /> > > > > > > > > > > > > > So, expectation is any23 will start using my_local releases. > > But its failing with below error: > > resolve-default: > [ivy:resolve] :: loading settings :: file = > /home/govind/apache/nutch/ivy/ivysettings.xml > [ivy:resolve] > [ivy:resolve] :: problems summary :: > [ivy:resolve] WARNINGS > [ivy:resolve] module not found: org.apache.any23#apache-any23;2.3 > [ivy:resolve] local-maven2: tried > [ivy:resolve] > /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.xml > [ivy:resolve] -- artifact > org.apache.any23#apache-any23;2.3!apache-any23.jar: > [ivy:resolve] > /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.jar > [ivy:resolve] :: > [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: > [ivy:resolve] :: > [ivy:resolve] :: org.apache.any23#apache-any23;2.3: not found > [ivy:resolve] :: > [ivy:resolve] > [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS > Target 'resolve-default' failed with message 'impossible to resolve > dependencies: > > > Am I missing something in local resolver defined for any23 ? > Is it the case, that we can not use the locally released jars in nutch ? > Is there any other hack I can use to this resolved ? > > Regards, > Govind >
using any23 with nutch
hi all, I want to use any23 2.3-snapshot version with nutch. This is what I have done: 1. have "mvn install" in any23 repo. so jars are released in local ~/.m2 dir. ex. /home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar 2. nutch repo, plugins/any23/ivy.xml 3. In nutch repo, have changed ivy setting as below: So, expectation is any23 will start using my_local releases. But its failing with below error: resolve-default: [ivy:resolve] :: loading settings :: file = /home/govind/apache/nutch/ivy/ivysettings.xml [ivy:resolve] [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] module not found: org.apache.any23#apache-any23;2.3 [ivy:resolve] local-maven2: tried [ivy:resolve] /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.xml [ivy:resolve] -- artifact org.apache.any23#apache-any23;2.3!apache-any23.jar: [ivy:resolve] /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.jar [ivy:resolve] :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: org.apache.any23#apache-any23;2.3: not found [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Target 'resolve-default' failed with message 'impossible to resolve dependencies: Am I missing something in local resolver defined for any23 ? Is it the case, that we can not use the locally released jars in nutch ? Is there any other hack I can use to this resolved ? Regards, Govind
Re: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1
+1 for build plugins test - success On Thu, Jul 26, 2018 at 10:25 PM Roannel Fernández Hernández wrote: > +1 Great work, folks > > - Mensaje original - > > De: "Sebastian Nagel" > > Para: user@nutch.apache.org > > CC: d...@nutch.apache.org > > Enviados: Jueves, 26 de Julio 2018 11:05:06 > > Asunto: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1 > > > > Hi Folks, > > > > A first candidate for the Nutch 1.15 release is available at: > > > > https://dist.apache.org/repos/dist/dev/nutch/1.15/ > > > > The release candidate is a zip and tar.gz archive of the binary and > sources > > in: > > https://github.com/apache/nutch/tree/release-1.15 > > > > The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is > >555d00ddc0371b05c5958bde7abb2a9db8c38ee2 > > > > In addition, a staged maven repository is available here: > > > https://repository.apache.org/content/repositories/orgapachenutch-1015/ > > > > We addressed 119 Issues: > >https://s.apache.org/nczS > > > > Please vote on releasing this package as Apache Nutch 1.15. > > The vote is open for the next 72 hours and passes if a majority of at > > least three +1 Nutch PMC votes are cast. > > > > [ ] +1 Release this package as Apache Nutch 1.15. > > [ ] -1 Do not release this package because… > > > > Cheers, > > Sebastian > > (On behalf of the Nutch PMC) > > > > P.S. Here is my +1. > > > UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad > de las Ciencias Informáticas. > Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu >