Hi Eyeris, For TopN, check out the sizeFetchlist param in bin/crawl (line 62).
On Fri, Dec 12, 2014 at 9:18 AM, Eyeris RodrIguez Rueda <[email protected]> wrote: > Thanks Sebastian. > about the second question maybe i don´t explain good, in the script of > nutch 1.9 the last parameter is number of rounds and i think that this is > equivalent to depth parameter in nutch 1.5.1, but the topN parameter i > don´t find in nutch 1.9,this is very usefull for make limited crawl process > because every round don´t have limit. > > About the 3 question you are right dragones.uci.cu is only available > inside the university, but you can try with any website that use https and > check. I have an idea about what can be the problem but i don´t know how to > solve. > When i try to access to dragones.uci.cu with firefox i need to add an > exception because the certificate has problem and nutch don´t know how to > manipulate this error,it was great if i can configure an option that trust > in websites with that errors. I was looking the httpclient plugin code but > i don´t find the code that manipulate this problem, please could you help > me or give an advice. > > > > > > > ----- Original Message ----- > From: "Sebastian Nagel" <[email protected]> > To: [email protected] > Sent: Thursday, December 11, 2014 6:07:05 PM > Subject: Re: questions about nutch 1.9 > > Hi Eyeris, > > > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 > that the spider jump this > step if i don´t set solr parameter ? > > Yes, that's possible in recent trunk of 1.x, see NUTCH-1832 > (in doubt, it should be possible to update/replace only bin/crawl): > Just pass an empty Solr-URL. > > > > 2- It is possible to use topN or similar parameter in nutch 1.9 or every > round include all link in > crawldb ? > > In this point, I don't know about any differences between 1.9 and 1.5.1. > > > > 3- I have activated httpclient plugin and when i crawl a website that > use https protocol i get > this error in the output console > > Sorry, I remember you asked this question a month ago, and I didn't find > the time to > continue the thread. Can you try without httpclient: > - use 1.9 or trunk > - and remove protocol-httpclient from plugin.includes > > I'm not able test/reproduce the problem because I cannot resolve > the host dragones.uci.cu ? Is this host only reachable within the > university network? > > > Best, > Sebastian > > > On 12/11/2014 10:11 PM, Eyeris RodrIguez Rueda wrote: > > Please any help? > > > > > > Hello. > > I want to use nutch 1.9 but there are some things that i don´t > understand because i was using nutch 1.5.1 before and some things are > changed in nutch 1.9. > > Sorry if is a basic things. > > Some questions: > > > > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 > that the spider jump this step if i don´t set solr parameter ? > > > > 2- It is possible to use topN or similar parameter in nutch 1.9 or every > round include all link in crawldb ? > > > > 3- I have activated httpclient plugin and when i crawl a website that > use https protocol i get this error in the output console > > ********************************* > > fetch of https://dragones.uci.cu/ failed with: > javax.net.ssl.SSLHandshakeException: > sun.security.validator.ValidatorException: PKIX path building failed: > sun.security.provider.certpath.SunCertPathBuilderException: unable to find > valid certification path to requested target > > > > parsechecker tool throw similar error. > > > > Please any suggestion or advice will be appreciated. > > > > > > > > --------------------------------------------------- > > XII Aniversario de la creación de la Universidad de las Ciencias > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014. > > > > > > --------------------------------------------------- > XII Aniversario de la creación de la Universidad de las Ciencias > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014. > -- Jonathan Cooper-Ellis *Data Engineer* myVBO, LLC dba Ziftr

