Re: questions about nutch 1.9

Jonathan Cooper-Ellis Fri, 12 Dec 2014 13:21:08 -0800

Hi Eyeris,

For TopN, check out the sizeFetchlist param in bin/crawl (line 62).


On Fri, Dec 12, 2014 at 9:18 AM, Eyeris RodrIguez Rueda <[email protected]>
wrote:

> Thanks Sebastian.
> about the second question maybe i don´t explain good, in the script of
> nutch 1.9 the last parameter is number of rounds and i think that this is
> equivalent to depth parameter in nutch 1.5.1, but the topN parameter i
> don´t find in nutch 1.9,this is very usefull for make limited crawl process
> because every round don´t have limit.
>
> About the 3 question you are right dragones.uci.cu is only available
> inside the university, but you can try with any website that use https and
> check. I have an idea about what can be the problem but i don´t know how to
> solve.
> When i try to access to dragones.uci.cu with firefox i need to add an
> exception because the certificate has problem and nutch don´t know how to
> manipulate this error,it was great if i can configure an option that trust
> in websites with that errors. I was looking the httpclient plugin code but
> i don´t find the code that manipulate this problem, please could you help
> me or give an advice.
>
>
>
>
>
>
> ----- Original Message -----
> From: "Sebastian Nagel" <[email protected]>
> To: [email protected]
> Sent: Thursday, December 11, 2014 6:07:05 PM
> Subject: Re: questions about nutch 1.9
>
> Hi Eyeris,
>
> > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1
> that the spider jump this
> step if i don´t set solr parameter ?
>
> Yes, that's possible in recent trunk of 1.x, see NUTCH-1832
> (in doubt, it should be possible to update/replace only bin/crawl):
> Just pass an empty Solr-URL.
>
>
> > 2- It is possible to use topN or similar parameter in nutch 1.9 or every
> round include all link in
> crawldb ?
>
> In this point, I don't know about any differences between 1.9 and 1.5.1.
>
>
> > 3- I have activated httpclient plugin and when i crawl a website that
> use https protocol i get
> this error in the output console
>
> Sorry, I remember you asked this question a month ago, and I didn't find
> the time to
> continue the thread. Can you try without httpclient:
> - use 1.9 or trunk
> - and remove protocol-httpclient from plugin.includes
>
> I'm not able test/reproduce the problem because I cannot resolve
> the host dragones.uci.cu ?  Is this host only reachable within the
> university network?
>
>
> Best,
> Sebastian
>
>
> On 12/11/2014 10:11 PM, Eyeris RodrIguez Rueda wrote:
> > Please any help?
> >
> >
> > Hello.
> > I want to use nutch 1.9 but there are some things that i don´t
> understand because i was using nutch 1.5.1 before and some things are
> changed in nutch 1.9.
> > Sorry if is a basic things.
> > Some questions:
> >
> > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1
> that the spider jump this step if i don´t set solr parameter ?
> >
> > 2- It is possible to use topN or similar parameter in nutch 1.9 or every
> round include all link in crawldb ?
> >
> > 3- I have activated httpclient plugin and when i crawl a website that
> use https protocol i get this error in the output console
> > *********************************
> > fetch of https://dragones.uci.cu/ failed with:
> javax.net.ssl.SSLHandshakeException:
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target
> >
> > parsechecker tool throw similar error.
> >
> > Please any suggestion or advice will be appreciated.
> >
> >
> >
> > ---------------------------------------------------
> > XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> >
>
>
>
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>



-- 
Jonathan Cooper-Ellis
*Data Engineer*
myVBO, LLC dba Ziftr

Re: questions about nutch 1.9

Reply via email to