Hi, Thanks for the reply.
I tried applying the patch(http-client-form-authtication.patch) in NUTCH-827 [1]. Compiled the code using ant. When I ran the crawler it is giving the following warning log message, " httpclient.Http: Bad auth conf file: Element <removedFormFields> not recognized in httpclient-auth.xml - expected <authscope> " . How do I make sure that the changes in the code is reflected? It seems like the changes are not effected while crawling. What is the correct procedure to compile the code in the plugins? Thanks, Tizy On Tue, Dec 16, 2014 at 6:34 PM, remi tassing <[email protected]> wrote: > > I have been doing a lot of POST authentication while crawling corporate > stuff. Since POST methods may vary drastically between sites (e.g. typical > JIRA to POST+JS redirection, NTLMv2...) it's hard not to extend the crawler > with some additional Java. > > So what I've ended up doing is to build a "handler" class for each site > specific site and that handler knows how to send requests and fetch the > contain. Some common response type is expected so it looks like an > extension/plugin design for the protocol-httpclient plugin. > > On Tue, Dec 16, 2014 at 5:46 PM, Tizy Ninan <[email protected]> wrote: > > > > Hi Talat, > > > > Thanks a lot for the reply. I will go through it and try it out. > > > > Thanks, > > Tizy > > > > On Tue, Dec 16, 2014 at 2:25 PM, Talat Uyarer <[email protected]> wrote: > > > > > > Hi Tizy, > > > > > > There is some discuss. You can reach at NUTCH-827 [1] IMHO we need > > > some help. If we create this feature it will be useful. > > > > > > Talat > > > > > > [1] https://issues.apache.org/jira/browse/NUTCH-827 > > > > > > 2014-12-16 10:44 GMT+02:00 Tizy Ninan <[email protected]>: > > > > Hi, > > > > > > > > Thanks for the reply. > > > > Is there any alternative way to do this authentication? Does the > > fetcher > > > > job of Nutch accept cookies for fetching the web sites from the same > > > > domain? Could you suggest any work around to do form based > > authentication > > > > using Nutch? > > > > > > > > Thanks, > > > > Tizy > > > > > > > > On Tue, Dec 16, 2014 at 1:08 PM, Halil Ibrahim Simsek < > > > [email protected]> > > > > wrote: > > > >> > > > >> Hello Tizy, > > > >> > > > >> As I know, currently the development version of Nutch can do Basic, > > > Digest > > > >> and NTLM based authentication. [1] Nutch can not do POST based > > > >> authentication that depends on cookies. BTW there is a document > which > > > >> supposed to provide this feature but as far as i see no code > developed > > > yet. > > > >> [2] > > > >> > > > >> [1] https://wiki.apache.org/nutch/HttpAuthenticationSchemes > > > >> [2] https://wiki.apache.org/nutch/HttpPostAuthentication > > > >> > > > >> Halil > > > >> > > > >> 2014-12-16 7:16 GMT+02:00 Tizy Ninan <[email protected]>: > > > >> > > > > >> > Hi, > > > >> > > > > >> > I am trying to develop a custom crawler to crawl websites that > > require > > > >> form > > > >> > based authentication using Nutch v1.9 in Java. The > > > >> HttpPostAuthentication > > > >> > feature of Nutch is followed to implement it. > > > >> > > > > >> > The login parameters required for authentication such as html > > form-id, > > > >> > login post data(username, password) are specified as key-value > pairs > > > in a > > > >> > configuration file. What is required to identify the html login > > > form(id > > > >> or > > > >> > name of the html form)? How to identify the html form parameters > if > > > id or > > > >> > name of the form is not specified? > > > >> > > > > >> > I have also posted the question to the developer mailing list, but > > did > > > >> not > > > >> > receive any reply.I am stuck with this for a while. Could somebody > > > >> provide > > > >> > with a solution on how to specify the html form parameters of > > > websites to > > > >> > be crawled to perform form based authentication? > > > >> > > > > >> > Thanks and Regards, > > > >> > Tizy > > > >> > > > > >> > > > > > > > > > > > > -- > > > > Thanks and Regards, > > > > Tizy > > > > > > > > > > > > -- > > > Talat UYARER > > > Websitesi: http://talat.uyarer.com > > > Twitter: http://twitter.com/talatuyarer > > > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 > > > > > > > > > -- > > Thanks and Regards, > > Tizy > > > -- Thanks and Regards, Tizy

