Can you create a ticket and submit a patch? There is already code in place that overrides HttpClient's default cookie policy. I'm surprised this case is not already covered.
Thanks! Karl On Thu, Jul 7, 2016 at 8:49 AM, jetnet <[email protected]> wrote: > hi Karl, > > the problem was the host name in the seeding URL, not the FQDN. So, > the default cookie policy does woks with FQDNs only. > That's why the obtained cookies were never used for the further requests. > Changing the seeding URL to the "full host name" format solved the problem. > > jeeeez, that was a weird one... > > How about adding the next line to the code? > > cookie.setAttribute(ClientCookie.DOMAIN_ATTR, "true"); > > Thanks! > Konstantin > > 2016-07-07 13:24 GMT+02:00 Karl Wright <[email protected]>: > > Hi Konstantin, > > > > The mock site that the test crawls and logs into is generated by > > MockSessionWebService.java, under > > > connectors/webcrawler/connector/src/test/java/org/apache/manifoldcf/crawler/connectors/webcrawler/tests. > > It does almost precisely what your site is doing. The test itself is > > SessionTester.java. Your setup should be similar to how the test sets up > > the login sequence and protected content area. > > > > Thanks, > > Karl > > > > > > On Thu, Jul 7, 2016 at 7:17 AM, Karl Wright <[email protected]> wrote: > >> > >> Hi Konstantin, > >> > >> There is an advanced Web Connector integration test, which currently > >> passes, that tests session login and cookie transmission. I'll look > over > >> the test to be sure it is complete, but if so you should really be > looking > >> at your login sequence and verifying that the cookie set takes place in > a > >> request that is part of the login sequence. > >> > >> Thanks, > >> Karl > >> > >> > >> On Thu, Jul 7, 2016 at 6:58 AM, jetnet <[email protected]> wrote: > >>> > >>> Thanks for the hint regarding the httpclient logging! > >>> So, it turned out, the cookies do NOT get added to the request: > >>> > >>> DEBUG 2016-07-07 12:49:26,015 (Worker thread '4') - WEB: Get method > >>> for '/sitemap.xml' > >>> DEBUG 2016-07-07 12:49:26,015 (Worker thread '4') - WEB: Adding 2 > >>> cookies for '/sitemap.xml' > >>> DEBUG 2016-07-07 12:49:26,015 (Worker thread '4') - WEB: Cookie > >>> '[version: 0][name: PHPSESSID][value: > >>> 8jegbs2dqb6r9oc3mb4pt0q777][domain: wikisite][path: /][expiry: null]' > >>> added > >>> DEBUG 2016-07-07 12:49:26,015 (Worker thread '4') - WEB: Cookie > >>> '[version: 0][name: authtoken][value: > >>> 920_636034784213249598_d2f40072be60b4de7bee72d74fc04400][domain: > >>> wikisite][path: /][expiry: Thu Jul 14 10:53:41 CEST 2016]' added > >>> DEBUG 2016-07-07 12:49:26,030 (Thread-1214) - CookieSpec selected: > >>> standard > >>> DEBUG 2016-07-07 12:49:26,093 (Thread-1214) - Auth cache not set in the > >>> context > >>> DEBUG 2016-07-07 12:49:26,093 (Thread-1214) - Connection request: > >>> [route: {}->http://wikisite:80][total kept alive: 0; route allocated: > >>> 0 of 1; total allocated: 0 of 20] > >>> DEBUG 2016-07-07 12:49:26,140 (Thread-1214) - Connection leased: [id: > >>> 0][route: {}->http://wikisite:80][total kept alive: 0; route > >>> allocated: 1 of 1; total allocated: 1 of 20] > >>> DEBUG 2016-07-07 12:49:26,140 (Thread-1214) - Opening connection > >>> {}->http://wikisite:80 > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - Connecting to > >>> wikisite/10.0.0.100:80 > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - Connection established > >>> 10.0.0.184:58501<->10.0.0.100:80 > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0: set > >>> socket timeout to 300000 > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - Executing request GET > >>> /sitemap.xml HTTP/1.1 > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - Target auth state: > >>> UNCHALLENGED > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - Proxy auth state: > >>> UNCHALLENGED > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> GET > >>> /sitemap.xml HTTP/1.1 > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> > >>> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; > >>> [email protected]) > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> From: > >>> [email protected] > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> > Accept: > >>> */* > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> > >>> Accept-Encoding: gzip,deflate > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> Host: > >>> wikisite > >>> DEBUG 2016-07-07 12:49:26,155 (Thread-1214) - http-outgoing-0 >> > >>> Connection: Keep-Alive > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > HTTP/1.1 > >>> 200 OK > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > >>> Content-Type: application/xml; charset=utf-8 > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > >>> Server: Microsoft-IIS/7.5 > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > >>> X-Powered-By: PHP/5.2.14 > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > >>> Set-Cookie: PHPSESSID=bk9487elppchvshc38c7pfnv01; path=/; HttpOnly > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > >>> X-Powered-By: ASP.NET > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << Date: > >>> Thu, 07 Jul 2016 10:49:38 GMT > >>> DEBUG 2016-07-07 12:49:37,768 (Thread-1214) - http-outgoing-0 << > >>> Content-Length: 684207 > >>> > >>> > >>> Jira tiket? :) > >>> > >>> Thanks, > >>> Konstantin > >>> > >>> > >>> 2016-07-07 12:37 GMT+02:00 Karl Wright <[email protected]>: > >>> > It really does add cookies as stated. > >>> > > >>> > That doesn't mean, however, that the cookies being sent correspond > to a > >>> > session that is correctly logged in. There's no way to tell this > from > >>> > the > >>> > logs. > >>> > > >>> > You can possibly get more information about the back-and-forth by > >>> > enabling > >>> > httpcomponents/httpclient wire logging. Headers only should be > >>> > sufficient. > >>> > You should see the exact cookies and be able to verify that the > cookies > >>> > sent > >>> > are the ones that were returned. You still won't be able to tell if > >>> > the > >>> > login was successful or not. > >>> > > >>> > Karl > >>> > > >>> > > >>> > > >>> > On Thu, Jul 7, 2016 at 6:25 AM, jetnet <[email protected]> wrote: > >>> >> > >>> >> ok, so, it means, that I do not need the 3rd stage at all? As the > >>> >> second stage (form authentication) records the cookies and redirects > >>> >> back: > >>> >> > >>> >> the second stage: > >>> >> > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Post > method > >>> >> for '/Special:UserLogin' > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Post > >>> >> parameter name 'username' value 'someuser' for '/Special:UserLogin' > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Post > >>> >> parameter name 'returntourl' value 'http://wikisite/sitemap.xml' > for > >>> >> '/Special:UserLogin' > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Post > >>> >> parameter name 'password' value 'XXXXXX' for '/Special:UserLogin' > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Adding 2 > >>> >> cookies for '/Special:UserLogin' > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Cookie > >>> >> '[version: 0][name: PHPSESSID][value: > >>> >> bughgf8fbjkkevk79ot4ef2vj1][domain: wikisite][path: /][expiry: > null]' > >>> >> added > >>> >> DEBUG 2016-07-07 10:52:48,231 (Worker thread '79') - WEB: Cookie > >>> >> '[version: 0][name: authtoken][value: > >>> >> 920_636034352097041592_136c71f2ac1fc2dd1ba72de805fcd1b5][domain: > >>> >> wikisite][path: /][expiry: Wed Jul 13 22:53:29 CEST 2016]' added > >>> >> DEBUG 2016-07-07 10:52:48,434 (Worker thread '79') - WEB: Retrieving > >>> >> cookies... > >>> >> DEBUG 2016-07-07 10:52:48,434 (Worker thread '79') - WEB: Cookie > >>> >> '[version: 0][name: PHPSESSID][value: > >>> >> 589h3f20tjndhkc391nu5u0u51][domain: wikisite][path: /][expiry: > null]' > >>> >> DEBUG 2016-07-07 10:52:48,434 (Worker thread '79') - WEB: Cookie > >>> >> '[version: 0][name: authtoken][value: > >>> >> 920_636034783686256706_585415102d050458acfd91a9d1f223d5][domain: > >>> >> wikisite][path: /][expiry: Thu Jul 14 10:52:48 CEST 2016]' > >>> >> INFO 2016-07-07 10:52:48,449 (Worker thread '79') - WEB: FETCH > >>> >> LOGIN|http://wikisite/Special:UserLogin|1467881568231+218|302|153| > >>> >> DEBUG 2016-07-07 10:52:48,449 (Worker thread '79') - WEB: Document > >>> >> 'http://wikisite/Special:UserLogin' did not match expected form, > link, > >>> >> redirection, or content for sequence 'wikisite' > >>> >> > >>> >> so, the last message means, nothing matches in the sequence anymore > - > >>> >> logon end. > >>> >> And the last two cookies are being used for the next fetch of the > >>> >> sitemap, but the its content still matches the public pattern. > >>> >> > >>> >> Strange things happen... I just tried to use the authtoken cookie > from > >>> >> the log direct in the browser - and it gets authenticated without > >>> >> problems: I get the "private" content. But the manifoldcf not... > >>> >> weird... > >>> >> > >>> >> DEBUG 2016-07-07 10:52:48,543 (Worker thread '79') - WEB: Adding 2 > >>> >> cookies for '/sitemap.xml' > >>> >> DEBUG 2016-07-07 10:52:48,543 (Worker thread '79') - WEB: Cookie > >>> >> '[version: 0][name: PHPSESSID][value: > >>> >> 589h3f20tjndhkc391nu5u0u51][domain: wikisite][path: /][expiry: > null]' > >>> >> added > >>> >> DEBUG 2016-07-07 10:52:48,543 (Worker thread '79') - WEB: Cookie > >>> >> '[version: 0][name: authtoken][value: > >>> >> 920_636034783686256706_585415102d050458acfd91a9d1f223d5][domain: > >>> >> wikisite][path: /][expiry: Thu Jul 14 10:52:48 CEST 2016]' added > >>> >> INFO 2016-07-07 10:52:58,500 (Worker thread '79') - WEB: FETCH > >>> >> URL|http://wikisite/sitemap.xml|1467881568543+9957|200|684072| > >>> >> > >>> >> size: 684072 - is public content. > >>> >> > >>> >> Does it **really** add the cookies to the request? :) > >>> >> > >>> >> Thanks! > >>> >> Konstantin > >>> >> > >>> >> 2016-07-07 11:44 GMT+02:00 Karl Wright <[email protected]>: > >>> >> > "I thought, when the auth sequence is done > >>> >> > (exit login mode), the redirect to the original page happens > >>> >> > automatically (which is the case here, but somehow the content is > >>> >> > still "public")." > >>> >> > > >>> >> > That is correct BUT if the final redirection is what sets the > >>> >> > cookies > >>> >> > THEN > >>> >> > the cookies will only be recorded by the web connector if the > final > >>> >> > redirection is part of the login sequence. > >>> >> > > >>> >> > Thanks, > >>> >> > Karl > >>> >> > > >>> >> > > >>> >> > On Thu, Jul 7, 2016 at 5:33 AM, jetnet <[email protected]> wrote: > >>> >> >> > >>> >> >> hi Karl, > >>> >> >> thank you for the very prompt feedback! > >>> >> >> > >>> >> >> > 1) Have you made sure to include the redirection back to the > >>> >> >> > content? > >>> >> >> This is the step I don't quite understand - could you please > >>> >> >> clarify > >>> >> >> how that could be done? I thought, when the auth sequence is done > >>> >> >> (exit login mode), the redirect to the original page happens > >>> >> >> automatically (which is the case here, but somehow the content is > >>> >> >> still "public"). > >>> >> >> > >>> >> >> > 2) your check for *entering* the login sequence is too broad > and > >>> >> >> > fires > >>> >> >> > again even though the private sitemap page is being returned. > >>> >> >> totally agree, that's why the first step is to look into the > >>> >> >> content > >>> >> >> of the page, to check, if there is a pattern which appears in the > >>> >> >> public version ONLY. > >>> >> >> This is the only solution I can imagine so far, but any ideas - > >>> >> >> very > >>> >> >> welcome! > >>> >> >> > >>> >> >> The simple history shows basically the same - the process never > >>> >> >> leaves > >>> >> >> the login stage. > >>> >> >> > >>> >> >> If I remove the 3rd step, then I see, that the login stage is > over > >>> >> >> (logon end), but as the content of the sitemap.xml is still > >>> >> >> "public", > >>> >> >> the login process kicks in again. > >>> >> >> > >>> >> >> Thanks! > >>> >> >> Konstantin > >>> >> >> > >>> >> >> 2016-07-07 11:07 GMT+02:00 Karl Wright <[email protected]>: > >>> >> >> > Hi Konstantin, > >>> >> >> > > >>> >> >> > There are two possibilities: > >>> >> >> > > >>> >> >> > (1) You have missed one stage when specifying the login > sequence. > >>> >> >> > The > >>> >> >> > cookies are getting set, but not during a step that's part of > the > >>> >> >> > login > >>> >> >> > sequence. Have you made sure to include the redirection back > to > >>> >> >> > the > >>> >> >> > content? > >>> >> >> > (2) You really are logging in but your check for *entering* the > >>> >> >> > login > >>> >> >> > sequence is too broad and fires again even though the private > >>> >> >> > sitemap > >>> >> >> > page > >>> >> >> > is being returned. > >>> >> >> > > >>> >> >> > You can also look at the simple history as well to get an idea > >>> >> >> > what > >>> >> >> > MCF > >>> >> >> > is > >>> >> >> > doing for your job for session handling. > >>> >> >> > > >>> >> >> > Thanks, > >>> >> >> > Karl > >>> >> >> > > >>> >> >> > > >>> >> >> > On Thu, Jul 7, 2016 at 4:35 AM, jetnet <[email protected]> > wrote: > >>> >> >> >> > >>> >> >> >> Hi All, > >>> >> >> >> > >>> >> >> >> I've been trying to setup a session-based auth sequence for a > >>> >> >> >> forked > >>> >> >> >> MediaWiki site (Wiki connector does not work with this > version), > >>> >> >> >> but > >>> >> >> >> somehow got stuck with the configuration. > >>> >> >> >> The idea is to index the site using its sitemap.xml with > hops=1. > >>> >> >> >> The > >>> >> >> >> "public" version (user not logged in) of the sitemap.xml > >>> >> >> >> contains a > >>> >> >> >> different set of links as the "authenticated" one (user logged > >>> >> >> >> in). > >>> >> >> >> The current auth sequence looks like this (the job's seeding > >>> >> >> >> URL=http://wikisite/sitemap.xml): > >>> >> >> >> > >>> >> >> >> 1) the first call to the seeding URL should be redirected to > the > >>> >> >> >> login > >>> >> >> >> page > >>> >> >> >> Login URL regexp: sitemap.xml > >>> >> >> >> Page type: content > >>> >> >> >> Identification regular expression: <some content from the > >>> >> >> >> "public" > >>> >> >> >> version> > >>> >> >> >> Override target URL: /Special:UserLogin > >>> >> >> >> > >>> >> >> >> 2) enter user's credentials on the login page > >>> >> >> >> Login URL regexp: Special:UserLogin > >>> >> >> >> Page type: form > >>> >> >> >> Override form parameters: username=someuser, password=******, > >>> >> >> >> returntourl=http://wikisite/sitemap.xml > >>> >> >> >> > >>> >> >> >> 3) the login page ***should*** redirect back to the seeding > URL > >>> >> >> >> with > >>> >> >> >> the authorized content > >>> >> >> >> Login URL regexp: /Special:UserLogin > >>> >> >> >> Page type: redirection > >>> >> >> >> Identification regular expression: /sitemap.xml > >>> >> >> >> > >>> >> >> >> From the log-file I can see, that first 2 steps work fine - > the > >>> >> >> >> public > >>> >> >> >> content gets recognized, the form data get sent, the session's > >>> >> >> >> cookies > >>> >> >> >> get set. But the 3rd step returns the "public" version of the > >>> >> >> >> sitemap.xml again, and the login process is getting stuck in a > >>> >> >> >> loop. > >>> >> >> >> Am I on the right way or did I miss something? > >>> >> >> >> > >>> >> >> >> here is the log for the 3rd step: > >>> >> >> >> > >>> >> >> >> INFO 2016-07-06 22:52:27,285 (Worker thread '43') - WEB: > FETCH > >>> >> >> >> > >>> >> >> >> LOGIN| > http://wikisite/Special:UserLogin|1467838347082+203|302|153| > >>> >> >> >> DEBUG 2016-07-06 22:52:27,285 (Worker thread '43') - WEB: > Tried > >>> >> >> >> to > >>> >> >> >> match raw url 'http://wikisite/sitemap.xml' > >>> >> >> >> DEBUG 2016-07-06 22:52:27,285 (Worker thread '43') - WEB: > Tried > >>> >> >> >> to > >>> >> >> >> match cooked url 'http://wikisite/sitemap.xml' > >>> >> >> >> DEBUG 2016-07-06 22:52:27,285 (Worker thread '43') - WEB: > >>> >> >> >> Redirection > >>> >> >> >> link lookup matched 'http://wikisite/sitemap.xml' > >>> >> >> >> DEBUG 2016-07-06 22:52:27,285 (Worker thread '43') - WEB: > >>> >> >> >> Document > >>> >> >> >> 'http://wikisite/Special:UserLogin' matches preferred > >>> >> >> >> redirection, > >>> >> >> >> so > >>> >> >> >> determined to be login page for sequence 'wikisite' > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: > >>> >> >> >> Waiting > >>> >> >> >> for > >>> >> >> >> an HttpClient object > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: For > >>> >> >> >> http://wikisite/sitemap.xml, setting virtual host to wikisite > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: Got > an > >>> >> >> >> HttpClient object after 0 ms. > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: Get > >>> >> >> >> method > >>> >> >> >> for '/sitemap.xml' > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: > Adding > >>> >> >> >> 2 > >>> >> >> >> cookies for '/sitemap.xml' > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: > >>> >> >> >> Cookie > >>> >> >> >> '[version: 0][name: PHPSESSID][value: > >>> >> >> >> 1vnhgi0f84dc9pi6eaoj0nau45][domain: wikisite][path: /][expiry: > >>> >> >> >> null]' > >>> >> >> >> added > >>> >> >> >> DEBUG 2016-07-06 22:52:27,394 (Worker thread '43') - WEB: > >>> >> >> >> Cookie > >>> >> >> >> '[version: 0][name: authtoken][value: > >>> >> >> >> > 920_636034351472613318_616a5fd45ce4d5fed6c5318d73b38070][domain: > >>> >> >> >> wikisite][path: /][expiry: Wed Jul 13 22:52:27 CEST 2016]' > added > >>> >> >> >> DEBUG 2016-07-06 22:52:35,660 (Worker thread '43') - WEB: > >>> >> >> >> Retrieving > >>> >> >> >> cookies... > >>> >> >> >> DEBUG 2016-07-06 22:52:35,660 (Worker thread '43') - WEB: > >>> >> >> >> Cookie > >>> >> >> >> '[version: 0][name: PHPSESSID][value: > >>> >> >> >> vqfpr88pqa6d62nl6h4lp03nu1][domain: wikisite][path: /][expiry: > >>> >> >> >> null]' > >>> >> >> >> DEBUG 2016-07-06 22:52:35,660 (Worker thread '43') - WEB: > >>> >> >> >> Cookie > >>> >> >> >> '[version: 0][name: authtoken][value: > >>> >> >> >> > 920_636034351472613318_616a5fd45ce4d5fed6c5318d73b38070][domain: > >>> >> >> >> wikisite][path: /][expiry: Wed Jul 13 22:52:27 CEST 2016]' > >>> >> >> >> INFO 2016-07-06 22:52:37,004 (Worker thread '43') - WEB: > FETCH > >>> >> >> >> LOGIN| > http://wikisite/sitemap.xml|1467838347394+9610|200|683773| > >>> >> >> >> DEBUG 2016-07-06 22:52:37,004 (Worker thread '43') - WEB: > >>> >> >> >> Document > >>> >> >> >> 'http://wikisite/sitemap.xml' is text, with encoding 'utf-8'; > >>> >> >> >> link > >>> >> >> >> extraction starting > >>> >> >> >> DEBUG 2016-07-06 22:52:37,019 (Worker thread '43') - WEB: > >>> >> >> >> Document > >>> >> >> >> 'http://wikisite/sitemap.xml' matches content, so determined > to > >>> >> >> >> be > >>> >> >> >> login page for sequence 'wikisite' > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> Thank you! > >>> >> >> >> regards, Konstantin > >>> >> >> > > >>> >> >> > > >>> >> > > >>> >> > > >>> > > >>> > > >> > >> > > >
