Sorry, mean "second last page" ;)
> Date: Mon, 3 Sep 2012 08:18:02 -0400 > Subject: Re: Web connector - Session-based access credentials > From: [email protected] > To: [email protected] > > What do you mean, "first last page"? > The Web Connector needs to refetch the page that caused the > redirection, because that is likely to be a content page based on the > user's own description of the login sequence. Otherwise pages would > be missing from the crawl, whenever login needed to be redone. > > Karl > > On Mon, Sep 3, 2012 at 7:43 AM, Michael Kooloos <[email protected]> wrote: > > No, same thing happens in the browser also, so need to find a different seed > > page that doesn't have this behaviour, but no luck there yet.. > > > > Other way to solve this 'issue' is if the coonnector will go back to the > > first last page after finishing the login-sequence, instead of the last page > > (since the last page stays in a loop). Should be possible, right? > > > > Michael > > > >> Date: Mon, 3 Sep 2012 07:15:20 -0400 > > > >> Subject: Re: Web connector - Session-based access credentials > >> From: [email protected] > >> To: [email protected] > >> > >> Ok - if the redirect is occurring in a browser whether or not you are > >> logged in, then yes, you cannot use that page as a seed. If this only > >> seems to happen in the Web Connector, on the other hand, we should > >> keep talking, because your login sequence is not actually succeeding > >> to set up the session cookies properly. > >> > >> Thanks! > >> Karl > >> > >> On Mon, Sep 3, 2012 at 6:14 AM, Michael Kooloos <[email protected]> > >> wrote: > >> > Hi Karl, > >> > > >> > Thanks. Found the issue, the seed document keeps redirecting to the > >> > logon > >> > page (even after login has occured). This is an issue (protection?) of > >> > the > >> > website and it now makes sense to me why the connector stays in a loop. > >> > Haven't found a solution yet, have to find a more appropriate seed > >> > document > >> > or a way to skip the redirect the second time it enters the loop.. > >> > > >> > Many thanks for your support! > >> > > >> >> Date: Thu, 30 Aug 2012 11:52:01 -0400 > >> > > >> >> Subject: Re: Web connector - Session-based access credentials > >> >> From: [email protected] > >> >> To: [email protected] > >> >> > >> >> If I understand how you have it set up, what the ManifoldCF web > >> >> connector will do is this: > >> >> > >> >> (1) Fetch the seed document. > >> >> (2) Take the redirection to the logon page, and thus enter the login > >> >> sequence > >> >> (3) Do the login sequence and establish the correct cookies > >> >> (4) Refetch the seed document > >> >> (5) Take the redirection to the logon page... > >> >> > >> >> So, as you can see, your seed document must redirect ONLY if login has > >> >> not yet occurred, or you will be stuck in a loop. So either fix that, > >> >> or choose a more appropriate seed document. > >> >> > >> >> On normal site, typically you get different results on most content > >> >> pages when login has occurred vs. when login has not yet occurred. It > >> >> is up to you to define in the Web Connector what combination of pages > >> >> and content constitute a logon request vs. normal content fetch. And > >> >> that's the whole problem, and why this is so complicated. > >> >> > >> >> Thanks, > >> >> Karl > >> >> > >> >> On Thu, Aug 30, 2012 at 11:38 AM, Michael Kooloos > >> >> <[email protected]> > >> >> wrote: > >> >> > Karl, > >> >> > > >> >> > My seed document is not a logon page, but the seed document url > >> >> > automatically redirects to the logon pages. So the first regex is of > >> >> > the > >> >> > logon page, then the regex for the Login URL is the same (since it's > >> >> > the > >> >> > logon page), type = Form. Do I define any redirect after the logon > >> >> > form? > >> >> > > >> >> > Hope this makes a bit of sence.. > >> >> > > >> >> > Didn't think it would be that hard to setup some access credentials.. > >> >> > > >> >> >> Date: Thu, 30 Aug 2012 10:03:20 -0400 > >> >> > > >> >> >> Subject: Re: Web connector - Session-based access credentials > >> >> >> From: [email protected] > >> >> >> To: [email protected] > >> >> >> > >> >> >> It sounds like your regular expression(s) which describe what pages > >> >> >> belong to the logon sequence may be incorrect. After the logon > >> >> >> sequence exits, the crawler will attempt to refetch the page it was > >> >> >> working on before it entered the logon sequence. If that page is > >> >> >> PART > >> >> >> of the logon sequence it will loop as you describe. > >> >> >> > >> >> >> Your seed documents should therefore NOT be logon pages or you will > >> >> >> never get anywhere... > >> >> >> > >> >> >> Karl > >> >> >> > >> >> >> On Thu, Aug 30, 2012 at 9:58 AM, Michael Kooloos > >> >> >> <[email protected]> > >> >> >> wrote: > >> >> >> > Karl, > >> >> >> > > >> >> >> > I've read through the similar problems/questions on the list (only > >> >> >> > found > >> >> >> > 3), > >> >> >> > but without any luck. In the Seed I've the page I want to crawl, > >> >> >> > but > >> >> >> > this on > >> >> >> > protected by security, so I setup a redirect to the login-page and > >> >> >> > a > >> >> >> > form > >> >> >> > for the login-page with the username/password parameters. When I > >> >> >> > look > >> >> >> > in > >> >> >> > the > >> >> >> > Simple History I see the fetch of the first page, the begin-logon, > >> >> >> > redirect > >> >> >> > to the login-page, the end-logon, but then it starts all over > >> >> >> > again > >> >> >> > and > >> >> >> > keeps in a loop. Any ideas? I think a working example will help me > >> >> >> > a > >> >> >> > lot.. > >> >> >> > > >> >> >> > Michael > >> >> >> > > >> >> >> >> Date: Thu, 30 Aug 2012 09:29:08 -0400 > >> >> >> >> Subject: Re: Web connector - Session-based access credentials > >> >> >> >> From: [email protected] > >> >> >> >> To: [email protected] > >> >> >> > > >> >> >> >> > >> >> >> >> I set it up to crawl Angie's List at one point. It was developed > >> >> >> >> to > >> >> >> >> crawl an oil-and-gas exploration subscription site. Others have > >> >> >> >> fielded fairly detailed questions and/or problems to this list, > >> >> >> >> so I > >> >> >> >> know it has been used by many. > >> >> >> >> > >> >> >> >> Can you give a more thorough and detailed description of what > >> >> >> >> your > >> >> >> >> are > >> >> >> >> trying to crawl, and what is happening for you? > >> >> >> >> > >> >> >> >> Karl > >> >> >> >> > >> >> >> >> On Thu, Aug 30, 2012 at 9:25 AM, Michael Kooloos > >> >> >> >> <[email protected]> > >> >> >> >> wrote: > >> >> >> >> > > >> >> >> >> > Hi, > >> >> >> >> > > >> >> >> >> > Does anyone have a working example of the session-based access > >> >> >> >> > credentials > >> >> >> >> > for the web connector? Following the end-user-documentation as > >> >> >> >> > good > >> >> >> >> > as > >> >> >> >> > possible, but still no luck :( > >> >> >> >> > > >> >> >> >> > Thanks!
