So you're saying wget can be run in a mode whereby it follows the redirect to fetch the content but uses the original, pre-redirect url to create the directory to store the content?
On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <[email protected]> wrote: > Hi Mark, > > Yes, but I'm afraid we *can't* emulate the redirect behavior because > that's an upstream connector choice. WGet can operate in a mode where it > uses the pre-redirect URL, and resolves conflicts nonetheless. How does it > do it? > > Karl > > > > On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <[email protected]> wrote: > >> wget -x uses the redirect url as the basis for the path it creates. >> >> So, if http://mysite/news returns a 302 redirecting to >> http://mysite/news/index.html, wget saves as: >> >> mysite/news/index.html >> >> MCF, on the other hand, saves as: >> >> http/mysite/news >> >> Mark >> >> >> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Mark, >>> >>> The filesystem connector is supposed to emulate WGET behavior. What >>> does WGET do in this case? >>> >>> Karl >>> >>> >>> >>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]>wrote: >>> >>>> Noticed this problem while crawling a web site and saving to the file >>>> system with the FileSystem output connector. >>>> >>>> Let's say the website defines a URL like this: >>>> >>>> http://mysite/news >>>> >>>> That URI actually gets mapped to a file on the web server, say >>>> http://mysite/news/index.html, but the http://mysite/news URI does >>>> exist and gets sent as the documentURI to addOrReplaceDocument(). >>>> >>>> MCF's FileSystem connector gets the http://mysite/news URL and creates >>>> a directory for saving that content that looks like this http/mysite/news, >>>> where news is a file. >>>> >>>> But then if the site also defines a URL like this >>>> http://mysite/news/local/today.html, MCF's FileSystem connector fails >>>> trying to create the directory http/mysite/news/local because part of it, >>>> http/mysite/news, already exists as a file. >>>> >>>> Of course, if the URIs are crawled in the reverse order, the file can't >>>> be created because a directory already exists with that name. >>>> >>>> Make sense? >>>> >>>> The real killer is that when this happen it's fatal to the job. That >>>> is, it doesn't just fail to get that one URL, the connector returns a fatal >>>> error and the crawl is stopped. >>>> >>>> Mark >>>> >>>> >>> >> >
