Hi Mark, The filesystem connector is supposed to emulate WGET behavior. What does WGET do in this case?
Karl On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]> wrote: > Noticed this problem while crawling a web site and saving to the file > system with the FileSystem output connector. > > Let's say the website defines a URL like this: > > http://mysite/news > > That URI actually gets mapped to a file on the web server, say > http://mysite/news/index.html, but the http://mysite/news URI does exist > and gets sent as the documentURI to addOrReplaceDocument(). > > MCF's FileSystem connector gets the http://mysite/news URL and creates a > directory for saving that content that looks like this http/mysite/news, > where news is a file. > > But then if the site also defines a URL like this > http://mysite/news/local/today.html, MCF's FileSystem connector fails > trying to create the directory http/mysite/news/local because part of it, > http/mysite/news, already exists as a file. > > Of course, if the URIs are crawled in the reverse order, the file can't be > created because a directory already exists with that name. > > Make sense? > > The real killer is that when this happen it's fatal to the job. That is, > it doesn't just fail to get that one URL, the connector returns a fatal > error and the crawl is stopped. > > Mark > >
