Re: FileSystem connector path issue

Karl Wright Tue, 19 Nov 2013 14:59:18 -0800

Hi Mark,

Yes, at least the materials I see online say that this is the case.  But I
don't know exactly how.


For the purposes of the File System Output Connector, it doesn't matter,
since anyone can construct a site that does NOT redirect and still has the
URL layout as you originally described.  So the problem has to be solved.

I can experiment with WGET here, to check out what its behavior might be,
but not while I'm doing Windows stuff - so I thought you might be able to
do that.

Thanks,
Karl



On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <[email protected]> wrote:

> So you're saying wget can be run in a mode whereby it follows the redirect
> to fetch the content but uses the original, pre-redirect url to create the
> directory to store the content?
>
>
> On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Mark,
>>
>> Yes, but I'm afraid we *can't* emulate the redirect behavior because
>> that's an upstream connector choice.  WGet can operate in a mode where it
>> uses the pre-redirect URL, and resolves conflicts nonetheless.  How does it
>> do it?
>>
>> Karl
>>
>>
>>
>> On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <[email protected]> wrote:
>>
>>> wget -x uses the redirect url as the basis for the path it creates.
>>>
>>> So, if http://mysite/news returns a 302 redirecting to
>>> http://mysite/news/index.html, wget saves as:
>>>
>>> mysite/news/index.html
>>>
>>> MCF, on the other hand, saves as:
>>>
>>> http/mysite/news
>>>
>>> Mark
>>>
>>>
>>> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> The filesystem connector is supposed to emulate WGET behavior.  What
>>>> does WGET do in this case?
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <[email protected]>wrote:
>>>>
>>>>> Noticed this problem while crawling a web site and saving to the file
>>>>> system with the FileSystem output connector.
>>>>>
>>>>> Let's say the website defines a URL like this:
>>>>>
>>>>> http://mysite/news
>>>>>
>>>>> That URI actually gets mapped to a file on the web server, say
>>>>> http://mysite/news/index.html, but the http://mysite/news URI does
>>>>> exist and gets sent as the documentURI to addOrReplaceDocument().
>>>>>
>>>>> MCF's FileSystem connector gets the http://mysite/news URL and
>>>>> creates a directory for saving that content that looks like this
>>>>> http/mysite/news, where news is a file.
>>>>>
>>>>> But then if the site also defines a URL like this
>>>>> http://mysite/news/local/today.html, MCF's FileSystem connector fails
>>>>> trying to create the directory http/mysite/news/local because part of it,
>>>>> http/mysite/news, already exists as a file.
>>>>>
>>>>> Of course, if the URIs are crawled in the reverse order, the file
>>>>> can't be created because a directory already exists with that name.
>>>>>
>>>>> Make sense?
>>>>>
>>>>> The real killer is that when this happen it's fatal to the job. That
>>>>> is, it doesn't just fail to get that one URL, the connector returns a 
>>>>> fatal
>>>>> error and the crawl is stopped.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: FileSystem connector path issue

Reply via email to