a simple url-rewriting conf should fix the problem, wihout touch the file system
everything can be done server side

Best Regards

On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
<[EMAIL PROTECTED]> wrote:
> Thanks averyone for the contributions.
>
> Ultimately, our purpose is to process documents from the site into our
> search database, so probably the most important thing is to limit the
> number of files being processed.  The case of  the URLs in the html
> probably wouldn't cause us much concern, but I could see that it might
> be useful to "convert" a site for mirroring from a non-case sensetive
> (windows) environment to a case sensetive (li|u)nix one - this would
> need to include translation of urls in content as well as filenames on
> disk.
>
> In the meantime - does anyone know of a proxy server that could
> translate urls from mixed case to lower case.  I thought that if we
> downloaded using wget via such a proxy server we might get the
> appropriate result.
>
> The other alternative we were thinking of was to post process the files
> with symlinks for all mixed case versions of files and directories (I
> think someone already suggested this - greate minds and all that...). I
> assume that wget would correctly use the symlink to determine the
> time/date stamp of the file for determining if it requires updating (or
> would it use the time/date stamp of the symlink?). I also assume that if
> wget downloaded the file it would overwrite the symlink and we would
> have to run our "convert files to" symlinks process again.
>
> Just to put it in perspective, the actual site is approximately 45gb
> (that's what the administrator said) and wget downloaded > 100gb
> (463,000 files) when I did the first process.
>
> Cheers
> Allan
>
> -----Original Message-----
> From: Micah Cowan [mailto:[EMAIL PROTECTED]
> Sent: Saturday, 14 June 2008 7:30 AM
> To: Tony Lewis
> Cc: Coombe, Allan David (DPS); 'Wget'
> Subject: Re: Wget 1.11.3 - case sensetivity and URLs
>
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Tony Lewis wrote:
>> Micah Cowan wrote:
>>
>>> Unfortunately, nothing really comes to mind. If you'd like, you could
>
>>> file a feature request at
>>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
>
>>> asking Wget to treat URLs case-insensitively.
>>
>> To have the effect that Allan seeks, I think the option would have to
>> convert all URIs to lower case at an appropriate point in the process.
>
>> I think you probably want to send the original case to the server
>> (just in case it really does matter to the server). If you're going to
>
>> treat different case URIs as matching then the lower-case version will
>
>> have to be stored in the hash. The most important part (from the
>> perspective that Allan voices) is that the versions written to disk
>> use lower case characters.
>
> Well, that really depends. If it's doing a straight recursive download,
> without preexisting local files, then all that's really necessary is to
> do lookups/stores in the blacklist in a case-normalized manner.
>
> If preexisting files matter, then yes, your solution would fix it.
> Another solution would be to scan directory contents for the first name
> that matches case insensitively. That's obviously much less efficient,
> but has the advantage that the file will match at least one of the
> "real" cases from the server.
>
> As Matthias points out, your lower-case normalization solution could be
> achieved in a more general manner with a hook. Which is something I was
> planning on introducing perhaps in 1.13 anyway (so you could, say, run
> sed on the filenames before Wget uses them), so that's probably the
> approach I'd take. But probably not before 1.13, even if someone
> provides a patch for it in time for 1.12 (too many other things to focus
> on, and I'd like to introduce the "external command" hooks as a suite,
> if possible).
>
> OTOH, case normalization in the blacklists would still be useful, in
> addition to that mechanism. Could make another good addition for 1.13
> (because it'll be more useful in combination with the rename hooks).
>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer,
> and GNU Wget Project Maintainer.
> http://micah.cowan.name/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
> nVYivipui+0TRmmK04kD2JE=
> =OMsD
> -----END PGP SIGNATURE-----
>



-- 
-mmw

Reply via email to