a simple url-rewriting conf should fix the problem, wihout touch the file system everything can be done server side
Best Regards On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS) <[EMAIL PROTECTED]> wrote: > Thanks averyone for the contributions. > > Ultimately, our purpose is to process documents from the site into our > search database, so probably the most important thing is to limit the > number of files being processed. The case of the URLs in the html > probably wouldn't cause us much concern, but I could see that it might > be useful to "convert" a site for mirroring from a non-case sensetive > (windows) environment to a case sensetive (li|u)nix one - this would > need to include translation of urls in content as well as filenames on > disk. > > In the meantime - does anyone know of a proxy server that could > translate urls from mixed case to lower case. I thought that if we > downloaded using wget via such a proxy server we might get the > appropriate result. > > The other alternative we were thinking of was to post process the files > with symlinks for all mixed case versions of files and directories (I > think someone already suggested this - greate minds and all that...). I > assume that wget would correctly use the symlink to determine the > time/date stamp of the file for determining if it requires updating (or > would it use the time/date stamp of the symlink?). I also assume that if > wget downloaded the file it would overwrite the symlink and we would > have to run our "convert files to" symlinks process again. > > Just to put it in perspective, the actual site is approximately 45gb > (that's what the administrator said) and wget downloaded > 100gb > (463,000 files) when I did the first process. > > Cheers > Allan > > -----Original Message----- > From: Micah Cowan [mailto:[EMAIL PROTECTED] > Sent: Saturday, 14 June 2008 7:30 AM > To: Tony Lewis > Cc: Coombe, Allan David (DPS); 'Wget' > Subject: Re: Wget 1.11.3 - case sensetivity and URLs > > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Tony Lewis wrote: >> Micah Cowan wrote: >> >>> Unfortunately, nothing really comes to mind. If you'd like, you could > >>> file a feature request at >>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option > >>> asking Wget to treat URLs case-insensitively. >> >> To have the effect that Allan seeks, I think the option would have to >> convert all URIs to lower case at an appropriate point in the process. > >> I think you probably want to send the original case to the server >> (just in case it really does matter to the server). If you're going to > >> treat different case URIs as matching then the lower-case version will > >> have to be stored in the hash. The most important part (from the >> perspective that Allan voices) is that the versions written to disk >> use lower case characters. > > Well, that really depends. If it's doing a straight recursive download, > without preexisting local files, then all that's really necessary is to > do lookups/stores in the blacklist in a case-normalized manner. > > If preexisting files matter, then yes, your solution would fix it. > Another solution would be to scan directory contents for the first name > that matches case insensitively. That's obviously much less efficient, > but has the advantage that the file will match at least one of the > "real" cases from the server. > > As Matthias points out, your lower-case normalization solution could be > achieved in a more general manner with a hook. Which is something I was > planning on introducing perhaps in 1.13 anyway (so you could, say, run > sed on the filenames before Wget uses them), so that's probably the > approach I'd take. But probably not before 1.13, even if someone > provides a patch for it in time for 1.12 (too many other things to focus > on, and I'd like to introduce the "external command" hooks as a suite, > if possible). > > OTOH, case normalization in the blacklists would still be useful, in > addition to that mechanism. Could make another good addition for 1.13 > (because it'll be more useful in combination with the rename hooks). > > - -- > Micah J. Cowan > Programmer, musician, typesetting enthusiast, gamer, > and GNU Wget Project Maintainer. > http://micah.cowan.name/ > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ > nVYivipui+0TRmmK04kD2JE= > =OMsD > -----END PGP SIGNATURE----- > -- -mmw
