without touching the file system On Thu, Jun 19, 2008 at 9:23 AM, mm w <[EMAIL PROTECTED]> wrote: > a simple url-rewriting conf should fix the problem, wihout touch the file > system > everything can be done server side > > Best Regards > > On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS) > <[EMAIL PROTECTED]> wrote: >> Thanks averyone for the contributions. >> >> Ultimately, our purpose is to process documents from the site into our >> search database, so probably the most important thing is to limit the >> number of files being processed. The case of the URLs in the html >> probably wouldn't cause us much concern, but I could see that it might >> be useful to "convert" a site for mirroring from a non-case sensetive >> (windows) environment to a case sensetive (li|u)nix one - this would >> need to include translation of urls in content as well as filenames on >> disk. >> >> In the meantime - does anyone know of a proxy server that could >> translate urls from mixed case to lower case. I thought that if we >> downloaded using wget via such a proxy server we might get the >> appropriate result. >> >> The other alternative we were thinking of was to post process the files >> with symlinks for all mixed case versions of files and directories (I >> think someone already suggested this - greate minds and all that...). I >> assume that wget would correctly use the symlink to determine the >> time/date stamp of the file for determining if it requires updating (or >> would it use the time/date stamp of the symlink?). I also assume that if >> wget downloaded the file it would overwrite the symlink and we would >> have to run our "convert files to" symlinks process again. >> >> Just to put it in perspective, the actual site is approximately 45gb >> (that's what the administrator said) and wget downloaded > 100gb >> (463,000 files) when I did the first process. >> >> Cheers >> Allan >> >> -----Original Message----- >> From: Micah Cowan [mailto:[EMAIL PROTECTED] >> Sent: Saturday, 14 June 2008 7:30 AM >> To: Tony Lewis >> Cc: Coombe, Allan David (DPS); 'Wget' >> Subject: Re: Wget 1.11.3 - case sensetivity and URLs >> >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Tony Lewis wrote: >>> Micah Cowan wrote: >>> >>>> Unfortunately, nothing really comes to mind. If you'd like, you could >> >>>> file a feature request at >>>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option >> >>>> asking Wget to treat URLs case-insensitively. >>> >>> To have the effect that Allan seeks, I think the option would have to >>> convert all URIs to lower case at an appropriate point in the process. >> >>> I think you probably want to send the original case to the server >>> (just in case it really does matter to the server). If you're going to >> >>> treat different case URIs as matching then the lower-case version will >> >>> have to be stored in the hash. The most important part (from the >>> perspective that Allan voices) is that the versions written to disk >>> use lower case characters. >> >> Well, that really depends. If it's doing a straight recursive download, >> without preexisting local files, then all that's really necessary is to >> do lookups/stores in the blacklist in a case-normalized manner. >> >> If preexisting files matter, then yes, your solution would fix it. >> Another solution would be to scan directory contents for the first name >> that matches case insensitively. That's obviously much less efficient, >> but has the advantage that the file will match at least one of the >> "real" cases from the server. >> >> As Matthias points out, your lower-case normalization solution could be >> achieved in a more general manner with a hook. Which is something I was >> planning on introducing perhaps in 1.13 anyway (so you could, say, run >> sed on the filenames before Wget uses them), so that's probably the >> approach I'd take. But probably not before 1.13, even if someone >> provides a patch for it in time for 1.12 (too many other things to focus >> on, and I'd like to introduce the "external command" hooks as a suite, >> if possible). >> >> OTOH, case normalization in the blacklists would still be useful, in >> addition to that mechanism. Could make another good addition for 1.13 >> (because it'll be more useful in combination with the rename hooks). >> >> - -- >> Micah J. Cowan >> Programmer, musician, typesetting enthusiast, gamer, >> and GNU Wget Project Maintainer. >> http://micah.cowan.name/ >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.6 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ >> nVYivipui+0TRmmK04kD2JE= >> =OMsD >> -----END PGP SIGNATURE----- >> > > > > -- > -mmw >
-- -mmw