RE: .1, .2 before suffix rather than after

2007-11-06 Thread Christopher G. Lewis
Hmm - changing the rename schema would potentially create a HUGE issue with
clobbering.

For example, and quite hypothetical...

Given a directory with the following:
  index.html
  index-1.html
  index.1.html

All three are served by the server and rendered by the browser.  They are
distinct files given the file system and the URL interpretation of the file
system by the web server.

Now, Wget downloads index.html, then downloads it again.  Our choices for
the second file are:
  1) index.html.1
  2) index-1.html
  3) index.1.html

Of the three, only #1 is pretty much guaranteed *not* to exist on the web
server.  Why?  Because by changing the extension, we've changed the content
type.  So if our intentions are to not clobber (which, I believe, is the
whole point) we are *much* better off sticking with the current schema and
creating a file that most can't be served by the web server.

Note that this is quite a contrived example to illustrate the point.


However, my 2 cents on the behavior - It would be *wonderful* if wget could
look at the local file system and rename each version to file.ext.n+1 so the
new download is index.html, not index.html.1.  I've been caught a couple of
times with this, so to me the default behavior is backwards (ie, new file
s/b the URL, older files get versioned)

Chris


Christopher G. Lewis
http://www.ChristopherLewis.com
 

 -Original Message-
 From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 04, 2007 4:19 PM
 To: Wget
 Cc: Christian Roche
 Subject: Re: .1, .2 before suffix rather than after
 
 Hrvoje Niksic [EMAIL PROTECTED] writes:
 
  Micah Cowan [EMAIL PROTECTED] writes:
 
  Christian Roche has submitted a revised version of a patch 
 to modify
  the unique-name-finding algorithm to generate names in the pattern
  foo-n.html rather than foo.html.n. The patch looks good, and
  will likely go in very soon.
 
  foo.html.n has the advantage of simplicity: you can tell at a glance
  that foo.n is a duplicate of foo.  Also, it is trivial to remove
  the unwanted files by removing foo.*.
 
 It just occurred to me that this change breaks backward compatibility.
 It will break scripts that try to clean up after Wget or that in any
 way depend on the current naming scheme.
 


smime.p7s
Description: S/MIME cryptographic signature


Re: .1, .2 before suffix rather than after

2007-11-06 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Christopher G. Lewis wrote:
 Hmm - changing the rename schema would potentially create a HUGE issue with
 clobbering.
 
 For example, and quite hypothetical...
 
 Given a directory with the following:
   index.html
   index-1.html
   index.1.html
 
 All three are served by the server and rendered by the browser.  They are
 distinct files given the file system and the URL interpretation of the file
 system by the web server.
 
 Now, Wget downloads index.html, then downloads it again.  Our choices for
 the second file are:
   1) index.html.1
   2) index-1.html
   3) index.1.html
 
 Of the three, only #1 is pretty much guaranteed *not* to exist on the web
 server.  Why?  Because by changing the extension, we've changed the content
 type.  So if our intentions are to not clobber (which, I believe, is the
 whole point) we are *much* better off sticking with the current schema and
 creating a file that most can't be served by the web server.

Of course you are 100% correct that it is the whole point.

However, while this is indeed a problem, I don't think it's a clobbering
problem. I believe Wget would then choose (or could be made to then
choose) index-2.html, etc, for the file which on the server is named
index-1.html.

Of course, while that would resolve clobbering, that would make it
virtually impossible to determine what file had what local name, which
is entirely unacceptable.

I wonder how Wget currently handles perverse cases like index.html.1
actually existing on the server and already on the local system. :)

 Note that this is quite a contrived example to illustrate the point.

Yeah. Unfortunately, though, something like page-1.html, page-2.html,
isn't quite so unlikely.

It's intended that Reget (I'll call it that for now, until we figure out
what the hell we're going to do with that whole cluster of
functionality) will have support for a database of download-session
metadata, that would handle mappings between the remote URI and the
local file. With that, it'd be possible to construct a simple utility
which could be invoked like, reget-fmap http://example.com/foo.html;
and might spit out something like ./example.com/foo.html.

This might couple quite well with providing a plugin hook to control the
renaming scheme.

Given your excellent points, and the fact that I didn't get the
overwhelmingly positive response to this suggestion that I had
anticipated, I'd better table this patch. :(

 However, my 2 cents on the behavior - It would be *wonderful* if wget could
 look at the local file system and rename each version to file.ext.n+1 so the
 new download is index.html, not index.html.1.  I've been caught a couple of
 times with this, so to me the default behavior is backwards (ie, new file
 s/b the URL, older files get versioned)

That would of course be substantially more work, and provide even
greater opportunity for race conditions/interoperability issues than we
already do, but I agree that it'd be nice-to-have.

Unfortunately, I don't think there's any way we'll ever do this in Wget:
 it'd be too confusing for people used to the current way. And while, as
Hrvoje pointed out, the currently proposed suffixes patch could
potentially break backwards compatibility, it's not likely to do so in a
harmful/destructive way, whereas any current scripts that currently
download files and then erase the renamed ones will suddenly be
destroying the new data, rather than the old, if we reverse the renaming. :\

That problem is partly exacerbated by the fact that, from a certain
perspective, we ought to be able to stick our noses in the air and
claim that any scripts of that sort ought to have been telling Wget to
clobber files, rather than letting Wget rename them and trying to delete
them afterwards... but there is not currently a way to Wget to do that.
There is no way to ask Wget to clobber files when it normally wouldn't.

However, with the proper hook in Reget, it'd be easy enough to have a
plugin that handles it this way. Actually, since Reget is looking to
probably be an entirely new beast, and we'll certainly have to break
compatibility with traditional Wget, we could consider making this the
default renaming mechanism for Reget; but I'm still concerned about the
extra work, race conditions, and potential for screwing with other
programs that may be operating on some of the files involved.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMKRh7M8hyUobTrERCFeGAJwM8yPR35j8rbsqkG8Vk8A1Bdm0YACggbBN
6s7EOEwhxCerjaeuQAblccw=
=rdpM
-END PGP SIGNATURE-


Re: .1, .2 before suffix rather than after

2007-11-06 Thread Andreas Pettersson

Hrvoje Niksic wrote:

It just occurred to me that this change breaks backward compatibility.
It will break scripts that try to clean up after Wget or that in any
way depend on the current naming scheme


I'm also a bit hesitant about changing the way files get named.

With a .1 at the absolute end of the filename I _know_ this file got its 
name because there already was a file with the same name. If the new 
file instead is named filename-1.jpg I cannot be certain if this is 
because of a file collision, or if the original file really had this 
name, which of course it might have had.


If a script is supposed to restore the original filename of a downloaded 
file (perhaps for future downloads), it's easy to just cut the trailing 
number, it there is one. How could that be done in an easy and secure 
way if there is an eventual number before the extension, a number that I 
don't even know if it's part of the original filename or not?


And already having local files named -1.ext is not so uncommon. What 
happens if there is a local file with that name? -2.ext could be the 
answer, but that makes it really difficult to find downloaded files 
programmatically.


And how is .tar.gz renamed?  .tar-1.gz?

Sorry, but I'm not so sure about this..

--
Andreas




Delete a partial download after a timeout

2007-11-06 Thread moisi
Hello,

I use wget to retrieve an XML feed. The problem is that sometimes I get timeout 
error. I used the good wget options to handle this problem, but when wget 
retries to download the file again, the data are appended to the file which 
causes a bad file. Here is wget command line:

wget -c -nc -O tmp/out.xml -t 5 -w 60 -T 60 http://url.of.the.feed/feed.xml

Can you help me to fix this problem? I am not subscribed to the list, so can 
you CC'ed me to the reply.
Thanks a lot

Sydney




  
_ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail


Re: Delete a partial download after a timeout

2007-11-06 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

moisi wrote:
 Hello,
 
 I use wget to retrieve an XML feed. The problem is that sometimes I
 get timeout error. I used the good wget options to handle this
 problem, but when wget retries to download the file again, the data
 are appended to the file which causes a bad file. Here is wget
 command line:
 
 wget -c -nc -O tmp/out.xml -t 5 -w 60 -T 60
 http://url.of.the.feed/feed.xml

Hm. I'm not sure how Wget could deal with this in a general way. We
could certainly ask Wget to truncate the output-file, when the
output-file refers to a truncate-able file; but it doesn't always (such
as with -O -).

I think you'd be better off without using -O, and rename/append the file
after downloading.

I believe your use of -c and -nc with -O is not meaningful. -O is meant
to work very similarly to a redirection.

Hm... but current development Wget seems to have a regression with -nc
and -O in relation to Wget 1.10.2:

$ ls -a
. ..
$ wget -c -nc -O foo micah.cowan.name
- --2007-11-06 10:55:28--  http://micah.cowan.name/
Resolving micah.cowan.name... 66.150.225.51
Connecting to micah.cowan.name|66.150.225.51|:80... connected.
HTTP request sent, awaiting response... 200 OK
File ‘foo’ already there; not retrieving.

$

Question to group: should -nc even work with -O? And if so, I suppose
the wget-1.10.2 behavior would be the expected...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMLl07M8hyUobTrERCM80AJ9ZR0qJfFNdvebtlXVuhvkZpAJ1AwCeI90T
C46dvepWUA5Kw47lVrDfYGg=
=LA8j
-END PGP SIGNATURE-


Re: .1, .2 before suffix rather than after

2007-11-06 Thread Hrvoje Niksic
Andreas Pettersson [EMAIL PROTECTED] writes:

 And how is .tar.gz renamed?  .tar-1.gz?

Ouch.


Re: Question re server actions

2007-11-06 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Alan Thomas wrote:
   I admittedly do not know much about web server responses, and I
 have a question about why wget did not retrieve a document. . . .
  
I executed the following wget command:
  
 wget --recursive --level=20 --append-output=wget_log.txt
 --accept=pdf,doc,ppt,xls,zip,tar,gz,mov,avi,mpeg,mpg,wmv --no-parent
 --no-directories --directory-prefix=TEST_AnyLogic_Docs
 http://www.xjtek.com;
  
 However, it did not get the PDF document found by clicking on
 this link: http://www.xjtek.com/anylogic/license_agreement.  This URL
 automatically results in a download of a PDF file.
  
 Why?  Is there a wget option that will include this file? 

I believe it's being rejected because it doesn't end in a suffix that's
in your --accept list; it's a PDF file, but its URL doesn't end in .pdf.
It does use Content-Disposition to specify a filename, but the release
version of Wget doesn't acknowledge those.

If you use the current development version of Wget, and specify -e
content_disposition=on, it will download. If you're willing to try
that, you'll need to look at
http://wget.addictivecode.org/RepositoryAccess for information on how to
get the current development version of Wget (you should use the 1.11
repository, not mainline), and special building requirements.

- --
HTHm
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMRq97M8hyUobTrERCJ6WAJwK6uv/HlrLmTA7zK5DLZCnswkofQCfbMvJ
6yAiHoWEsXLohuYmQTGlPDo=
=DWHZ
-END PGP SIGNATURE-


Bugs! [Re: Question re server actions]

2007-11-06 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Alan Thomas wrote:
 Thanks.  I unzipped those binaries, but I still have a problem. . . .
 
 I changed the wget command to:
 
 wget --recursive --level=20 --append-output=wget_log.txt -econtent_dispositi
 on=on  --accept=pdf,doc,ppt,xls,zip,tar,gz  --no-parent --no-directories --d
 irectory-prefix=TEST_AnyLogic_Docs http://www.xjtek.com;
 
 However, the log file shows:
 
 --2007-11-06 21:33:55--  http://www.xjtek.com/
 Resolving www.xjtek.com... 207.228.227.14
 Connecting to www.xjtek.com|207.228.227.14|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 --2007-11-06 21:34:11--  http://www.xjtek.com/
 Connecting to www.xjtek.com|207.228.227.14|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 Saving to: `TEST_AnyLogic_Docs/index.html'
 
  0K ..  128K=0.08s
 
 2007-11-06 21:34:12 (128 KB/s) - `TEST_AnyLogic_Docs/index.html' saved
 [11091]
 
 Removing TEST_AnyLogic_Docs/index.html since it should be rejected.
 
 FINISHED --2007-11-06 21:34:12--
 Downloaded: 1 files, 11K in 0.08s (128 KB/s)
 
 The version of wget is shown as 1.10+devel.

Congratulations! Looks like you've discovered a bug! :\

And just in time, too, as we're expecting to release 1.11 any day now.

When I try your version with --debug, it looks like it thinks all the
links are trying to escape upwards: that is, it thinks that they
disobey your --no-parents. You should be able to remove the --no-parents
from your command-line, and it will work, as in your case there _are_ no
parents to traverse to, and the --no-parents is superfluous.

I also discovered (what I consider to be) a bug, in that

wget -e content_disposition=on --accept=pdf -r
http://www.xjtek.com/anylogic/license_agreement/

downloads the file to ./License_AnyLogic_6.x.x.pdf, rather than to
www.xjtek.com/file/114/License_AnyLogic_6.x.x.pdf (the dirname for which
matches its URL after redirection).

 Also, I`m not sure why - is required vice -- in front of the new
 option.

It's not a long option; it's the short option -e, followed by an
argument, content_disposition=on. There is not currently a long-option
version for this. Support for Content-Disposition will be enabled by
default in Wget 1.12, so a long-option probably won't be added (unless
it's to disable the support).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHMVTe7M8hyUobTrERCFC0AJ9cSLdrnQOD7I770y5yBLPpNer6ggCfcGMj
G6q+mYUI+oooD9xkHURxTVw=
=ApQs
-END PGP SIGNATURE-