Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-20 Thread Todd Pattist





   When deciding whether it should delete a file afterwards, however, it
 uses the _local_ filename (relevant code also in recur.c, near "Either
 --delete-after was specified,"). I'm not positive, but this probably
 means query strings _do_ matter in that case. :p
 
 Confused? Coz I sure am!

I had thought there was already an issue filed against this, but upon
searching discovered I was thinking of a couple related bug that had
been closed. I've filed a new issue for this:

https://savannah.gnu.org/bugs/?22670


I'm not sure whether this post should go into the buglist discussion or
here, but I'll put it here.

I have to say, I'm not sure this is properly classed as a bug. If
accept/reject applies to the original URL filename, why should the code
bother to apply it again to the local name? If filters don't pass the
URL filename and wget doesn't retrieve the file, it can't save it. I
assume the answer was to handle script and content_disposition cases
where you don't know what you're going to get back. If you match only
on URL, you'd have no way to control traversing separate from file
retention, and that's something you definitely want. (It's the default
for conventional html based sites.) To put it another way, I usually
want to download all the php files, and traverse all that turn out to
be html, but I may only want to keep the zips or jpgs. With two
checks, one before download on the URL filename and another after
download on the local filename, I've got some control in cgi, php
script based sites that is similar to the control in a conventional
html page site. 

If this behavior is changed, then you'd probably need to have two sets
of accept/reject filters that could be defined separately, one set to
control traversing, and one to control file retention. I'd actually
prefer that, particularly with matching extended to the query string
portion of the URL. Right now, it may be impossible to prevent
traversing some links. If you don't want to traverse
"index.php?mode=logout", but do want to get "index.php?mode=getfile"
there's no way to do it since the URL filename is the same.

In the short term, it would help to add something to the documentation
in the accept/reject area, such as the following:

The accept/reject filters are applied to the
filename twice - once to the filename in the URL before downloading to
determine if the file should be retrieved (and parsed for more links if
it is determined after download to be an html file) and again to the
local filename after it is retrieved to determine if it should be
kept. The local filename after retrieval may be significantly
different from the URL filename before retrieval for many reasons.
These include:
1) The URL filename does not include any query string portion of the
URL, such as the string "?topic=16" in the URL
"http://site.com/index.php?topic=16". After download the file may be
stored as the local filename "[EMAIL PROTECTED]". Accept/reject
matching does not apply to the URL query string portion before
download, but will apply after download when the query string is
incorporated into the local filename.
2) When content disposition is on, the local filename may be completely
different from the URL filename. The URL "index.php?getfile=21" may
return a content disposition header producing a local file of
"some_interesting_file.zip".
3) The -E (html extension) and sometimes the -nd (no directories)
switches will alter the filename suffix by adding .html or .1 for
duplicate files.

If the URL filename in links found when the starting page is parsed do
not pass the accept/reject filters, the links will not be followed and
will not be parsed for more links unless the filename ends html or
htm. If accept/reject filters are used on cgi, php, asp and similar
script based sites the URL filename must pass the filters (without
considering any query string portion) if the links are to be
traversed/parsed, and the local filename must pass the filters if the
retrieved files are to be retained.




Re: regarding wget and gsoc

2008-03-20 Thread Siddhant Goel
Hi.
I couldn't gather much from the wget source code; so want to put up some
questions.
1. Does this necessarily have to be a change in the code of wget? Or
something separate?
2. If it does, could you please tell me the files I should look into, to get
started. I have the general solution in mind, but to implement I need to
know where to start. :)
One thing I had in mind. Does grepping the database file, each time a
download has to be made, sound good? This way you could check easily for
every issue mentioned here -
http://wget.addictivecode.org/FeatureSpecifications/MetaDataBase .
Thanks,
Siddhant


Session Info Database API concepts [regarding wget and gsoc]

2008-03-20 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

(I've rearranged your post slightly so that the answers are ordered more
conveniently for me.)

Siddhant Goel wrote:
 Hi.
 I couldn't gather much from the wget source code; so want to put up some
 questions.
 1. Does this necessarily have to be a change in the code of wget? Or
 something separate?

I don't see how a separate program could get all the necessary
information to put this together; particularly things like content-type,
etc. Perhaps some of the top use cases might be implemented with some
sort of wrapper around log files, but even there, without access to
internal Wget structures, I don't see how even file-existence checks
could be done (since the external program would have to know what Wget
was considering downloading).

 One thing I had in mind. Does grepping the database file, each time a
 download has to be made, sound good? This way you could check easily for
 every issue mentioned here -
 http://wget.addictivecode.org/FeatureSpecifications/MetaDataBase .

Using grep each time Wget has to decide whether to download would be
absolutely hideous, in terms of efficiency. Wget would have to fork a
new process for grep, which would then read the entire database file on
each download. Other operations might require Wget to invoke grep
multiple times.

OTOH, if Wget builds appropriate data structures (hash tables and the
like), it can determine in constant time whether a given URL was already
downloaded in a previous session.

No, grep is not a viable option. Besides, wget is intended to run on
non-Unixen (notably, Windows, and VMS too, at least to some degree);
requiring grep (or other external utilities) is not desirable.

 2. If it does, could you please tell me the files I should look into, to
 get started. I have the general solution in mind, but to implement I
 need to know where to start. :)

It would be wholly new code, so it'd be in a new module. You'd need to
interface with it from existing code, naturally; but to start, one might
wish to design and code the module completely apart from Wget, and write
little test drivers to demonstrate how it might be used from within Wget.

For instance, you might start by writing functions with prototypes like:

  /** SIDB Writer Facilities **/
  sidb_writer *
  sidb_write_start(const char *filename);

  sidb_writer_entry *
  sidb_writer_entry_new(sidb_writer *w, const char *uri);

  void
  sidb_writer_entry_redirect(sidb_writer_entry *rw, const char *uri, int
redirect_http_status_code);

  void
  sidb_writer_entry_local_path(sidb_writer_entry *rw, const char
*fname);

  void
  sidb_entry_finish(sidb_writer_entry *rw);

  void
  sidb_write_end(sidb_writer *w);

  /** SIDB Reader Facilities **/
  sidb_reader *
  sidb_read(const char *filename);

  sidb_entry *
  sidb_lookup_uri(struct sidb_reader *, const char *uri);

  const char *
  sidb_entry_get_local_name(sidb_entry *e);

sidb for Session Info DataBase. I've left out error code returns; it
seems to me that wget will not normally want to terminate just because
an error occurred in writing the session info database; we can ask the
sidb modules to spew warnings automatically when errors occur by giving
it flags, or supply it with an error callback function, etc. For
situations where we do want wget to immediately abort for sidb errors
(for instance, continuing a session), it could check a sidb_error
function or some such.

The intent is that all the writer operations would take virtually no
time at all. The sidb_read function should take at most O(N log N) time
on the size of the SIDB file, and should take less than a second under
normal circumstances on typical machines, for a file with entries for a
thousand web resources. Thereafter, the other SIDB reader operations
should take virtually no time at all.

The sidb_lookup_uri should be able to find an entry based on either the
URI that was specified to the corresponding call to
sidb_writer_entry_new, _or_ by any URI that was added to a resource
entry via sidb_writer_entry_redirect.

Interposing writes to different entrys should be allowed and explicitly
tested (to prepare the way for multiple simultaneous downloads in the
future). That is, the following should be valid:

  sidb_writer *sw = sidb_write_start(.wget-sidb);
  sidb_writer_entry *foo, *bar;
  foo = sidb_writer_entry_new(sw, http://example.com/foo;);
  bar = sidb_writer_entry_new(sw, http://example.com/bar;);
 /* Add info to foo entry. */
  sidb_writer_entry_redirect(foo,
http://newsite.example.com/news.html;);
 /* Add to bar entry. */
  sidb_writer_entry_local_path(bar, example.com/bar);
 /* Add to foo entry again. */
  sidb_writer_entry_local_name(foo, newsite.example.com/news.html);
  sidb_entry_finish(foo);
  sidb_entry_finish(bar);
  sidb_write_end(sw);

On reading back the information, calling sidb_lookup_uri for either
http://example.com/foo; or http://newsite.example.com/news.html;
should both give 

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-20 Thread Todd Pattist






  If we were going to leave this behavior in for some time, then I think
it'd be appropriate to at least mention it (maybe I'll just mention it
anyway, without a comprehensive explanation

It would probably be sufficient to just add a very brief mention
to the docs of 1.11, the two things that confused the heck out of me - 
1) The accept/reject filters are applied twice, once to the URL
filename before retrieval and once to the local filename after
retrieval, and
2) A query string is not considered to be part of the URL filename.

You can probably imagine my confusion when I saw [EMAIL PROTECTED]
being saved. I then tried to prevent that link from being traversed
with a match on part of the query string, and I'd see that file
disappear, only to later realize it was traversed. I had no idea that
the query string was not being considered during the acc/rej match, nor
that the process was performed twice.

I look forward to 1.12.





Google SoC 2008

2008-03-20 Thread Saint Xavier

Hi all,

I would like to participate in GSOC 2008. I saw some Wget project ideas on
http://www.gnu.org/software/soc-projects/ideas.html. I'm particulary interested
to improve international support and HTTP/1.1 headers support and some 
not-too-big
tasks like FTP proxy authentication.

Do you have specific requirements for applying ? Does someone already pick these
tasks or similar ones ? Is that enough for applying ?

BTW, I'm an electronic engineer student in Belgium (West Europe) and a GNU
enthusiast.

Regards,
Saint Xavier


Re: Google SoC 2008

2008-03-20 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Saint Xavier wrote:
 Hi all,
 
 I would like to participate in GSOC 2008. I saw some Wget project ideas on
 http://www.gnu.org/software/soc-projects/ideas.html. I'm particulary 
 interested
 to improve international support and HTTP/1.1 headers support and some 
 not-too-big
 tasks like FTP proxy authentication.

Great! Those are certainly important improvements.

 Do you have specific requirements for applying ?

We don't have any specific requirements above and beyond the ones
mentioned at http://www.gnu.org/software/soc-projects/guidelines.html.

 Does someone already pick these
 tasks or similar ones ?

Anything listed there (and more!) is fair game. The tasks aren't
picked until you submit your application, we rank the applications in
order of preference, and Google then assigns students to specific
project tasks. So the available tasks will all be assigned at the same time.

 Is that enough for applying ?

I'm not sure what you mean by that; but GNU's proposal guidelines are at
http://www.gnu.org/software/soc-projects/guidelines.html. Application is
made through Google's website at http://code.google.com/soc/2008/.

Note that, the better idea you give us about how you plan to tackle the
problem, the better an idea we'll have about how suitable a candidate
you are; so please provide a fair amount of detail regarding what you
plan to do. If you have any code that is representative of your
experience, knowledge and/or style, please include that with your
application as well.

Good luck!

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4tvc7M8hyUobTrERAr4cAJoDf5d5JFfOVMrrVHwaFcBZepuw5ACfQ4tr
14L5XBXdn04P/QG6ud868Ao=
=Q4Q8
-END PGP SIGNATURE-


Re: Session Info Database API concepts [regarding wget and gsoc]

2008-03-20 Thread Charles
On Fri, Mar 21, 2008 at 2:33 AM, Micah Cowan [EMAIL PROTECTED] wrote:
  The intent is that all the writer operations would take virtually no
  time at all. The sidb_read function should take at most O(N log N) time
  on the size of the SIDB file, and should take less than a second under
  normal circumstances on typical machines

What encoding will this session database file use?


Re: Toward a 1.11.1 release

2008-03-20 Thread Steven M. Schweda
 [...]  Is it even useful to _do_ prereleases?

   I was waiting for the version which integrated the (previously
suggested) VMS-related changes.  (There are some generic FTP-related
fixes hidden among the VMS-related ones, too, of course.)

   Perhaps the Summer of Code thing will turn up someone with interests
broader than Linux.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547