Re: [Bug-wget] WARC output

2011-10-08 Thread Giuseppe Scrivano
Hi Gijs,


Gijs van Tulder gvtul...@gmail.com writes:

 can you please send a complete diff against the current development
 tree version?

 Here's the diff of the WARC additions (1.9MB zipped) to revision 2565:

  http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2

the patch is huge and I think we don't want to add some many files into
the wget tree.  Can't we assume the user will install the warc tools by
herself and let configure check if they are installed or not?  This will
require some more work but the result will be much less intrusive.  What
do you think?

Thanks,
Giuseppe



Re: [Bug-wget] WARC output

2011-10-08 Thread Ángel González

Giuseppe Scrivano wrote:

the patch is huge and I think we don't want to add some many files into
the wget tree.  Can't we assume the user will install the warc tools by
herself and let configure check if they are installed or not?  This will
require some more work but the result will be much less intrusive.  What
do you think?

Thanks,
Giuseppe

I don't think all those files are even remotely needed.
I am seeing for instance, python files for creating warc interacting 
with curl.

Why would that be useful in wget repository?
I -optimistically- think we could make warc files with a simpler 
implementation.

Also, the patch seems to duplicate code (compare lines 337731-337810 with
337944-338013 in the patch file). Surely that could be refactored?




Re: [Bug-wget] WARC output

2011-09-26 Thread Giuseppe Scrivano
Gijs van Tulder gvtul...@gmail.com writes:

 Hi.

 It's been a while since we've discussed the WARC addition to Wget. Is
 there anything I can help with?

can you please send a complete diff against the current development tree
version?

I'll take a look at it ASAP.

Thanks,
Giuseppe



Re: [Bug-wget] WARC output

2011-09-26 Thread Gijs van Tulder

 can you please send a complete diff against the current development
 tree version?

Here's the diff of the WARC additions (1.9MB zipped) to revision 2565:

 http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2

Thanks,

Gijs



Re: [Bug-wget] WARC output

2011-09-25 Thread Gijs van Tulder

Hi.

It's been a while since we've discussed the WARC addition to Wget. Is 
there anything I can help with?


Gijs



Re: [Bug-wget] WARC output

2011-08-10 Thread Giuseppe Scrivano
Gijs van Tulder gvtul...@gmail.com writes:

 It would be cool if Wget could become one of these tools. Already the
 Swiss army knife for mirroring websites, the one thing that Wget is
 missing is a good way to store these mirrors. The current output of
 --mirror is not sufficient for archival purposes:

Sure we do!



 With some help from others, I've added WARC functions to Wget. With
 the --warc-file option you can specify that the mirror should also be
 written to a WARC archive. Wget will then keep everything, including

Can you please track all contributors?  Any contribution to GNU wget
requires copyright assigments to the FSF.



 Do you think this is something that could be included in the main Wget
 version? If that's the case, what should be the next step?

Sure, I will take a look at the code in the next days.  In the
meanwhile, can you check if you are following the GNU Coding Standards
for the new code[1]?



 The implementation makes use of the open source WARC Tools library
 (Apache License 2.0):
  http://code.google.com/p/warc-tools/

how much code is really needed from that library?  I wonder if we can
avoid this dependency at all.

Cheers,
Giuseppe



1) http://www.gnu.org/prep/standards/



Re: [Bug-wget] WARC output

2011-08-10 Thread Gijs van Tulder

Giuseppe Scrivano writes:

 The implementation makes use of the open source WARC Tools library
 (Apache License 2.0):
   http://code.google.com/p/warc-tools/

 how much code is really needed from that library?  I wonder if we can
 avoid this dependency at all.

The library comes with some utilities, an HTTrack plugin, a Java module 
etc. These extra things are not needed for Wget. But of the C library, I 
used pretty much everything. The library handles all the WARC writing 
stuff. It can also read WARCs, but that's not needed here.


Rough estimate: 12.000 lines of code (excluding comments).

It's probably important to note that I have changed a few small things 
in the warc-tools library. (I have records in Git.)



As for the other dependencies:
- I used an MIT-licenced base32 encoder (there seems to be no such
  module in Gnulib), but that's quite small so could be replaced;
- it links to the UUID library.


 Can you please track all contributors?  Any contribution to GNU wget
 requires copyright assigments to the FSF.

Yes, it's all in the Git history, so it's easy to make a list. (There's 
only one other contributor of code, others helped with testing.)


 In the meanwhile, can you check if you are following the GNU Coding
 Standards for the new code?

I tried to do that. So except for the warc-tools library, which uses a 
different standard, all new code follows the GNU standards (I hope).


Thanks,

Gijs



Re: [Bug-wget] WARC output

2011-08-09 Thread Patrick Steil
That sounds awesome!  You have my vote... :)



On Tue, Aug 9, 2011 at 4:49 AM, Gijs van Tulder gvtul...@gmail.com wrote:

 Hi,

 I'd like to propose a new feature that allows Wget to make WARC files.

 Perhaps you're already familiar with it, but in short: WARC is a file
 format for web archives. In a single WARC file, you can store every file of
 the website, plus the HTTP request and response headers and other metadata.
 This makes it a very useful format for web archivists: you keep everything
 together, in the most detailed and original form.

 The WARC format (an ISO standard, ISO 28500) has been developed by the
 International Internet Preservation Consortium, which includes the Internet
 Archive and many national libraries. It is supposed to become *the* standard
 file format for web archives. For example, it is used in the Internet
 Archive's Wayback Machine and its Heritrix crawler. There are several
 projects building tools to work with WARC files.


 It would be cool if Wget could become one of these tools. Already the Swiss
 army knife for mirroring websites, the one thing that Wget is missing is a
 good way to store these mirrors. The current output of --mirror is not
 sufficient for archival purposes:

  - it throws away the HTTP headers (of the request and response);
  - it doesn't keep 404 pages and redirects;
  - it doesn't store the original urls but mangles the filenames;
  - and, if you're not careful, it even rewrites the links inside
   the documents that it has downloaded.

 The WARC format supports these things.


 With some help from others, I've added WARC functions to Wget. With the
 --warc-file option you can specify that the mirror should also be written to
 a WARC archive. Wget will then keep everything, including the HTTP request
 and response headers, redirects and 404 pages.

 Do you think this is something that could be included in the main Wget
 version? If that's the case, what should be the next step?

 Description, links to more information about WARC:
  
 http://www.archiveteam.org/**index.php?title=Wget_with_**WARC_outputhttp://www.archiveteam.org/index.php?title=Wget_with_WARC_output

 Code:
  https://github.com/alard/wget-**warc/https://github.com/alard/wget-warc/
  https://github.com/downloads/**alard/wget-warc/wget-warc-**
 20110809.tar.bz2https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2

 The implementation makes use of the open source WARC Tools library
 (Apache License 2.0):
  http://code.google.com/p/warc-**tools/http://code.google.com/p/warc-tools/


 I look forward to your response.

 Kind regards,

 Gijs van Tulder




-- 

**

*Patrick Steil  |  ChurchBuzz.org*

Church Website Optimization http://www.churchbuzz.org/
Like us on Facebook http://facebook.com/churchbuzz!

Mobile: 940-391-9250