Re: [Bug-wget] WARC output
Hi Gijs, Gijs van Tulder gvtul...@gmail.com writes: can you please send a complete diff against the current development tree version? Here's the diff of the WARC additions (1.9MB zipped) to revision 2565: http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2 the patch is huge and I think we don't want to add some many files into the wget tree. Can't we assume the user will install the warc tools by herself and let configure check if they are installed or not? This will require some more work but the result will be much less intrusive. What do you think? Thanks, Giuseppe
Re: [Bug-wget] WARC output
Giuseppe Scrivano wrote: the patch is huge and I think we don't want to add some many files into the wget tree. Can't we assume the user will install the warc tools by herself and let configure check if they are installed or not? This will require some more work but the result will be much less intrusive. What do you think? Thanks, Giuseppe I don't think all those files are even remotely needed. I am seeing for instance, python files for creating warc interacting with curl. Why would that be useful in wget repository? I -optimistically- think we could make warc files with a simpler implementation. Also, the patch seems to duplicate code (compare lines 337731-337810 with 337944-338013 in the patch file). Surely that could be refactored?
Re: [Bug-wget] WARC output
Gijs van Tulder gvtul...@gmail.com writes: Hi. It's been a while since we've discussed the WARC addition to Wget. Is there anything I can help with? can you please send a complete diff against the current development tree version? I'll take a look at it ASAP. Thanks, Giuseppe
Re: [Bug-wget] WARC output
can you please send a complete diff against the current development tree version? Here's the diff of the WARC additions (1.9MB zipped) to revision 2565: http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2 Thanks, Gijs
Re: [Bug-wget] WARC output
Hi. It's been a while since we've discussed the WARC addition to Wget. Is there anything I can help with? Gijs
Re: [Bug-wget] WARC output
Gijs van Tulder gvtul...@gmail.com writes: It would be cool if Wget could become one of these tools. Already the Swiss army knife for mirroring websites, the one thing that Wget is missing is a good way to store these mirrors. The current output of --mirror is not sufficient for archival purposes: Sure we do! With some help from others, I've added WARC functions to Wget. With the --warc-file option you can specify that the mirror should also be written to a WARC archive. Wget will then keep everything, including Can you please track all contributors? Any contribution to GNU wget requires copyright assigments to the FSF. Do you think this is something that could be included in the main Wget version? If that's the case, what should be the next step? Sure, I will take a look at the code in the next days. In the meanwhile, can you check if you are following the GNU Coding Standards for the new code[1]? The implementation makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-tools/ how much code is really needed from that library? I wonder if we can avoid this dependency at all. Cheers, Giuseppe 1) http://www.gnu.org/prep/standards/
Re: [Bug-wget] WARC output
Giuseppe Scrivano writes: The implementation makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-tools/ how much code is really needed from that library? I wonder if we can avoid this dependency at all. The library comes with some utilities, an HTTrack plugin, a Java module etc. These extra things are not needed for Wget. But of the C library, I used pretty much everything. The library handles all the WARC writing stuff. It can also read WARCs, but that's not needed here. Rough estimate: 12.000 lines of code (excluding comments). It's probably important to note that I have changed a few small things in the warc-tools library. (I have records in Git.) As for the other dependencies: - I used an MIT-licenced base32 encoder (there seems to be no such module in Gnulib), but that's quite small so could be replaced; - it links to the UUID library. Can you please track all contributors? Any contribution to GNU wget requires copyright assigments to the FSF. Yes, it's all in the Git history, so it's easy to make a list. (There's only one other contributor of code, others helped with testing.) In the meanwhile, can you check if you are following the GNU Coding Standards for the new code? I tried to do that. So except for the warc-tools library, which uses a different standard, all new code follows the GNU standards (I hope). Thanks, Gijs
Re: [Bug-wget] WARC output
That sounds awesome! You have my vote... :) On Tue, Aug 9, 2011 at 4:49 AM, Gijs van Tulder gvtul...@gmail.com wrote: Hi, I'd like to propose a new feature that allows Wget to make WARC files. Perhaps you're already familiar with it, but in short: WARC is a file format for web archives. In a single WARC file, you can store every file of the website, plus the HTTP request and response headers and other metadata. This makes it a very useful format for web archivists: you keep everything together, in the most detailed and original form. The WARC format (an ISO standard, ISO 28500) has been developed by the International Internet Preservation Consortium, which includes the Internet Archive and many national libraries. It is supposed to become *the* standard file format for web archives. For example, it is used in the Internet Archive's Wayback Machine and its Heritrix crawler. There are several projects building tools to work with WARC files. It would be cool if Wget could become one of these tools. Already the Swiss army knife for mirroring websites, the one thing that Wget is missing is a good way to store these mirrors. The current output of --mirror is not sufficient for archival purposes: - it throws away the HTTP headers (of the request and response); - it doesn't keep 404 pages and redirects; - it doesn't store the original urls but mangles the filenames; - and, if you're not careful, it even rewrites the links inside the documents that it has downloaded. The WARC format supports these things. With some help from others, I've added WARC functions to Wget. With the --warc-file option you can specify that the mirror should also be written to a WARC archive. Wget will then keep everything, including the HTTP request and response headers, redirects and 404 pages. Do you think this is something that could be included in the main Wget version? If that's the case, what should be the next step? Description, links to more information about WARC: http://www.archiveteam.org/**index.php?title=Wget_with_**WARC_outputhttp://www.archiveteam.org/index.php?title=Wget_with_WARC_output Code: https://github.com/alard/wget-**warc/https://github.com/alard/wget-warc/ https://github.com/downloads/**alard/wget-warc/wget-warc-** 20110809.tar.bz2https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2 The implementation makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-**tools/http://code.google.com/p/warc-tools/ I look forward to your response. Kind regards, Gijs van Tulder -- ** *Patrick Steil | ChurchBuzz.org* Church Website Optimization http://www.churchbuzz.org/ Like us on Facebook http://facebook.com/churchbuzz! Mobile: 940-391-9250