Re: [CODE4LIB] WARC file format now ISO standard

2009-06-02 Thread st...@archive.org

point well taken. :)

there were no significant changes to the WARC format
between the last draft and the published standard.

you can use Heritrix WARCReader, or WARC Tools warcvalidator
to verify that you have created a valid WARC in accordance
with the spec.


/st...@archive.org


On 6/2/09 2:27 PM, Ray Denenberg, Library of Congress wrote:
But you have to pay $200 for the document that lists changes from last 
draft to first official version.


(Ok, Ok, it was just a joke. But you do get the point.)


- Original Message - From: "st...@archive.org" 
To: 
Sent: Tuesday, June 02, 2009 5:18 PM
Subject: Re: [CODE4LIB] WARC file format now ISO standard



hi Karen,

understood.

the final draft of the spec is available here:
http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618 



and other (similar) versions here:
http://archive-access.sourceforge.net/warc/


/st...@archive.org



On 6/2/09 2:15 PM, Karen Coyle wrote:
Unfortunately, being an ISO standard, to obtain it costs 118 CHF 
(about $110 USD). Hard to follow a standard you can't afford to read. 
Is there an online version somewhere?


kc

st...@archive.org wrote:

hi code4lib,

if you're archiving web content, please use the WARC format.

thanks,
/st...@archive.org



WARC File Format Published as an International Standard
http://netpreserve.org/press/pr20090601.php

ISO 28500:2009 specifies the WARC file format:

* to store both the payload content and control information from
  mainstream Internet application layer protocols, such as the
  Hypertext Transfer Protocol (HTTP), Domain Name System (DNS),
  and File Transfer Protocol (FTP);
* to store arbitrary metadata linked to other stored data
  (e.g. subject classifier, discovered language, encoding);
* to support data compression and maintain data record integrity;
* to store all control information from the harvesting protocol
  (e.g. request headers), not just response information;
* to store the results of data transformations linked to other
  stored data;
* to store a duplicate detection event linked to other stored
  data (to reduce storage in the presence of identical or
  substantially similar resources);
* to be extended without disruption to existing functionality;
* to support handling of overly long records by truncation or
  segmentation, where desired.


more info here:
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml






Re: [CODE4LIB] WARC file format now ISO standard

2009-06-02 Thread st...@archive.org

hi Karen,

understood.

the final draft of the spec is available here:
http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618

and other (similar) versions here:
http://archive-access.sourceforge.net/warc/


/st...@archive.org



On 6/2/09 2:15 PM, Karen Coyle wrote:
Unfortunately, being an ISO standard, to obtain it costs 118 CHF (about 
$110 USD). Hard to follow a standard you can't afford to read. Is there 
an online version somewhere?


kc

st...@archive.org wrote:

hi code4lib,

if you're archiving web content, please use the WARC format.

thanks,
/st...@archive.org



WARC File Format Published as an International Standard
http://netpreserve.org/press/pr20090601.php

ISO 28500:2009 specifies the WARC file format:

* to store both the payload content and control information from
  mainstream Internet application layer protocols, such as the
  Hypertext Transfer Protocol (HTTP), Domain Name System (DNS),
  and File Transfer Protocol (FTP);
* to store arbitrary metadata linked to other stored data
  (e.g. subject classifier, discovered language, encoding);
* to support data compression and maintain data record integrity;
* to store all control information from the harvesting protocol
  (e.g. request headers), not just response information;
* to store the results of data transformations linked to other
  stored data;
* to store a duplicate detection event linked to other stored
  data (to reduce storage in the presence of identical or
  substantially similar resources);
* to be extended without disruption to existing functionality;
* to support handling of overly long records by truncation or
  segmentation, where desired.


more info here:
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml







[CODE4LIB] WARC file format now ISO standard

2009-06-02 Thread st...@archive.org

hi code4lib,

if you're archiving web content, please use the WARC format.

thanks,
/st...@archive.org



WARC File Format Published as an International Standard
http://netpreserve.org/press/pr20090601.php

ISO 28500:2009 specifies the WARC file format:

* to store both the payload content and control information from
  mainstream Internet application layer protocols, such as the
  Hypertext Transfer Protocol (HTTP), Domain Name System (DNS),
  and File Transfer Protocol (FTP);
* to store arbitrary metadata linked to other stored data
  (e.g. subject classifier, discovered language, encoding);
* to support data compression and maintain data record integrity;
* to store all control information from the harvesting protocol
  (e.g. request headers), not just response information;
* to store the results of data transformations linked to other
  stored data;
* to store a duplicate detection event linked to other stored
  data (to reduce storage in the presence of identical or
  substantially similar resources);
* to be extended without disruption to existing functionality;
* to support handling of overly long records by truncation or
  segmentation, where desired.


more info here:
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml


Re: [CODE4LIB] A Book Grab by Google

2009-05-20 Thread st...@archive.org

On 5/20/09 11:19 AM, Eric Hellman wrote:
> I don't see how bashing Google (which is NOT what the
> library association briefs are doing, btw) for gaps in US
> and international Copyright Law(orphan works, for example)
> will end up helping libraries.

i think the concern is that the settlement could give
_only_ Google the right to scan orphaned works, and no
one else. that certainly wouldn't help libraries.

/st...@archive.org


Re: [CODE4LIB] A Book Grab by Google [hack]

2009-05-19 Thread st...@archive.org

also, if your script can handle a redirect, you can use
our locator to find each item, e.g.

http://www.archive.org/download/librariesreaders00fostuoft/
http://www.archive.org/download/developmentofchi00tancuoft/
http://www.archive.org/download/rulesregulations00brituoft/

as the data does migrate occasionally for maintenance.


/st...@archive.org



On 5/19/09 10:51 AM, raj kumar wrote:

On May 19, 2009, at 10:40 AM, Eric Lease Morgan wrote:


On May 19, 2009, at 1:24 PM, Eric Lease Morgan wrote:


I applaud the Internet Archive and the Open Content Alliance's
efforts.  archive.org++


Try this hack with Google Books, not.

$ echo 
http://ia300206.us.archive.org/3/items/librariesreaders00fostuoft/ > 
libraries.urls


$ echo 
http://ia310827.us.archive.org/0/items/developmentofchi00tancuoft/ >> 
libraries.urls


$ echo 
http://ia310832.us.archive.org/2/items/rulesregulations00brituoft/ >> 
libraries.urls


$ echo 'wget -erobots=off --wait 1 -np -m -nd -A 
_djvu.txt,.pdf,.gif,_marc.xml -R _bw.pdf -i $1' > mirror.sh


$ chmod +x mirror.sh

$ ./mirror.sh libraries.urls


Here is a script that will let you download all the books from archive.org:

http://blog.openlibrary.org/2008/11/24/bulk-access-to-ocr-for-1-million-books/ 



You'll have to slightly modify it to download the format you want...

-raj


[CODE4LIB] A Book Grab by Google

2009-05-19 Thread st...@archive.org

fyi - [the Google Book Settlement] "should not be approved"


A Book Grab by Google
by Brewster Kahle
Tuesday, May 19, 2009
Washington Post | Opinions
http://www.washingtonpost.com/wp-dyn/content/article/2009/05/18/AR2009051802637.html


/st...@archive.org


Re: [CODE4LIB] Wolfram Alpha (was: Another nail in the coffin)

2009-05-07 Thread st...@archive.org

thanks so much for your post Alex, i hadn't had a chance to
consider Wolfram|Alpha (WA) seriously until you posted the link
to the talk (and i had the time to actually watch it).

On 5/3/09 6:13 PM, Alexander Johannesen wrote:
> http://www.youtube.com/watch?v=5TIOH80Qg7Q
> Organisations and people are slowly turning into data
> producers, not book producers.

when i think of data producers, i think CRC press and the like,
companies that compile and publish scientific data. certainly
much of this data is now born-digital or being converted to
digital formats (or put on the web), rather than only being
published in books. but these organizations and people are
still producing data, and those that produce books are in a
rapidly changing space (aren't we all).

imo, the advent of WA will likely result in the production of
_more_ books, not less, and will almost certainly benefit
libraries and learners.

after watching Mr. Wolfram's talk, i realize that most of the
responses to Wolfram Alpha on the net appear to be missing the
point. more specifically,

* WA consists of curated (computable) data + algorithms (5M+
  lines of Mathematica) + (inverted) linguistic analysis[1] +
  automated presentation.

* afaict WA does not attempt to compete with Google or Wikipedia
  or open source/public science, they are all complimentary and
  compatible!

* WA is admirably unique in its effort to make quality data
  useful, rather than merely organizing/regurgitating heaps of
  folk data and net garbage.

* the value added by WA is that it makes (so-called) public data
  "computable", in the NKS[2] sense, as executable Mathematica
  code.

as mentioned in the talk, Wolfram engineers take data from
proprietary, widely accepted, peer-reviewed sources (probably
familiar to any research librarian) and transforms it into
datasets computable in the WA environment[3].

there is considerable confusion as to how WA compares to Google,
Wikipedia, and the Open Source world. i think Google is solving
a different problem with very different data, and Wikipedia (as
mentioned in the talk) is one of many input sources to WA. more
specifically,

* Google's input data set is un-curated, albeit cleverly ranked,
  links to web pages, and _some_ data from the web. it (rightly)
  does not have "computable" data or the Mathematica
  computational engine, but does have many of the natural
  language and automatic presentation features, as well as a
  search engine query box type interface (which i think is the
  cause of much incorrect comparison).

* Wikipedia is merely folk input to WA, complimentary but
  missing _quality_ data (think CRC press), computational
  algorithms, natural language processing, and automated
  presentation. the only basis for comparison i can see here is
  that both Wikipedia and WA contain a lot of useful information
  - however, what is done with and how you interact with that
  data is clearly very different.

* WA is not in danger of being "open-sourced" because curating
  and converting quality scientific data into computable
  datasets is non-trivial, and so is the Mathematica
  computational engine. the comparisons here, i think stem from
  the fact that it has a web interface, and much of the data is
  available from public sources. for many problem-solvers, i
  think it's natural to respond with, "hmmm, how would i have
  done this..."

ultimately, i think Wolfram Alpha will be an extremely valuable
tool for libraries, and could (hopefully) change the way
learners think about how to get information and solve problems.

i think it's exciting to think that it could steer learners and
researchers away from looking to the web (unfortunately, almost
always Google by default) for quick answers, and back to
thinking about how they can answer questions for themselves,
given quality information, and powerful tools for problem
solving.


/st...@archive.org


Notes:

[1] as mentioned near 0:39:00 in the video, Wolfram explains
that the natural language problem that WA attempts to solve
(like search engines) is different than the traditional one.
the traditional NLP problem is taking a mass of data produced
by humans and trying to make sense of it, while the query box
problem is taking short human utterances, and trying to
formulate a problem which is computable from a mass of data.

[2] A New Kind of Science
http://www.wolframscience.com/nksonline/toc.html
i must confess, i haven't completely digested this material.

[3] as a long-time MATLAB user in a former life, this makes a
lot of sense. in MATLAB, everything is a computable matrix, and
solving problems in that environment is about taking (highly
non-linear) real-world problems, and linearizing them to be
computable in the MATLAB environment. this approach has deep
mathematical roots, and is consistent in solving problems across
many scientific disciplines, so the kind of prob

Re: [CODE4LIB] Recommend book scanner?

2009-05-02 Thread st...@archive.org

On 5/1/09 8:27 PM, Lars Aronsson wrote:
Does anybody have a printed test sheet that we can scan or photo, 
and then compare the resulting digital images?  It should have 
lines at various densities and areas of different colours, just 
like an old TV test image.  Can you buy such calibration sheets?


archive.org scans typically include a color card target
image near the back (or front) of the book, e.g.

http://www-steve.us.archive.org/public/data/eg/birdsthateverych00doub/birdsthateverych00doub_jp2/birdsthateverych00doub_0371.jp2

typical specs for our scanning rig (scribe) are roughly:

  1 8x8x5' scribe structure
  2 Canon EOS 5Ds
  2 light boxes
  1 orthogonal glass platen and cradle
  1 foot pedal, pulley system
  Linux PC
LAMP stack
custom web-based UI
gphoto, imagemagick, leptonica, rsync
  fast internet


we scan over 1,000 books a day with about 100 scribes like this.


/st...@archive.org


Re: [CODE4LIB] Linux Public Computers - time and ticket reservation system

2009-01-05 Thread st...@archive.org

hi Darrell,

thanks for your intriguing post.

a few observations; 1) this is one instance of the use of
a GNU/Linux system which may seem to be at odds with the
very premise of free (in the GNU sense) software, and that
is; to NOT limit the ability of users to do things. so your
use cases may seem odd at first, but you have a valid and
important case.

2) many open source programmers may not be familiar with
commercial software products (and may not want to be), so
you might have a better chance of getting an answer if you
do the groundwork of listing the features you are in search
of yourself, rather than asking the list to go learn them.

3) it seems that a good desktop linux distro would allow
an administrator or programmer to create a system (based on
the existing pieces you mention) that might consist of a
some shell scripts, perhaps a "lite" database, a web server,
and client- and server-side scripts to accomplish the
features that you list, and then provide hooks for that
system to be made into a distributable package (e.g. Ubuntu).

i wouldn't be surprised if your listing the desired features
explicitly might seed some capable programmer's mind to suggest
(or even spend some time coding something up) which may help
you right away. or, it may just prompt someone to remember
that something _does_ already exist that answers your needs.
(i think Francis' LibPrint suggestion seems very helpful)

just keep in mind that the very nature of the linux system
is organic, and the workforce is distributed and lasseiz-faire.
it doesn't seem to be very agile in responding to monolithic
deficiencies (just look at how we ended up with the linux
kernel vs. hurd :).


/st...@archive.org


Re: [CODE4LIB] Linux Public Computers - time and ticket reservation system

2009-01-05 Thread st...@archive.org

hi Darrell,

thanks for your intriguing post.

a few of observations; 1) this is one instance of the use
of a GNU/Linux system which may seem to be at odds with the
very premise of free (in the GNU sense) software, and that
is, to NOT limit the ability of users to do things. so your
use cases may seem odd at first, but you have a valid and
important case.

2) many open source programmers may not be familiar with
commercial software products (and may not want to be), so
you might have a better chance of getting an answer if you
do the groundwork of listing the features you are in search
of yourself, rather than asking the list to go learn them.

3) it seems that a good desktop linux distro would allow
an administrator or programmer to create a system (based on
the existing pieces you mention) that might consist of a
some shell scripts, perhaps a "lite" database, a web server,
and client- and server-side scripts to accomplish the
features that you list, and then provide hooks for that
system to be made into a distributable package (e.g. Ubuntu).

i wouldn't be surprised if your listing the desired features
explicitly might seed some capable programmers mind to suggest
(or even spend some time coding something up) which may help
you right away. or, it may just prompt someone to remember
that something _does_ already exist that answers your needs.
(i think Francis' LibPrint suggestion seems very helpful)

just keep in mind that the very nature of the linux system
is organic, and the workforce is distributed and lasseiz-faire.
it doesn't seem to be very agile in responding to monolithic
deficiencies (just look at how we ended up with the linux
kernel vs. hurd :).


/st...@archive.org



On 12/30/08 12:37 PM, Darrell Eifert wrote:

Hi Folks --

Nicolaie Constantinescu recommended that I contact this list with my 
questions after posting a query to the "Linux in Libraries" group.  I 
will be presenting an introduction to Desktop Linux at the New Hampshire 
Library Association next year, and would like some help on answering a 
question that is sure to arise from my prospective audience.  Many 
librarians are intrigued by the possibility of lowering IT costs and 
maintenance time, especially for their public-use computers.  Right now 
however, there doesn't seem to be any open source versions of a 
reservation / ticket system (such as the excellent WinXP "Time Limit 
Manager" from Fortres) and a desktop security application such as Deep 
Freeze.   There are commercial options from Groovix or Userful, but that 
pretty much defeats the practical goal of lowering IT costs, or the 
ideological goal of moving to free and open-source applications.


All the 'bits and pieces' for a good reservation and security system 
seem to be out there.  Edubuntu gives us a LTSP solution with a central 
server and the ability to see 'screenshots' of individual PCs if 
necessary.  CUPS gives a very fine-grained control over printing, and 
perhaps can be modified to function as a print-upon-payment release 
station.  A MySQL / PhP module could handle generating and storing 
random passwords / logins, while a small program to set folder 
permissions may be able to lock down a Gnome or KDE desktop to prevent 
users from changing icons, menus, or wallpaper.  Web content filtering 
is available from several sources if necessary.  A browser-based central 
server module might help to make the project "distro agnostic".


I think many small and medium-sized libraries would be much more likely 
to consider the advantages of choosing Linux for their public-use 
computers if a polished open-source reservation and printing control 
system was available.  In the world of commercial software, an 
entrepreneur or company sees an opportunity, programs a solution, and 
sells the product.  On that model we have the afore mentioned "Time 
Limit Manager" for XP (which we use here at the Lane Library and highly 
recommend) at a one-time cost of only $20 per PC.
In the world of Linux and open-source software, how does one go about 
getting a programmer or group of programmers to provide a free solution 
(with regular maintenance and updates) to a pressing need?  Would 
Canonical (for example) be interested in creating the program as a way 
to popularize Ubuntu with the thousands who use library computers every 
day?  Would anyone on this list be interested in spearheading such a 
project?  Is there a place to float such a project before a group of 
up-and-coming programmers (Google Summer of Code??) that would give them 
bragging rights on a resume?


Any ideas (including ideas on a basic programming framework or project 
"how to") would be more than welcome ...


Cheers,