[MCN-L] FW: RESPONSES: Low-cost digitization rig

Tanner, Simon Sun, 10 Aug 2008 17:31:31 +0100

First up - there are several single button solutions out there already 
(and many that deliver something more useful than PDF - like structured 
XML). So why build another mouse trap?


Secondly, I wrote a handbook on cost reduction for digitisation for 
Minerva: see here:
http://www.minervaeurope.org/publications/costreduction.htm

The problem is not the technology/software it is the labour and person 
hours required. Any time human intervention is required the costs go up.

A real problem for museums is that a very large proportion of their 
materials require some form of special handling. So the costs of 
removing the item from its preservation enclosure and getting it to the 
scan mechanism and then replacing it in its preservation mechanism far 
outweighs the cost of pressing the button to image the thing.

I have set up dozens of point and click solutions that are extremely 
easy to use or are automated page turning solutions. The staffing issue 
never goes away though and is still the defining cost issue. One very 
well known institution has automated page turning scanners with staff 
sat watching it do its job to ensure it doesn't foul up - this seemingly 
nonsensical approach necessitated by the nature of the core materials. 
Google bookscan (anecdotally) is rejecting 3-5% of library books from 
its mass digitisation program purely on format and condition - so 
automated imaging is never a 100% solution.

The other real problem is that many of the collections are under 
catalogued and to make the digital objects useful then requires high 
amounts of human intervention for adding metadata. There is thus a 
dichotomy in being able to deliver far more images than can be 
effectively managed and described in ways that add value to the content 
and context. So unless the content is text driven and self describing 
(which pretty much means science and newspaper type content only) then 
there is are a lot of metadata costs around this.

I want significant amounts of content to be available online and for 
large scale digitisation to happen - I have worked on many such 
projects. What I think there is too much focus upon is the "technology 
will save us" thinking about these issues. It is only a partial solution 
and to raise it to the level of panacea is a mistake.

PS. all our research shows that the long tail economic model does not 
apply very well to museum collections - hence my scepticism.

Best,
        Simon Tanner
        King's College London

akeshet at imj.org.il wrote:
> I've been following this discussion, and for what it's worth, I must say that 
> I fully support those
> who have emphasized the labor issues.  There's no lack of software (or 
> hardware) solutions.  The REAL
> problem is capable manpower, and what it costs.  That goes for museums of any 
> size.
> 
> The "one-button, anyone can do it" idea is usually misleading. Before and 
> after that "for dummies" methodology can be used, a lot of professional staff 
> has to do a lot of work -- there is a huge "labor bottleneck" involved in any 
> project of this kind.  That costs time and money, and that is the real 
> problem in search of a funding solution.
> 
> 
> Amalyah Keshet
> Head of Image Resources & Copyright Management
> The Israel Museum, Jerusalem
> 
> 
> ________________________________
> 
> From: mcn-l-bounces at mcn.edu on behalf of Cherry, Rich
> Sent: Fri 08/08/2008 00:49
> To: Museum Computer Network Listserv
> Subject: Re: [MCN-L] RESPONSES: Low-cost digitization rig
> 
> I think your answers address some of the labor issues but not all.  I
> don't want to sound like I am against this (as I am not) but I think
> there are some major practical issues that cannot be addressed with
> interns and volunteers in most cases:
> 
> Who leads the project (gets the institution behind it, finds the free
> labor source, finds space, organizes tasks and manages a schedule)?
> Who selects the material?
> Who moves the material to the location for scanning (is the free labor a
> security issue?)?
> Who reviews the material to see if there are copyright issues?
> Who proofs the final product to see if errors were made?
> Where will the product live when the funding for online archives
> disappears?
> If there is no cataloging for access other than the OCR is the only use
> a huge repository of unconnected individual pages or if its books and
> collections who catalogs them and connects them?
> 
> I think some institutions will find a way to accomplish the tasks above
> and I think that the product described below would help them but my
> guess is that if they can get the buy in required to expend the effort
> listed above that very few would be stopped because of the software
> being a little expensive or hard to use.
> 
> To put it another way:  I don't think there are a lot of institutions
> sitting on the fence waiting for a low cost software solution to the
> problem.  They are trying to figure out the challenges listed above.
> 
> I do think that the online archive piece might move a few organizations
> closer to doing it.  It might even be more attractive if some of the OCR
> processing took place there as well.
> 
> Is part of the plan to use something like Amazon Web Services for this?
> 
> Rich
> 
> 
> Rich Cherry
> Director of Operations
> Skirball Cultural Center
> 2701 N. Sepulveda Blvd.
> Los Angeles, CA 90049
> Work: (310) 440-4777
> Fax: (310) 440-4595
> rcherry at skirball.org
> 
> 
> 
> -----Original Message-----
> From: mcn-l-bounces at mcn.edu [mailto:mcn-l-bounces at mcn.edu] On Behalf Of
> Christopher J. Mackie
> Sent: Thursday, August 07, 2008 8:13 AM
> To: mcn-l at mcn.edu
> Subject: [MCN-L] RESPONSES: Low-cost digitization rig
> 
> Thanks to everyone who responded to my query yesterday about
> digitization rigs. The responses were extremely helpful, and I am most
> grateful.
> 
> There were so many responses on and offline, some of which include
> several back-and-forths, that I'm not going to attempt a summary of the
> questions and concerns; instead, let me respond by providing what I hope
> will be a clearer and more comprehensive explanation of what we're
> talking about (with my apologies for not making all of it clearer the
> first time :-).
> 
> 1) The focus of the project is on the software, not the hardware. We're
> using off-the-shelf, consumer-grade hardware so that we don't have to
> think about hardware at all; the whole reason to create the software
> now, as some of you noted, is that the hardware has reached the point
> where it's no longer the bottleneck. The sub-$2,000 configuration we
> describe will produce 600dpi output, for example--and runs fast enough
> that the page-turner, not the imaging hardware, is the throughput
> bottleneck. (For those who really want to know, it's a two-camera,
> stereo rig that shoots both pages of a book at once and uses the stereo
> plus some other tricks to dewarp and align text.)
> 
> 2) The software will be, literally, one-button. Anyone who can turn a
> page and press a button will be able to use it--no training, no special
> skills required. Someone will have to assemble the rig (we anticipate a
> one-sheet instruction set, with room for many pictures :-), but once
> it's up, it will keep itself in calibration and perform all of the
> ancillary tasks of digitizing--dewarping, aligning, etc.--without user
> input. It will auto-structure the text, but there will also be
> opportunities for user interaction.
> 
> 3) The PDF output is the only *final* output we have discussed, and it's
> required for the 'flowing' of text for reader device independence. But
> the system preserves interim formats as well, including the original
> scanned TIFFs and the interim hOCR format, so getting data into other
> formats will be relatively straightforward.
> 
> 4) The metadata to be input will be at the discretion of the institution
> doing the digitizing, and can be applied at the level of the individual
> image or collection of images (e.g., the book, or even a collection of
> books).
> 
> 5) The software is SOA and also scriptable, so it can easily be
> decomposed and extended (e.g., it will use the OCRopus open-source OCR
> system, which can itself be implemented as a set of services). We're
> going to build a seamless wrapper for institutions that don't have the
> tech capacity to do their own customization, but that won't prevent
> institutions that have the chops from doing whatever they want with it.
> As you might suspect from this, by the way, the software will be
> web-based and multi-user; this means that a larger institution could set
> up several volunteer-operated rigs, all feeding one staffer who's doing
> QA and adding/reviewing metadata.
> 
> 6) The system will include automated open archiving online (i.e., not at
> institutional expense) for both the raw images and the finished outputs
> (perhaps also the interim formats), with a stable, permanent URL;
> institutions that want to host locally will be able to do that
> instead/in addition. To respond in particular to Joe's point about
> Internet Archive; we've talked with IA already. I don't want to speak
> for them, but it's safe to say that we've taken their views into account
> and I think they'll be very pleased by what's delivered, should we move
> forward.
> 
> 7) The model we're thinking about is a "Long Tail" model; while it's
> entirely possible that even 'big' digitizers might benefit from the
> software, our primary purpose is to extend the capability to small
> institutions with small collections. In the model we envision, these
> smaller institutions won't be devoting staff--or anyone--full-time to
> the project, but will be digitizing using opportunistic labor
> (volunteers, interns, min-wage teens, etc.) as available. Staff will
> most likely only engage in small quantities, for QA and metadata
> purposes. This model trades time for money: it removes the labor
> bottleneck by allowing an institution to pay nearly nothing as long as
> it's willing to wait a while for the output, instead of getting results
> quickly but paying a lot for them. In short, we're envisioning a very
> different kind of 'scalability' than the one(s) that many of you
> mentioned.
> 
> 8) Several people inquired about the flowing PDFs. The technology will
> be fully open-source, so it will also be usable by projects other than
> this.
> 
> 9) One thing nobody mentioned, because I didn't mention it in the first
> place: because this project will use OCRopus, it has the potential to be
> used for texts in many languages (including math and chem formulae), not
> just English. How quickly and well that capability emerges will depend
> on how big and multi-lingual a community forms around OCRopus, but at
> least, unlike many commercial OCR products, there's hope....
> 
> As for the questions I asked originally, here's what I think I heard:
> 
> 1) Setting aside the concerns about whether/how the technology will
> work, many of you feel that this is likely to be quite valuable. Of
> those who do, most of you think that small institutions will be willing
> to publish most of the materials openly, provided they are supported
> (with hosting, etc.) in doing so.
> 
> 2) Quite a few of you remain skeptical that even a low-cost rig will
> enable small institutions to digitize, because you believe that labor
> remains the true bottleneck. I'd be interested to know if that remains
> true even in the context of the "Long Tail" model I described in (7),
> above, or whether you were thinking implicitly about a 'big'
> digitization project on fixed timelines.
> 
> Again, thanks *very* much to everyone who took the trouble to respond.
> We're still quite a ways from a final decision, so if you think of
> anything else you'd like to share, please don't hesitate to get in
> touch.
> 
> All best,  --Chris
> 
> Christopher J. Mackie
> Associate Program Officer
> Research in Information Technology
> The Andrew W. Mellon Foundation
> --
> 282 Alexander Rd.
> Princeton, NJ 08540
> --
> 140 E. 62nd St.
> New York, NY 10065
> --
> +1 609.924.9424 (office: GMT - 5:00)
> +1 609.933.1877 (mobile)
> +1 646.274.6351 (fax)
> cjmackie06 @ AIM
> cjmackie5 @ Yahoo
> --
> http://rit.mellon.org; http://www.mellon.org
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> You are currently subscribed to mcn-l, the listserv of the Museum
> Computer Network (http://www.mcn.edu)
> 
> To post to this list, send messages to: mcn-l at mcn.edu
> 
> To unsubscribe or change mcn-l delivery options visit:
> http://toronto.mediatrope.com/mailman/listinfo/mcn-l
> _______________________________________________
> You are currently subscribed to mcn-l, the listserv of the Museum Computer 
> Network (http://www.mcn.edu)
> 
> To post to this list, send messages to: mcn-l at mcn.edu
> 
> To unsubscribe or change mcn-l delivery options visit:
> http://toronto.mediatrope.com/mailman/listinfo/mcn-l
> 
> _______________________________________________
> You are currently subscribed to mcn-l, the listserv of the Museum Computer 
> Network (http://www.mcn.edu)
> 
> To post to this list, send messages to: mcn-l at mcn.edu
> 
> To unsubscribe or change mcn-l delivery options visit:
> http://toronto.mediatrope.com/mailman/listinfo/mcn-l

-- 
Simon Tanner
Director,  King's Digital Consultancy Services,
King's College London,
Centre for Computing in the Humanities,
26-29 Drury Lane, London WC2B 5RL
Tel: +44 (0)7887 691716 or Admin: +44 (0)20 7848 2861
Email: simon.tanner at kcl.ac.uk
http://www.kdcs.kcl.ac.uk/

[MCN-L] FW: RESPONSES: Low-cost digitization rig

Reply via email to