[CODE4LIB] Mashed Library UK 2009 - registration now open
Hope this might be of interest to some of you. I'm not sure how feasible it'll be to stream and/or video the event, but we're currently looking into it. regards Dave Pattern University of Huddersfield - Mashed Library UK 2009: Mash Oop North! Date: Tuesday 7th July 2009 Time: 10.00am until late afternoon Venue: University of Huddersfield, Huddersfield, HD1 3DH Web site: http://mashlib09.wordpress.com Fee: £15 (ex. vat) Speakers: Tony Hirst, Mike Ellis, Brendan Dawes, Richard Wallis and more Primary sponsor: Talis The first Mashed Library UK event, organised by Owen Stephens, was held at Birkbeck College in November 2008 with the aim of bringing together interested people and doing interesting stuff with libraries and technology. Further details about the 2008 event are available here: http://mashedlibrary.ning.com The University of Huddersfield is proud to be hosting the second event, dubbed Mash Oop North!, which is being sponsored by Talis. The event will take place in Huddersfield on July 7th. Mashed Library is aimed at librarians, library developers and library techies who want to learn more about Web 2.0 3.0, Library 2.0, creating mash-ups and generally doing interesting/cool/useful things with data. In particular, we expect the event to generate the following outcomes for all attendees: 1) Awareness of the latest developments in library technology 2) Application of Web 2.0 technologies in a library context 3) Community building and networking 4) Learn new skills and develop existing ones The event is primarily an unconference, so attendees will be encouraged to participate throughout the day. Further information is available on the event blog: http://mashlib09.wordpress.com A small token registration fee of £15 is the only charge for the event. Places are limited to around 60 delegates, so we would advise booking early to avoid disappointment! img src=http://www.hud.ac.uk/images/emails/neutral_navy_blue_003976.gif; alt=Inspiring tomorrow's professionals --- This transmission is confidential and may be legally privileged. If you receive it in error, please notify us immediately by e-mail and remove it from your system. If the content of this e-mail does not relate to the business of the University of Huddersfield, then we do not endorse it and will accept no liability.
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Ray Denenberg, Library of Congress writes: Thanks, Ross. For SRU, this is an opportune time to reconcile these differences. Opportune, because we are approaching standardization of SRU/CQL within OASIS, and there will be a number of areas that need to change. Agreed. Looking at the situation as it stands, it really does seem insane that we've ended up with these three or four different URIs describing each of the data formats; and if we with our library background can't get this right, what hope does the rest of the world have? Because OpenURL 1.0 seems to have been more widely implemented than SRU (though much less so than OpenURL 0.1), I think it would be less painful to change SRU to change OpenURL's data-format URIs than vice versa; good implementations will of course recognise both old and new URIs. Some observations. 1. the 'ofi' namespace of 'info' has the advantage that the name, ofi, isn't necessarily tied to a community or application (I suppose one could claim that the acronym ofi means openURL something starting with 'f' for Identifiers but it doesn't say so anywhere that I can find.) However, the namespace itself (if not the name) is tied to OpenURL. Namespace of Registry Identifiers used by the NISO OpenURL Framework Registry. That seems like a simple problem to fix. (Changing that title would not cause any technical problems. ) 2. In contrast, with the srw namespace, the actual name is srw. So at least in name, it is tied to an application. Agreed -- another reason to prefer the OpenURL standard's URIs. 3. On the other side, the srw namespace has the distinct advantage of built-in extensibility. For the URI: info:srw/schema/1/onix-v2.0, the 1 is an authority. There are (currently) 15 such authorities, they are listed in the (second) table at http://www.loc.gov/standards/sru/resources/infoURI.html Authority 1 is the SRU maintenance agency, and the objects registered under that authority are, more-or-less, public. But objects can be defined under the other authorities with no registration process required. 4. ofi does not offer this sort of extensibility. But SRU's has always been a clumsy extensibility mechanism -- the assignment of integer identifiers for sub-namespaces has the distinct whiff of an OID hangover. In these enlightened days, we use our domains for namespace partitioning, as with HTTP URLs. I'd like to see the info:ofi URI specification extended to allow this kind of thing: info:ofi/ext:miketaylor.org.uk:whateverTheHeckIWantToPutHere So, if we were going to unify these two systems (and I can't speak for the SRU community and commit to doing so yet) the extensibility offered by the srw approach would be an absolute requirement. If it could somehow be built in to ofi, then I would not be opposed to migrating the srw identifiers. Another approach would be to register an entirely new 'info:' URI namespace and migrating all of these identifiers to the new namespace. Oh, gosh, no, introducing yet ANOTHER set of identifiers is really not the answer! :-) _/|____ /o ) \/ Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk )_v__/\ Conclusion: is left to the reader (see Table 2). Acknowledgements: I wrote this paper for money -- A. A. Chastel, _A critical analysis of the explanation of red-shifts by a new field_, AA 53, 67 (1976)
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Jonathan Rochkind writes: Crosswalk is exactly the wrong answer for this. Two very small overlapping communities of most library developers can surely agree on using the same identifiers, and then we make things easier for US. We don't need to solve the entire universe of problems. Solve the simple problem in front of you in the simplest way that could possibly work and still leave room for future expansion and improvement. From that, we learn how to solve the big problems, when we're ready. Overreach and try to solve the huge problem including every possible use case, many of which don't apply to you but SOMEDAY MIGHT... and you end up with the kind of over-abstracted over-engineered too-complicated-to-actually-catch-on solutions that... we in the library community normally end up with. I strongly, STRONGLY agree with this. It's exactly what I was about to write myself, in response to Peter's message, until I saw that Jonathan had saved me the trouble :-) Let's solve the problem that's in front of us right now: bring SRU into harmony with OpenURL in this respect, and the very act of doing so will lend extra legitimacy to the agreed-on identifiers, which will then be more strongly positioned as The Right Identifiers for other initiatives to use. _/|____ /o ) \/ Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk )_v__/\ You cannot really appreciate Dilbert unless you've read it in the original Klingon. -- Klingon Programming Mantra
Re: [CODE4LIB] Recommend book scanner?
On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. On Thu, Apr 30, 2009 at 11:49 AM, Erik Hetzner erik.hetz...@ucop.eduwrote: At Wed, 29 Apr 2009 13:32:08 -0400, Christine Schwartz wrote: We are looking into buying a book scanner which we'll probably use for archival papers as well--probably something in the $1,000.00 range. Any advice? Most organizations, or at least the big ones, Internet Archive and Google, seem to be using a design based on 2 fixed cameras rather than a tradition scanner type device. Is this what you had in mind? Unfortunately none of these products are cheap. Internet Archive’s Scribe machine cost upwards (3 years ago) of $15k, [1] mostly because it has two very expensive cameras. Google’s data is unavailable. A company called Kirtas also sells what look like very expensive machines of a similar design. On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. I think that these are a real possibility for smaller organizations. The maturity of the software and workflow is problematic, but with Google’s Ocropus OCR software [4] freely available as the heart of a scanning workflow, the possibility is there. Both bkrpr and [3] have software currently available, although in the case of bkrpr at least the software is in the very early stages of development. best, Erik Hetzner 1. http://redjar.org/jared/blog/archives/2006/02/10/more-details-on-open-archives-scribe-book-scanner-project/ 2. http://bkrpr.org/doku.php 3. http://www.instructables.com/id/DIY-High-Speed-Book-Scanner-from-Trash-and-Cheap-C/ 4. http://code.google.com/p/ocropus/ ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3
Re: [CODE4LIB] Recommend book scanner?
Amanda P wrote: Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. To capture an image 8.5 x 11 at 300 dpi, you need roughly 8.4 megapixels, which is well within the capabilities of an inexpensive pocket camera. (If you need 600 dpi, then you're in the 33.6 megapixel range.) As to whether the quality will be sufficient, this would depend on the goals and requirements of the project, but 300 dpi should be enough to get good OCR results for normal-sized text. Our very old version of PrimeOCR recommends 300 dpi, and suggests that 400 dpi may provide substantially better quality for text sizes smaller than 8 point, while 200 dpi will be sufficient for text 12 points and up. At 300 and 400 dpi on 19th Century small-print, variable quality texts, we are generally getting good to very good recognition: the quality of the original document itself is the limiting factor. More modern documents (and OCR software) should produce even better results. The cameras used by the Internet Archive are only 12 megapixels, though they are of substantially higher quality than a Canon PowerShot. Some applications require very high quality images, and cheap cameras might not be able to deliver the goods, but if you just want to make sure the text of your documents is digitally preserved and/or available to read online, you don't really need all that much in the way of hardware. Using a pocket camera and a stand to digitize more than a few pages is going to be slow, clumsy and painful, but for many applications, the end result may be entirely acceptable. -William
Re: [CODE4LIB] Recommend book scanner?
At Fri, 1 May 2009 09:51:19 -0500, Amanda P wrote: On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. I know very little about digital cameras, so I hope I get this right. According to Wikipedia, Google uses (or used) an 11MP camera (Elphel 323). You can get a 12MP camera for about $200. With a 12MP camera you should easily be able to get 300 DPI images of book pages and letter size archival documents. For a $100 camera you can get more or less 300 DPI images of book pages. * The problems I have always seen with OCR had much to do with alignment and artifacts than with DPI. 300 DPI is fine for OCR as far as my (limited) experience goes - as long as you have quality images. If your intention is to scan items for preservation, then, yes, you want higher quality - but I can’t imagine any setup for archival quality costing anywhere near $1000. If you just want to make scans full text OCR available, these setups seem worth looking at - especially if the software workflow can be improved. best, Erik * 12 MP seems to equal 4256 x 2848 pixels. To take a ‘scan’ (photo) of a page at 300 DPI, that page would need to be 14.18 x 9.49 (dividing pixels / 300). As long as you can get the camera close enough to the image to not waste much space you will be getting in the close to 300 DPI range for images of size 8.5 x 11 or less. ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgplxGVqVq0Xx.pgp Description: PGP signature
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
I am pleased to disagree to various levels of 'strongly (if we can agree on a definition for it :-). Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he supplied -snip We could have something like: http://purl.org/DataFormat/marcxml . skos:prefLabel MARC21 XML . . skos:notation info:srw/schema/1/marcxml-v1.1 . . skos:notation info:ofi/fmt:xml:xsd:MARC21 . . skos:notation http://www.loc.gov/MARC21/slim; . . skos:broader http://purl.org/DataFormat/marc . . skos:description ... . Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really the point. The point is that all of these various identifiers would be valid, but we'd have a real way of knowing what they actually mean. Maybe this is what you mean by a crosswalk. --end Is exactly what I meant by a crosswalk. Basically a translating dictionary which allows any entity (system or person) to relate the various identifiers. I would love to see a single unified set of identifiers, my life as a wrangled of record semantics would be s much easier. But I don't see it happening. That does not mean we should not try. Even a unification in our space (and if not in the library/information space, then where? as Mike said) reduces the larger problem. However I don't believe it is a scalable solution (which may not matter if all of a group of users agree, they why not leave them to it) as, at any time one group/organisation/person/system could introduce a new scheme, and a world view which relies on unified semantics would no longer be viable. Which means until global unification on an object (better a (large) set of objects) is achieved it will be necessary to have the translating dictionary and systems which know how to use it. Unification reduces Ray's list of 15 alternative uris to 14 or 13 or whatever. As long as that number is 1 translation will be necessary. (I will leave aside discussions of massive record bloat, continual system re-writes, the politics of whose view prevails, the unhelpfulness of compromises for joint solutions, and so on.) Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Mike Taylor Sent: Friday, May 01, 2009 02:36 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Jonathan Rochkind writes: Crosswalk is exactly the wrong answer for this. Two very small overlapping communities of most library developers can surely agree on using the same identifiers, and then we make things easier for US. We don't need to solve the entire universe of problems. Solve the simple problem in front of you in the simplest way that could possibly work and still leave room for future expansion and improvement. From that, we learn how to solve the big problems, when we're ready. Overreach and try to solve the huge problem including every possible use case, many of which don't apply to you but SOMEDAY MIGHT... and you end up with the kind of over-abstracted over-engineered too-complicated-to-actually-catch-on solutions that... we in the library community normally end up with. I strongly, STRONGLY agree with this. It's exactly what I was about to write myself, in response to Peter's message, until I saw that Jonathan had saved me the trouble :-) Let's solve the problem that's in front of us right now: bring SRU into harmony with OpenURL in this respect, and the very act of doing so will lend extra legitimacy to the agreed-on identifiers, which will then be more strongly positioned as The Right Identifiers for other initiatives to use. _/|_ ___ /o ) \/ Mike Taylorm...@indexdata.com http://www.miketaylor.org.uk )_v__/\ You cannot really appreciate Dilbert unless you've read it in the original Klingon. -- Klingon Programming Mantra
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Ideally, though, if we have some buy in and extend this outside our communities, future identifiers *should* have fewer variations, since people can find the appropriate URI for the format and use that. I readily admit that this is wishful thinking, but so be it. I do think that modeling it as SKOS/RDF at least would make it attractive to the Linked Data/Semweb crowd who are likely the sorts of people that would be interested in seeing URIs, anyway. I mean, the worst that can happen is that nobody cares, right? -Ross. On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote: I am pleased to disagree to various levels of 'strongly (if we can agree on a definition for it :-). Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he supplied -snip We could have something like: http://purl.org/DataFormat/marcxml . skos:prefLabel MARC21 XML . . skos:notation info:srw/schema/1/marcxml-v1.1 . . skos:notation info:ofi/fmt:xml:xsd:MARC21 . . skos:notation http://www.loc.gov/MARC21/slim; . . skos:broader http://purl.org/DataFormat/marc . . skos:description ... . Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really the point. The point is that all of these various identifiers would be valid, but we'd have a real way of knowing what they actually mean. Maybe this is what you mean by a crosswalk. --end Is exactly what I meant by a crosswalk. Basically a translating dictionary which allows any entity (system or person) to relate the various identifiers. I would love to see a single unified set of identifiers, my life as a wrangled of record semantics would be s much easier. But I don't see it happening. That does not mean we should not try. Even a unification in our space (and if not in the library/information space, then where? as Mike said) reduces the larger problem. However I don't believe it is a scalable solution (which may not matter if all of a group of users agree, they why not leave them to it) as, at any time one group/organisation/person/system could introduce a new scheme, and a world view which relies on unified semantics would no longer be viable. Which means until global unification on an object (better a (large) set of objects) is achieved it will be necessary to have the translating dictionary and systems which know how to use it. Unification reduces Ray's list of 15 alternative uris to 14 or 13 or whatever. As long as that number is 1 translation will be necessary. (I will leave aside discussions of massive record bloat, continual system re-writes, the politics of whose view prevails, the unhelpfulness of compromises for joint solutions, and so on.) Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Mike Taylor Sent: Friday, May 01, 2009 02:36 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Jonathan Rochkind writes: Crosswalk is exactly the wrong answer for this. Two very small overlapping communities of most library developers can surely agree on using the same identifiers, and then we make things easier for US. We don't need to solve the entire universe of problems. Solve the simple problem in front of you in the simplest way that could possibly work and still leave room for future expansion and improvement. From that, we learn how to solve the big problems, when we're ready. Overreach and try to solve the huge problem including every possible use case, many of which don't apply to you but SOMEDAY MIGHT... and you end up with the kind of over-abstracted over-engineered too-complicated-to-actually-catch-on solutions that... we in the library community normally end up with. I strongly, STRONGLY agree with this. It's exactly what I was about to write myself, in response to Peter's message, until I saw that Jonathan had saved me the trouble :-) Let's solve the problem that's in front of us right now: bring SRU into harmony with OpenURL in this respect, and the very act of doing so will lend extra legitimacy to the agreed-on identifiers, which will then be more strongly positioned as The Right Identifiers for other initiatives to use. _/|_ ___ /o ) \/ Mike Taylor m...@indexdata.com http://www.miketaylor.org.uk )_v__/\ You cannot really appreciate Dilbert unless you've read it in the original Klingon. -- Klingon Programming Mantra
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
I agree with Ross wholeheartedly. Particularly in the use of an RDF based mechanism to describe, and then have systems act on, the semantics of these uniquely identified objects. Semantics (as in Web) has been exercising my thoughts recently and the problems we have here are writ large over all the SW people are trying to achieve. Perhaps we can help... Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: Friday, May 01, 2009 13:40 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Ideally, though, if we have some buy in and extend this outside our communities, future identifiers *should* have fewer variations, since people can find the appropriate URI for the format and use that. I readily admit that this is wishful thinking, but so be it. I do think that modeling it as SKOS/RDF at least would make it attractive to the Linked Data/Semweb crowd who are likely the sorts of people that would be interested in seeing URIs, anyway. I mean, the worst that can happen is that nobody cares, right? -Ross. On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote: I am pleased to disagree to various levels of 'strongly (if we can agree on a definition for it :-). Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he supplied -snip We could have something like: http://purl.org/DataFormat/marcxml . skos:prefLabel MARC21 XML . . skos:notation info:srw/schema/1/marcxml-v1.1 . . skos:notation info:ofi/fmt:xml:xsd:MARC21 . . skos:notation http://www.loc.gov/MARC21/slim; . . skos:broader http://purl.org/DataFormat/marc . . skos:description ... . Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really the point. The point is that all of these various identifiers would be valid, but we'd have a real way of knowing what they actually mean. Maybe this is what you mean by a crosswalk. --end Is exactly what I meant by a crosswalk. Basically a translating dictionary which allows any entity (system or person) to relate the various identifiers. I would love to see a single unified set of identifiers, my life as a wrangled of record semantics would be s much easier. But I don't see it happening. That does not mean we should not try. Even a unification in our space (and if not in the library/information space, then where? as Mike said) reduces the larger problem. However I don't believe it is a scalable solution (which may not matter if all of a group of users agree, they why not leave them to it) as, at any time one group/organisation/person/system could introduce a new scheme, and a world view which relies on unified semantics would no longer be viable. Which means until global unification on an object (better a (large) set of objects) is achieved it will be necessary to have the translating dictionary and systems which know how to use it. Unification reduces Ray's list of 15 alternative uris to 14 or 13 or whatever. As long as that number is 1 translation will be necessary. (I will leave aside discussions of massive record bloat, continual system re-writes, the politics of whose view prevails, the unhelpfulness of compromises for joint solutions, and so on.) Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Mike Taylor Sent: Friday, May 01, 2009 02:36 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Jonathan Rochkind writes: Crosswalk is exactly the wrong answer for this. Two very small overlapping communities of most library developers can surely agree on using the same identifiers, and then we make things easier for US. We don't need to solve the entire universe of problems. Solve the simple problem in front of you in the simplest way that could possibly work and still leave room for future expansion and improvement. From that, we learn how to solve the big problems, when we're ready. Overreach and try to solve the huge problem including every possible use case, many of which don't apply to you but SOMEDAY MIGHT... and you end up with the kind of over-abstracted over-engineered too-complicated-to-actually-catch-on solutions that... we in the library community normally end up with. I strongly, STRONGLY agree with this. It's exactly what I was about to write myself, in response to Peter's message, until I saw that Jonathan had saved me the trouble :-) Let's solve the problem that's in front of us right now: bring SRU into harmony with OpenURL in this respect, and the very act of doing so will lend extra legitimacy to the agreed-on identifiers, which will then be more strongly positioned as The Right Identifiers
Re: [CODE4LIB] Recommend book scanner?
On Fri, May 1, 2009 at 5:39 PM, Mike Taylor m...@indexdata.com wrote: If you want real 300 dpi images, at anything like the quality you get from a flatbed scanner, then you're going to need cameras much more expensive than $100. Or just wait, say, about 3 years.
Re: [CODE4LIB] Recommend book scanner?
That is right. In addition, for certain printing (gold seal), digital camera delivers better result than scanners. -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind Sent: Friday, May 01, 2009 2:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Recommend book scanner? Yeah, I don't think people use cameras instead of flatbed scanners because they produce superior results, or are cheaper: They use them because they're _faster_ for large-scale digitization, and also make it possible to capture pages from rare/fragile materials with less damage to the materials. (Flatbeds are not good on bindings, if you want to get a good image). If these things don't apply, is there any reason not to use a flatbed scanner? Not that I know of? Jonathan Randy Stern wrote: My understanding is that a flatbed or sheetfed document scanner that produces 300 dpi will produce much better OCR results than a cheap digital camera that produces 300 dpi. The reasons have to do with the resolution and distortion of the resulting image, where resolution is defined as the number of line pairs per mm can be resolved (for example when scanning a test chart) - in other words the details that will show up for character images, and distortion is image aberration that can appear at the edges of the page image areas, particularly when illumination is not even. A scanner has much more even illumination. At 11:21 AM 5/1/2009 -0700, Erik Hetzner wrote: At Fri, 1 May 2009 09:51:19 -0500, Amanda P wrote: On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. I know very little about digital cameras, so I hope I get this right. According to Wikipedia, Google uses (or used) an 11MP camera (Elphel 323). You can get a 12MP camera for about $200. With a 12MP camera you should easily be able to get 300 DPI images of book pages and letter size archival documents. For a $100 camera you can get more or less 300 DPI images of book pages. * The problems I have always seen with OCR had much to do with alignment and artifacts than with DPI. 300 DPI is fine for OCR as far as my (limited) experience goes - as long as you have quality images. If your intention is to scan items for preservation, then, yes, you want higher quality - but I can’t imagine any setup for archival quality costing anywhere near $1000. If you just want to make scans full text OCR available, these setups seem worth looking at - especially if the software workflow can be improved. best, Erik * 12 MP seems to equal 4256 x 2848 pixels. To take a ‘scan’ (photo) of a page at 300 DPI, that page would need to be 14.18 x 9.49 (dividing pixels / 300). As long as you can get the camera close enough to the image to not waste much space you will be getting in the close to 300 DPI range for images of size 8.5 x 11 or less. ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
From my perspective, all we're talking about is using the same URI to refer to the same format(s) accross the library community standards this community generally can control. That will make things much easier for developers, especially but not only when building software that interacts with more than one of these standards (as client or server). Now, once you've done that, you've ALSO set the stage for that kind of RDF scenario, among other RDF scenarios. I agree with Mike that that particular scenario is unlikely, but once you set the stage for RDF experimentation like that, if folks are interested in experimenting (and many in our community are), maybe something more attractively useful will come out of it. Or maybe not. Either way, you've made things easier and more inter-operable just by using the same set of URIs across multiple standards to refer to the same thing. So, yeah, I'd still focus on that, rather than any kind of 'cross walk', RDF or not. It's the actual use case in front of us, in which the benefit will definitely be worth the effort (if the effort is kept manageable by avoiding trying to solve the entire universe of problems at once). Jonathan Mike Taylor wrote: So what are we talking about here? A situation where an SRU server receives a request for response records to be delivered in a particular format, it doesn't recognise the format URI, so it goes and looks it up in an RDF database and discovers that it's equivalent to a URI that it does know? Hmm ... it's crazy, but it might just work. I bet no-one does it, though. _/|____ /o ) \/ Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk )_v__/\ Someday, I'll show you around monster-free Tokyo -- dialogue from Gamera: Guardian of the Universe Peter Noerr writes: I agree with Ross wholeheartedly. Particularly in the use of an RDF based mechanism to describe, and then have systems act on, the semantics of these uniquely identified objects. Semantics (as in Web) has been exercising my thoughts recently and the problems we have here are writ large over all the SW people are trying to achieve. Perhaps we can help... Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: Friday, May 01, 2009 13:40 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Ideally, though, if we have some buy in and extend this outside our communities, future identifiers *should* have fewer variations, since people can find the appropriate URI for the format and use that. I readily admit that this is wishful thinking, but so be it. I do think that modeling it as SKOS/RDF at least would make it attractive to the Linked Data/Semweb crowd who are likely the sorts of people that would be interested in seeing URIs, anyway. I mean, the worst that can happen is that nobody cares, right? -Ross. On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote: I am pleased to disagree to various levels of 'strongly (if we can agree on a definition for it :-). Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he supplied -snip We could have something like: http://purl.org/DataFormat/marcxml . skos:prefLabel MARC21 XML . . skos:notation info:srw/schema/1/marcxml-v1.1 . . skos:notation info:ofi/fmt:xml:xsd:MARC21 . . skos:notation http://www.loc.gov/MARC21/slim; . . skos:broader http://purl.org/DataFormat/marc . . skos:description ... . Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really the point. The point is that all of these various identifiers would be valid, but we'd have a real way of knowing what they actually mean. Maybe this is what you mean by a crosswalk. --end Is exactly what I meant by a crosswalk. Basically a translating dictionary which allows any entity (system or person) to relate the various identifiers. I would love to see a single unified set of identifiers, my life as a wrangled of record semantics would be s much easier. But I don't see it happening. That does not mean we should not try. Even a unification in our space (and if not in the library/information space, then where? as Mike said) reduces the larger problem. However I don't believe it is a scalable solution (which may not matter if all of a group of users agree, they why not leave them to it) as, at any time one group/organisation/person/system could introduce a new scheme, and a world view which relies on unified semantics would no longer be viable. Which means until global unification on an object (better a (large) set of objects) is
Re: [CODE4LIB] Recommend book scanner?
My understanding is that a flatbed or sheetfed document scanner that produces 300 dpi will produce much better OCR results than a cheap digital camera that produces 300 dpi. The reasons have to do with the resolution and distortion of the resulting image, where resolution is defined as the number of line pairs per mm can be resolved (for example when scanning a test chart) - in other words the details that will show up for character images, and distortion is image aberration that can appear at the edges of the page image areas, particularly when illumination is not even. A scanner has much more even illumination. At 11:21 AM 5/1/2009 -0700, Erik Hetzner wrote: At Fri, 1 May 2009 09:51:19 -0500, Amanda P wrote: On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. I know very little about digital cameras, so I hope I get this right. According to Wikipedia, Google uses (or used) an 11MP camera (Elphel 323). You can get a 12MP camera for about $200. With a 12MP camera you should easily be able to get 300 DPI images of book pages and letter size archival documents. For a $100 camera you can get more or less 300 DPI images of book pages. * The problems I have always seen with OCR had much to do with alignment and artifacts than with DPI. 300 DPI is fine for OCR as far as my (limited) experience goes - as long as you have quality images. If your intention is to scan items for preservation, then, yes, you want higher quality - but I canât imagine any setup for archival quality costing anywhere near $1000. If you just want to make scans full text OCR available, these setups seem worth looking at - especially if the software workflow can be improved. best, Erik * 12 MP seems to equal 4256 x 2848 pixels. To take a âscanâ (photo) of a page at 300 DPI, that page would need to be 14.18 x 9.49 (dividing pixels / 300). As long as you can get the camera close enough to the image to not waste much space you will be getting in the close to 300 DPI range for images of size 8.5 x 11 or less. ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
So what are we talking about here? A situation where an SRU server receives a request for response records to be delivered in a particular format, it doesn't recognise the format URI, so it goes and looks it up in an RDF database and discovers that it's equivalent to a URI that it does know? Hmm ... it's crazy, but it might just work. I bet no-one does it, though. _/|____ /o ) \/ Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk )_v__/\ Someday, I'll show you around monster-free Tokyo -- dialogue from Gamera: Guardian of the Universe Peter Noerr writes: I agree with Ross wholeheartedly. Particularly in the use of an RDF based mechanism to describe, and then have systems act on, the semantics of these uniquely identified objects. Semantics (as in Web) has been exercising my thoughts recently and the problems we have here are writ large over all the SW people are trying to achieve. Perhaps we can help... Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: Friday, May 01, 2009 13:40 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Ideally, though, if we have some buy in and extend this outside our communities, future identifiers *should* have fewer variations, since people can find the appropriate URI for the format and use that. I readily admit that this is wishful thinking, but so be it. I do think that modeling it as SKOS/RDF at least would make it attractive to the Linked Data/Semweb crowd who are likely the sorts of people that would be interested in seeing URIs, anyway. I mean, the worst that can happen is that nobody cares, right? -Ross. On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote: I am pleased to disagree to various levels of 'strongly (if we can agree on a definition for it :-). Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he supplied -snip We could have something like: http://purl.org/DataFormat/marcxml . skos:prefLabel MARC21 XML . . skos:notation info:srw/schema/1/marcxml-v1.1 . . skos:notation info:ofi/fmt:xml:xsd:MARC21 . . skos:notation http://www.loc.gov/MARC21/slim; . . skos:broader http://purl.org/DataFormat/marc . . skos:description ... . Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really the point. The point is that all of these various identifiers would be valid, but we'd have a real way of knowing what they actually mean. Maybe this is what you mean by a crosswalk. --end Is exactly what I meant by a crosswalk. Basically a translating dictionary which allows any entity (system or person) to relate the various identifiers. I would love to see a single unified set of identifiers, my life as a wrangled of record semantics would be s much easier. But I don't see it happening. That does not mean we should not try. Even a unification in our space (and if not in the library/information space, then where? as Mike said) reduces the larger problem. However I don't believe it is a scalable solution (which may not matter if all of a group of users agree, they why not leave them to it) as, at any time one group/organisation/person/system could introduce a new scheme, and a world view which relies on unified semantics would no longer be viable. Which means until global unification on an object (better a (large) set of objects) is achieved it will be necessary to have the translating dictionary and systems which know how to use it. Unification reduces Ray's list of 15 alternative uris to 14 or 13 or whatever. As long as that number is 1 translation will be necessary. (I will leave aside discussions of massive record bloat, continual system re-writes, the politics of whose view prevails, the unhelpfulness of compromises for joint solutions, and so on.) Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Mike Taylor Sent: Friday, May 01, 2009 02:36 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Jonathan Rochkind writes: Crosswalk is exactly the wrong answer for this. Two very small overlapping communities of most library developers can surely agree on using the same identifiers, and then we make things easier for US. We don't need to solve the entire universe of problems. Solve the simple problem in front of you in the simplest way that could possibly work and still leave room for future expansion and
Re: [CODE4LIB] Recommend book scanner?
William Wueppelmann writes: Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. To capture an image 8.5 x 11 at 300 dpi, you need roughly 8.4 megapixels, which is well within the capabilities of an inexpensive pocket camera. Or not. Cheap cameras may well produce JPEGs that contain eight million pixels, but that doesn't mean that they are using all or even much of that resolution. In my experience, most cheap cameras are producing way more data that their lenses can actually feed them, so that you can halve the resolution or more without losing any actual information. Such cameras will, in effect, give you a 150 dpi scan -- even if that scan is expressed as a 300 dpi image. If you want real 300 dpi images, at anything like the quality you get from a flatbed scanner, then you're going to need cameras much more expensive than $100. _/|____ /o ) \/ Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk )_v__/\ I think it should either be unrestricted garnishing, or a single Olympic standard mayonaisse -- Monty Python.
Re: [CODE4LIB] Recommend book scanner?
Yeah, I don't think people use cameras instead of flatbed scanners because they produce superior results, or are cheaper: They use them because they're _faster_ for large-scale digitization, and also make it possible to capture pages from rare/fragile materials with less damage to the materials. (Flatbeds are not good on bindings, if you want to get a good image). If these things don't apply, is there any reason not to use a flatbed scanner? Not that I know of? Jonathan Randy Stern wrote: My understanding is that a flatbed or sheetfed document scanner that produces 300 dpi will produce much better OCR results than a cheap digital camera that produces 300 dpi. The reasons have to do with the resolution and distortion of the resulting image, where resolution is defined as the number of line pairs per mm can be resolved (for example when scanning a test chart) - in other words the details that will show up for character images, and distortion is image aberration that can appear at the edges of the page image areas, particularly when illumination is not even. A scanner has much more even illumination. At 11:21 AM 5/1/2009 -0700, Erik Hetzner wrote: At Fri, 1 May 2009 09:51:19 -0500, Amanda P wrote: On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. I know very little about digital cameras, so I hope I get this right. According to Wikipedia, Google uses (or used) an 11MP camera (Elphel 323). You can get a 12MP camera for about $200. With a 12MP camera you should easily be able to get 300 DPI images of book pages and letter size archival documents. For a $100 camera you can get more or less 300 DPI images of book pages. * The problems I have always seen with OCR had much to do with alignment and artifacts than with DPI. 300 DPI is fine for OCR as far as my (limited) experience goes - as long as you have quality images. If your intention is to scan items for preservation, then, yes, you want higher quality - but I can’t imagine any setup for archival quality costing anywhere near $1000. If you just want to make scans full text OCR available, these setups seem worth looking at - especially if the software workflow can be improved. best, Erik * 12 MP seems to equal 4256 x 2848 pixels. To take a ‘scan’ (photo) of a page at 300 DPI, that page would need to be 14.18 x 9.49 (dividing pixels / 300). As long as you can get the camera close enough to the image to not waste much space you will be getting in the close to 300 DPI range for images of size 8.5 x 11 or less. ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
I agree that most software probably won't do it. But the data will be there and free and relatively easy to integrate if one wanted to. In a lot ways, Jonathan, it's got Umlaut written all over it. Now to get to Jonathan's point -- yes, I think the primary goal still needs to be working towards bringing use of identifiers for a given thing to a single variant. However, we would obviously have to know what the options are in order to figure out what that one is -- while we're doing that, why not enter the different options into the registry and document them in some way (such as, who uses this variant?). Voila, we have a crosswalk. Of course, the downside is that we technically also have a new URI for this resource (since the skos:Concept would need to have a URI), but we could probably hand wave that away as the id for the registry concept, not the data format. So -- we seem to have some agreement here? -Ross. On Fri, May 1, 2009 at 5:53 PM, Jonathan Rochkind rochk...@jhu.edu wrote: From my perspective, all we're talking about is using the same URI to refer to the same format(s) accross the library community standards this community generally can control. That will make things much easier for developers, especially but not only when building software that interacts with more than one of these standards (as client or server). Now, once you've done that, you've ALSO set the stage for that kind of RDF scenario, among other RDF scenarios. I agree with Mike that that particular scenario is unlikely, but once you set the stage for RDF experimentation like that, if folks are interested in experimenting (and many in our community are), maybe something more attractively useful will come out of it. Or maybe not. Either way, you've made things easier and more inter-operable just by using the same set of URIs across multiple standards to refer to the same thing. So, yeah, I'd still focus on that, rather than any kind of 'cross walk', RDF or not. It's the actual use case in front of us, in which the benefit will definitely be worth the effort (if the effort is kept manageable by avoiding trying to solve the entire universe of problems at once). Jonathan Mike Taylor wrote: So what are we talking about here? A situation where an SRU server receives a request for response records to be delivered in a particular format, it doesn't recognise the format URI, so it goes and looks it up in an RDF database and discovers that it's equivalent to a URI that it does know? Hmm ... it's crazy, but it might just work. I bet no-one does it, though. _/|_ ___ /o ) \/ Mike Taylor m...@indexdata.com http://www.miketaylor.org.uk )_v__/\ Someday, I'll show you around monster-free Tokyo -- dialogue from Gamera: Guardian of the Universe Peter Noerr writes: I agree with Ross wholeheartedly. Particularly in the use of an RDF based mechanism to describe, and then have systems act on, the semantics of these uniquely identified objects. Semantics (as in Web) has been exercising my thoughts recently and the problems we have here are writ large over all the SW people are trying to achieve. Perhaps we can help... Peter -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: Friday, May 01, 2009 13:40 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All Ideally, though, if we have some buy in and extend this outside our communities, future identifiers *should* have fewer variations, since people can find the appropriate URI for the format and use that. I readily admit that this is wishful thinking, but so be it. I do think that modeling it as SKOS/RDF at least would make it attractive to the Linked Data/Semweb crowd who are likely the sorts of people that would be interested in seeing URIs, anyway. I mean, the worst that can happen is that nobody cares, right? -Ross. On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote: I am pleased to disagree to various levels of 'strongly (if we can agree on a definition for it :-). Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he supplied -snip We could have something like: http://purl.org/DataFormat/marcxml . skos:prefLabel MARC21 XML . . skos:notation info:srw/schema/1/marcxml-v1.1 . . skos:notation info:ofi/fmt:xml:xsd:MARC21 . . skos:notation http://www.loc.gov/MARC21/slim; . . skos:broader http://purl.org/DataFormat/marc . . skos:description ... . Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really the point. The point is that all of these various identifiers would be valid, but we'd
Re: [CODE4LIB] Recommend book scanner?
Mike Taylor wrote: Or not. Cheap cameras may well produce JPEGs that contain eight million pixels, but that doesn't mean that they are using all or even much of that resolution. Does anybody have a printed test sheet that we can scan or photo, and then compare the resulting digital images? It should have lines at various densities and areas of different colours, just like an old TV test image. Can you buy such calibration sheets? We could make it a standard routine, to always shoot such a sheet at the beginning of any captured book, to give the reader an idea of the digitization quality of the used equipment. They are called technical target in figure 14, page 149, of Lisa L. Fox (ed.), Preservation Microfilming, 2nd ed. (1996), ISBN 0-8389-0653-2. The example there is manufactured by AP International, http://www.a-p-international.com/ However, their price list is $100-400 per package of 50 sheets. I wouldn't pay more for the calibration targets than for the camera, if I could avoid it. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/