Re: [CODE4LIB] one tool and/or resource that you recommend to newbie coders in a library?
Bohyun Kim k...@fiu.edu wrote: Hi all code4lib-bers, As coders and coding librarians, what is ONE tool and/or resource that you recommend to newbie coders in a library (and why)? I promise I will create and circulate the list and make it into a Code4Lib wiki page for collective wisdom. =) How to Design Programs is online at http://www.ccs.neu.edu/home/matthias/HtDP2e/. Good for newbie coders. StackOverflow.com is a great site for questions. Also a pretty good list at http://grokcode.com/11/the-top-9-in-a-hackers-bookshelf/ Bill
Re: [CODE4LIB] Writing good documentation
Francis Kayiwa kay...@uic.edu wrote: Aside from Wiki's can anyone recommend any freely available document creating tools. Eric Hellman's[0] post this AM spurred this. My (Our?) goal is an easy way to create How-To like Documentation geared towards a novice. GNU Emacs (http://www.gnu.org/software/emacs/) is the Swiss army knife of document creation tools. Bill
Re: [CODE4LIB] OCR Solutions
Tesseract is free, but in my experience, to make it work you usually have to train up a model, though the model that comes with it seems to be set up for scanning English book pages, so may be appropriate for library use. OCRopus, from a research group in Germany, seems more powerful than Tesseract, and is also freeware, but is (IMO) currently a pain to set up. And, again, to get good results, you often need to train a model. But it seems to have much more functionality than Tesseract (which may or may not be a good thing :-). If you have Microsoft Office (versions including MS Office 2003 or later but prior to MS Office 2010) on a Windows machine, you also have (had) Microsoft's OCR package, which exposes its functionality through a COM interface, so you can call it from other programs. See http://msdn.microsoft.com/en-us/library/aa202819(v=office.11).aspx Similarly, Google Docs 3.0 offers free OCR via the Google Docs API. I've also tried TOCR, a $100-per-machine Windows-only OCR library from www.transym.com, a British MOD spin-off. Comes as a DLL, plus a simple application, and you can build your own application to use the DLL. Pretty accurate, for printed English text -- gives bounding boxes and word and character confidences. Bill
Re: [CODE4LIB] iPads as Kiosks
David Uspal david.us...@villanova.edu wrote: Then again, by selecting the iPad you're essentially tethered to Apple's iron grip of the iWorld via its iTunes vetting process and strict control of Apple hardware. YMMV on this depending on what you're doing, but it should definitely be a consideration when choosing between Android tablets and the iPad. Only if you go the native app route. I ruthlessly adapted Intuity's Notabene app code to create an HTML5 kiosk app. http://blog.intuitymedialab.eu/2010/05/19/intuitys-notabene-rapid-html5-prototyping-on-the-ipad/ Bill
Re: [CODE4LIB] Advice on a class
Simon Spero sesunc...@gmail.com wrote: Additional languages which carry weight with me on a resume are OCaml, Processing, and any of Common Lisp, Scheme, or Clojure. Did you mean Clozure? The other two are kinds of lisp. :-P ;-). Nothing wrong with Clojure -- presumably JSR 292 in Java 7 will make it even better. I thought I'd covered Clozure under CL, though. Bill
Re: [CODE4LIB] Advice on a class
Bill Dueber b...@dueber.com wrote: Unless you're in a very, *very* different library than mine, all the low-level stuff written in C and variants are at a low-enough level (and in very specialized domains) that I'd never have an expectation that anyone working in the library would mess with it. I presume there are people in research libraries that muck around with C/C++, rolling their own libraries and what not, but I've been at Michigan for six years and I haven't heard of it, here or elsewhere. For most of us, I think, doing something in C is premature optimization. Completely agree with most of this. The issue comes up when you need to *fix* something in one of these libraries that someone else has provided, or hook one of them up to your higher-level language. I hear the assertion often that a programmer needs to know C/assembly so they can truly understand how the machine is working at a deep level, and I've grown over the years to dismiss that assertion. Yes, I'd say needs is too strong. Almost no one who makes a living programming in libraries is going to do a better job by hand than a modern optimizing compiler, certainly not if you take ROI into account. Data structures and Big-O is enough to tell you if you're moving down the right path, and then it's just a matter of being smart enough to use existing libraries where applicable --- and, again, avoiding premature optimization, at the code level or at the selection-of-a-language level. If you're hiring someone who's going to have to know the details of cache misses and how far to unroll loops or whatnot, you know exactly what you're hiring him/her for and don't need advice about hiring a generic programmer. Yep, and yep. For stuff where you need not access to the bare metal but just raw speed, Java is mature, super-fast, and has lots of optimized libraries available. Here's where we differ. Java is mature, and very well-optimized (though super-fast is perhaps a bit much), but the language itself is rather poorly designed, and the libraries available are in my experience a sad hodge-podge of brilliance and stodge. I've spent far too much time working around inexplicably years-old bugs in J2SE's standard library. Some of the J2EE systems are very good, though. It also turns out many of us are using Solr these days, so Java has snuck into the library infrastructure already. :-). I used Doug's text-indexing library when he and Jan wrote it in Common Lisp at PARC in the early 90's, and I was happy to see it appear again in more commercially acceptable form as Lucene. Lucene, and parts of Solr, are important contributions to the technical landscape, and Lucene is an example of an excellent Java library (I don't know Solr well enough to comment). But I use Lucene through Andi Vajda's excellent PyLucene, and I use a couple of other Java libraries via JCC, Vajda's wrapper for those Java libraries worth using. 90% of my argument for doing green-field projects in Ruby is that I can access low-level Java libraries to do the heavy lifting and get an optimizing VM for free. The thing I've found about Python is that one rarely needs to stoop to using Java libraries, as there are generally superior Python libraries available. This is perhaps just a comment on the relative novelty of Ruby, which is a fine programming language in its own right. Though if my goal was to exploit the Java ecosystem, I imagine I'd use Clojure or Jython instead. I don't care what scripting language you use, as long as you use it well. Python, Ruby, Perl, whatever. If you know one, you can learn the others. I tend to agree about the learning, but that doesn't make them equally useful. The size and shape of the user community and the variety of offerings and third-party libraries available are important considerations. A good programmer is aware of that and reflects that awareness in his skill set. I would never foist a project in, say, OCaml or Scheme on my library, because who the hell is going to maintain it (and its environment)?? What kind of programmer can I hire 'off the street'? This is an important and unfortunate commercial consideration which sometimes forces a trade-off with the quality and functionality benefits available with higher-level boutique languages. But I sure want someone who understands side effects and their effect on multi-threaded programming, because I've got a lot of idle cores sitting around waiting for work. Yep. Some nice features in Clojure for exploiting that. Finally, I always ask someone what their favorite programming environment is. I've had a few candidates tell me that they just use Notepad, and I don't mind admitting that that's almost a dealbreaker for me. Using a good editor or an IDE is a critical part of taking advantage of the language ecosystem. A good programming editor is a must. Yep. Not having at least syntax highlighting and checking is, to me, a sign that
Re: [CODE4LIB] Advice on a class
If I'm hiring a programmer, I want them to know C and Python. C because all the low-level stuff is written in that, Python because it's simply the most useful all-around programming language at the moment, and if you don't know it, well, how devoted are you really to your craft? Various flavors of C are acceptable: Objective-C is OK with me, and C++ is a plus -- it's an order of magnitude more difficult than C to use properly, and people who can sling it properly are rare. Additional languages which carry weight with me on a resume are OCaml, Processing, and any of Common Lisp, Scheme, or Clojure. If I was hiring a digital *librarian*, I'd also expect them to know Javascript, the language at the heart of the EPUB format. But Javascript is kind of tricky; it's a subtle powerful language with bad syntax and weak libraries. I certainly wouldn't recommend it to start with. Cary Gordon listu...@chillco.com wrote: There are still plenty of opportunities for Cobol coders, but I wouldn't recommend that either. Java is the COBOL of the 21st century, so if you know Java well, there will be a job in that for the next 20-30 years, I'd expect. Until the Singularity happens, anyway. I'd think there will always be lots of enterprise Java jobs around. Bill
Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?
Richard, Joel M richar...@si.edu wrote: If you're looking just to learn and not spend any money at all, you could always set up a Linux flavor running on VirtualBox. Second that. It's a lot of effort and I daresay you'd learn a lot about many things, but it may not be viable. Really? I think it's pretty simple: (1) Install VB from http://download.virtualbox.org/virtualbox/4.0.4/VirtualBox-4.0.4-70112-Win.exe. (2) Download a CD image of Ubuntu from http://www.ubuntu.com/desktop/get-ubuntu/download. (3) Fire up VB, say new machine, and point it at the Ubuntu CD file. Let Ubuntu install. (4) In Ubuntu, say sudo apt-get install yaz php5. Should get you at least part of the way there. Bill
Re: [CODE4LIB] reading early versions of FrameMaker...
Hi, Louis. Thanks for your note. I have FrameMaker 5, but when I point it at these files, it tells me an earlier version is needed to open them. The header in the file (from 1989, for instance) says that the version is MakerFile 1.03. I'm on the track of a version of Maker 3. The side-topic question I have (given that we're in the middle of trying find the best passive solution for long-term *reliable* storage of data) is how were these files stored? Optical media? (If so, what flavor?) Harddrives? Tape? (What flavor?) Hard drives. PARC has been big on network file systems since the 70's, of course. In the early 90's, we were experimenting with the idea of an eternal infinite network file system. As part of this, we invested in some optical archive technology systems, and put a lot of stuff on them which had previously been stored on tape and disk pack. Those bits have been brought forward systematically to successive generations of storage technology. (If you're curious, the future we're looking towards today is content-centric networking (http://www.ccnx.org/) -- data matters, not where it lives.) One possibility I recall. Frame binary files must end at a 1K boundary. When they were sent via email, they often got a CR/LF added to the end, and then Frame would no longer touch them. We wrote a little utility that trimmed off anything after the last 1K boundary that fixed this. So if the file was 2050 bytes, it truncated it to 2048. If the exact byte sizes are not a multiple of 1024, let me know and I'll see if I can find that utility someplace. Or you could do it with a binary editor. Ah, good tip. I'll look into that. Last three quotes from: http://answerpot.com/showthread.php?185398-FrameMaker+3+files So while it may not relate to the original question of making older, archived files available in modern formats, I suppose we could definitively say that so long as the data isn't in a proprietary format, or if it is, so long as the software is still in production, you've a good chance at opening older data -- as long as it's not corrupt. I still like the older PARC document formats, tedit and Tioga. They stored the plain text at the front of the file, a marker (typically a run of two zero bytes), and then the binary markup directives, which referenced byte positions in the plain text. So, you could visit any document as a plain-text file and get some sense of what was in it. Bill
[CODE4LIB] reading early versions of FrameMaker...
At PARC, we have some digital documents from the early '90's in FrameMaker version 1 and 2. But we have no versions of FrameMaker suitable for opening them, and re-rendering them in a more accessible format. I'm wondering if others have faced this issue in making archives accessible, and if so, what they did about it? Bill
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Deng, Sai sai.d...@wichita.edu wrote: Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Not sure what you mean by handle access restrictions. Do you mean it can index the documents put into it even if they have DRM encumbrances? UpLib has search within the documents -- if you search for a word or phrase, it shows you all the documents which match, but also all the pages in each document which match. Supports a wide variety of document formats, from JPEG2000 to PDF to Powerpoint. But as far as I know it doesn't deal with DRM restrictions. Bill
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Deng, Sai sai.d...@wichita.edu wrote: For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). OK, that's not I typically think of when I hear DRM. Access control is (I think) the way it's usually put. No, UpLib has no built-in access control system, though the hooks are there, and I know that some have used them to do access control. I know of one UpLib application which requires incoming connections to provide a client certificate, which it uses to give different clients different access rights. Probably overkill for most uses. You'd probably want to do an application-specific Web UI, though -- you could put the access restrictions there. I recently saw a Tomcat app which uses the UpLib Java client-side library to search for documents, then provided a completely custom UI. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? How about Greenstone? Bill
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
Deng, Sai sai.d...@wichita.edu wrote: Thanks for the questions! We don't have a clear idea yet and we are looking for a system now. The basic idea is that we'll deposit some licensed materials for some department and open them only to that group. I guess a local account would be ok, of course, if a campus account can be recognized, that's better. In which case you'll need some access control system which can understand your campus login system. They'll need to log in to see the document if it's not ip restricted, right? IP restriction might not be the best way since faculty members will not always be in their departments. Will you let them search for documents, and show the search results, even if they can't retrieve the full document, as the ACM Digital Library does? Or do search results have to be filtered, too? How many different access groups will you have? One per department? One per licensed set of material? And what's the approximate size of each of those numbers? A simple thing to do would be to install something like DocuShare, which already does all this stuff and is built on top of Autonomy, one of the better suites for extracting and indexing content from documents. Bill Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Mark Jordan [mjor...@sfu.ca] Sent: Wednesday, October 20, 2010 5:08 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Sophie, It might help some of us on the list to understand what types of access control you need if you can describe some of the ways that the allowed users (people and/or departments, to use your examples) will identify themselves? Will they have already logged into the system with a local (to the system) account, or with a campus account that knows that they are part of a specific department? Will they need to log into he system when they request to see a specific document? Will where they are sitting matter (i.e., restricted by IP address)? Mark Mark Jordan Head of Library Systems W.A.C. Bennett Library, Simon Fraser University Burnaby, British Columbia, V5A 1S6, Canada Voice: 778.782.5753 / Fax: 778.782.3023 / Skype: mark.jordan50 mjor...@sfu.ca - Original Message - Thanks for the information! Greenstone has full text search, but I heard that its access control is much weaker than DSpace. Will it be able to allow certain documents open only to certain people or certain departments? Thanks. Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen [jans...@parc.com] Sent: Wednesday, October 20, 2010 4:31 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Deng, Sai sai.d...@wichita.edu wrote: For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). OK, that's not I typically think of when I hear DRM. Access control is (I think) the way it's usually put. No, UpLib has no built-in access control system, though the hooks are there, and I know that some have used them to do access control. I know of one UpLib application which requires incoming connections to provide a client certificate, which it uses to give different clients different access rights. Probably overkill for most uses. You'd probably want to do an application-specific Web UI, though -- you could put the access restrictions there. I recently saw a Tomcat app which uses the UpLib Java client-side library to search for documents, then provided a completely custom UI. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? How about Greenstone? Bill