Re: [CODE4LIB] one tool and/or resource that you recommend to newbie coders in a library?

2012-11-01 Thread Bill Janssen
Bohyun Kim k...@fiu.edu wrote:

 Hi all code4lib-bers,
 
 As coders and coding librarians, what is ONE tool and/or resource that you 
 recommend to newbie coders in a library (and why)?  I promise I will create 
 and circulate the list and make it into a Code4Lib wiki page for collective 
 wisdom.  =)

How to Design Programs is online at
http://www.ccs.neu.edu/home/matthias/HtDP2e/.  Good for newbie coders.

StackOverflow.com is a great site for questions.

Also a pretty good list at
http://grokcode.com/11/the-top-9-in-a-hackers-bookshelf/

Bill


Re: [CODE4LIB] Writing good documentation

2012-11-01 Thread Bill Janssen
Francis Kayiwa kay...@uic.edu wrote:

 Aside from Wiki's can anyone recommend any freely available document
 creating tools. Eric Hellman's[0] post this AM spurred this. My (Our?) goal
 is an easy way to create How-To like Documentation geared towards a
 novice.

GNU Emacs (http://www.gnu.org/software/emacs/) is the Swiss army knife
of document creation tools.

Bill


Re: [CODE4LIB] OCR Solutions

2011-11-05 Thread Bill Janssen
Tesseract is free, but in my experience, to make it work you usually
have to train up a model, though the model that comes with it seems to
be set up for scanning English book pages, so may be appropriate for
library use.

OCRopus, from a research group in Germany, seems more powerful than
Tesseract, and is also freeware, but is (IMO) currently a pain to set
up.  And, again, to get good results, you often need to train a model.
But it seems to have much more functionality than Tesseract (which may
or may not be a good thing :-).

If you have Microsoft Office (versions including MS Office 2003 or later
but prior to MS Office 2010) on a Windows machine, you also have (had)
Microsoft's OCR package, which exposes its functionality through a COM
interface, so you can call it from other programs.  See
http://msdn.microsoft.com/en-us/library/aa202819(v=office.11).aspx

Similarly, Google Docs 3.0 offers free OCR via the Google Docs API.

I've also tried TOCR, a $100-per-machine Windows-only OCR library from
www.transym.com, a British MOD spin-off.  Comes as a DLL, plus a simple
application, and you can build your own application to use the DLL.
Pretty accurate, for printed English text -- gives bounding boxes and
word and character confidences.

Bill


Re: [CODE4LIB] iPads as Kiosks

2011-09-08 Thread Bill Janssen
David Uspal david.us...@villanova.edu wrote:

 Then again, by selecting the iPad you're essentially tethered to
 Apple's iron grip of the iWorld via its iTunes vetting process and
 strict control of Apple hardware.  YMMV on this depending on what
 you're doing, but it should definitely be a consideration when
 choosing between Android tablets and the iPad.

Only if you go the native app route.  I ruthlessly adapted Intuity's
Notabene app code to create an HTML5 kiosk app.

http://blog.intuitymedialab.eu/2010/05/19/intuitys-notabene-rapid-html5-prototyping-on-the-ipad/

Bill


Re: [CODE4LIB] Advice on a class

2011-07-29 Thread Bill Janssen
Simon Spero sesunc...@gmail.com wrote:

  Additional languages which carry weight with me on a resume are
  OCaml, Processing, and any of Common Lisp, Scheme, or Clojure.
 
 Did you mean Clozure?  The other two are kinds of lisp. :-P

;-).  Nothing wrong with Clojure -- presumably JSR 292 in Java 7 will
make it even better.  I thought I'd covered Clozure under CL, though.

Bill


Re: [CODE4LIB] Advice on a class

2011-07-29 Thread Bill Janssen
Bill Dueber b...@dueber.com wrote:

 Unless you're in a very, *very* different library than mine, all the
 low-level stuff written in C and variants are at a low-enough level (and in
 very specialized domains) that I'd never have an expectation that anyone
 working in the library would mess with it. I presume there are people in
 research libraries that muck around with C/C++, rolling their own libraries
 and what not, but I've been at Michigan for six years and I haven't heard of
 it, here or elsewhere. For most of us, I think, doing something in C is
 premature optimization.

Completely agree with most of this.  The issue comes up when you need to
*fix* something in one of these libraries that someone else has
provided, or hook one of them up to your higher-level language.

 I hear the assertion often that a programmer needs to know C/assembly so
 they can truly understand how the machine is working at a deep level, and
 I've grown over the years to dismiss that assertion.

Yes, I'd say needs is too strong.

 Almost no one who makes
 a living programming in libraries is going to do a better job by hand than
 a modern optimizing compiler, certainly not if you take ROI into account.
 Data structures and Big-O is enough to tell you if you're moving down the
 right path, and then it's just a matter of being smart enough to use
 existing libraries where applicable --- and, again, avoiding premature
 optimization, at the code level or at the selection-of-a-language level. If
 you're hiring someone who's going to have to know the details of cache
 misses and how far to unroll loops or whatnot, you know exactly what you're
 hiring him/her for and don't need advice about hiring a generic programmer.

Yep, and yep.

 For stuff where you need not access to the bare metal but just raw speed,
 Java is mature, super-fast, and has lots of optimized libraries available.

Here's where we differ.  Java is mature, and very well-optimized (though
super-fast is perhaps a bit much), but the language itself is rather
poorly designed, and the libraries available are in my experience a sad
hodge-podge of brilliance and stodge.  I've spent far too much time
working around inexplicably years-old bugs in J2SE's standard library.

Some of the J2EE systems are very good, though.

 It also turns out many of us are using Solr these days, so Java has snuck
 into the library infrastructure already.

:-).  I used Doug's text-indexing library when he and Jan wrote it in
Common Lisp at PARC in the early 90's, and I was happy to see it appear
again in more commercially acceptable form as Lucene.  Lucene, and parts
of Solr, are important contributions to the technical landscape, and
Lucene is an example of an excellent Java library (I don't know Solr
well enough to comment).

But I use Lucene through Andi Vajda's excellent PyLucene, and I use a
couple of other Java libraries via JCC, Vajda's wrapper for those Java
libraries worth using.

 90% of my argument for doing green-field projects in Ruby is that I
 can access low-level Java libraries to do the heavy lifting and get an
 optimizing VM for free.

The thing I've found about Python is that one rarely needs to stoop to
using Java libraries, as there are generally superior Python libraries
available.  This is perhaps just a comment on the relative novelty of
Ruby, which is a fine programming language in its own right.  Though if
my goal was to exploit the Java ecosystem, I imagine I'd use Clojure or
Jython instead.

 I don't care what scripting language you use, as long as you use it well.
 Python, Ruby, Perl, whatever. If you know one, you can learn the others.

I tend to agree about the learning, but that doesn't make them equally
useful.  The size and shape of the user community and the variety of
offerings and third-party libraries available are important
considerations.  A good programmer is aware of that and reflects that
awareness in his skill set.

 I would never foist a project in, say, OCaml or Scheme on my library,
 because who the hell is going to maintain it (and its environment)??

What kind of programmer can I hire 'off the street'?  This is an
important and unfortunate commercial consideration which sometimes
forces a trade-off with the quality and functionality benefits available
with higher-level boutique languages.

 But I sure want someone who understands side effects and their effect
 on multi-threaded programming, because I've got a lot of idle cores
 sitting around waiting for work.

Yep.  Some nice features in Clojure for exploiting that.
  
 Finally, I always ask someone what their favorite programming environment
 is. I've had a few candidates tell me that they just use Notepad, and I
 don't mind admitting that that's almost a dealbreaker for me. Using a good
 editor or an IDE is a critical part of taking advantage of the language
 ecosystem. A good programming editor is a must.

Yep.

 Not having at least syntax highlighting and checking is, to me, a sign
 that 

Re: [CODE4LIB] Advice on a class

2011-07-27 Thread Bill Janssen
If I'm hiring a programmer, I want them to know C and Python.  C because
all the low-level stuff is written in that, Python because it's simply
the most useful all-around programming language at the moment, and if
you don't know it, well, how devoted are you really to your craft?

Various flavors of C are acceptable:  Objective-C is OK with me, and C++
is a plus -- it's an order of magnitude more difficult than C to use
properly, and people who can sling it properly are rare.  Additional
languages which carry weight with me on a resume are OCaml, Processing,
and any of Common Lisp, Scheme, or Clojure.

If I was hiring a digital *librarian*, I'd also expect them to know
Javascript, the language at the heart of the EPUB format.  But
Javascript is kind of tricky; it's a subtle powerful language with bad
syntax and weak libraries.  I certainly wouldn't recommend it to start
with.

Cary Gordon listu...@chillco.com wrote:
 There are still plenty of opportunities for Cobol coders, but I
 wouldn't recommend that either.

Java is the COBOL of the 21st century, so if you know Java well, there
will be a job in that for the next 20-30 years, I'd expect.  Until the
Singularity happens, anyway.  I'd think there will always be lots of
enterprise Java jobs around.

Bill


Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?

2011-03-23 Thread Bill Janssen
Richard, Joel M richar...@si.edu wrote:

 If you're looking just to learn and not spend any money at all, you
 could always set up a Linux flavor running on VirtualBox.

Second that.

 It's a lot of effort and I daresay you'd learn a lot about many
 things, but it may not be viable.

Really?  I think it's pretty simple:

  (1) Install VB from 
http://download.virtualbox.org/virtualbox/4.0.4/VirtualBox-4.0.4-70112-Win.exe.
  (2) Download a CD image of Ubuntu from 
http://www.ubuntu.com/desktop/get-ubuntu/download.
  (3) Fire up VB, say new machine, and point it at the Ubuntu CD file.  Let 
Ubuntu install.
  (4) In Ubuntu, say sudo apt-get install yaz php5.

Should get you at least part of the way there.

Bill


Re: [CODE4LIB] reading early versions of FrameMaker...

2011-01-25 Thread Bill Janssen
Hi, Louis.  Thanks for your note.

I have FrameMaker 5, but when I point it at these files, it tells me an
earlier version is needed to open them.  The header in the file (from
1989, for instance) says that the version is MakerFile 1.03.  I'm on
the track of a version of Maker 3.

 The side-topic question I have (given that we're in the middle of trying
  find the best passive solution for long-term *reliable* storage of data)
  is how were these files stored? Optical media? (If so, what flavor?)
  Harddrives? Tape? (What flavor?)

Hard drives.  PARC has been big on network file systems since the 70's,
of course.  In the early 90's, we were experimenting with the idea of an
eternal infinite network file system.  As part of this, we invested in
some optical archive technology systems, and put a lot of stuff on them
which had previously been stored on tape and disk pack.  Those bits have
been brought forward systematically to successive generations of storage
technology.  (If you're curious, the future we're looking towards today
is content-centric networking (http://www.ccnx.org/) -- data matters,
not where it lives.)

  One possibility I recall. Frame binary files must end at a 1K boundary. When
  they were sent via email, they often got a CR/LF added to the end, and then
  Frame would no longer touch them. We wrote a little utility that trimmed off
  anything after the last 1K boundary that fixed this. So if the file was 2050
  bytes, it truncated it to 2048. If the exact byte sizes are not a multiple
  of 1024, let me know and I'll see if I can find that utility someplace. Or
  you could do it with a binary editor.

Ah, good tip.  I'll look into that.

 Last three quotes from:
 http://answerpot.com/showthread.php?185398-FrameMaker+3+files
 
 So while it may not relate to the original question of making older,
 archived files available in modern formats, I suppose we could definitively
 say that so long as the data isn't in a proprietary format, or if it is, so
 long as the software is still in production, you've a good chance at opening
 older data -- as long as it's not corrupt.

I still like the older PARC document formats, tedit and Tioga.  They
stored the plain text at the front of the file, a marker (typically a
run of two zero bytes), and then the binary markup directives, which
referenced byte positions in the plain text.  So, you could visit any
document as a plain-text file and get some sense of what was in it.

Bill


[CODE4LIB] reading early versions of FrameMaker...

2011-01-24 Thread Bill Janssen
At PARC, we have some digital documents from the early '90's in
FrameMaker version 1 and 2.  But we have no versions of FrameMaker
suitable for opening them, and re-rendering them in a more accessible
format.  I'm wondering if others have faced this issue in making
archives accessible, and if so, what they did about it?

Bill


Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?

2010-10-20 Thread Bill Janssen
Deng, Sai sai.d...@wichita.edu wrote:

 Do you know the Digital Library systems which can search within the
 documents (e.g. PDFs) and handle access restrictions (e.g. DRM)?

Not sure what you mean by handle access restrictions.  Do you mean it
can index the documents put into it even if they have DRM encumbrances?

UpLib has search within the documents -- if you search for a word or
phrase, it shows you all the documents which match, but also all the
pages in each document which match.  Supports a wide variety of document
formats, from JPEG2000 to PDF to Powerpoint.  But as far as I know it
doesn't deal with DRM restrictions.

Bill


Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?

2010-10-20 Thread Bill Janssen
Deng, Sai sai.d...@wichita.edu wrote:

 For access restriction, I mean we would like to have certain documents
 open only to certain communities (UpLib cannot do that, right?).

OK, that's not I typically think of when I hear DRM.  Access control
is (I think) the way it's usually put.

No, UpLib has no built-in access control system, though the hooks are
there, and I know that some have used them to do access control.  I know
of one UpLib application which requires incoming connections to provide
a client certificate, which it uses to give different clients different
access rights.  Probably overkill for most uses.

You'd probably want to do an application-specific Web UI, though -- you
could put the access restrictions there.  I recently saw a Tomcat app
which uses the UpLib Java client-side library to search for documents,
then provided a completely custom UI.

 On second thought, I searched for DSpace full text search and found
 this:
 https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing
 However, I haven't seen any instance which shows the full text search
 results as I would see from vendor databases.
 
 Any idea on what system might be good/best for search within documents and 
 DRM?

How about Greenstone?

Bill


Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?

2010-10-20 Thread Bill Janssen
Deng, Sai sai.d...@wichita.edu wrote:

 Thanks for the questions!

 We don't have a clear idea yet and we are looking for a system
 now. The basic idea is that we'll deposit some licensed materials for
 some department and open them only to that group. I guess a local
 account would be ok, of course, if a campus account can be recognized,
 that's better.

In which case you'll need some access control system which can
understand your campus login system.

 They'll need to log in to see the document if it's not
 ip restricted, right? IP restriction might not be the best way since
 faculty members will not always be in their departments.

Will you let them search for documents, and show the search results,
even if they can't retrieve the full document, as the ACM Digital
Library does?  Or do search results have to be filtered, too?

How many different access groups will you have?  One per department?
One per licensed set of material?  And what's the approximate size of
each of those numbers?

A simple thing to do would be to install something like DocuShare, which
already does all this stuff and is built on top of Autonomy, one of the
better suites for extracting and indexing content from documents.

Bill



 
 Sophie  
 
 From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Mark Jordan 
 [mjor...@sfu.ca]
 Sent: Wednesday, October 20, 2010 5:08 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and 
 access restrictions)?
 
 Sophie,
 
 It might help some of us on the list to understand what types of access 
 control you need if you can describe some of the ways that the allowed users 
 (people and/or departments, to use your examples) will identify themselves? 
 Will they have already logged into the system with a local (to the system) 
 account, or with a campus account that knows that they are part of a specific 
 department? Will they need to log into he system when they request to see a 
 specific document? Will where they are sitting matter (i.e., restricted by IP 
 address)?
 
 Mark
 
 Mark Jordan
 Head of Library Systems
 W.A.C. Bennett Library, Simon Fraser University
 Burnaby, British Columbia, V5A 1S6, Canada
 Voice: 778.782.5753 / Fax: 778.782.3023 / Skype: mark.jordan50
 mjor...@sfu.ca
 
 - Original Message -
  Thanks for the information!
  Greenstone has full text search, but I heard that its access control
  is much weaker than DSpace. Will it be able to allow certain documents
  open only to certain people or certain departments?
  Thanks.
  Sophie
  
  From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill
  Janssen [jans...@parc.com]
  Sent: Wednesday, October 20, 2010 4:31 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] DL Systems (allowing search within documents
  and access restrictions)?
 
  Deng, Sai sai.d...@wichita.edu wrote:
 
   For access restriction, I mean we would like to have certain
   documents
   open only to certain communities (UpLib cannot do that, right?).
 
  OK, that's not I typically think of when I hear DRM. Access
  control
  is (I think) the way it's usually put.
 
  No, UpLib has no built-in access control system, though the hooks are
  there, and I know that some have used them to do access control. I
  know
  of one UpLib application which requires incoming connections to
  provide
  a client certificate, which it uses to give different clients
  different
  access rights. Probably overkill for most uses.
 
  You'd probably want to do an application-specific Web UI, though --
  you
  could put the access restrictions there. I recently saw a Tomcat app
  which uses the UpLib Java client-side library to search for documents,
  then provided a completely custom UI.
 
   On second thought, I searched for DSpace full text search and
   found
   this:
   https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing
   However, I haven't seen any instance which shows the full text
   search
   results as I would see from vendor databases.
  
   Any idea on what system might be good/best for search within
   documents and DRM?
 
  How about Greenstone?
 
  Bill