Re: [CODE4LIB] Language codes

2016-06-01 Thread Andrew Cunningham
It is better to refer to BCP-47 instead.

https://tools.ietf.org/html/bcp47

An RFC can be updated, when it is, it recieves a new number. For language
tagging, the relevant information is split across two RFCs. BCP-47 is a
permanent IEFT ifentifier referencing the latest versions of the two RFCs
relating to language tagging.

Andrew

On 2 Jun 2016 9:24 am, "Stuart A. Yeates"  wrote:
>
> I recommend reading https://tools.ietf.org/html/rfc5646 which seems to do
> what you need.
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On Thu, Jun 2, 2016 at 10:59 AM, Greg Lindahl  wrote:
>
> > Some of the Internet Archive's library partners are asking us about
> > language metadata for regional languages that don't have standard
> > codes.  Is there a standard way of dealing with this situation?
> >
> > Overall we use MARC codes https://www.loc.gov/marc/languages/ which
> > were last updated in 2007. LOC also maintains ISO639-2
> > https://www.loc.gov/standards/iso639-2/php/code_list.php last updated
> > in 2014.
> >
> > The languages in question are regional languages which are currently
> > lumped together in both standards. With the recent rise in interest
> > and funding for regional languages, it's no surprise that some
> > catalogers want to split these languages out into separate codes.
> >
> > Thanks!
> >
> > -- greg
> >


Re: [CODE4LIB] Language codes

2016-06-01 Thread Andrew Cunningham
On 2 Jun 2016 9:40 am, "Andrew Cunningham" <lang.supp...@gmail.com> wrote:
>
>
> Ultimately it is what a library is working on, if you are cataloguing
then all you have is ISO-639-3/B
>

Opps, meant to input ISO-639-2/B

Andrew


Re: [CODE4LIB] Language codes

2016-06-01 Thread Andrew Cunningham
Outside the library sector, the most common approach to language tagging
and matching isn't ISO-639-2 or ISO-639-3, rather BCP-47.

Quite a number of ISO-639-2 language tags represent what ISO-639-3 refers
to as macro languages. For instance 'kar' in ISO-639-2 resolves to 20
language codes in ISO-639-3

But ISO-639-3 by itself isn't sufficient to fully identify a written
language.

Eg you could have sr-Cyrl for Serbian in the Cyrillic script. Sr-Latn to
represent Serbian written in the Latin orthography, sr-Latn-alalc97 ...
Romanised Cyrillic Serbian based on the ALA-LC Cyrillic romanisation table
published in 1997.

Its worth noting the only ALA-LC romanisation tables that can be specified
in BCP-47 are the 1997 editions.

Ultimately it is what a library is working on, if you are cataloguing then
all you have is ISO-639-3/B

If you are working on a digitisation or linked data project it is much
better to correctly use BCP-47 which would align your resources more
accurately with the rest of the broader information ecosystem in which your
resources would exist.

Andrew
On 2 Jun 2016 9:15 am, "Craig Franklin"  wrote:

> We've never had any problems sticking to ISO639-2 codes (in cases there
> isn't a shorter ISO639-1 code available).  I'm interested in what sort of
> regional languages you might be dealing with where there are significant
> gaps in that standard?
>
> You might also look at ISO 639-3, which is quite comprehensive but also
> introduces a fair chunk of complexity:
>
> http://www-01.sil.org/iso639-3/download.asp
>
> Cheers,
> Craig Franklin
>
> On 2 June 2016 at 08:59, Greg Lindahl  wrote:
>
> > Some of the Internet Archive's library partners are asking us about
> > language metadata for regional languages that don't have standard
> > codes.  Is there a standard way of dealing with this situation?
> >
> > Overall we use MARC codes https://www.loc.gov/marc/languages/ which
> > were last updated in 2007. LOC also maintains ISO639-2
> > https://www.loc.gov/standards/iso639-2/php/code_list.php last updated
> > in 2014.
> >
> > The languages in question are regional languages which are currently
> > lumped together in both standards. With the recent rise in interest
> > and funding for regional languages, it's no surprise that some
> > catalogers want to split these languages out into separate codes.
> >
> > Thanks!
> >
> > -- greg
> >
>


[CODE4LIB] Fwd: [camms-ccaam] Common encoding errors

2016-02-22 Thread Andrew Cunningham
On behalf of Charles Riley:

-- Forwarded message --
From: Riley, Charles <charles.ri...@yale.edu>
Date: 23 February 2016 at 05:37
Subject: [camms-ccaam] Common encoding errors
To: "voyage...@listserv.nd.edu" <voyage...@listserv.nd.edu>, "
lit...@lists.ala.org" <lit...@lists.ala.org>, "camms-cc...@lists.ala.org" <
camms-cc...@lists.ala.org>, "ol-tech-boun...@archive.org" <
ol-tech-boun...@archive.org>, "ole.technical.usergr...@kuali.org" <
ole.technical.usergr...@kuali.org>, "auto...@listserv.syr.edu" <
auto...@listserv.syr.edu>


Hi all,



This is something I’ve noticed happening with somewhat regular, and
probably increasing occurrence lately:  a class of problems with records
containing either escaped entity references from HTML or XML (like
‘’), or accented characters that have become corrupted in a data
migration (like ‘français
<https://openlibrary.org/works/OL10004281W/Les_archets_français>‘).  I was
asked by another librarian if I could point them to any resources that deal
with this class of issues, and rounded up a few that I thought would be
good to share.  Here’s what I came across, in terms of examples and
explanations for some of the more common cases:



http://markmcb.com/2011/11/07/replacing-ae%E2%80%9C-ae%E2%84%A2-aeoe-etc-with-utf-8-characters-in-ruby-on-rails/



https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
(But treat this list with caution in using it to search; there will be
false positives for a search for ‘amp;’, for example.)



http://www.i18nqa.com/debug/utf8-debug.html (See also associated links on
this page.)



Hope this helps!



Charles Riley



*Charles Riley*

*Interim Librarian for African Studies and Catalog Librarian*

*Sterling Memorial Library*

*Yale University*



*charles.ri...@yale.edu <charles.ri...@yale.edu>*

*(203)432-7566 <%28203%29432-7566> or (203)432-9301 <%28203%29432-9301>*







-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: [CODE4LIB] Best way to handle non-US keyboard chars in URLs?

2016-02-21 Thread Andrew Cunningham
i,
On Monday, 22 February 2016, Chris Moschini <ch...@brass9.com> wrote:
> On Feb 20, 2016 9:33 PM, "Stuart A. Yeates" <syea...@gmail.com> wrote:
>>
>> 1) With Unicode 8, sign writing and ASL,  the American / international
>> dichotomy is largely specious. Before that there were American indigenous
>> languages (Cheyenne etc.), but in my experience Americans don't usually
>> think of them them as American.
>
> It's not about the label, so don't get too hung up on that. It's about
> what's easy to type on a typical US keyboard.
>

If you are accessing a non-English resource, then having characters outside
the basic latin block would seem to be perfectly acceptable to me.

There are two types of users involved .. those that xan read the target
language and those that can't.

Those you can should be able to work with keyboards other than a US English
layout. On most devices this is fairly trivial. Not to mention the user may
jot actuually have the US English keyboard layout as their default input
system.

On a multilingual site I prefer the access points to be in the language of
the resource.

Obviously there are cases where people who can not read the language need
to access a resource. In those cases I would look at apis that expose the
resource in a different way. Maybe through a transliteration mapping.
Rather than having a second URL.

Ultimately it comes fown to who the users are and why they are accessing
the resource.

It seems to me your primary concern is for users who can not read the
resource in any event

Andrew

-- 
Andrew Cunningham
lang.supp...@gmail.com


[CODE4LIB]

2016-02-08 Thread Andrew Cunningham
Thanks I will look into them.

On 9 February 2016 at 03:56, Han, Yan - (yhan) <y...@email.arizona.edu>
wrote:

> Yes. Use iText or PDFBox
>
> These are common PDF libraries.
>
>
>
>
>
> On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" <
> CODE4LIB@LISTSERV.ND.EDU on behalf of lang.supp...@gmail.com> wrote:
>
> >Hi all,
> >
> >I am working with PDF files in some South Asian and South East Asian
> >languages. Each PDF has ActualText added for each tag in the PDF. Each PDF
> >has ActualText as an alternative forvthe visible text layer in the PDF.
> >
> >Is anyone aware of tools the will allow me to index and search PDFs based
> >on the ActualText content rather than the visible text layers in the PDF?
> >
> >Andrew
> >
> >--
> >Andrew Cunningham
> >lang.supp...@gmail.com
>



-- 
Andrew Cunningham
lang.supp...@gmail.com


[CODE4LIB]

2016-02-08 Thread Andrew Cunningham
Thanks Levy will look at PDFBox and see what i can leverage from it.

Andrew


On 9 February 2016 at 04:33, Levy, Michael <ml...@ushmm.org> wrote:

> There is a method named getActualText() in PDFBox, there are some listserv
> postings (circa 2012) that indicate that the command-line PDFBox did not
> support extraction of the ActualText contents at that time. That may have
> changed. I'd like to know more.
>
> Thank you Andrew for sending me scurrying to learn about ActualText. I
> don't think we have any in any of the PDFs that I'm indexing, but I
> wouldn't have known it existed without your posting.
>
>
> On Mon, Feb 8, 2016 at 11:56 AM, Han, Yan - (yhan) <y...@email.arizona.edu
> >
> wrote:
>
> > Yes. Use iText or PDFBox
> >
> > These are common PDF libraries.
> >
> >
> >
> >
> >
> > On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" <
> > CODE4LIB@LISTSERV.ND.EDU on behalf of lang.supp...@gmail.com> wrote:
> >
> > >Hi all,
> > >
> > >I am working with PDF files in some South Asian and South East Asian
> > >languages. Each PDF has ActualText added for each tag in the PDF. Each
> PDF
> > >has ActualText as an alternative forvthe visible text layer in the PDF.
> > >
> > >Is anyone aware of tools the will allow me to index and search PDFs
> based
> > >on the ActualText content rather than the visible text layers in the
> PDF?
> > >
> > >Andrew
> > >
> > >--
> > >Andrew Cunningham
> > >lang.supp...@gmail.com
> >
>



-- 
Andrew Cunningham
lang.supp...@gmail.com


[CODE4LIB]

2016-02-06 Thread Andrew Cunningham
Hi all,

I am working with PDF files in some South Asian and South East Asian
languages. Each PDF has ActualText added for each tag in the PDF. Each PDF
has ActualText as an alternative forvthe visible text layer in the PDF.

Is anyone aware of tools the will allow me to index and search PDFs based
on the ActualText content rather than the visible text layers in the PDF?

Andrew

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: [CODE4LIB] Library community web standards (was: LibGuides v2 - Templates and Nav)

2014-09-30 Thread Andrew Cunningham
Hi Brad,

An interesting idea, but many potential failure points.

I have been in the position of spending considerable time to develop,best
practive materials on web internationalisation for our state government,
without any prospect of being able to roll it out within our own library.

Wether we are discussing corporate or opensource solutions. Web
technologies withon library sector are at,the within the long tail of
implementation.

But best practice should be encouraged.

Andrew

On 01/10/2014 12:23 AM, Brad Coffield bcoffield.libr...@gmail.com wrote:

 I agree that it would be a bad idea to endeavor to create our own special
 standards that deviate from accepted web best practices and standards. My
 own thought was more towards a guide for librarians, curated by
librarians,
 that provides a summary of best practices. On the one hand, something to
 help those without a deep tech background to quickly get up to speed with
 best practices instead of needing to conduct a lot of research and
reading.
 But beyond that, it would also be a resource that went deeper for those
who
 wanted to explore the literature.

 So, bullet points and short lists of information accompanied by links to
 additional resources etc. (So, right now, it sounds like a libguide lol)

 Though I do think there would potentially be additional information that
 did apply mostly/only to libraries and our particular sites etc. Off the
 top of my head: a thorough treatment and recommendations regarding
 libguides v2 and accessibility, customizing common library-used products
 (like Serial Solutions 360 link, Worldcat Local and all their competitors)
 so that they are most usable and accessible.

 At it's core, though, what I'm picturing is something where librarians get
 together and cut through the noise, pull out best web practices, and
 display them in a quickly digested format. Everything else would be the
 proverbial gravy.

 On Tue, Sep 30, 2014 at 10:01 AM, Michael Schofield mschofi...@nova.edu
 wrote:

  I am interested but I am a little hazy about what kind of standards you
  all are suggesting. I would warn against creating standards that
conflict
  with any actual web standards, because I--and, I think, many
others--would
  honestly recommend that the #libweb should aspire to and adhere more
firmly
  to larger web standards and best practices that conflict with something
  that's more, ah, librarylike. Although that might not be what you folks
  have in mind at all : ).
 
  Michael S.
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Brad Coffield
  Sent: Tuesday, September 30, 2014 9:30 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Library community web standards (was: LibGuides
v2
  - Templates and Nav)
 
  Josh, thanks for separating this topic out and starting this new
thread. I
  don't know of any such library standards that exist on the web. I agree
  that this sounds like a great idea. As for this group or not... why not!
  It's 2014 and they don't exist yet and they would be incredibly useful
for
  many libraries, if not all. Now all we need is a cool 'working group'
title
  for ourselves and we're halfway done! Right???
 
  But seriously, I'd love to help.
 
  Brad
 
 
 
 
  --
  Brad Coffield, MLIS
  Assistant Information and Web Services Librarian Saint Francis
University
  814-472-3315
  bcoffi...@francis.edu
 



 --
 Brad Coffield, MLIS
 Assistant Information and Web Services Librarian
 Saint Francis University
 814-472-3315
 bcoffi...@francis.edu


Re: [CODE4LIB] Natural language programming

2014-07-01 Thread Andrew Cunningham
Since you maybe looking at Drupal intergratin down the path, I would look
at using python znd the NLTK , and develop a web service that coild ghen be
used by drupal
On 01/07/2014 11:13 PM, Katie konrad.ka...@gmail.com wrote:

 Hello,

 Has anyone here experience in the world of natural language programming
 (while applying information retrieval techniques)?

 I'm currently trying to develop a tool that will:

 1. take a pdf and extract the text (paying no attention to images or
 formatting)
 2. analyze the text via term weighting, inverse document frequency, and
 other natural language processing techniques
 3. assemble a list of suggested terms and concepts that are weighted
 heavily in that document

 Step 1 is straightforward and I've had much success there. Step 2 is the
 problem child. I've played around with a few APIs (like AlchemyAPI) but
 they have character length limitations or other shortcomings that keep me
 looking.

 The background behind this project is that I work for a digital library
 with a large pre-existing collection of pdfs with rudimentary metadata. The
 aforementioned tool will be used to classify and group the pdfs according
 to the themes of the library. Our CMS is Drupal so depending on my level of
 ambition, this *might* develop into a module.

 Does this sound like a project that has been done/attempted before? Any
 suggested tools or reading materials?



Re: [CODE4LIB] Cataloguing Telugu

2014-04-07 Thread Andrew Cunningham
Stuart,  had a quick look at the proposal,  not sure cataloguing is an
appropriate term,  nor are they citations.

I suspect that a simple database,  web interface, simple search interface
and Telugu collation should suffice. No specific tools would be needed. We
are talking about a fairly common web infrastructure requirements,  the
challenge will be integrating it with wikimedia platforms,

Best to discuss that with the internationalisaton team at WMF.

On 08/04/2014 7:02 AM, Stuart Yeates stuart.yea...@vuw.ac.nz wrote:

 Currently there is a funding proposal for cataloguing Telugu works up
 before the Wikimedia foundation. If anyone has experience with Telugu or
 knows of any tools that are likely to be useful, please give your input:

 https://meta.wikimedia.org/wiki/Grants:IEG/Making_telugu_
 content_accessible

 cheers
 stuart



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
You may want to consider how best to handle PDF files where the text would
contain ligatures and glyph ids rather than the underlying characters.

A.
On 12/10/2013 4:58 AM, Eric Lease Morgan emor...@nd.edu wrote:

 On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com
 wrote:

  For a limited period of time I am making publicly available a Web-based
  program called PDF2TXT -- http://bit.ly/1bJRyh8
 
  Very slick, good work.  I can see where this tool can be very helpful.
  It
  does have some issues with some characters, but this is rather common
 with
  most systems.

 Again, thank you for the support. Yes, there are some escaping issues to
 be resolved. Release early. Release often. I need help with the graphic
 design in general.

 Here's an enhancement I thought of:

   1. allow readers to authenticate
   2. allow readers to upload documents
   3. documents get saved in readers' cache
   4. allow interface to list documents in the cache
   5. provide text mining services against reader-selected documents
   6. go to Step #1

 It would also be cool if I could figure out how to finish the installation
 of Tesseract to enable OCRing. [1]

 [1] OCRing -
 http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html

 --
 Eric Morgan



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
Hi Mark,

I suspect the tool wil only be able to handle select languages, and very
doubtful you could develop a tool to handle non-LCG text.

For a fully internationalised tool, you would have fo ignore all text
layers in a PDF and run all PDFs through OCR to generate text.

Then you'd need to apply very sophisticated word boundary identification
routines.

A.
On 12/10/2013 9:40 AM, Mark Pernotto mark.perno...@gmail.com wrote:

 Very cool tool, thank you!

 Putting my devil's advocate hat on, it doesn't parse foreign documents well
 (I got it to break!).  I also got inconsistent results feeding it PDF files
 with tables embedded (but haven't been able to figure out what it is about
 them it doesn't like).

 Just from a curiosity standpoint, what encoding is being utilized?  I know
 nothing about Perl.  It seemed to have no problem parsing a dash (-) if it
 was up against another character (2007-2012), but barfs when it's by itself
 (2007 � 2012). I'm only referring to 'extracted text' mode.

 If it helps, I can send along *most* of my test PDF files used.

 Thank you!
 .m





 On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan emor...@nd.edu
 wrote:

  On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com
  wrote:
 
   For a limited period of time I am making publicly available a
 Web-based
   program called PDF2TXT -- http://bit.ly/1bJRyh8
  
   Very slick, good work.  I can see where this tool can be very helpful.
   It
   does have some issues with some characters, but this is rather common
  with
   most systems.
 
  Again, thank you for the support. Yes, there are some escaping issues to
  be resolved. Release early. Release often. I need help with the graphic
  design in general.
 
  Here's an enhancement I thought of:
 
1. allow readers to authenticate
2. allow readers to upload documents
3. documents get saved in readers' cache
4. allow interface to list documents in the cache
5. provide text mining services against reader-selected documents
6. go to Step #1
 
  It would also be cool if I could figure out how to finish the
 installation
  of Tesseract to enable OCRing. [1]
 
  [1] OCRing -
  http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html
 
  --
  Eric Morgan
 



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
Perl has its own encoding model, strings vould be unicode or legacy
encoding, unicode is Unicode is indicated by the presence of a flag on a
string. Out its decided on a string by string basis.

If it is a legacy encoding, then it could be any legacy encoding.

If your data is truly multilingual, multiscript and in a variety of
encodings, it becomes a challenge to manage it in Perl.

In our own projects we found perl module to be inadequate and needed our
own internal modules to handle encoding issues, radio when you factor in
the fact that some cpan modules have the nasty habit of stripping the
Unicode flag from strings.

Although that said, Perl still has better Unicode support than most
languages.

A.


Re: [CODE4LIB] Python and Ruby

2013-07-29 Thread Andrew Cunningham
Both Ruby and Python, have their strengths and weaknesses, and as others
have mentioned, it will come down to need and existing projects you want to
leverage.

We use both Python and Ruby internally.

Know your tools and their strengths and weaknesses.

My personal interested is more and more revolved around natural language
processing, and its potential in library based tools. Purging is quite
strong in computational linguistics and has useful libraries for natural
language processing.

Andrew
 On 30/07/2013 1:43 AM, Joshua Welker wel...@ucmo.edu wrote:

 Not intending to start a language flame war/holy war here, but in the
 library coding community, is there a particular reason to use Ruby over
 Python or vice-versa? I am personally comfortable with Python, but I have
 noticed that there is a big Ruby following in Code4Lib and similar
 communities. Am I going to be able to contribute and work better with the
 community if I use Ruby rather than Python?

 I am 100% aware that there is no objective way to answer which of the two
 languages is the best. I am interested in the much more narrow question of
 which will work better for library-related scripting projects in terms of
 the following factors:

 -existing modules that I can re-use that are related to libraries (MARC
 tools, XML/RDF tools, modules released by major vendors, etc)
 -availability of help from others in the community
 -interest/ability of others to re-use my code

 Thanks.

 Josh Welker
 Information Technology Librarian
 James C. Kirkpatrick Library
 University of Central Missouri
 Warrensburg, MO 64093
 JCKL 2260
 660.543.8022



Re: [CODE4LIB] Python and Ruby

2013-07-29 Thread Andrew Cunningham
White space is potentially an illusion  it isn't necessarilly there,
esp when the whitespace is not a character ...

;)
On 30/07/2013 8:02 AM, Michael J. Giarlo leftw...@alumni.rutgers.edu
wrote:

 And you would think Python developers would know how to...

 ( •_•)
 ( •_•)⌐■-■
 (⌐■_■)

 read between the (whitespace) lines?

 YEAH


 On Mon, Jul 29, 2013 at 2:57 PM, Ross Singer rossfsin...@gmail.com
 wrote:

  Muahahahahahahaha!
 
  MUAHAHAHAHAHAHA!
 
  And you walked right into it!  You fools!
 
  -Ross.
 
  On Monday, July 29, 2013, Jay Luker wrote:
 
   On Mon, Jul 29, 2013 at 4:38 PM, Joshua Welker wel...@ucmo.edu
  javascript:;
   wrote:
  
And I hate Python whitespace.
  
   Ah-ha!
  
   A more paranoid pythonista than I might suspect this whole thread was
   simply an exercise in Ruby shilling.
  
   --jay
  
 



Re: [CODE4LIB] tiff2pdf, then back to pdf?

2013-04-26 Thread Andrew Cunningham
Although I do find the persistent myth of PDF/A as an archival format
amusing.

Under very specific circumstances it can be, but its rare for those
circumstances to be deliberatively met.

And for many languages it is impossible to use pdf for archival purpuses
ever.

It is the nature of PDF.
On 27/04/2013 8:28 AM, Jason Curtis cur...@sandiego.edu wrote:

 Hi, Edward:

 After reading through the string of messages and the options that you list
 below, I think that #3 is your best option.  It seems to best fall in line
 with good archiving practices as I understand them (have one copy for
 public use and another for archival purposes).  If you really want to
 convert the TIFF to PDF and ditch the TIFF file, I would suggest using
 PDF/A, the archival version of PDF, if you can.  Best of luck!

 Sincerely,
 Jason

 __
 Jason Curtis
 Technical Services Librarian
 Legal Research Center
 University of San Diego
 5998 Alcalá Park
 San Diego, CA 92110
 Ph: (619) 260-4600, ext.2875
 Fax: (619) 260-7495
 cur...@sandiego.edu

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Edward M. Corrado
 Sent: Friday, April 26, 2013 2:55 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] tiff2pdf, then back to pdf?

 On Fri, Apr 26, 2013 at 5:29 PM, Ethan Gruber ewg4x...@gmail.com wrote:

  What's your use case in this scenario? Do you want to provide access
  to the PDFs over the web or are you using them as your archival
  format?  You probably don't want to use PDF to achieve both objectives.
 



 The problem I have is I have multipage TIFF files and I don't currently
 have a good way for users to view them. I also need to preserve these
 files. Ideally my use case would be to use PDF files created from the TIFFs
 for both preservation and an archival format. But, as I said, that depends
 on if I can recreate the original tiff. I have the option of creating a
 custom viewer that can deal with the the display of the tiff files, but I'm
 looking for other options.

 So I have a few choices that I thought of implementing (that I haven't
 ruled out):

 1) This is what I asked about. Make a PDF from the TIFF files. If I could
 embed the tiff into a pdf, and then at some point recreate the tiff if
 needed for archival purposes, I have my solution.

 2) Convert the multipage TIFF files to individual TIFF files. This would
 work for my endusers, but would be more clunky than a PDF for them. The new
 TIFF fiels could be my archival copy.

 3) Convert the multipage TIFF files to PDF (probably in a smaller,
 compressed? state), use the PDF for display/access, save the TIFF for
 archival purposes.

 4) Convert the multipage TIFFs to PDF (or PDF/A?), and don't worry about
 being able to recreate the original TIFF files.

 I should add, the content is what is important in these documents and they
 are mostly type written or hand written text. Still, I'd like to keep them
 in as high quality of a format as possible.

 I'm sure there are some other possible solutions as well. I really would
 like #1, but it may not be possible. If it isn't, I need to decide (with
 representatives of my user community) which of the others are better. My
 guess is it would be #3, but I am not positive.

 Edward






 
  Ethan
  On Apr 26, 2013 5:11 PM, Edward M. Corrado ecorr...@ecorrado.us
 wrote:
 
   This works sometimes. Well, it does give me a new tiff file from the
   pdf all of the time, but it is not always anywhere near the same
   size as the original tiff. My guess is that maybe there is a flag or
   somethign that woulf help. Here is what I get with one fil:
  
  
   ecorrado@ecorrado:~/Desktop/test$ convert -compress none A001a.tif
   A001a.pdf ecorrado@ecorrado:~/Desktop/test$ convert -compress none
   A001a.pdf A001b.tif ecorrado@ecorrado:~/Desktop/test$ ls -al total
   361056
   drwxrwxr-x 2 ecorrado ecorrado 4096 Apr 26 17:07 .
   drwxr-xr-x 7 ecorrado ecorrado20480 Apr 26 16:54 ..
   -rw-rw-r-- 1 ecorrado ecorrado 38497046 Apr 26 17:07 A001a.pdf
   -rw-r--r-- 1 ecorrado ecorrado 38178650 Apr 26 17:07 A001a.tif
   -rw-rw-r-- 1 ecorrado ecorrado  5871196 Apr 26 17:07 A001b.tif
  
  
   In this case, the two tif files should be the same size. They are
   not
  even
   close. Maybe there is a flag to convert (besides compress) that I
   can
  use.
   FWIW: I tried three files/ 2 are like this. The other one, the
   resulting tiff is the same size as the original.
  
   Edward
  
  
  
  
  
   On Fri, Apr 26, 2013 at 4:25 PM, Aaron Addison 
  addi...@library.umass.edu
   wrote:
  
Imagemagick's convert will do it both ways.
   
convert a.tiff b.pdf
convert b.pdf a.tiff
   
If the pdf is more than one page, the tiff will be a multipage tiff.
   
Aaron
   
--
Aaron Addison
Unix Administrator
W. E. B. Du Bois Library UMass Amherst
413 577 2104
   
   
   
On Fri, 2013-04-26 at 16:08 -0400, Edward M. Corrado 

Re: [CODE4LIB] From Chinese characters to convert Pinyin and Traditional and Simplified Chinese and Hangul

2013-04-18 Thread Andrew Cunningham
HI Wataru,

very interesting script, although I'd be inclined to suggest an
enhancement.

It would be useful to add language tagging to the input field and each of
the conversions.

The page as it stands will not use appropriate fonts for each language, web
browsers need appropriate language to facilitate appropriate font fallback
behaviours.

Andrew


On 18 April 2013 19:29, Wataru Ono ono.wataru.p...@gmail.com wrote:

 Hi,

 I'm Wataru ONO, librarian at Hitotsubashi University Library in Japanese.

 This tool is From Chinese characters to convert Pinyin and
 Traditional and Simplified Chinese and Hangul
 https://googledrive.com/host/0B_vZSxPrv8xmVnZwSkk0ZmU2Zmc/han2pin.html

 You can convert between Simplified and Traditional Chinese and
 Japanese characters.
 This is made of pure javascript.
 If you are interested in this tool, please feel free to use and down load.

 Best regards




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: [CODE4LIB] one tool and/or resource that you recommend to newbie coders in a library?

2012-11-01 Thread Andrew Cunningham
My 2 cents worth ... and one for each cent:

* Komodo Edit

* www.w3.org/International



On 2 November 2012 07:24, Bohyun Kim k...@fiu.edu wrote:

 Hi all code4lib-bers,

 As coders and coding librarians, what is ONE tool and/or resource that you
 recommend to newbie coders in a library (and why)?  I promise I will create
 and circulate the list and make it into a Code4Lib wiki page for collective
 wisdom.  =)

 Thanks in advance!
 Bohyun

 ---
 Bohyun Kim, MA, MSLIS
 Digital Access Librarian
 bohyun@fiu.edu
 305-348-1471
 Medical Library, College of Medicine
 Florida International University
 http://medlib.fiu.edu
 http://medlib.fiu.edu/m (Mobile)




-- 
Andrew Cunningham
Project Manager, Research and Development
Social and Digital Inclusion Unit
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-20 Thread Andrew Cunningham
 on marc4j (which
is used heavily by SolrMarc)
 is that for any significant processing of Marc records the only solution
that makes sense is to
 translate the record data into Unicode characters as it is being read
in.  Of course as you and others
 have stated, determining what the data actually is, in order to
correctly translate it to Unicode, is
 no easy task.  The leader byte that merely indicates is UTF8 or  is
not UTF8 is wrong often enough
 in the real world that it is of little value when it indicates is
UTF-8and is even less value when
 it indicates is not UTF-8

 Significant portions of the code I've added to marc4j deal with trying
to determine what the encoding
 of that data actually is and trying to translate the data correctly into
Unicode even when the data is
 incorrect.

 You also argued in another message that cataloger entry tools should
 give feedback to help the cataloger not create errors.   I agree.  I
 think one possible step towards this would be that the editor must work
in Unicode, irrespective of
 the data format that the underlying system expects the data to be.  If
the underlying system expects
 MARC8 then the save as process should be able to translate the data
into MARC8 on output.

 -Robert Haschart


-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] Unicode font for PDF generation?

2012-03-18 Thread Andrew Cunningham
There are no pan Unicode fonts. Last one I saw was for Unicode 2.0

There is a limit to the number of glyphs a font can contain.

It is possible to create a subset of unicode and place it in a single font,
but you need to be able to identify your current and future character
requirements.

But not sure why you need a single font, unless your xml to pdf conversion
can't process stylesheets.

Andrew

On Saturday, 17 March 2012, Mark Redar mark.re...@ucop.edu wrote:
 Hi All,

 We're having some fun with unicode characters in PDF generation. We have
a process that automatically generates a pdf from XML input. The tool stack
doesn't support multiple fonts for displaying different codepoints so we
need a good pan-unicode font to bundle with the pdfs.

 Currently, we use the DejaVu font family for creating the pdfs. This has
good coverage for latin  cyrillic characters but has no CJK
(chinese-japanese-korean) coverage. We've looked into licensing a
commercial fonts, but for web server use these require annual licensing
fees that are substantial (in the thousands of $).
 A number of our source documents contain CJK characters and some
contributors have noticed the lack of support for these characters.

 Does anyone know of a good pan-unicode free font that includes CJK
codepoints that looks good? Gnu unifont has the coverage, but it is not the
best looking font.

 Barring that, we're thinking of rolling our own pan-unicode font. There
are good open source fonts for portions of the unicode character sets.
We're hoping to find some way to take a number of open source fonts and
combine them into one large pan-unicode font.

 Does anyone have experience with font authoring and merging different
fonts?

 It looks as though FontForge can merge fonts, but it's not clear how to
deal with overlapping codepoints in the merged fonts.

 Thanks,

 Mark


-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] Unicode font for PDF generation?

2012-03-18 Thread Andrew Cunningham
A couple of additional thoughts:

• The most complete cjk font projects require 2 fonts to handle all cjk
characters
• There are language specific glyph variations between chinese and
japanese, so ideal situation is to use diffetent fonts tailored for each
On Saturday, 17 March 2012, Mark Redar mark.re...@ucop.edu wrote:
 Hi All,

 We're having some fun with unicode characters in PDF generation. We have
a process that automatically generates a pdf from XML input. The tool stack
doesn't support multiple fonts for displaying different codepoints so we
need a good pan-unicode font to bundle with the pdfs.

 Currently, we use the DejaVu font family for creating the pdfs. This has
good coverage for latin  cyrillic characters but has no CJK
(chinese-japanese-korean) coverage. We've looked into licensing a
commercial fonts, but for web server use these require annual licensing
fees that are substantial (in the thousands of $).
 A number of our source documents contain CJK characters and some
contributors have noticed the lack of support for these characters.

 Does anyone know of a good pan-unicode free font that includes CJK
codepoints that looks good? Gnu unifont has the coverage, but it is not the
best looking font.

 Barring that, we're thinking of rolling our own pan-unicode font. There
are good open source fonts for portions of the unicode character sets.
We're hoping to find some way to take a number of open source fonts and
combine them into one large pan-unicode font.

 Does anyone have experience with font authoring and merging different
fonts?

 It looks as though FontForge can merge fonts, but it's not clear how to
deal with overlapping codepoints in the merged fonts.

 Thanks,

 Mark


-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] Unicode font for PDF generation?

2012-03-18 Thread Andrew Cunningham
For additional CJKV fonts look at:

http://en.wikipedia.org/wiki/List_of_CJK_fonts






-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] Plea for help from Horowhenua Library Trust to Koha Community

2011-11-23 Thread Andrew Cunningham
On 23 November 2011 06:32, MJ Ray m...@phonecoop.coop wrote:
 Mike Taylor m...@indexdata.com


 2. Koha means akin to gift.  The irony of trying to trademark that
 word in particular is mindboggling and should shame PTFS in the eyes
 of everyone who likes sharing information - basically all of us who
 are involved with libraries at some level, isn't it?


I'm wondering if cultural property rights can be use to over turn a
trademark. Not only is koha a maori word it is a cultural concept.



-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] Plea for help from Horowhenua Library Trust to Koha Community

2011-11-23 Thread Andrew Cunningham
I'd be inclined to have a quite chat with Maori political activists
and see what their feleings are on non-New Zealand companies applying
for trademark status on Maori words in New Zealand.

-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] MARCXML - What is it for?

2010-10-27 Thread Andrew Cunningham
I'd suspect that MARCXML isn't going anywhere fast, a shame perhaps.

The key difference between MARCXML and MARC is that MARCXML inherits
XMLs internationalisation features.

It is an aspect at which MARC is very poor.

Andrew

-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Andrew Cunningham
Hi

2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz:

 If, however, you need to deal with characters which don't qualify for
 inclusion in Unicode (or which do qualify but which haven't yet been
 assigned code points). I recommend tei:glyph:





 http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

 We use this to represent typographically interesting but short-lived
 approaches to the representation of Māori in printed works. See for example
 the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage
 like 'f') in the following text:

an interesting approach, although not the only way to address that
particular issue.

and depends on whether you want to treat it as a ligature or as a character.

Other approaches have been to :
1) use PUA assignments, e.g. the MUFI and SIL PUA
assignments/registries as examples; or
2) use U+200D to request ligation

Both these approaches would require specifically defined or modified fonts.

 http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

 for the underlying TEI XML representation see:

 http://www.nzetc.org/tei-source/Auc1911NgaM.xml

 cheers
 stuart
 --
 Stuart Yeates
 http://www.nzetc.org/       New Zealand Electronic Text Centre
 http://researcharchive.vuw.ac.nz/     Institutional Repository




-- 
Andrew Cunningham
Vicnet Research and Development Coordinator
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com