[CODE4LIB] character-sets for dummies?

2009-12-16 Thread Ken Irwin
Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
 Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
 Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and 
the titles from the second category (weird characters display in emacs) render 
properly when the database data is output to the a web browser. The data from 
the former example (\###) renders as an I don't know what character this is 
placeholder in Firefox and IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Chad Fennell
A classic general overview (on the topic of what the heck ARE
character sets???):

http://www.joelonsoftware.com/articles/Unicode.html



On Wed, Dec 16, 2009 at 11:02 AM, Ken Irwin kir...@wittenberg.edu wrote:
 Hi all,

 I'm looking for a good source to help me understand character sets and how to 
 use them. I pretty much know nothing about this - the whole world of Unicode, 
 ASCII, octal, UTF-8, etc. is baffling to me.

 My immediate issue is that I think I need to integrate data from a variety of 
 character sets into one MySQL table - I expect I need some way to convert 
 from one to another, but I don't really even know how to tell which data are 
 in which format.

 Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
 from publishers, vendors, the library catalog (III), etc. When I look at the 
 data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a                  [slashes-and-digits instead of 
 diacritics]
 And other data looks more like:
  Revista de Música Latinoamericana    [weird characters instead of 
 diacritics]

 My MySQL table is currently set up with the collation set to: utf8-bin , and 
 the titles from the second category (weird characters display in emacs) 
 render properly when the database data is output to the a web browser. The 
 data from the former example (\###) renders as an I don't know what 
 character this is placeholder in Firefox and IE.

 So, can someone please point me toward any or all of the following?

 ·         A good primer for understanding all of this stuff

 ·         A method for converting all of my data to the same character set so 
 it plays nicely in the database

 ·         The names of which character-sets I might be working with here

 Many thanks!

 Ken



Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Hagedon, Mike
This probably one place to start:

http://www.joelonsoftware.com/articles/Unicode.html

Mike

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ken 
Irwin
Sent: Wednesday, December 16, 2009 10:02 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] character-sets for dummies?

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
 Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
 Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and 
the titles from the second category (weird characters display in emacs) render 
properly when the database data is output to the a web browser. The data from 
the former example (\###) renders as an I don't know what character this is 
placeholder in Firefox and IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Jonathan Rochkind
So, character encodings are really confusing, even for those who have 
dealt with them before. I'm not sure if there is a good 'dealing with 
character encodings for dummies' book, but if there is, I think I could 
use it too!


But from your case, I can say:Ideally your source records are in a 
_known_ character set.  Either they are in a format where it's 
documented somewhere that that format is always in, or they are in a 
format that specifies exactly what the encoding is. You didn't mention 
exactly where your data is coming from.


For instance, MARC data is (legally) always in either UTF-8 or MARC-8.  
And there's a byte somewhere in the MARC header that specifies which one.


Assuming that's byte is set properly.  If you really don't have any 
'metadata' specifying what character encoding your data is in, and you 
have to guess from the data itself... that's not good. There's no 
foolproof way to do this, it's going to rely on heuristics. So you'd 
probably want to first narrow down a set of possibilities it could be 
encoded in, and then look around on the web for heuristic algorithms to 
try and guess from among that set.


Or you could just try assuming everything is UTF-8, and see if it 
works.  Your examples look like they _could_ be UTF-8, hard to say.


Because once you do figure out what everything is, I can recommend with 
confidence that what you want to do is translate EVERYTHING into UTF-8 
in your database.  Try to do all UTF-8 all the time, and it will save 
you a world of headaches. And once you know what something is, you 
should be able to find a tool to translate it to UTF-8.


Hope this helps somewhat get you started thinking about the questions to 
ask. Character encoding issues are definitely confusing. Which is why 
the more UTF-8 the better, just get everything into UTF-8 and don't look 
back.


Jonathan

Ken Irwin wrote:

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
 Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
 Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and the titles 
from the second category (weird characters display in emacs) render properly when the 
database data is output to the a web browser. The data from the former example (\###) 
renders as an I don't know what character this is placeholder in Firefox and 
IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken

  


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Nate Vack
On Wed, Dec 16, 2009 at 11:24 AM, Walker, David dwal...@calstate.edu wrote:

 If you're looking to convert that data to UTF-8 (which I assume you would), 
 then your best friend is a program from Index Data called yaz-marcdump, which 
 comes with the Yaz toolkit.  It runs on Linux and Windows, and can be invoked 
 from the command line or from scripts to quickly and painlessly convert your 
 catalog data into UTF-8.

Do keep in mind that if you've got a *mix* of character encodings in
your database, you may have a Big Annoying Problem. Unless you know
what records are in what format, there's no general way to do a
conversion.

You can use the sweet sweet python 'chardet' library to get a good
idea of what encoding things are in, and maybe run things through
iconv to normalize them to UTF8.

Cheers,
-Nate


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Brian Stamper

When you see these kind of errors:

 Revista de Música Latinoamericana[weird characters instead of  
diacritics]


if you can look at the data in a web browser it can be used as a tool to  
help you identify the correct encoding. Web browsers usually render  
character sets based on whatever appears in this line in the HTML source:


meta http-equiv=content-type content=text/html; charset=UTF-8

but most browsers allow you to force a different character encoding, so if  
something is rendering incorrectly you can use browser display options to  
try to find the correct set. It would be under something like View   
Encoding  (whatever). I find Opera to be great for this because I was  
able to add a handy button to quickly cycle through the most common  
encodings. Of course, web browsers in general might not grok MARC-8, but  
you get the idea.



Brian Stamper
The Ohio State University Libraries
Scholarly Resources Integration
610 Ackerman Road Rm. 5833
Columbus, OH 43202-4500


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Cory Rockliff
If you're looking for a book-length treatment, 'Unicode Explained' is 
fairly readable, and the first three chapters are about character 
encodings in general:


http://books.google.com/books?id=PcWU2yxc8WkCprintsec=frontcover

On 12/16/2009 12:02 PM, Ken Irwin wrote:

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.

My immediate issue is that I think I need to integrate data from a variety of 
character sets into one MySQL table - I expect I need some way to convert from 
one to another, but I don't really even know how to tell which data are in 
which format.

Our homegrown journal list (akin to SerialsSolutions) includes data ingested 
from publishers, vendors, the library catalog (III), etc. When I look at the 
data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a  [slashes-and-digits instead of 
diacritics]
And other data looks more like:
  Revista de Música Latinoamericana[weird characters instead of diacritics]

My MySQL table is currently set up with the collation set to: utf8-bin , and the titles 
from the second category (weird characters display in emacs) render properly when the 
database data is output to the a web browser. The data from the former example (\###) 
renders as an I don't know what character this is placeholder in Firefox and 
IE.

So, can someone please point me toward any or all of the following?

· A good primer for understanding all of this stuff

· A method for converting all of my data to the same character set so 
it plays nicely in the database

· The names of which character-sets I might be working with here

Many thanks!

Ken
---
[This E-mail scanned for viruses by Declude Virus]



   


attachment: rockliff.vcf

Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Doran, Michael D
Hi Ken,

In an effort to better understand character sets myself, I have brought 
together some information on my website, with an emphasis on library automation 
and the internet environment:
  
  Coded Character Sets  A Technical Primer for Librarians
  http://rocky.uta.edu/doran/charsets/

Make sure you look at the Resources on the Web page, too 
(http://rocky.uta.edu/doran/charsets/resources.html).

The quote about character sets that most resonated with me was An apparently 
simple subject which turns out to be brutally complicated.  They are 
definitely worth learning about, though!  Have fun.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ken Irwin
 Sent: Wednesday, December 16, 2009 11:02 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] character-sets for dummies?
 
 Hi all,
 
 I'm looking for a good source to help me understand character sets and
 how to use them. I pretty much know nothing about this - the whole
 world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me.
 
 My immediate issue is that I think I need to integrate data from a
 variety of character sets into one MySQL table - I expect I need some
 way to convert from one to another, but I don't really even know how to
 tell which data are in which format.
 
 Our homegrown journal list (akin to SerialsSolutions) includes data
 ingested from publishers, vendors, the library catalog (III), etc. When
 I look at the data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a  [slashes-and-digits
 instead of diacritics]
 And other data looks more like:
  Revista de Música Latinoamericana[weird characters instead of
 diacritics]
 
 My MySQL table is currently set up with the collation set to: utf8-bin
 , and the titles from the second category (weird characters display in
 emacs) render properly when the database data is output to the a web
 browser. The data from the former example (\###) renders as an I don't
 know what character this is placeholder in Firefox and IE.
 
 So, can someone please point me toward any or all of the following?
 
 · A good primer for understanding all of this stuff
 
 · A method for converting all of my data to the same character
 set so it plays nicely in the database
 
 · The names of which character-sets I might be working with
 here
 
 Many thanks!
 
 Ken


[CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Mads Villadsen
I just went through the registration for code4lib 2010, but at no point was I 
asked for payment information - but I did get a Registration Receipt.

How is the payment supposed to take place?

Regards.

Mads Villadsen


Re: [CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Debbie La
TRY REGISTERING AGAIN,,,SAME THING HAPPENED TO ME,,,GL
 
 Date: Wed, 16 Dec 2009 19:07:21 +0100
 From: m...@statsbiblioteket.dk
 Subject: [CODE4LIB] Payment for code4lib 2010?
 To: CODE4LIB@LISTSERV.ND.EDU
 
 I just went through the registration for code4lib 2010, but at no point was I 
 asked for payment information - but I did get a Registration Receipt.
 
 How is the payment supposed to take place?
 
 Regards.
 
 Mads Villadsen
  
_
Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
http://clk.atdmt.com/GBL/go/177141664/direct/01/

Re: [CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Edward M. Corrado
Mads,

There was a glitch with the form (since fixed). I am waiting to hear
from Kevin Clarke if we should register again or if we will be billed
some other way.

Edward




On Wed, Dec 16, 2009 at 1:07 PM, Mads Villadsen m...@statsbiblioteket.dk 
wrote:
 I just went through the registration for code4lib 2010, but at no point was I 
 asked for payment information - but I did get a Registration Receipt.

 How is the payment supposed to take place?

 Regards.

 Mads Villadsen



Re: [CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Matthew Bachtell
I was only charged the $25 for the preconference but wish to attend
the entire event.  Should I also wait or reregister?

On Wed, Dec 16, 2009 at 1:16 PM, Edward M. Corrado ecorr...@ecorrado.us wrote:
 Mads,

 There was a glitch with the form (since fixed). I am waiting to hear
 from Kevin Clarke if we should register again or if we will be billed
 some other way.

 Edward




 On Wed, Dec 16, 2009 at 1:07 PM, Mads Villadsen m...@statsbiblioteket.dk 
 wrote:
 I just went through the registration for code4lib 2010, but at no point was 
 I asked for payment information - but I did get a Registration Receipt.

 How is the payment supposed to take place?

 Regards.

 Mads Villadsen




Re: [CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Beacom, Matthew
Where is the registration form?

Matthew


-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Edward 
M. Corrado
Sent: Wednesday, December 16, 2009 1:16 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Payment for code4lib 2010?

Mads,

There was a glitch with the form (since fixed). I am waiting to hear
from Kevin Clarke if we should register again or if we will be billed
some other way.

Edward




On Wed, Dec 16, 2009 at 1:07 PM, Mads Villadsen m...@statsbiblioteket.dk 
wrote:
 I just went through the registration for code4lib 2010, but at no point was I 
 asked for payment information - but I did get a Registration Receipt.

 How is the payment supposed to take place?

 Regards.

 Mads Villadsen



[CODE4LIB] registration

2009-12-16 Thread Kevin S. Clarke
So it seems we had a few hiccups with registrations.  It should be
fixed now.  Hiccups included 1) not charging at all for the conference
and 2) incrementing the cost of the conference each time you went back
to register (clearing your cookies should help with the last one - or
switch to a different browser).

For those of you who got in free of charge, we'll contact you to
arrange to get the money from you.  You are registered though.

Everyone should be receiving a confirmation email.  If you don't, you
might have hit the site while it was being fixed.  Register again and
we'll refund if you get charged twice.

Kevin


Re: [CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Lovins, Daniel
https://wcupg.wcu.edu/C20252_ustores/web/product_detail.jsp?PRODUCTID=49SINGLESTORE=true

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of 
 Beacom,
 Matthew
 Sent: Wednesday, December 16, 2009 1:17 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Payment for code4lib 2010?
 
 Where is the registration form?
 
 Matthew
 
 
 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Edward
 M. Corrado
 Sent: Wednesday, December 16, 2009 1:16 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Payment for code4lib 2010?
 
 Mads,
 
 There was a glitch with the form (since fixed). I am waiting to hear
 from Kevin Clarke if we should register again or if we will be billed
 some other way.
 
 Edward
 
 
 
 
 On Wed, Dec 16, 2009 at 1:07 PM, Mads Villadsen m...@statsbiblioteket.dk 
 wrote:
  I just went through the registration for code4lib 2010, but at no point was 
  I asked
 for payment information - but I did get a Registration Receipt.
 
  How is the payment supposed to take place?
 
  Regards.
 
  Mads Villadsen
 


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Ken Irwin
Hi all -- thanks for these fabulous replies. I'm learning a lot. 

Armed with a bit of new knowledge, I've done some tinkering. I think I've 
solved my original quandaries, and have opened new cans of worms. I have a few 
more specific questions:

1) It appears that once I switch my MySQL table over from a latin character set 
to UTF-8, it is not longer case-insensitive (this makes sense based on what I 
learned from the Joel on Software post). All of the scripting I've done until 
now takes advantage of the case insensitivity; is there an easy way to keep 
this case insensitive while in UTF-8? 

2) Is there a good/easy way to make the database agnostic about diacritics, so 
that a search for cafe will also find café 

The answers to both of these may be convert data to some normalized A-Z field 
that never displays, but I can only imagine that normalizing even 
most-Roman-characters-with-diacritics to plain ASCII-style characters can be 
daunting task.

Any advice on these particulars? 

Thanks,
Ken


Re: [CODE4LIB] Payment for code4lib 2010?

2009-12-16 Thread Kevin S. Clarke
If you were charged only for the preconference, you're in (and will be
contacted later for the conference fee).  We had a form glitch, but it
should be fixed now.

Kevin



On Wed, Dec 16, 2009 at 1:17 PM, Matthew Bachtell
matthewbacht...@gmail.com wrote:
 I was only charged the $25 for the preconference but wish to attend
 the entire event.  Should I also wait or reregister?

 On Wed, Dec 16, 2009 at 1:16 PM, Edward M. Corrado ecorr...@ecorrado.us 
 wrote:
 Mads,

 There was a glitch with the form (since fixed). I am waiting to hear
 from Kevin Clarke if we should register again or if we will be billed
 some other way.

 Edward




 On Wed, Dec 16, 2009 at 1:07 PM, Mads Villadsen m...@statsbiblioteket.dk 
 wrote:
 I just went through the registration for code4lib 2010, but at no point was 
 I asked for payment information - but I did get a Registration Receipt.

 How is the payment supposed to take place?

 Regards.

 Mads Villadsen





[CODE4LIB] Another approach to shared conference transportation

2009-12-16 Thread Michael B. Klein
I've added a page to the code4lib wiki to help coordinate rides to/from
Charlotte Douglas International Airport, Asheville Regional Airport, and
other locations. If you need a ride, or have a car and are willing to share
a ride, please sign up so we get everyone matched up!

http://wiki.code4lib.org/index.php/C4L2010rideshare

Michael


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Thomale, J
Ken,

Great suggestions so far--I have just one thing to add.

If you ever reach the point at which you find yourself examining code tables to 
figure out what character set something is using, you might also want to find a 
good hex editor so that you can examine your data byte by byte. Since what 
you're looking at otherwise is always going to be the data as interpreted by a 
particular program (email program, web browser, text editor), looking at it 
with a hex editor can give you a nice grounding in reality, without that extra 
layer of interpretation.

I use XVI32: http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm

Jason Thomale
Metadata Librarian
Texas Tech University Libraries



 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ken Irwin
 Sent: Wednesday, December 16, 2009 11:02 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] character-sets for dummies?
 
 Hi all,
 
 I'm looking for a good source to help me understand character sets and
 how to use them. I pretty much know nothing about this - the whole
 world of Unicode, ASCII, octal, UTF-8, etc. is baffling to me.
 
 My immediate issue is that I think I need to integrate data from a
 variety of character sets into one MySQL table - I expect I need some
 way to convert from one to another, but I don't really even know how to
 tell which data are in which format.
 
 Our homegrown journal list (akin to SerialsSolutions) includes data
 ingested from publishers, vendors, the library catalog (III), etc. When
 I look at the data in emacs, some of it renders like this:
  Revista de Oncolog\303\255a  [slashes-and-digits
 instead of diacritics]
 And other data looks more like:
  Revista de Música Latinoamericana[weird characters instead of
 diacritics]
 
 My MySQL table is currently set up with the collation set to: utf8-bin
 , and the titles from the second category (weird characters display in
 emacs) render properly when the database data is output to the a web
 browser. The data from the former example (\###) renders as an I don't
 know what character this is placeholder in Firefox and IE.
 
 So, can someone please point me toward any or all of the following?
 
 · A good primer for understanding all of this stuff
 
 · A method for converting all of my data to the same character
 set so it plays nicely in the database
 
 · The names of which character-sets I might be working with
 here
 
 Many thanks!
 
 Ken


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread stuart yeates

Ken Irwin wrote:

Hi all,

I'm looking for a good source to help me understand character sets and how to 
use them. I pretty much know nothing about this - the whole world of Unicode, 
ASCII, octal, UTF-8, etc. is baffling to me.


Other people have recommended a whole lot of fabulous resources, so I 
won't cover ground they already have.


If, however, you need to deal with characters which don't qualify for 
inclusion in Unicode (or which do qualify but which haven't yet been 
assigned code points). I recommend tei:glyph:


http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

We use this to represent typographically interesting but short-lived 
approaches to the representation of Māori in printed works. See for 
example the 'wh' ligature (which looks like a 'vh' and is pronounced in 
modern usage like 'f') in the following text:


http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

for the underlying TEI XML representation see:

http://www.nzetc.org/tei-source/Auc1911NgaM.xml

cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread KREYCHE, MICHAEL
Ken--

You may find a reason to create a normalized stealth field, but I have a 
couple of suggestions that will probably help you avoid that scenario.

1) Read up a little on the Unicode Normalization Forms 
(http://unicode.org/reports/tr15/) and convert all your UTF-8 characters to the 
composed form (NFC). The standard for MARC data is the decomposed form (NFD), 
but this is a real pain to work with if you like things to sort nicely (at 
least in MySQL). One way to do this is in perl with Unicode::Normalize. 

2) Use a collation other than utf8-bin (here's where you lost your case 
insensitivity, I think). Try utf8_unicode_ci (ci as in case insensitive).

I wish I had written down everything I learned about this stuff, but I 
didn't--and I keep having to go back and refresh my memory.

Mike
--
Michael Kreyche
Systems Librarian / Associate Professor
Libraries and Media Services 
Kent State University
330-672-1918

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On 
 Behalf Of Ken Irwin
 Sent: Wednesday, December 16, 2009 1:26 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] character-sets for dummies?
 
 Hi all -- thanks for these fabulous replies. I'm learning a lot. 
 
 Armed with a bit of new knowledge, I've done some tinkering. 
 I think I've solved my original quandaries, and have opened 
 new cans of worms. I have a few more specific questions:
 
 1) It appears that once I switch my MySQL table over from a 
 latin character set to UTF-8, it is not longer 
 case-insensitive (this makes sense based on what I learned 
 from the Joel on Software post). All of the scripting I've 
 done until now takes advantage of the case insensitivity; is 
 there an easy way to keep this case insensitive while in UTF-8? 
 
 2) Is there a good/easy way to make the database agnostic 
 about diacritics, so that a search for cafe will also find café 
 
 The answers to both of these may be convert data to some 
 normalized A-Z field that never displays, but I can only 
 imagine that normalizing even 
 most-Roman-characters-with-diacritics to plain ASCII-style 
 characters can be daunting task.
 
 Any advice on these particulars? 
 
 Thanks,
 Ken
 


[CODE4LIB] SVN/Mercurial hosting

2009-12-16 Thread Yitzchak Schaffer

Hello all,

As I was considering whether to migrate our SVN repositories to 
Mercurial (or possibly Bazaar) so as to allow for distributed control 
(like if I'm on the train or otherwise off the grid), I got word from 
our IT higher-ups that they want us to stop hosting our code on our 
domain and server.


Before I start trekking around looking for hosting, does anyone in the 
crowd here have a server set up, and is potentially willing to host 
Trac+SVN or Trac+HG for our open-source projects?  We currently have two.


Alternately, I'd love to hear suggestions on regular hosting providers - 
particularly for Trac+Mercurial.


Many thanks,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY  10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com

Access Problems? Contact systems.libr...@touro.edu


Re: [CODE4LIB] NYC carpool?

2009-12-16 Thread Yitzchak Schaffer

On 12/16/2009 2:27 PM, Schwartz, Raymond wrote:

Anyone in the New York City area interested in this?  /Ray


As an addendum, the wiki page for updates on the carpool is
http://wiki.code4lib.org/index.php/C4L2010planning:RoommatesRidesEtc

I am still planning to go, pending getting funding approval.  I do *not* 
plan on attending the preconferences; if this were a breaking factor in 
the carpool, I would reconsider.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY  10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com

Access Problems? Contact systems.libr...@touro.edu


Re: [CODE4LIB] SVN/Mercurial hosting

2009-12-16 Thread Mark A. Matienzo
Hi Yitzchak,

I've been pretty happy with using BitBucket [1] to host Mercurial
repositories. It doesn't have Trac, but it does have it's own decently
featured issue tracker, commit log viewer, and wiki system. The free
plan is generous enough for you to get started.

[1] http://bitbucket.org/

Mark A. Matienzo
Applications Developer, Strategic Planning
The New York Public Library



On Wed, Dec 16, 2009 at 2:22 PM, Yitzchak Schaffer
yitzchak.schaf...@gmx.com wrote:
 Hello all,

 As I was considering whether to migrate our SVN repositories to Mercurial
 (or possibly Bazaar) so as to allow for distributed control (like if I'm on
 the train or otherwise off the grid), I got word from our IT higher-ups that
 they want us to stop hosting our code on our domain and server.

 Before I start trekking around looking for hosting, does anyone in the crowd
 here have a server set up, and is potentially willing to host Trac+SVN or
 Trac+HG for our open-source projects?  We currently have two.

 Alternately, I'd love to hear suggestions on regular hosting providers -
 particularly for Trac+Mercurial.

 Many thanks,

 --
 Yitzchak Schaffer
 Systems Manager
 Touro College Libraries
 33 West 23rd Street
 New York, NY  10010
 Tel (212) 463-0400 x5230
 Fax (212) 627-3197
 Email yitzchak.schaf...@gmx.com

 Access Problems? Contact systems.libr...@touro.edu



Re: [CODE4LIB] SVN/Mercurial hosting

2009-12-16 Thread Ross Singer
Also, Google Code offers both HG and SVN support.

http://code.google.com/projecthosting/

I have several projects there (although haven't used Mercurial) and
certainly find it a lot less frustrating than admin'ing Trac.

-Ross.

On Wed, Dec 16, 2009 at 2:39 PM, Mark A. Matienzo m...@matienzo.org wrote:
 Hi Yitzchak,

 I've been pretty happy with using BitBucket [1] to host Mercurial
 repositories. It doesn't have Trac, but it does have it's own decently
 featured issue tracker, commit log viewer, and wiki system. The free
 plan is generous enough for you to get started.

 [1] http://bitbucket.org/

 Mark A. Matienzo
 Applications Developer, Strategic Planning
 The New York Public Library



 On Wed, Dec 16, 2009 at 2:22 PM, Yitzchak Schaffer
 yitzchak.schaf...@gmx.com wrote:
 Hello all,

 As I was considering whether to migrate our SVN repositories to Mercurial
 (or possibly Bazaar) so as to allow for distributed control (like if I'm on
 the train or otherwise off the grid), I got word from our IT higher-ups that
 they want us to stop hosting our code on our domain and server.

 Before I start trekking around looking for hosting, does anyone in the crowd
 here have a server set up, and is potentially willing to host Trac+SVN or
 Trac+HG for our open-source projects?  We currently have two.

 Alternately, I'd love to hear suggestions on regular hosting providers -
 particularly for Trac+Mercurial.

 Many thanks,

 --
 Yitzchak Schaffer
 Systems Manager
 Touro College Libraries
 33 West 23rd Street
 New York, NY  10010
 Tel (212) 463-0400 x5230
 Fax (212) 627-3197
 Email yitzchak.schaf...@gmx.com

 Access Problems? Contact systems.libr...@touro.edu




Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Doran, Michael D
Hi Ken,

 1) It appears that once I switch my MySQL table over from a latin
 character set to UTF-8

My understanding is that a database character set is essentially a *label* that 
means My intention is to put data encoded in X character set in columns/fields 
of certain string datatypes.  I'm more familiar with Oracle than with MySQL, 
but I assume they are similar in that changing the database character set from 
Latin-1 to UTF-8 doesn't change any data, just how that data is labeled.  If 
all that data *was* UTF-8 then all is well.  If some of the data was a 
different character set, you still have a problem of data of mixed character 
sets in columns of similar datatype (a database no-no).

 2) Is there a good/easy way to make the database agnostic about
 diacritics, so that a search for cafe will also find café
 
 The answers to both of these may be convert data to some normalized A-
 Z field that never displays, but I can only imagine that normalizing
 even most-Roman-characters-with-diacritics to plain ASCII-style
 characters can be daunting task.

When I hear normalized A-Z it strikes me as a very English-centric approach.  
Which may be fine for your particular database and situation, but it tends not 
to scale well if at some point you find yourself having to deal with non-Roman 
languages.  If you are learning about character sets, might as well aim for 
solutions that will have a wider applicability.  ;-)

As suggested by Michael Kreyche, normalization is important, both for your 
database data and also in regards to user-supplied search terms.  Unlike Mr. 
Kreyche, I would strongly advocate for NFD, the *decomposed* normalized form.  
Once both the search terms and the data are NFD, the quick-and-dirty way is to 
then strip out any combining characters and match on what remains.  This is not 
ideal, since in some languages, certain accented characters are considered to 
be different characters (and sort differently, too, if correctly localized) 
than the base, un-accented character.  However, I am guessing that will 
probably work fine for your purposes.

Personally, I think a search feature that would list exact matches first (i.e. 
terms that match before stripping out the combining characters) and then fuzzy 
matches (i.e. terms that didn't match the first iteration but that match after 
stripping out the combining characters) is better.  But also more complex to 
implement and perhaps over-kill in this situation.

Depending on which scripting language you are using (and how much trouble you 
want to go to) I may have some more (opinionated) suggestions.  If you end up 
coding some of this yourself, you may also want to investigate the Unicode 
Properties/Sub-Properties available in regular expressions.  They provide a lot 
of power and flexibility.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ken Irwin
 Sent: Wednesday, December 16, 2009 12:26 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] character-sets for dummies?
 
 Hi all -- thanks for these fabulous replies. I'm learning a lot.
 
 Armed with a bit of new knowledge, I've done some tinkering. I think
 I've solved my original quandaries, and have opened new cans of worms.
 I have a few more specific questions:
 
 1) It appears that once I switch my MySQL table over from a latin
 character set to UTF-8, it is not longer case-insensitive (this makes
 sense based on what I learned from the Joel on Software post). All of
 the scripting I've done until now takes advantage of the case
 insensitivity; is there an easy way to keep this case insensitive while
 in UTF-8?
 
 2) Is there a good/easy way to make the database agnostic about
 diacritics, so that a search for cafe will also find café
 
 The answers to both of these may be convert data to some normalized A-
 Z field that never displays, but I can only imagine that normalizing
 even most-Roman-characters-with-diacritics to plain ASCII-style
 characters can be daunting task.
 
 Any advice on these particulars?
 
 Thanks,
 Ken


Re: [CODE4LIB] SVN/Mercurial hosting

2009-12-16 Thread Nate Vack
On Wed, Dec 16, 2009 at 1:22 PM, Yitzchak Schaffer
yitzchak.schaf...@gmx.com wrote:

 Before I start trekking around looking for hosting, does anyone in the crowd
 here have a server set up, and is potentially willing to host Trac+SVN or
 Trac+HG for our open-source projects?  We currently have two.

If you're not married to Mercurial or bzr for your distributed source
control, and would be willing to play git, github is positively lovely
and free for open-source projects.

http://github.com/

Cheers,
-Nate


[CODE4LIB] New WorldCat Basic API Released

2009-12-16 Thread Roy Tennant
The new WorldCat Basic API http://worldcat.org/devnet/wiki/BasicAPIDetails
provides an easy method to search WorldCat http://worldcat.org/ for items
in libraries such as books, videos, music and more, and receive results back
in XML. The OpenSearch protocol is supported, and results are returned in
Atom and RSS. Information included in results are authors, titles, ISBNs and
OCLC numbers. Records will be returned in standard bibliographic citation
formats such as APA, Chicago, Harvard, MLA and Turabian. The API will also
provide links back to WorldCat.org for geographically-sorted library
information. 
 
Anyone can gain access to the WorldCat Basic API for non-commercial purposes
from the WorldCat Affiliates site
http://www.worldcat.org/wcpa/content/affiliate/, by signing up for a
unique key to use the service. Commercial uses of the WorldCat Basic API are
encouraged, but interested parties are required to contact the WorldCat
Partnership team 
mailto:owcpart...@oclc.org?subject=commercial%20app%20interest%20for%20worl
dCat%20Basic%20API to arrange access.

OCLC Web Services offers ways to connect people to knowledge through
libraries, and for libraries to reap the benefits of library cooperation.
There are many other Web Services http://www.oclc.org/services/web/
available from OCLC, including the more full-featured WorldCat Search API,
and also ready-made WorldCat widgets
http://www.worldcat.org/wcpa/content/affiliate/ available for download.
 
Feedback about any of the OCLC Web Services is welcome on the OCLC Developer
Network blog and mailing list, available from the OCLC Developer Network
site http://worldcat.org/devnet/.
Roy Tennant


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread Andrew Cunningham
Hi

2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz:

 If, however, you need to deal with characters which don't qualify for
 inclusion in Unicode (or which do qualify but which haven't yet been
 assigned code points). I recommend tei:glyph:





 http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

 We use this to represent typographically interesting but short-lived
 approaches to the representation of Māori in printed works. See for example
 the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage
 like 'f') in the following text:

an interesting approach, although not the only way to address that
particular issue.

and depends on whether you want to treat it as a ligature or as a character.

Other approaches have been to :
1) use PUA assignments, e.g. the MUFI and SIL PUA
assignments/registries as examples; or
2) use U+200D to request ligation

Both these approaches would require specifically defined or modified fonts.

 http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

 for the underlying TEI XML representation see:

 http://www.nzetc.org/tei-source/Auc1911NgaM.xml

 cheers
 stuart
 --
 Stuart Yeates
 http://www.nzetc.org/       New Zealand Electronic Text Centre
 http://researcharchive.vuw.ac.nz/     Institutional Repository




-- 
Andrew Cunningham
Vicnet Research and Development Coordinator
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com


Re: [CODE4LIB] character-sets for dummies?

2009-12-16 Thread stuart yeates

Andrew Cunningham wrote:

Hi

2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz:


If, however, you need to deal with characters which don't qualify for
inclusion in Unicode (or which do qualify but which haven't yet been
assigned code points). I recommend tei:glyph:

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html

We use this to represent typographically interesting but short-lived
approaches to the representation of Māori in printed works. See for example
the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage
like 'f') in the following text:


an interesting approach, although not the only way to address that
particular issue.

and depends on whether you want to treat it as a ligature or as a character.

Other approaches have been to :
1) use PUA assignments, e.g. the MUFI and SIL PUA
assignments/registries as examples; or
2) use U+200D to request ligation


On reflection, this is a subtly more general approach than our TEI one, 
since this allows new non-glyph characters to be introduced as well as 
new glyph characters.


OTOH, there are a limited number of PUA code-points, a constraint that 
the TEI approach does not suffer.


[For those unfamiliar with Unicode PUA mechanisms, see 
http://unicode.org/faq/casemap_charprop.html#8 and 
http://www.alanwood.net/unicode/private_use_area.html ]



Both these approaches would require specifically defined or modified fonts.


In our case, when generating (X)HTML (our primary delivery formats) we 
substitute character images cut from page scans of the original 
documents. Generating the right HTML and CSS for this is non-trivial.


cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository


Re: [CODE4LIB] New WorldCat Basic API Released

2009-12-16 Thread Ziso, Ya'aqov
 Information included in results are authors, titles, ISBNs and OCLC 
numbers ...
==
Roy Tennant, no results for subjects? if yes, from which specific indexes?
Ya'aqov Ziso