Re: [GENERAL] Searching for bare letters

2011-10-03 Thread Andrew Sullivan
On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote:
 quite grateful for that.  (I really hadn't ever needed to deal with
 such issues in the past, having worked mostly with English and
 Hebrew, which don't have such accent marks.)

That isn't quite true about English.  We have words like coöperate and
naïve.  The former is sometimes fixed with a hyphen instead, but the
latter can't be.

I think what happened is that English speakers, because we're already
used to being sloppy (you can't tell what's a subjunctive in English,
either, just by looking) were willing to adapt our spelling to reflect
the limitations of typewriters.  Also, English never really had an
official standard spelling -- by the time the English were attempting
to standardize seriously, there was already an American branch with
its own Bossypants Official Reformer of Spelling (BORS, which in
that case was Noah Webster.  See G.B. Shaw for a British example).  So
we mostly lost the accents in standard spelling.  We also lost various
standard digraphs, like that in encyclopædia (which, depending on
which branch of nonsense you subscribe to, can be spelled instead
encyclopedia or encyclopaedia; both would have been called wrong
once upon a time).

 As for the unaccent dictionary, I hadn't heard of it before, but
 just saw it now in contrib, and it looks like it might fit
 perfectly.  I'll take a look; thanks for the suggestion.

The big problem there is what someone else pointed to up-thread: in
some languages, the natural thing to do is to transliterate using
multiple characters.  The usual example is that in German is it common
to use e after a vowel to approximate the umlaut.  So, ö becomes
oe.  Unfortunately, in Swedish this is clearly a mistake, and if you
can't use the diaeresis, then you just use the undecorated character
instead.  The famous Swedish ship called the Götheborg cannot be
transliterated as Goetheborg.  Even in German, the rule is
complicated, because it's not two-way: you can't spell the famous
writer's name Göthe (even though Google seems to think you can).

As far as I can tell, the unaccent dictionary doesn't handle the
two-character case, though it sure looks like it could be extended to
do it.  But it doesn't seem to have a facility for differentiating
based on the language of the string.  I don't know whether that could
be added.

The upshot is that, if you need to store multilingual input and do
special handling on the strings afterwards, you are wise to store the
string with a language tag so that you can apply the right rules later
on.  See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some
pointers.  If just stripping accents is good enough for you, then
the unaccent dictionary will probably be good enough.

A

-- 
Andrew Sullivan
a...@crankycanuck.ca

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Reuven M. Lerner

Hi, everyone.  Uwe wrote:


What kind of client are the users using?  I assume you will have some kind
of user interface. For me this is a typical job for a user interface. The
number of letters with equivalents in different languages are extremely
limited, so a simple matching routine in the user interface should give you a
way to issue the proper query.


The user interface will be via a Web application.  But we need to store 
the data with the European characters, such as ñ, so that we can display 
them appropriately.  So much as I like your suggestion, we need to do 
the opposite of what you're saying -- namely, take a bare letter, and 
then search for letters with accents and such on them.


I am beginning to think that storing two versions of each name, one bare 
and the other not, might be the easiest way to go.   But hey, I'm open 
to more suggestions.


Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread hamann . w
Reuven M. Lerner wrote:

 pHi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon
   to be upgraded to 9.1, given that we haven't yet launched).  The
   project will involve numerous text fields containing English,
   Spanish, and Portuguese.  Some of those text fields will be
   searchable by the user.  That's easy enough to do; for our
   purposes, I was planning to use some combination of LIKE searches;
   the database is small enough that this doesn't take very much
   time, and we don't expect the number of searchable records (or
   columns within those records) to be all that large./p
 pThe thing is, the people running the site want searches to work
   on what I'm calling (for lack of a better term) bare letters. 
   That is, if the user searches for n, then the search should also
   match Spanish words containing ñ.  I'm told by Spanish-speaking
   members of the team that this is how they would expect searches to
   work.  However, when I just did a quick test using a UTF-8 encoded
   9.0 database, I found that PostgreSQL didn't  see the two
   characters as identical.  (I must say, this is the behavior that I
   would have expected, had the Spanish-speaking team member not said
   anything on the subject.)/p
 pSo my question is whether I can somehow wrangle PostgreSQL into
   thinking that n and ñ are the same character for search
   purposes, or if I need to do something else -- use regexps, keep a
   naked, searchable version of each column alongside the native
   one, or something else entirely -- to get this to work./p
 pAny ideas?/p
 pThanks,/p
 pReuvenbr

I had the same problem with german (there is ä ö ü)
I ended up with a normalized version of the database (for many purposes, this 
could
be just an extra column) plus preprocessing the input.
There is one difficulty with german searches: these letters are commonly 
transliterated into
ue etc, like in Muenchen. So depending on culture, some people would expect a 
u search
term to match, and others the ue. So preprocessing query means replacing bare 
u
(not followed by e) with a ue? regex

BTW: if your search form does not explicitly tell the browser to use utf8 to 
encode the search field,
you might expect a small proportion of iso-latin1 requests

Regards
Wolfgang




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Uwe Schroeder
 Hi, everyone.  Uwe wrote:
  What kind of client are the users using?  I assume you will have some
  kind of user interface. For me this is a typical job for a user
  interface. The number of letters with equivalents in different
  languages are extremely limited, so a simple matching routine in the
  user interface should give you a way to issue the proper query.
 
 The user interface will be via a Web application.  But we need to store
 the data with the European characters, such as ñ, so that we can display
 them appropriately.  So much as I like your suggestion, we need to do
 the opposite of what you're saying -- namely, take a bare letter, and
 then search for letters with accents and such on them.
 
 I am beginning to think that storing two versions of each name, one bare
 and the other not, might be the easiest way to go.   But hey, I'm open
 to more suggestions.
 
 Reuven


That still doesn't hinder you from using a matching algorithm. Here a simple 
example (to my understanding of the problem)
You have texts stored in the db both containing a n and a ñ. Now a client 
enters n on the website. What you want to do is look for both variations, so 
n translates into n or ñ.
There you have it. In the routine that receives the request you have a 
matching method that matches on n (or any of the few other characters with 
equivalents) and the routine will issue a query with a xx like %n% or xx 
like %ñ% (personally I would use ilike, since that eliminates the case 
problem).

Since you're referring to a name, I sure don't know the specifics of the 
problem or data layout, but by what I know I think you can tackle this with a 
rather primitive match - translate to kind of algorithm.

One thing I'd not do: store duplicate versions. There's always a way to deal 
with data the way it is. In my opinion storing different versions of the same 
data just bloats a database in favor of a smarter way to deal with the initial 
data.

Uwe



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Oleg Bartunov
I don't see the problem - you can have a dictionary, which does all work on 
recognizing bare letters and output several versions. Have you seen unaccent

dictionary ?

Oleg
On Sun, 2 Oct 2011, Uwe Schroeder wrote:


Hi, everyone.  Uwe wrote:

What kind of client are the users using?  I assume you will have some
kind of user interface. For me this is a typical job for a user
interface. The number of letters with equivalents in different
languages are extremely limited, so a simple matching routine in the
user interface should give you a way to issue the proper query.


The user interface will be via a Web application.  But we need to store
the data with the European characters, such as ?, so that we can display
them appropriately.  So much as I like your suggestion, we need to do
the opposite of what you're saying -- namely, take a bare letter, and
then search for letters with accents and such on them.

I am beginning to think that storing two versions of each name, one bare
and the other not, might be the easiest way to go.   But hey, I'm open
to more suggestions.

Reuven



That still doesn't hinder you from using a matching algorithm. Here a simple
example (to my understanding of the problem)
You have texts stored in the db both containing a n and a ?. Now a client
enters n on the website. What you want to do is look for both variations, so
n translates into n or ?.
There you have it. In the routine that receives the request you have a
matching method that matches on n (or any of the few other characters with
equivalents) and the routine will issue a query with a xx like %n% or xx
like %?% (personally I would use ilike, since that eliminates the case
problem).

Since you're referring to a name, I sure don't know the specifics of the
problem or data layout, but by what I know I think you can tackle this with a
rather primitive match - translate to kind of algorithm.

One thing I'd not do: store duplicate versions. There's always a way to deal
with data the way it is. In my opinion storing different versions of the same
data just bloats a database in favor of a smarter way to deal with the initial
data.

Uwe






Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Reuven M. Lerner

Hi, Oleg.  You wrote:


I don't see the problem - you can have a dictionary, which does all work
on recognizing bare letters and output several versions. Have you seen
unaccent
dictionary ?


This seems to be the direction that everyone is suggesting, and I'm 
quite grateful for that.  (I really hadn't ever needed to deal with such 
issues in the past, having worked mostly with English and Hebrew, which 
don't have such accent marks.)


As for the unaccent dictionary, I hadn't heard of it before, but just 
saw it now in contrib, and it looks like it might fit perfectly.  I'll 
take a look; thanks for the suggestion.


Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Mike Christensen
 I don't see the problem - you can have a dictionary, which does all work
 on recognizing bare letters and output several versions. Have you seen
 unaccent
 dictionary ?

 This seems to be the direction that everyone is suggesting, and I'm quite
 grateful for that.  (I really hadn't ever needed to deal with such issues in
 the past, having worked mostly with English and Hebrew, which don't have
 such accent marks.)

 As for the unaccent dictionary, I hadn't heard of it before, but just saw it
 now in contrib, and it looks like it might fit perfectly.  I'll take a look;
 thanks for the suggestion.

I wrote this code for something similar I was doing, feel free to rip
it off or copy the regular expressions:


input = Regex.Replace(input, @[\xC0-\xC5\xE0-\xE5], a);  //Replace with a
input = Regex.Replace(input, @[\xC8-\xCB\xE8-\xEB], e);  //Replace with e
input = Regex.Replace(input, @[\xCC-\xCF\xEC-\xEF], i);  //Replace with i
input = Regex.Replace(input, @[\xD1\xF1], n);//Replace with n
input = Regex.Replace(input, @[\xD2-\xD6\xF2-\xF6], o);  //Replace with o
input = Regex.Replace(input, @[\xD9-\xDC\xF9-\xFC], u);  //Replace with u
input = Regex.Replace(input, @[\xDD\xDF\xFF], y);//Replace with y

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Eduardo Morras

At 01:25 02/10/2011, Reuven M. Lerner wrote:

Hi, everyone.  I'm working on a project on 
PostgreSQL 9.0 (soon to be upgraded to 9.1, 
given that we haven't yet launched).  The 
project will involve numerous text fields 
containing English, Spanish, and 
Portuguese.  Some of those text fields will be 
searchable by the user.  That's easy enough to 
do; for our purposes, I was planning to use some 
combination of LIKE searches; the database is 
small enough that this doesn't take very much 
time, and we don't expect the number of 
searchable records (or columns within those records) to be all that large.


The thing is, the people running the site want 
searches to work on what I'm calling (for lack 
of a better term) bare letters.  That is, if 
the user searches for n, then the search 
should also match Spanish words containing 
ñ.  I'm told by Spanish-speaking members of 
the team that this is how they would expect 
searches to work.  However, when I just did a 
quick test using a UTF-8 encoded 9.0 database, I 
found that PostgreSQL didn't  see the two 
characters as identical.  (I must say, this is 
the behavior that I would have expected, had the 
Spanish-speaking team member not said anything on the subject.)


So my question is whether I can somehow wrangle 
PostgreSQL into thinking that n and ñ are 
the same character for search purposes, or if I 
need to do something else -- use regexps, keep a 
naked, searchable version of each column 
alongside the native one, or something else entirely -- to get this to work.


Any ideas?


You can use perceptual hashing for that. There 
are multiple algorithms, some of them can be tuned for specific languages.


See this documentation:

http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,

http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,

http://en.wikipedia.org/wiki/Metaphone is a 
family of several modern algorithms.


Remember that they are hashing algorithms, some 
words can collide because they have the same pronunciation but write different.


I remember that datapark search engine uses them 
with dictionaries. You can check it too.


http://www.dataparksearch.org/


Thanks,

Reuven


HTH 




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Searching for bare letters

2011-10-01 Thread Reuven M. Lerner

  
  
Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon
  to be upgraded to 9.1, given that we haven't yet launched).  The
  project will involve numerous text fields containing English,
  Spanish, and Portuguese.  Some of those text fields will be
  searchable by the user.  That's easy enough to do; for our
  purposes, I was planning to use some combination of LIKE searches;
  the database is small enough that this doesn't take very much
  time, and we don't expect the number of searchable records (or
  columns within those records) to be all that large.
The thing is, the people running the site want searches to work
  on what I'm calling (for lack of a better term) "bare" letters. 
  That is, if the user searches for "n", then the search should also
  match Spanish words containing "ñ".  I'm told by Spanish-speaking
  members of the team that this is how they would expect searches to
  work.  However, when I just did a quick test using a UTF-8 encoded
  9.0 database, I found that PostgreSQL didn't  see the two
  characters as identical.  (I must say, this is the behavior that I
  would have expected, had the Spanish-speaking team member not said
  anything on the subject.)
So my question is whether I can somehow wrangle PostgreSQL into
  thinking that "n" and "ñ" are the same character for search
  purposes, or if I need to do something else -- use regexps, keep a
  "naked," searchable version of each column alongside the native
  one, or something else entirely -- to get this to work.
Any ideas?
Thanks,
Reuven

 
-- 
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

  



Re: [GENERAL] Searching for bare letters

2011-10-01 Thread planas
On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:

 Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be
 upgraded to 9.1, given that we haven't yet launched).  The project
 will involve numerous text fields containing English, Spanish, and
 Portuguese.  Some of those text fields will be searchable by the user.
 That's easy enough to do; for our purposes, I was planning to use some
 combination of LIKE searches; the database is small enough that this
 doesn't take very much time, and we don't expect the number of
 searchable records (or columns within those records) to be all that
 large.
 
 The thing is, the people running the site want searches to work on
 what I'm calling (for lack of a better term) bare letters.  That is,
 if the user searches for n, then the search should also match
 Spanish words containing ñ.  I'm told by Spanish-speaking members of
 the team that this is how they would expect searches to work.
 However, when I just did a quick test using a UTF-8 encoded 9.0
 database, I found that PostgreSQL didn't  see the two characters as
 identical.  (I must say, this is the behavior that I would have
 expected, had the Spanish-speaking team member not said anything on
 the subject.)
 
 So my question is whether I can somehow wrangle PostgreSQL into
 thinking that n and ñ are the same character for search purposes,
 or if I need to do something else -- use regexps, keep a naked,
 searchable version of each column alongside the native one, or
 something else entirely -- to get this to work.
 

Could you parse the search string for the non-English characters and
convert them to the appropriate English character? My skills are not
that good or I would offer more details.

 Any ideas?
 
 Thanks,
 
 Reuven
 
 
 
 -- 
 Reuven M. Lerner -- Web development, consulting, and training
 Mobile: +972-54-496-8405 * US phone: 847-230-9795
 Skype/AIM: reuvenlerner



-- 
Jay Lozier
jsloz...@gmail.com


Re: [GENERAL] Searching for bare letters

2011-10-01 Thread Uwe Schroeder


 Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be
 upgraded to 9.1, given that we haven't yet launched).  The project will
 involve numerous text fields containing English, Spanish, and Portuguese. 
 Some of those text fields will be searchable by the user.  That's easy
 enough to do; for our purposes, I was planning to use some combination of
 LIKE searches; the database is small enough that this doesn't take very
 much time, and we don't expect the number of searchable records (or
 columns within those records) to be all that large. The thing is, the
 people running the site want searches to work on what I'm calling (for
 lack of a better term) bare letters.  That is, if the user searches for
 n, then the search should also match Spanish words containing ñ.  I'm
 told by Spanish-speaking members of the team that this is how they would
 expect searches to work.  However, when I just did a quick test using a
 UTF-8 encoded 9.0 database, I found that PostgreSQL didn't  see the two
 characters as identical.  (I must say, this is the behavior that I would
 have expected, had the Spanish-speaking team member not said anything on
 the subject.) So my question is whether I can somehow wrangle PostgreSQL
 into thinking that n and ñ are the same character for search purposes,
 or if I need to do something else -- use regexps, keep a naked,
 searchable version of each column alongside the native one, or something
 else entirely -- to get this to work. Any ideas?
 Thanks,
 Reuven



What kind of client are the users using?  I assume you will have some kind 
of user interface. For me this is a typical job for a user interface. The 
number of letters with equivalents in different languages are extremely 
limited, so a simple matching routine in the user interface should give you a 
way to issue the proper query.

Uwe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-01 Thread Cody Caughlan
One approach would be to normalize all the text and search against that.

That is, basically convert all non-ASCII characters to their equivalents. 

I've had to do this in Solr for searching for the exact reasons you've 
outlined: treat ñ as n. Ditto for ü - u, é = e, etc.

This is easily done in Solr via the included ASCIIFoldingFilterFactory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

You could look at the code to see how they do the conversion and implement it.

/Cody

On Oct 1, 2011, at 7:09 PM, planas wrote:

 On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:
 Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be 
 upgraded to 9.1, given that we haven't yet launched).  The project will 
 involve numerous text fields containing English, Spanish, and Portuguese.  
 Some of those text fields will be searchable by the user.  That's easy 
 enough to do; for our purposes, I was planning to use some combination of 
 LIKE searches; the database is small enough that this doesn't take very much 
 time, and we don't expect the number of searchable records (or columns 
 within those records) to be all that large.
 
 The thing is, the people running the site want searches to work on what I'm 
 calling (for lack of a better term) bare letters.  That is, if the user 
 searches for n, then the search should also match Spanish words containing 
 ñ.  I'm told by Spanish-speaking members of the team that this is how they 
 would expect searches to work.  However, when I just did a quick test using 
 a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't  see the two 
 characters as identical.  (I must say, this is the behavior that I would 
 have expected, had the Spanish-speaking team member not said anything on the 
 subject.)
 
 So my question is whether I can somehow wrangle PostgreSQL into thinking 
 that n and ñ are the same character for search purposes, or if I need to 
 do something else -- use regexps, keep a naked, searchable version of each 
 column alongside the native one, or something else entirely -- to get this 
 to work.
 
 Could you parse the search string for the non-English characters and convert 
 them to the appropriate English character? My skills are not that good or I 
 would offer more details.
 Any ideas?
 
 Thanks,
 
 Reuven
 
 
 -- 
 Reuven M. Lerner -- Web development, consulting, and training
 Mobile: +972-54-496-8405 * US phone: 847-230-9795
 Skype/AIM: reuvenlerner
 
 
 -- 
 Jay Lozier
 jsloz...@gmail.com