Re: [GENERAL] Searching for bare letters
On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote: quite grateful for that. (I really hadn't ever needed to deal with such issues in the past, having worked mostly with English and Hebrew, which don't have such accent marks.) That isn't quite true about English. We have words like coöperate and naïve. The former is sometimes fixed with a hyphen instead, but the latter can't be. I think what happened is that English speakers, because we're already used to being sloppy (you can't tell what's a subjunctive in English, either, just by looking) were willing to adapt our spelling to reflect the limitations of typewriters. Also, English never really had an official standard spelling -- by the time the English were attempting to standardize seriously, there was already an American branch with its own Bossypants Official Reformer of Spelling (BORS, which in that case was Noah Webster. See G.B. Shaw for a British example). So we mostly lost the accents in standard spelling. We also lost various standard digraphs, like that in encyclopædia (which, depending on which branch of nonsense you subscribe to, can be spelled instead encyclopedia or encyclopaedia; both would have been called wrong once upon a time). As for the unaccent dictionary, I hadn't heard of it before, but just saw it now in contrib, and it looks like it might fit perfectly. I'll take a look; thanks for the suggestion. The big problem there is what someone else pointed to up-thread: in some languages, the natural thing to do is to transliterate using multiple characters. The usual example is that in German is it common to use e after a vowel to approximate the umlaut. So, ö becomes oe. Unfortunately, in Swedish this is clearly a mistake, and if you can't use the diaeresis, then you just use the undecorated character instead. The famous Swedish ship called the Götheborg cannot be transliterated as Goetheborg. Even in German, the rule is complicated, because it's not two-way: you can't spell the famous writer's name Göthe (even though Google seems to think you can). As far as I can tell, the unaccent dictionary doesn't handle the two-character case, though it sure looks like it could be extended to do it. But it doesn't seem to have a facility for differentiating based on the language of the string. I don't know whether that could be added. The upshot is that, if you need to store multilingual input and do special handling on the strings afterwards, you are wise to store the string with a language tag so that you can apply the right rules later on. See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some pointers. If just stripping accents is good enough for you, then the unaccent dictionary will probably be good enough. A -- Andrew Sullivan a...@crankycanuck.ca -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
Hi, everyone. Uwe wrote: What kind of client are the users using? I assume you will have some kind of user interface. For me this is a typical job for a user interface. The number of letters with equivalents in different languages are extremely limited, so a simple matching routine in the user interface should give you a way to issue the proper query. The user interface will be via a Web application. But we need to store the data with the European characters, such as ñ, so that we can display them appropriately. So much as I like your suggestion, we need to do the opposite of what you're saying -- namely, take a bare letter, and then search for letters with accents and such on them. I am beginning to think that storing two versions of each name, one bare and the other not, might be the easiest way to go. But hey, I'm open to more suggestions. Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
Reuven M. Lerner wrote: pHi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large./p pThe thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) bare letters. That is, if the user searches for n, then the search should also match Spanish words containing ñ. I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)/p pSo my question is whether I can somehow wrangle PostgreSQL into thinking that n and ñ are the same character for search purposes, or if I need to do something else -- use regexps, keep a naked, searchable version of each column alongside the native one, or something else entirely -- to get this to work./p pAny ideas?/p pThanks,/p pReuvenbr I had the same problem with german (there is ä ö ü) I ended up with a normalized version of the database (for many purposes, this could be just an extra column) plus preprocessing the input. There is one difficulty with german searches: these letters are commonly transliterated into ue etc, like in Muenchen. So depending on culture, some people would expect a u search term to match, and others the ue. So preprocessing query means replacing bare u (not followed by e) with a ue? regex BTW: if your search form does not explicitly tell the browser to use utf8 to encode the search field, you might expect a small proportion of iso-latin1 requests Regards Wolfgang -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
Hi, everyone. Uwe wrote: What kind of client are the users using? I assume you will have some kind of user interface. For me this is a typical job for a user interface. The number of letters with equivalents in different languages are extremely limited, so a simple matching routine in the user interface should give you a way to issue the proper query. The user interface will be via a Web application. But we need to store the data with the European characters, such as ñ, so that we can display them appropriately. So much as I like your suggestion, we need to do the opposite of what you're saying -- namely, take a bare letter, and then search for letters with accents and such on them. I am beginning to think that storing two versions of each name, one bare and the other not, might be the easiest way to go. But hey, I'm open to more suggestions. Reuven That still doesn't hinder you from using a matching algorithm. Here a simple example (to my understanding of the problem) You have texts stored in the db both containing a n and a ñ. Now a client enters n on the website. What you want to do is look for both variations, so n translates into n or ñ. There you have it. In the routine that receives the request you have a matching method that matches on n (or any of the few other characters with equivalents) and the routine will issue a query with a xx like %n% or xx like %ñ% (personally I would use ilike, since that eliminates the case problem). Since you're referring to a name, I sure don't know the specifics of the problem or data layout, but by what I know I think you can tackle this with a rather primitive match - translate to kind of algorithm. One thing I'd not do: store duplicate versions. There's always a way to deal with data the way it is. In my opinion storing different versions of the same data just bloats a database in favor of a smarter way to deal with the initial data. Uwe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
I don't see the problem - you can have a dictionary, which does all work on recognizing bare letters and output several versions. Have you seen unaccent dictionary ? Oleg On Sun, 2 Oct 2011, Uwe Schroeder wrote: Hi, everyone. Uwe wrote: What kind of client are the users using? I assume you will have some kind of user interface. For me this is a typical job for a user interface. The number of letters with equivalents in different languages are extremely limited, so a simple matching routine in the user interface should give you a way to issue the proper query. The user interface will be via a Web application. But we need to store the data with the European characters, such as ?, so that we can display them appropriately. So much as I like your suggestion, we need to do the opposite of what you're saying -- namely, take a bare letter, and then search for letters with accents and such on them. I am beginning to think that storing two versions of each name, one bare and the other not, might be the easiest way to go. But hey, I'm open to more suggestions. Reuven That still doesn't hinder you from using a matching algorithm. Here a simple example (to my understanding of the problem) You have texts stored in the db both containing a n and a ?. Now a client enters n on the website. What you want to do is look for both variations, so n translates into n or ?. There you have it. In the routine that receives the request you have a matching method that matches on n (or any of the few other characters with equivalents) and the routine will issue a query with a xx like %n% or xx like %?% (personally I would use ilike, since that eliminates the case problem). Since you're referring to a name, I sure don't know the specifics of the problem or data layout, but by what I know I think you can tackle this with a rather primitive match - translate to kind of algorithm. One thing I'd not do: store duplicate versions. There's always a way to deal with data the way it is. In my opinion storing different versions of the same data just bloats a database in favor of a smarter way to deal with the initial data. Uwe Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
Hi, Oleg. You wrote: I don't see the problem - you can have a dictionary, which does all work on recognizing bare letters and output several versions. Have you seen unaccent dictionary ? This seems to be the direction that everyone is suggesting, and I'm quite grateful for that. (I really hadn't ever needed to deal with such issues in the past, having worked mostly with English and Hebrew, which don't have such accent marks.) As for the unaccent dictionary, I hadn't heard of it before, but just saw it now in contrib, and it looks like it might fit perfectly. I'll take a look; thanks for the suggestion. Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
I don't see the problem - you can have a dictionary, which does all work on recognizing bare letters and output several versions. Have you seen unaccent dictionary ? This seems to be the direction that everyone is suggesting, and I'm quite grateful for that. (I really hadn't ever needed to deal with such issues in the past, having worked mostly with English and Hebrew, which don't have such accent marks.) As for the unaccent dictionary, I hadn't heard of it before, but just saw it now in contrib, and it looks like it might fit perfectly. I'll take a look; thanks for the suggestion. I wrote this code for something similar I was doing, feel free to rip it off or copy the regular expressions: input = Regex.Replace(input, @[\xC0-\xC5\xE0-\xE5], a); //Replace with a input = Regex.Replace(input, @[\xC8-\xCB\xE8-\xEB], e); //Replace with e input = Regex.Replace(input, @[\xCC-\xCF\xEC-\xEF], i); //Replace with i input = Regex.Replace(input, @[\xD1\xF1], n);//Replace with n input = Regex.Replace(input, @[\xD2-\xD6\xF2-\xF6], o); //Replace with o input = Regex.Replace(input, @[\xD9-\xDC\xF9-\xFC], u); //Replace with u input = Regex.Replace(input, @[\xDD\xDF\xFF], y);//Replace with y -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
At 01:25 02/10/2011, Reuven M. Lerner wrote: Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large. The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) bare letters. That is, if the user searches for n, then the search should also match Spanish words containing ñ. I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.) So my question is whether I can somehow wrangle PostgreSQL into thinking that n and ñ are the same character for search purposes, or if I need to do something else -- use regexps, keep a naked, searchable version of each column alongside the native one, or something else entirely -- to get this to work. Any ideas? You can use perceptual hashing for that. There are multiple algorithms, some of them can be tuned for specific languages. See this documentation: http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description, http://en.wikipedia.org/wiki/Soundex is the first one developed, very old, http://en.wikipedia.org/wiki/Metaphone is a family of several modern algorithms. Remember that they are hashing algorithms, some words can collide because they have the same pronunciation but write different. I remember that datapark search engine uses them with dictionaries. You can check it too. http://www.dataparksearch.org/ Thanks, Reuven HTH -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Searching for bare letters
Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large. The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) "bare" letters. That is, if the user searches for "n", then the search should also match Spanish words containing "ñ". I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.) So my question is whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes, or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside the native one, or something else entirely -- to get this to work. Any ideas? Thanks, Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
Re: [GENERAL] Searching for bare letters
On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote: Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large. The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) bare letters. That is, if the user searches for n, then the search should also match Spanish words containing ñ. I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.) So my question is whether I can somehow wrangle PostgreSQL into thinking that n and ñ are the same character for search purposes, or if I need to do something else -- use regexps, keep a naked, searchable version of each column alongside the native one, or something else entirely -- to get this to work. Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details. Any ideas? Thanks, Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner -- Jay Lozier jsloz...@gmail.com
Re: [GENERAL] Searching for bare letters
Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large. The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) bare letters. That is, if the user searches for n, then the search should also match Spanish words containing ñ. I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.) So my question is whether I can somehow wrangle PostgreSQL into thinking that n and ñ are the same character for search purposes, or if I need to do something else -- use regexps, keep a naked, searchable version of each column alongside the native one, or something else entirely -- to get this to work. Any ideas? Thanks, Reuven What kind of client are the users using? I assume you will have some kind of user interface. For me this is a typical job for a user interface. The number of letters with equivalents in different languages are extremely limited, so a simple matching routine in the user interface should give you a way to issue the proper query. Uwe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Searching for bare letters
One approach would be to normalize all the text and search against that. That is, basically convert all non-ASCII characters to their equivalents. I've had to do this in Solr for searching for the exact reasons you've outlined: treat ñ as n. Ditto for ü - u, é = e, etc. This is easily done in Solr via the included ASCIIFoldingFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory You could look at the code to see how they do the conversion and implement it. /Cody On Oct 1, 2011, at 7:09 PM, planas wrote: On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote: Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large. The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) bare letters. That is, if the user searches for n, then the search should also match Spanish words containing ñ. I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.) So my question is whether I can somehow wrangle PostgreSQL into thinking that n and ñ are the same character for search purposes, or if I need to do something else -- use regexps, keep a naked, searchable version of each column alongside the native one, or something else entirely -- to get this to work. Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details. Any ideas? Thanks, Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner -- Jay Lozier jsloz...@gmail.com