Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Behdad Esfahbod wrote: > That's the tricky part, or where the runtime-hell comes in. What > I did was to write a small java program based on the samples in > Lucene to connect to my database and feed the data into Lucene. > At search time, I have another little Java program that takes the > query string from command line and prints out search results to > standard output. My PHP script then just fires up a shell script > that in turn runs the Java program, piping the output into PHP...Knowledge is Power. (Alvin Toffler)That's a very wonderful architecture. It seems that I was blind before reading your e-mail. I have never thought about "shell" power before, and using it as an interface to talk with Java. I like your point of view. Very Interesting!Thank you very much for sharing the source code!Behzad Yahoo! Shopping Find Great Deals on Holiday Gifts at Yahoo! Shopping ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
On Wed, 30 Nov 2005, AmirBehzad Eslami wrote: > Dear Behdad, > > On 25 Nov 2005, you wrote: > > > Another options is to get yourself a real search engine, like > > Apache Lucene. I've written my experience using that here: > > > > http://mces.blogspot.com/2005/04/on-lucene-and-its-decency.html > > You always offer the most brilliant solutions!! > Unfortunately, I have no experience with this mehotd. But I'm still eager. > I read your weblog and met "Apache Lucene" homepage. > > I'm impressed. Would you tell us how you have integrated this > Java-driven package with PHP at http://rira.ir/ ?!! It works > really fast. That's the tricky part, or where the runtime-hell comes in. What I did was to write a small java program based on the samples in Lucene to connect to my database and feed the data into Lucene. At search time, I have another little Java program that takes the query string from command line and prints out search results to standard output. My PHP script then just fires up a shell script that in turn runs the Java program, piping the output into PHP... I don't have access to the Java codes at this time, but the PHP code involved is available here: http://cvs.sourceforge.net/viewcvs.py/rira/rira/php/page/search.php?rev=1.1.1.1&view=log If you are developing in .NET, there is a functional port of Lucene to .NET too. There is even a port of an older version of it to Python. BTW, you need to make sure you compile it with Unicode turned on. I don't quite remember the details, but there was some. I also have a Persian class written for it, but it didn't do much anyway. In a few weeks I will get access to rira.ir server and hopefully move the site to the above sf.net project, so you can see what's inside. > Thank in advance, > Behzad Cheers, --behdad http://behdad.org/ "Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill" -- Dan Bern, "New American Language" ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear Behdad, On 25 Nov 2005, you wrote: > Another options is to get yourself a real search engine, like> Apache Lucene. I've written my experience using that here: >> http://mces.blogspot.com/2005/04/on-lucene-and-its-decency.htmlYou always offer the most brilliant solutions!!Unfortunately, I have no experience with this mehotd. But I'm still eager.I read your weblog and met "Apache Lucene" homepage. I'm impressed. Would you tell us how you have integrated this Java-driven package with PHP at http://rira.ir/ ?!! It works really fast.Thank in advance, Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear Ehsan, On Nov 28, 2005, you wrote: > I've actually implemented this approach in a project. I have not yet published the > code, but if you want, I can make it available under the GPL. Yes! I would appreciate it.Thank you very much for your kindness. Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear Ehsan,You suggested a creative solution. Thank you.My application, consists of a database, and two user-interfaces.The first UI is used for data entry,where I parse a given XML file, extract and "Romanize" itsdata - based on a "Persian-Roman Conversion Map" -and then insert them into DB.Luckily, PHP provides a very fast function forsuch conversions, named strtr().Now I have a "Roman DB".The second UI is used for data retrieval (searching),where I "Romanize" the given search argument,and look for it trough the DB records. The results will bedecoded and converted to Persian, before sending to stdout. I've actually implemented this approach in a project. I have not yet published the code, but if you want, I can make it available under the GPL. Ehsan ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Mohsen wrote:> But himself solved his problem. > with : mysql_query("SET NAMES utf8"); > Even 4.0.xWrong. I decided to prepare two different versions for my software: - A MySQL 4.0-friendly version using Romanizing method (Hats off to you, Ehsan) - A MySQL 4.1-compatible version.The code you mentioned belongs to the 2nd version." SET NAMES indicates what is in the SQL statements that the client sends. Thus, SET NAMES 'cp1251' tells the server future incoming messages from this client are in character set cp1251. It also specifies the character set for results that the server sends back to the client. (For example, it indicates what character set column values are if you use a SELECT statement.) "MySQL Manual 4.1 -> 10.3.6. Connection Character Sets and Collations.Kind Regards, Behzad
Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___
PersianComputing mailing list
[email protected]
http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
[EMAIL PROTECTED] wrote: AmirBehzad Eslami <[EMAIL PROTECTED]> wrote on 24/11/2005 17:48:29: Dear list, I'm considering programming a simple "Search Engine" for a website, to find Arabic/Persian data within a MySQL database. This database contains a huge amount of data, encoded with Unicode(UTF-8). The big deal is to ** reduce the response time ** to end-users. My first solution is to create an Index and use the "FULL-TEXT Searching" method. Luckily, MySQL's provides FULL-TEXT Indexing support in MyISAM tables. But unfortunately, it doesn't support multi-byte charsets (e.g. Unicode). [1] Technically, MySQL creates Indexes over words. A "word'' is any sequence of characters consisting of letters and numbers [2]. Assuming this, I tried to save the records as Unicode Character References (), but the search failed again :-( Any suggestion? I appreciate any solution to solve this problem. Thanks in Advance, Behzad [1] MySQL Manual -> 6.8.3 Full-text Search TODO [2] MySQL Manual -> 6.8 MySQL Full-text Search P.S. *** I use MySQL 4.0 *** I think this is your problem: MySQL does not properly support Unicode until version 4.1. I am successfully using FullText with MySQL 4.1 to sort UTF-8 encoded Japanese text. I see no reason why it should not work for Arabic - if you upgrade. Alec ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing But himself solved his problem. with : mysql_query("SET NAMES utf8"); Even 4.0.x ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
AmirBehzad Eslami <[EMAIL PROTECTED]> wrote on 24/11/2005 17:48:29: > Dear list, > > I'm considering programming a simple "Search Engine" for a website, > to find Arabic/Persian data within a MySQL database. > This database contains a huge amount of data, encoded with Unicode(UTF-8). > > > The big deal is to ** reduce the response time ** to end-users. > > My first solution is to create an Index and use the "FULL-TEXT > Searching" method. > > Luckily, MySQL's provides FULL-TEXT Indexing support in MyISAM tables. > But unfortunately, it doesn't support multi-byte charsets (e.g. > Unicode). [1] > Technically, MySQL creates Indexes over words. > A "word'' is any sequence of characters consisting of letters and > numbers [2]. > > Assuming this, I tried to save the records as Unicode Character > References (), but the search failed again :-( > > Any suggestion? > I appreciate any solution to solve this problem. > > Thanks in Advance, > Behzad > > > [1] MySQL Manual -> 6.8.3 Full-text Search TODO > [2] MySQL Manual -> 6.8 MySQL Full-text Search > > > P.S. *** > I use MySQL 4.0 *** I think this is your problem: MySQL does not properly support Unicode until version 4.1. I am successfully using FullText with MySQL 4.1 to sort UTF-8 encoded Japanese text. I see no reason why it should not work for Arabic - if you upgrade. Alec ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Ehsan Akhgari wrote:> Another solution is make the db believe your text is English. > This could be done by "romanizing" the text before inserting it to the db, > and converting it back to Unicode after reading it from the db and before > displaying it to the user. This can be done by choosing a Roman letter for > each Persian letter, and reading Persian characters one by one and looking > them up in a conversion table and writing the equivalent Roman characters to > the output. However, this has the downside that IIRC MySQL's full-text search > is case-insensitive, and if I'm right in that you'd have to choose Roman > characters all from one case (upper or lower.) In addition to that, the data > stored in the db might be difficult/impossible to use without such a conversion. > It's you who ! should judge the tradeoffs before choosing to use this method or > not. Dear Ehsan, You suggested a creative solution. Thank you.My application, consists of a database, and two user-interfaces.The first UI is used for data entry, where I parse a given XML file, extract and "Romanize" its data - based on a "Persian-Roman Conversion Map" - and then insert them into DB. Luckily, PHP provides a very fast function for such conversions, named strtr().Now I have a "Roman DB".The second UI is used for data retrieval (searching), where I "Romanize" the given search argument, and look for it trough the DB records. The results will be decoded and converted to Persian, before sending to stdout.There are two disadvantages concerning this method: - Firstly, as you pointed out, it is impossible to use the data without the coversion. How! ever, I can develop "phpMyAdmin" to handle this and simplify data manipulation for the client.- Secondly, Romanizing adds a few overhead to the system. But while there only 10 records to be retrieved and displayed each time, this overhead doesn't make sense. In addition, PHP's strtr() function works fast enough.:-DI think, your solution is the only MySQL 4.0-friendly version to implement FULL-TEXT searching for Persian (well, that's not Persian, the Roman ;-) )Once again, thank you for sharing your knowledge. Behzad Yahoo! Personals Single? There's someone we'd like you to meet. Lot's of someone's, actually. Try Yahoo! Personals___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
On Nov 24, 2005, Medi Montaseri wrote:> One solution would be to augment a DB capability > at the application level. That is instead of the search > or select qualified by a SQL where clause, simply get > everything (select *) and then let the application filter > what you want. Then when your given DB provides > that operation by itself, simplify your application > and deligate that to DB (Query Engine). Actually, the client asked me to write a PHP-driven search engine to locate words in HTML resources. I'm considering MySQL as an "Indexing" tool to store the plain-text data and speed-up this search.The solution you explained requires that I write my own Indexer with PHP. I'm looking for a faster and easier way.> I'm not sure about PHP support of unicode, but I know > Perl is pretty strong on Regular Expressions wit! h > support for Unicode as well...With the aid of "mbstring" extension, PHP supports multi-byte characters. In case you're interested, take a look at:1) Toppa, Michael, "Solving the Unicode Puzzle," php|arch Magazine, May 2005. Availabe online at http://www.phparch.com/issuedata/articles/article_179.pdf 2) http://www.php.net/ref.mbstring BTW, if you code in Perl, I have something for you: http://www.dataparksearch.org/If you know a PHP-driven search engine like this, please let me know. Thanks in advance, Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
On Fri, 25 Nov 2005, Ehsan Akhgari wrote: > > One solution would be to augment a DB capability > at the application level. That is instead of the search > or select qualified by a SQL where clause, simply get > everything (select *) and then let the application filter > what you want. Then when your given DB provides > that operation by itself, simplify your application > and deligate that to DB (Query Engine). > > Another solution is make the db believe your text is English. > This could be done by "romanizing" the text before inserting it > to the db, and converting it back to Unicode after reading it > from the db and before displaying it to the user. This can be > done by choosing a Roman letter for each Persian letter, and > reading Persian characters one by one and looking them up in a > conversion table and writing the equivalent Roman characters to > the output. However, this has the downside that IIRC MySQL's > full-text search is case-insensitive, and if I'm right in that > you'd have to choose Roman characters all from one case (upper > or lower.) In addition to that, the data stored in the db > might be difficult/impossible to use without such a conversion. > It's you who should judge the tradeoffs before choosing to use > this method or not. > > For some good romanizing scripts, check out > http://home.byu.net/jmd56/download.html. Another options is to get yourself a real search engine, like Apache Lucene. I've written my experience using that here: http://mces.blogspot.com/2005/04/on-lucene-and-its-decency.html > Ehsan --behdad http://behdad.org/ "Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill" -- Dan Bern, "New American Language" ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
One solution would be to augment a DB capabilityat the application level. That is instead of the searchor select qualified by a SQL where clause, simply geteverything (select *) and then let the application filterwhat you want. Then when your given DB providesthat operation by itself, simplify your applicationand deligate that to DB (Query Engine). Another solution is make the db believe your text is English. This could be done by "romanizing" the text before inserting it to the db, and converting it back to Unicode after reading it from the db and before displaying it to the user. This can be done by choosing a Roman letter for each Persian letter, and reading Persian characters one by one and looking them up in a conversion table and writing the equivalent Roman characters to the output. However, this has the downside that IIRC MySQL's full-text search is case-insensitive, and if I'm right in that you'd have to choose Roman characters all from one case (upper or lower.) In addition to that, the data stored in the db might be difficult/impossible to use without such a conversion. It's you who should judge the tradeoffs before choosing to use this method or not. For some good romanizing scripts, check out http://home.byu.net/jmd56/download.html. Ehsan ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
One solution would be to augment a DB capability at the application level. That is instead of the search or select qualified by a SQL where clause, simply get everything (select *) and then let the application filter what you want. Then when your given DB provides that operation by itself, simplify your application and deligate that to DB (Query Engine). I'm not sure about PHP support of unicode, but I know Perl is pretty strong on Regular Expressions with support for Unicode as well... MediOn 11/24/05, Behnam Esfahbod <[EMAIL PROTECTED]> wrote: AmirBehzad Eslami wrote:> 2) Find another Web Hosting Company with PHP and MySQL 4.1 support.>> Would you (or anyone else in the list) recommend a reliable Web Hosting> Company with such services?! >You may like to see www.1and1.com. It's been our web hosting for 2years now.-- ' ' Behnam Esfahbod' * .. http://zwnj.org * ` * http://zwnj.info * o * http://behnam.esfahbod.info___PersianComputing mailing list [email protected]://lists.sharif.edu/mailman/listinfo/persiancomputing ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
AmirBehzad Eslami wrote: 2) Find another Web Hosting Company with PHP and MySQL 4.1 support. Would you (or anyone else in the list) recommend a reliable Web Hosting Company with such services?! You may like to see www.1and1.com. It's been our web hosting for 2 years now. -- ' ' Behnam Esfahbod ' * .. http://zwnj.org * ` * http://zwnj.info * o * http://behnam.esfahbod.info ___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Mohsen wrote: > Please use MySQL 4.1 or higher.Dear Mohsen,Nice to e-meet(!) you here, at PersianComputing mailing list!Thanks for your advice. I just heared the same message from MySQL geeks at [EMAIL PROTECTED]I already know that MySQL 4.1 supports Unicode[1], and I can install and use it on my own computer. So, why I'm bothering you here?Here's the problem: HostRocket.com - my prefered company for web hosting - have not installed MySQL 4.1 yet. They still use MySQL 4.0.2 and they won't install MySQL 4.1 :-(What Can I Do Now? === 1) To find a "MySQL 4.0-Friendly" method to perform quick searches. That's why I'm here, asking people to help me.2) Find another Web Hosting Company with PHP and MySQL 4.1 support. Would you (or anyone else in the list) recommend a reliable Web Hosting Company with such services?! Thanks in advance, Behzad [1] http://lists.mysql.com/mysql/155039 Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list [email protected] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
AmirBehzad Eslami wrote:
Dear list,
I'm considering programming a simple "Search Engine" for a website,
to find Arabic/Persian data within a MySQL database.
This database contains a huge amount of data, encoded with Unicode
(UTF-8).
The big deal is to ** reduce the response time ** to end-users.
My first solution is to create an Index and use the "FULL-TEXT
Searching" method.
Luckily, MySQL's provides FULL-TEXT Indexing support in MyISAM tables.
But unfortunately, it doesn't support multi-byte charsets (e.g.
Unicode). [1]
Technically, MySQL creates Indexes over words.
A "word'' is any sequence of characters consisting of letters and
numbers [2].
Assuming this, I tried to save the records as Unicode Character
References (), but the search failed again :-(
Any suggestion?
I appreciate any solution to solve this problem.
Thanks in Advance,
Behzad
*
[1] MySQL Manual -> 6.8.3 Full-text Search TODO
[2] MySQL Manual -> 6.8 MySQL Full-text Search
P.S.
I use MySQL 4.0
1) Table Strucutre
CREATE TABLE `articles` (
`article_id` int(10) unsigned NOT NULL auto_increment,
`article_title` NATIONAL varchar(255) NOT NULL default '',
`article_text` text NOT NULL,
PRIMARY KEY (`article_id`),
FULLTEXT (`article_title`,`article_text`)
) TYPE=MyISAM ;
ALTER TABLE `articles` CHARACTER SET ut8;
2) SQL-Query to Perform a Full-text search
SELECT * FROM articles WHERE MATCH(article_title, article_text)
AGAINST('سوال')
*
* *
* Yahoo! Music Unlimited - Access over 1 million songs. Try it free.
*
*
*
*
___
PersianComputing mailing list
[email protected]
http://lists.sharif.edu/mailman/listinfo/persiancomputing
*
Please use MySQL 4.1 or higher.
___
PersianComputing mailing list
[email protected]
http://lists.sharif.edu/mailman/listinfo/persiancomputing
