[persiancomputing] Thesis: Web Content Mining - Problems of Persian Websites
Hi, I'm still around ;-) How's every body? I found something on the net, which I guess there are many people in the list, who might be interested in the subject. Thesis: Web Content Mining - Problems of Persian Websites Summary: http://www.irandoc.ac.ir/data/e_j/vol4/shahidi_abs.htm PDF (Full Paper): http://www.irandoc.ac.ir/data/e_j/vol4/shahidi.pdf 1,896 Kbytes Cheers, Behzad ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear list,I'm considering programming a simple "Search Engine" for a website, to find Arabic/Persian data within a MySQL database. This database contains a huge amount of data, encoded with Unicode (UTF-8). The big deal is to ** reduce the response time ** to end-users.My first solution is to create an Index and use the "FULL-TEXT Searching" method.Luckily, MySQL's provides FULL-TEXT Indexing support in MyISAM tables. But unfortunately, it doesn't support multi-byte charsets (e.g. Unicode). [1] Technically, MySQL creates Indexes over words. A "word'' is any sequence of characters consisting of letters and numbers [2].Assuming this, I tried to save the records as Unicode Character References (), but the search failed again :-(Any suggestion? I appreciate any solution to solve this problem.Thanks in Advance, Behzad [1] MySQL Manual -> 6.8.3 Full-text Search TODO [2] MySQL Manual -> 6.8 MySQL Full-text Search P.S.I use MySQL 4.01) Table StrucutreCREATE TABLE `articles` ( `article_id` int(10) unsigned NOT NULL auto_increment, `article_title` NATIONAL varchar(255) NOT NULL default '', `article_text` text NOT NULL, PRIMARY KEY (`article_id`), FULLTEXT (`article_title`,`article_text`) ) TYPE=MyISAM ;ALTER TABLE `articles` CHARACTER SET ut8;2) SQL-Query to Perform a Full-text searchSELECT * FROM articles WHERE MATCH(article_title, article_text) AGAINST('سوال') Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Mohsen wrote: > Please use MySQL 4.1 or higher.Dear Mohsen,Nice to e-meet(!) you here, at PersianComputing mailing list!Thanks for your advice. I just heared the same message from MySQL geeks at [EMAIL PROTECTED]I already know that MySQL 4.1 supports Unicode[1], and I can install and use it on my own computer. So, why I'm bothering you here?Here's the problem: HostRocket.com - my prefered company for web hosting - have not installed MySQL 4.1 yet. They still use MySQL 4.0.2 and they won't install MySQL 4.1 :-(What Can I Do Now? === 1) To find a "MySQL 4.0-Friendly" method to perform quick searches. That's why I'm here, asking people to help me.2) Find another Web Hosting Company with PHP and MySQL 4.1 support. Would you (or anyone else in the list) recommend a reliable Web Hosting Company with such services?! Thanks in advance, Behzad [1] http://lists.mysql.com/mysql/155039 Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
On Nov 24, 2005, Medi Montaseri wrote:> One solution would be to augment a DB capability > at the application level. That is instead of the search > or select qualified by a SQL where clause, simply get > everything (select *) and then let the application filter > what you want. Then when your given DB provides > that operation by itself, simplify your application > and deligate that to DB (Query Engine). Actually, the client asked me to write a PHP-driven search engine to locate words in HTML resources. I'm considering MySQL as an "Indexing" tool to store the plain-text data and speed-up this search.The solution you explained requires that I write my own Indexer with PHP. I'm looking for a faster and easier way.> I'm not sure about PHP support of unicode, but I know > Perl is pretty strong on Regular Expressions wit! h > support for Unicode as well...With the aid of "mbstring" extension, PHP supports multi-byte characters. In case you're interested, take a look at:1) Toppa, Michael, "Solving the Unicode Puzzle," php|arch Magazine, May 2005. Availabe online at http://www.phparch.com/issuedata/articles/article_179.pdf 2) http://www.php.net/ref.mbstring BTW, if you code in Perl, I have something for you: http://www.dataparksearch.org/If you know a PHP-driven search engine like this, please let me know. Thanks in advance, Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Ehsan Akhgari wrote:> Another solution is make the db believe your text is English. > This could be done by "romanizing" the text before inserting it to the db, > and converting it back to Unicode after reading it from the db and before > displaying it to the user. This can be done by choosing a Roman letter for > each Persian letter, and reading Persian characters one by one and looking > them up in a conversion table and writing the equivalent Roman characters to > the output. However, this has the downside that IIRC MySQL's full-text search > is case-insensitive, and if I'm right in that you'd have to choose Roman > characters all from one case (upper or lower.) In addition to that, the data > stored in the db might be difficult/impossible to use without such a conversion. > It's you who ! should judge the tradeoffs before choosing to use this method or > not. Dear Ehsan, You suggested a creative solution. Thank you.My application, consists of a database, and two user-interfaces.The first UI is used for data entry, where I parse a given XML file, extract and "Romanize" its data - based on a "Persian-Roman Conversion Map" - and then insert them into DB. Luckily, PHP provides a very fast function for such conversions, named strtr().Now I have a "Roman DB".The second UI is used for data retrieval (searching), where I "Romanize" the given search argument, and look for it trough the DB records. The results will be decoded and converted to Persian, before sending to stdout.There are two disadvantages concerning this method: - Firstly, as you pointed out, it is impossible to use the data without the coversion. How! ever, I can develop "phpMyAdmin" to handle this and simplify data manipulation for the client.- Secondly, Romanizing adds a few overhead to the system. But while there only 10 records to be retrieved and displayed each time, this overhead doesn't make sense. In addition, PHP's strtr() function works fast enough.:-DI think, your solution is the only MySQL 4.0-friendly version to implement FULL-TEXT searching for Persian (well, that's not Persian, the Roman ;-) )Once again, thank you for sharing your knowledge. Behzad Yahoo! Personals Single? There's someone we'd like you to meet. Lot's of someone's, actually. Try Yahoo! Personals___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Mohsen wrote:> But himself solved his problem. > with : mysql_query("SET NAMES utf8"); > Even 4.0.xWrong. I decided to prepare two different versions for my software: - A MySQL 4.0-friendly version using Romanizing method (Hats off to you, Ehsan) - A MySQL 4.1-compatible version.The code you mentioned belongs to the 2nd version." SET NAMES indicates what is in the SQL statements that the client sends. Thus, SET NAMES 'cp1251' tells the server future incoming messages from this client are in character set cp1251. It also specifies the character set for results that the server sends back to the client. (For example, it indicates what character set column values are if you use a SELECT statement.) "MySQL Manual 4.1 -> 10.3.6. Connection Character Sets and Collations.Kind Regards, Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear Ehsan, On Nov 28, 2005, you wrote: > I've actually implemented this approach in a project. I have not yet published the > code, but if you want, I can make it available under the GPL. Yes! I would appreciate it.Thank you very much for your kindness. Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear Behdad, On 25 Nov 2005, you wrote: > Another options is to get yourself a real search engine, like> Apache Lucene. I've written my experience using that here: >> http://mces.blogspot.com/2005/04/on-lucene-and-its-decency.htmlYou always offer the most brilliant solutions!!Unfortunately, I have no experience with this mehotd. But I'm still eager.I read your weblog and met "Apache Lucene" homepage. I'm impressed. Would you tell us how you have integrated this Java-driven package with PHP at http://rira.ir/ ?!! It works really fast.Thank in advance, Behzad Yahoo! Music Unlimited - Access over 1 million songs. Try it free.___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Behdad Esfahbod wrote: > That's the tricky part, or where the runtime-hell comes in. What > I did was to write a small java program based on the samples in > Lucene to connect to my database and feed the data into Lucene. > At search time, I have another little Java program that takes the > query string from command line and prints out search results to > standard output. My PHP script then just fires up a shell script > that in turn runs the Java program, piping the output into PHP...Knowledge is Power. (Alvin Toffler)That's a very wonderful architecture. It seems that I was blind before reading your e-mail. I have never thought about "shell" power before, and using it as an interface to talk with Java. I like your point of view. Very Interesting!Thank you very much for sharing the source code!Behzad Yahoo! Shopping Find Great Deals on Holiday Gifts at Yahoo! Shopping ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
[persiancomputing]Separating persian numbers with comma is incorrect
e-Greetings, I saw http://students.washington.edu/irina/persianword/format.htm,But a note: As described in the book "Nogh-teh GozAri" (The official Persian Manual of Editing, Vol.5, Punctuaion Book) written by "Mohammad-RezA Mohammadi-Far", page 460: It is not correct to use comma (",") or U+06CC to separate every group of thousands in numbers. Instead, the editor must use the Persian letter 'Reh (U+0631)'. Bedrood,BEHzad ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Using of U+066C as a number-separator
Thanks to Connie who convinced me. It seems using of U+066C is the best option. But don't you think shape of U+066C is very similar to sign of 'foot' and 'minute'? (http://students.washington.edu/irina/persianword/afgDecSep.JPG) Bedrood,Behzad ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: Salam
ï Salaam be Amir Azemat aziz, Blogger ghadimi va be-naam! In mailing list yaa be gholeh khodet "Group", darbaaye masaa-ele mortabet baa Farsi nevisi dar Internet, WWW, Software haa, Application haa va gheyre misheh (Kolan Persian Computing). Injaa, mardom miaan so-aal miporsan va donbaal raahe hal migardan, yek ede ham javaab midan. Yek ede ham bahs mikonan va too saro kaleyeh ham mizanan, khoon be paa mikonan! Be kasaani ke in reply ro mikhoonand begam ke Amir, jozv kasaani hast ke dar hagheh Blogger haaye irani hagh be gardan daareh, va koli too Farsi nevisi dar Web, saabegheh daareh. Har chand ke az "Persian Yeh (U+06CC) estefaadeh nemikoneh ;-) Vali to ghesmati az blog khoon-dani khodesh (http://weblog.azemat.com/), ke esmesh "Link-dooni" hast, az font Nesf estefaadeh mikoneh :-D Man be noobeye khodam, Be onvaan nokhodi in mailing list, Be Amir khosh-aamad migam: Fazaa ro monavar kardi Amir ;-) Bedrood, Behzad - Original Message - From: Amir Azemati To: [EMAIL PROTECTED] Sent: Friday, January 10, 2003 2:35 PM Subject: Salam Salam be tamame baro bachz Group ... mishe yeki lotf kone va ye khorde dar morede ein Group tozih bede? Khosh Bashi ...Byehttp://weblog.azemat.com ___PersianComputing mailing list[EMAIL PROTECTED]http://lists.sharif.edu/mailman/listinfo/persiancomputing ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: Using of U+066C as a number-separator
On Thursday, January 08, 2004 7:00 PM Roozbeh Pournader wrote: > On Thu, 2004-01-08 at 18:14, AmirBehzad Eslami wrote: > > But don't you think shape of U+066C is very similar to sign of 'foot' > > and 'minute'? > > (http://students.washington.edu/irina/persianword/afgDecSep.JPG) > > Depends on the font. Compare with > <http://www.bamdad.org/~roozbeh/thsep.png>, for example. > > roozbeh Roozbeh, Thank you very much for that screen-shot. Ok. I admit it; Lab-beyk Yaa U+066C ;-) But as Connie mentioned, some users are unable to see this character correctly. I wonder even the Nesf2 has a bug about this U+066C. So, If I don't like to use U+066C (because of Web-Usability reasons), Is there an alternative for me? May I use the 'Reh' until most of users have standard systems? If your answer is No, I have two more options: 1. Forget to separate numbers (Paak kardan-e soorat mas-aleh) 2. Ask my website visitors to download a newer version of Tahoma (What about the font Nesf?) A) Does any body have a better option? B) What is this "Arabic Decimal Separator (U+066B)? Thanks in advance, Behzad ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: WEFT webpage font embedding--Call for feedback
Dear Connie, Like you, I use WinXP and IE 6.0. I'm sorry to say that I can't help you on the other platforms. But take a look at http://www.browsercam.com , which provides good services for web page testing on mutiliple platforms and different browsers. Hope this helps :-) Please inform us about the result. Thanks, Behzad - Original Message - From: "C Bobroff" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, May 07, 2004 11:01 AM Subject: WEFT webpage font embedding--Call for feedback > We've had a few discussions about WEFT before in the past but never really > explored it completely. Therefore, I made this demo page in both > English and Persian and embedded Tahoma, Koodak(by FarsiWeb) and Arabic > Typesetting: > http://students.washington.edu/irina/persianword/weft.htm > > Can you please check if Weft has worked? Do you see my fonts correctly? > Is the Yeh (medial form) showing up correctly in all fonts, especially on > Win98? Is the load time any longer than usual? If you have the old, buggy > Tahoma font, is my corrected font showing up instead? If you have the old > Sinasoft or Borna Koodak, is my FarsiWeb Koodak showing up? > > Please report your findings! Be sure to mention which version of Windows > and IE. By the way, you have to uninstall these fonts if you have them, > otherwise, the test is not too helpful :) > > As you may know, Weft only works on Windows and IE so don't bother to > check on anything else. Also please don't look at the source code! I was > in a great hurry and yes, it's a mess. Anyone who is qualified is welcome > to redo it if too unbearable. I would appreciate that! > > Thanks! > -Connie > ___ > PersianComputing mailing list > [EMAIL PROTECTED] > http://lists.sharif.edu/mailman/listinfo/persiancomputing > > ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing