implement thai lanaguage analyzer during nutch crawl process
hi everyone, I would like to implement the thai language analyzer during a nutch crawl and i know this is possible. I would appreciate some help in this regard as there is little or no documentation about this. i tried to activate the language identifier plugin for analysis by changing the plugin.includes property in nutch-site.xml but nothing is happening - the log still shows not including language identifier - :-( what's wrong ? Please help. Thanks in advance. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-during-nutch-crawl-process-tf2709770.html#a7554763 Sent from the Nutch - Dev mailing list archive at Nabble.com.
implement thai lanaguage analyzer during nutch crawl process
hi everyone, I would like to implement the thai language analyzer during a nutch crawl and i know this is possible. I would appreciate some help in this regard as there is little or no documentation about this. i tried to activate the language identifier plugin for analysis by changing the plugin.includes property in nutch-site.xml but nothing is happening - the log still shows not including language identifier - :-( what's wrong ? Please help. Thanks in advance. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-during-nutch-crawl-process-tf2709771.html#a7554764 Sent from the Nutch - Dev mailing list archive at Nabble.com.
RE: implement thai lanaguage analyzer in nutch
Thank you Mr. Teruhiko Kurosaka, I was able to locate the th.ngp file in nutch-0.8-dev distrib. I was able to compile the disstrib. When I ran the crawl - I'm not sure it picked up the language identifier. I added implementation id=org.apache.nutch.analysis.th.ThaiAnalyzer class=org.apache.nutch.analysis.th.ThaiAnalyzer lang=th/ to languageidentifier/plugin.xml Then I ran a crawl and got a stupid error. dedup ... Dedup: adding indexes in: crawlnewpantip14nov2/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:393) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:432) at org.apache.nutch.crawl.Crawl.main(Crawl.java:131) Your help much appreciated. Teruhiko Kurosaka wrote: Oh, Thai words are not space delimited? OK, in that case, you'd need to study how ThaiAnalyzer works and then modify the rules in NutchAnalysis.jj (if you are going to use the web search GUI from Nutch). This is because the search expressions are parsed by the parser generated from NutchAnalysis.jj first before each term is handed to the language specific analyzer, and currently if a character belongs to the CJK category, each character is treated as though it were a word. If ThaiAnalyzer does not do the same, you can index the Thai docs but you won't be able to find any doc unless the search term is one Unicode character. -kuro -Original Message- From: sanjeev [mailto:[EMAIL PROTECTED] Sent: 2006-11-08 19:28 To: nutch-dev@lucene.apache.org Subject: Re: implement thai lanaguage analyzer in nutch I need a Thai Analyzer for Nutch. I want the crawler to be intelligent enough to split thai words correctly since thai don't have spaces between words. :-( ogjunk-nutch wrote: Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut ch-tf2587282.html#a7251826 Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7334391 Sent from the Nutch - Dev mailing list archive at Nabble.com.
RE: implement thai lanaguage analyzer in nutch
Oh, Thai words are not space delimited? OK, in that case, you'd need to study how ThaiAnalyzer works and then modify the rules in NutchAnalysis.jj (if you are going to use the web search GUI from Nutch). This is because the search expressions are parsed by the parser generated from NutchAnalysis.jj first before each term is handed to the language specific analyzer, and currently if a character belongs to the CJK category, each character is treated as though it were a word. If ThaiAnalyzer does not do the same, you can index the Thai docs but you won't be able to find any doc unless the search term is one Unicode character. -kuro -Original Message- From: sanjeev [mailto:[EMAIL PROTECTED] Sent: 2006-11-08 19:28 To: nutch-dev@lucene.apache.org Subject: Re: implement thai lanaguage analyzer in nutch I need a Thai Analyzer for Nutch. I want the crawler to be intelligent enough to split thai words correctly since thai don't have spaces between words. :-( ogjunk-nutch wrote: Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut ch-tf2587282.html#a7251826 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
Sanjeev, You have implemented Thai language, right? What else changes you have done in orignal code ? Do I need to make same changes for say Hindi and Punjabi Language? If u bit of time to explain the things to him, will be of great help to me. Thank you ./Arun On 11/8/06, sanjeev [EMAIL PROTECTED] wrote: Arun I'm sure there is/must be a patch for Hindi too. I was seeing something on the forum about the Marathi Lanaguage. Only there is no documentation anywhere for these things. I'm assuming that in the pluggable architecture of Nutch the support for one language is the same as for any other language. But yes - even I would appreciate any information to resolve this problem. regards, sanjeev. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7233864 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
Arun, I tried implementing thai search for nutch. I followed the steps outllined in this tutorialfor Chinese: http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 So sorry - I am not able to help much. How urgent is your requirement ? Mine is very urgent as I have to get this implemented ASAP and I can't wait. cheers, sanjeev. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7236321 Sent from the Nutch - Dev mailing list archive at Nabble.com.
RE: implement thai lanaguage analyzer in nutch
Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro
Re: implement thai lanaguage analyzer in nutch
Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro
Re: implement thai lanaguage analyzer in nutch
ok. I downloaded the LuceneInAction code examples from the book and found there were some analyzers and tests/demos which included chinese. But these analyzers were standalone java programs with a main method. My question is how to integrate into nutch so the index created by crawl process can be searchable in thai ? Someone please help as I'm hopelessly confused by the whole thing. :-( cheers, sanjeev. ogjunk-nutch wrote: Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252838 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
ok Kuro - you are wrong about thai language having spaces between words. Thai don't have space between words and segmenting thai is a bit tricky methinks. Will appreciate any/all help you can give me cheers, sanjeev sanjeev wrote: ok. I downloaded the LuceneInAction code examples from the book and found there were some analyzers and tests/demos which included chinese. But these analyzers were standalone java programs with a main method. My question is how to integrate into nutch so the index created by crawl process can be searchable in thai ? Someone please help as I'm hopelessly confused by the whole thing. :-( cheers, sanjeev. ogjunk-nutch wrote: Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252863 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
Arun, No I haven't come anywhere near the solution. I am myself confused a little. From what I've learnt - one approach is to use NutchAnalysis.jj and compile using javacc. Another is to download dev version of nutch and try to use the patches for the language analyzer and identifier. I failed on both counts simply because I don't have a clear picture on how to make Nutch for thai language. Can someone please outline for me the steps and procedure to make Nutch run in thai ? cheers, sanjeev. Arun Kaundal wrote: Sanjeev, My requirement is also very urgent . I tried a lot solve the thing at me end but unable to do so. Well , I want to wish u best of luck of your problem, I think you are very near to solution Hope u will get the things working soon , so that you can also help me in this regard. thanx. ./arun On 11/8/06, sanjeev [EMAIL PROTECTED] wrote: Arun, I tried implementing thai search for nutch. I followed the steps outllined in this tutorialfor Chinese: http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 So sorry - I am not able to help much. How urgent is your requirement ? Mine is very urgent as I have to get this implemented ASAP and I can't wait. cheers, sanjeev. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7236321 Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252973 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
i think you should learn the javacc ,then understand the analasis.jj then the thai will be resolved soon . just try it On 11/7/06, sanjeev [EMAIL PROTECTED] wrote: Hello, After playing around with nutch for a few months I was tying to implement the thai lanaguage analyzer for nutch. Downloaded the subversion version and compiled using ant - everything fine. Next - I didn't see any tutorial for thai - but i did see one for chinese at http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 Tried following the same steps outlined above but ran into compiler errors ...type mismatch between lucene Token and nutch Token. Suffice to say I am back at square one as far as trying to implement the thai language analyzer for nutch. Can someone please outline for me the exact procedure for this ? Or point me to a tutorial which explains how to ? Would be highly obliged. Thanks. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7214087 Sent from the Nutch - Dev mailing list archive at Nabble.com. -- www.babatu.com
Re: implement thai lanaguage analyzer in nutch
Oh btw - I followed the chinese tutorial and was able to compile and everything was fine. Lemme just test if it is working properly - however i didn't make any changes to NutchAnalysis.jj I need more information please. Thanks a bunch. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
Hi sanjeev and Kauu I want to support Hindi-Language widely spoken in India language. Can u guide what else I need to modify ? I think there is no support to search and index Hindi language. I want to work on this. But I need some information as what to modify and where eaxctly the changes are require.? Can anybody help me? Thanx. ./Arun On 11/8/06, sanjeev [EMAIL PROTECTED] wrote: Oh btw - I followed the chinese tutorial and was able to compile and everything was fine. Lemme just test if it is working properly - however i didn't make any changes to NutchAnalysis.jj I need more information please. Thanks a bunch. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439 Sent from the Nutch - Dev mailing list archive at Nabble.com.