implement thai lanaguage analyzer during nutch crawl process

2006-11-26 Thread sanjeev

hi everyone,

I would like to implement the thai language analyzer during a nutch crawl
and i know this is possible.

I would appreciate some help in this regard as there is little or no
documentation about this.

i tried to activate the language identifier plugin for analysis by changing
the plugin.includes property in nutch-site.xml but nothing is happening -
the log still shows not including language identifier - :-( what's wrong ?



Please help.

Thanks in advance.


-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-during-nutch-crawl-process-tf2709770.html#a7554763
Sent from the Nutch - Dev mailing list archive at Nabble.com.



implement thai lanaguage analyzer during nutch crawl process

2006-11-26 Thread sanjeev

hi everyone,

I would like to implement the thai language analyzer during a nutch crawl
and i know this is possible.

I would appreciate some help in this regard as there is little or no
documentation about this.

i tried to activate the language identifier plugin for analysis by changing
the plugin.includes property in nutch-site.xml but nothing is happening -
the log still shows not including language identifier - :-( what's wrong ?



Please help.

Thanks in advance.


-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-during-nutch-crawl-process-tf2709771.html#a7554764
Sent from the Nutch - Dev mailing list archive at Nabble.com.



RE: implement thai lanaguage analyzer in nutch

2006-11-14 Thread sanjeev

Thank you Mr. Teruhiko Kurosaka,


I was able to locate the th.ngp file in nutch-0.8-dev distrib.

I was able to compile the disstrib. When I ran the crawl - I'm not sure it
picked up the 
language identifier. I added 

 implementation id=org.apache.nutch.analysis.th.ThaiAnalyzer
class=org.apache.nutch.analysis.th.ThaiAnalyzer lang=th/ 

to languageidentifier/plugin.xml

Then I ran a crawl and got a stupid error. dedup ...

Dedup: adding indexes in: crawlnewpantip14nov2/indexes
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:393)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:432)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)

Your help much appreciated.



Teruhiko Kurosaka wrote:
 
 Oh, Thai words are not space delimited?
 OK, in that case, you'd need to study how ThaiAnalyzer works and
 then modify the rules in NutchAnalysis.jj (if you are going to use
 the web search GUI from Nutch).  This is because the search
 expressions are parsed by the parser generated from NutchAnalysis.jj
 first before each term is handed to the language specific analyzer,
 and currently if a character belongs to the CJK category, each character
 is treated as though it were a word.  If ThaiAnalyzer does not do the
 same,
 you can index the Thai docs but you won't be able to find any doc unless
 the search term is one Unicode character.
 
 
 -kuro
 
 -Original Message-
 From: sanjeev [mailto:[EMAIL PROTECTED] 
 Sent: 2006-11-08 19:28
 To: nutch-dev@lucene.apache.org
 Subject: Re: implement thai lanaguage analyzer in nutch
 
 
 I need a Thai Analyzer for Nutch. I want the crawler to be 
 intelligent enough
 to split thai words correctly since thai don't have spaces 
 between words.
 :-(
 
 
 
 
 ogjunk-nutch wrote:
  
  Regarding Thai, there is a Thai Analyzer in Lucene already:
  
  $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
  total 24
  drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
  -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
  -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
  
  Otis
  
  - Original Message 
  From: Teruhiko Kurosaka [EMAIL PROTECTED]
  To: sanjeev [EMAIL PROTECTED]; 
 nutch-dev@lucene.apache.org
  Sent: Wednesday, November 8, 2006 2:16:38 PM
  Subject: RE: implement thai lanaguage analyzer in nutch
  
  Sanjay,
  I don't think you should follow the Chinese example and 
 extend the CJK
  range. 
  This was needed because Chinese and Japanese don't use 
 space to separate
  words.  I believe Thai uses spaces, right? If so, you should extend
  LETTER
  range to include Thai character rather than CJK.
  
  Another place you would need to change is the LanguageIdentifier. 
  You would either train it, or implement some hack,  in 
 order for it to
  be able to 
  detect Thai language documents that are not of HTML with lang=th
  attribute.
  
  -kuro
  
  
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
 ch-tf2587282.html#a7251826
 Sent from the Nutch - Dev mailing list archive at Nabble.com.
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7334391
Sent from the Nutch - Dev mailing list archive at Nabble.com.



RE: implement thai lanaguage analyzer in nutch

2006-11-10 Thread Teruhiko Kurosaka
Oh, Thai words are not space delimited?
OK, in that case, you'd need to study how ThaiAnalyzer works and
then modify the rules in NutchAnalysis.jj (if you are going to use
the web search GUI from Nutch).  This is because the search
expressions are parsed by the parser generated from NutchAnalysis.jj
first before each term is handed to the language specific analyzer,
and currently if a character belongs to the CJK category, each character
is treated as though it were a word.  If ThaiAnalyzer does not do the
same,
you can index the Thai docs but you won't be able to find any doc unless
the search term is one Unicode character.


-kuro

 -Original Message-
 From: sanjeev [mailto:[EMAIL PROTECTED] 
 Sent: 2006-11-08 19:28
 To: nutch-dev@lucene.apache.org
 Subject: Re: implement thai lanaguage analyzer in nutch
 
 
 I need a Thai Analyzer for Nutch. I want the crawler to be 
 intelligent enough
 to split thai words correctly since thai don't have spaces 
 between words.
 :-(
 
 
 
 
 ogjunk-nutch wrote:
  
  Regarding Thai, there is a Thai Analyzer in Lucene already:
  
  $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
  total 24
  drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
  -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
  -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
  
  Otis
  
  - Original Message 
  From: Teruhiko Kurosaka [EMAIL PROTECTED]
  To: sanjeev [EMAIL PROTECTED]; 
 nutch-dev@lucene.apache.org
  Sent: Wednesday, November 8, 2006 2:16:38 PM
  Subject: RE: implement thai lanaguage analyzer in nutch
  
  Sanjay,
  I don't think you should follow the Chinese example and 
 extend the CJK
  range. 
  This was needed because Chinese and Japanese don't use 
 space to separate
  words.  I believe Thai uses spaces, right? If so, you should extend
  LETTER
  range to include Thai character rather than CJK.
  
  Another place you would need to change is the LanguageIdentifier. 
  You would either train it, or implement some hack,  in 
 order for it to
  be able to 
  detect Thai language documents that are not of HTML with lang=th
  attribute.
  
  -kuro
  
  
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
 ch-tf2587282.html#a7251826
 Sent from the Nutch - Dev mailing list archive at Nabble.com.
 
 


Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread Arun Kaundal

Sanjeev,

You have implemented Thai language, right?  What else changes you have done
in orignal code ? Do I need to make same changes for say Hindi and Punjabi
Language?

   If u bit of time to explain the things to him, will be of great help to
me.

Thank you

./Arun


On 11/8/06, sanjeev [EMAIL PROTECTED] wrote:



Arun

I'm sure there is/must be a patch for Hindi too.

I was seeing something on the forum about the Marathi Lanaguage.

Only there is no documentation anywhere for these things.

I'm assuming that in the pluggable architecture of Nutch the support for
one
language is the
same as for any other language.

But yes - even I would appreciate any information to resolve this problem.

regards,
sanjeev.


--
View this message in context:
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7233864
Sent from the Nutch - Dev mailing list archive at Nabble.com.




Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread sanjeev

Arun,

I tried implementing thai search for nutch. 

I followed the steps outllined in this tutorialfor Chinese:
http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153

So sorry - I am not able to help much. How urgent is your requirement ?

Mine is very urgent as I have to get this implemented ASAP and I can't wait.


cheers,
sanjeev.





-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7236321
Sent from the Nutch - Dev mailing list archive at Nabble.com.



RE: implement thai lanaguage analyzer in nutch

2006-11-08 Thread Teruhiko Kurosaka
Sanjay,
I don't think you should follow the Chinese example and extend the CJK
range. 
This was needed because Chinese and Japanese don't use space to separate
words.  I believe Thai uses spaces, right? If so, you should extend
LETTER
range to include Thai character rather than CJK.

Another place you would need to change is the LanguageIdentifier. 
You would either train it, or implement some hack,  in order for it to
be able to 
detect Thai language documents that are not of HTML with lang=th
attribute.

-kuro


Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread ogjunk-nutch
Regarding Thai, there is a Thai Analyzer in Lucene already:

$ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
total 24
drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
-rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
-rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java

Otis

- Original Message 
From: Teruhiko Kurosaka [EMAIL PROTECTED]
To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Sent: Wednesday, November 8, 2006 2:16:38 PM
Subject: RE: implement thai lanaguage analyzer in nutch

Sanjay,
I don't think you should follow the Chinese example and extend the CJK
range. 
This was needed because Chinese and Japanese don't use space to separate
words.  I believe Thai uses spaces, right? If so, you should extend
LETTER
range to include Thai character rather than CJK.

Another place you would need to change is the LanguageIdentifier. 
You would either train it, or implement some hack,  in order for it to
be able to 
detect Thai language documents that are not of HTML with lang=th
attribute.

-kuro





Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread sanjeev

ok. I downloaded the LuceneInAction code examples from the book and found
there were some 
analyzers and tests/demos which included chinese.

But these analyzers were standalone java programs with a main method.

My question is how to integrate into nutch so the index created by crawl
process can be searchable in thai ?

Someone please help as I'm hopelessly confused by the whole thing. :-(

cheers,
sanjeev.





ogjunk-nutch wrote:
 
 Regarding Thai, there is a Thai Analyzer in Lucene already:
 
 $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
 total 24
 drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
 -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
 -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
 
 Otis
 
 - Original Message 
 From: Teruhiko Kurosaka [EMAIL PROTECTED]
 To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
 Sent: Wednesday, November 8, 2006 2:16:38 PM
 Subject: RE: implement thai lanaguage analyzer in nutch
 
 Sanjay,
 I don't think you should follow the Chinese example and extend the CJK
 range. 
 This was needed because Chinese and Japanese don't use space to separate
 words.  I believe Thai uses spaces, right? If so, you should extend
 LETTER
 range to include Thai character rather than CJK.
 
 Another place you would need to change is the LanguageIdentifier. 
 You would either train it, or implement some hack,  in order for it to
 be able to 
 detect Thai language documents that are not of HTML with lang=th
 attribute.
 
 -kuro
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252838
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread sanjeev

ok Kuro - you are wrong about thai language having spaces between words.
Thai don't have space between words and segmenting thai is a bit tricky
methinks.

Will appreciate any/all help you can give me

cheers,
sanjeev








sanjeev wrote:
 
 ok. I downloaded the LuceneInAction code examples from the book and found
 there were some 
 analyzers and tests/demos which included chinese.
 
 But these analyzers were standalone java programs with a main method.
 
 My question is how to integrate into nutch so the index created by crawl
 process can be searchable in thai ?
 
 Someone please help as I'm hopelessly confused by the whole thing. :-(
 
 cheers,
 sanjeev.
 
 
 
 
 
 ogjunk-nutch wrote:
 
 Regarding Thai, there is a Thai Analyzer in Lucene already:
 
 $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
 total 24
 drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
 -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
 -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
 
 Otis
 
 - Original Message 
 From: Teruhiko Kurosaka [EMAIL PROTECTED]
 To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
 Sent: Wednesday, November 8, 2006 2:16:38 PM
 Subject: RE: implement thai lanaguage analyzer in nutch
 
 Sanjay,
 I don't think you should follow the Chinese example and extend the CJK
 range. 
 This was needed because Chinese and Japanese don't use space to separate
 words.  I believe Thai uses spaces, right? If so, you should extend
 LETTER
 range to include Thai character rather than CJK.
 
 Another place you would need to change is the LanguageIdentifier. 
 You would either train it, or implement some hack,  in order for it to
 be able to 
 detect Thai language documents that are not of HTML with lang=th
 attribute.
 
 -kuro
 
 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252863
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread sanjeev

Arun,

No I haven't come anywhere near the solution. I am myself confused a little.
From what I've learnt - one approach is to use NutchAnalysis.jj and compile
using javacc.

Another is to download dev version of nutch and try to use the patches for
the language analyzer
and identifier.

I failed on both counts simply because I don't have a clear picture on how
to make Nutch for thai 
language.

Can someone please outline for me the steps and procedure to make Nutch run
in thai ?


cheers,
sanjeev.








Arun Kaundal wrote:
 
 Sanjeev,
 
 My requirement is also very urgent . I tried a lot solve the thing at me
 end
 but unable to do so.
 Well , I want to wish u best of luck of your problem, I think you are very
 near to solution
 Hope u will get the things working soon , so that you can also help me in
 this regard.
 
 thanx.
 
 ./arun
 
 
 On 11/8/06, sanjeev [EMAIL PROTECTED] wrote:


 Arun,

 I tried implementing thai search for nutch.

 I followed the steps outllined in this tutorialfor Chinese:
 http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153

 So sorry - I am not able to help much. How urgent is your requirement ?

 Mine is very urgent as I have to get this implemented ASAP and I can't
 wait.


 cheers,
 sanjeev.





 --
 View this message in context:
 http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7236321
 Sent from the Nutch - Dev mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252973
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread kauu

i think you should learn the javacc ,then understand the analasis.jj
then the thai will be resolved soon .
just try it

On 11/7/06, sanjeev [EMAIL PROTECTED] wrote:



Hello,

After playing around with nutch for a few months I was tying to implement
the thai lanaguage analyzer for nutch.

Downloaded the subversion version and compiled using ant - everything
fine.

Next - I didn't see any tutorial for thai - but i did see one for chinese
at

http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153

Tried following the same steps outlined above but ran into compiler errors
...type mismatch

between lucene Token and nutch Token.

Suffice to say I am back at square one as far as trying to implement the
thai language analyzer for nutch.

Can someone please outline for me the exact procedure for this ? Or point
me
to a tutorial which explains how to ?

Would be highly obliged.
Thanks.



--
View this message in context:
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7214087
Sent from the Nutch - Dev mailing list archive at Nabble.com.





--
www.babatu.com


Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread sanjeev

Oh btw - I followed the chinese tutorial and was able to compile and
everything was fine.

Lemme just test if it is working properly - however i didn't make any
changes to NutchAnalysis.jj 

I need more information please.

Thanks a bunch.
-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread Arun Kaundal

Hi sanjeev and Kauu

  I want to support Hindi-Language widely spoken in India language.
Can u guide what else I need to modify ? I think there is no support to
search and index Hindi language.
 I want to work on this. But I need some information as what
to modify and where eaxctly the changes are require.? Can anybody help me?

  Thanx.
./Arun


On 11/8/06, sanjeev [EMAIL PROTECTED] wrote:



Oh btw - I followed the chinese tutorial and was able to compile and
everything was fine.

Lemme just test if it is working properly - however i didn't make any
changes to NutchAnalysis.jj

I need more information please.

Thanks a bunch.
--
View this message in context:
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439
Sent from the Nutch - Dev mailing list archive at Nabble.com.