Modifying StopAnalyzer

2007-12-26 Thread Liaqat Ali


Hi, Erick

Thanks for your suggestion, putting the declaration of StringBuffer 
variable sb inside the for loop is working well. I want to ask another 
question, can we modify the StopyAnalyzer to insert Stop Words of 
another language, instead of English, like Urdu given below:





public static final String[] URDU_STOP_WORDS = { "پر", "کا", "کی", "کو" };



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modifying StopAnalyzer

2007-12-26 Thread Doron Cohen
>
>  can we modify the StopyAnalyzer to insert Stop Words of
> another language, instead of English, like Urdu given below:
> public static final String[] URDU_STOP_WORDS = { "پر", "کا", "کی", "کو" };
>

"new StandardAnalyzer(URDU_STOP_WORDS)" should work.

Regards,
Doron


Re: Index lucene database details.

2007-12-26 Thread Grant Ingersoll
I would start at the Lucene Java home page (http://lucene.apache.org/java 
) and dig in from there.  There are a number of good docs on Scoring  
and the IR model used (Boolean plus Vector.)  From there, I would dig  
into the javadocs and whip up some example code that indexes a set of  
tokens and documents with a controlled vocabulary.  From there, you  
can dig into the source itself, especially the new DocumentsWriter  
class.  And, of course, along the way, please feel free to submit  
documentation patches!


Also, this mailing list and the java-dev mailing list have a wealth of  
information about the internals of Lucene, so please dig through the  
archives and ask questions here as well.


-Grant

On Dec 22, 2007, at 9:10 PM, Berlin Brown wrote:


Do you guys have article links or other documents to describe the
lucene database.  Eg.  what is it composed of?

--
Berlin Brown
http://botspiritcompany.com/botlist/spring/help/about.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: optimize Index problem

2007-12-26 Thread Grant Ingersoll
Great, I think.  Except now I am really interested about the exception  
and what settings you had for heap size, Lucene version, etc.




On Dec 23, 2007, at 11:03 PM, Zhou Qi wrote:


Hi , Grant

 After I adjust the mergefactor of indexwriter from 1000 to 100, it  
worked.

Thank you.


22 Dec 2007 07:05:05 -0600, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:


AUTOMATIC REPLY

LUX is closed until 7th January 2008

most information about LUX is available at www.lux.org.uk



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Pagination ...

2007-12-26 Thread Dragon Fly

Any advice on this? Thanks.

> From: [EMAIL PROTECTED]
> To: java-user@lucene.apache.org
> Subject: Pagination ...
> Date: Sat, 22 Dec 2007 10:19:30 -0500
> 
> 
> Hi,
> 
> What is the most efficient way to do pagination in Lucene? I have always done 
> the following because this "flavor" of the search call allows me to specify 
> the top N hits (e.g. 1000) and a Sort object:
> 
> TopFieldDocs topFieldDocs = searcher.search(query, null, 1000, 
> SORT_BY_DATE);
> 
> Is it the best way? Thank you.
> 
> _
> Don't get caught with egg on your face. Play Chicktionary!
> http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec

_
Get the power of Windows + Web with the new Windows Live.
http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007

Re: Pagination ...

2007-12-26 Thread Zhou Qi
Using the search function for pagination will carry out unnecessary index
search when you are going previous or next. Generally, most of the
information need (e.g 80%) can be satisfied by the first 100 documents
(20%). In lucene, the returing documents is set to 100 for the sake of
speed.

I am not quite sure my way of pagination is best: but it works fine under
test preasure: Just keep the first search result in cache and fetch the
snippet when the document is presented in current page.

2007/12/26, Dragon Fly <[EMAIL PROTECTED]>:
>
>
> Any advice on this? Thanks.
>
> > From: [EMAIL PROTECTED]
> > To: java-user@lucene.apache.org
> > Subject: Pagination ...
> > Date: Sat, 22 Dec 2007 10:19:30 -0500
> >
> >
> > Hi,
> >
> > What is the most efficient way to do pagination in Lucene? I have always
> done the following because this "flavor" of the search call allows me to
> specify the top N hits ( e.g. 1000) and a Sort object:
> >
> > TopFieldDocs topFieldDocs = searcher.search(query, null, 1000,
> SORT_BY_DATE);
> >
> > Is it the best way? Thank you.
> >
> > _
> > Don't get caught with egg on your face. Play Chicktionary!
> > http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec
>
> _
> Get the power of Windows + Web with the new Windows Live.
> http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007


Re: Index lucene database details.

2007-12-26 Thread Zhou Qi
Hi Grant,

The exception is throw from java native method."Failed to merge indexes,
java.lang.OutOfMemoryError: Java heap space ". ( I have set the -Xmx1024m in
JVM.)
I guess it is similar as the problem appeared in previous thread before (
http://www.nabble.com/Index-merge-and-java-heap-space-tt505274.html#a505274).
But I don't know the detail reason. Anyone has answer?

2007/12/26, Grant Ingersoll <[EMAIL PROTECTED]>:
>
> I would start at the Lucene Java home page (http://lucene.apache.org/java
> ) and dig in from there.  There are a number of good docs on Scoring
> and the IR model used (Boolean plus Vector.)  From there, I would dig
> into the javadocs and whip up some example code that indexes a set of
> tokens and documents with a controlled vocabulary.  From there, you
> can dig into the source itself, especially the new DocumentsWriter
> class.  And, of course, along the way, please feel free to submit
> documentation patches!
>
> Also, this mailing list and the java-dev mailing list have a wealth of
> information about the internals of Lucene, so please dig through the
> archives and ask questions here as well.
>
> -Grant
>
> On Dec 22, 2007, at 9:10 PM, Berlin Brown wrote:
>
> > Do you guys have article links or other documents to describe the
> > lucene database.  Eg.  what is it composed of?
> >
> > --
> > Berlin Brown
> > http://botspiritcompany.com/botlist/spring/help/about.html
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> --
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Pagination ...

2007-12-26 Thread Mike Richmond
You might want to take a look at Solr (http://lucene.apache.org/solr/).  You
could either use Solr directly, or see how they implement paging.


--Mike


On Dec 26, 2007 12:12 PM, Zhou Qi <[EMAIL PROTECTED]> wrote:

> Using the search function for pagination will carry out unnecessary index
> search when you are going previous or next. Generally, most of the
> information need (e.g 80%) can be satisfied by the first 100 documents
> (20%). In lucene, the returing documents is set to 100 for the sake of
> speed.
>
> I am not quite sure my way of pagination is best: but it works fine under
> test preasure: Just keep the first search result in cache and fetch the
> snippet when the document is presented in current page.
>
> 2007/12/26, Dragon Fly <[EMAIL PROTECTED]>:
> >
> >
> > Any advice on this? Thanks.
> >
> > > From: [EMAIL PROTECTED]
> > > To: java-user@lucene.apache.org
> > > Subject: Pagination ...
> > > Date: Sat, 22 Dec 2007 10:19:30 -0500
> > >
> > >
> > > Hi,
> > >
> > > What is the most efficient way to do pagination in Lucene? I have
> always
> > done the following because this "flavor" of the search call allows me to
> > specify the top N hits ( e.g. 1000) and a Sort object:
> > >
> > > TopFieldDocs topFieldDocs = searcher.search(query, null, 1000,
> > SORT_BY_DATE);
> > >
> > > Is it the best way? Thank you.
> > >
> > > _
> > > Don't get caught with egg on your face. Play Chicktionary!
> > > http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec
> >
> > _
> > Get the power of Windows + Web with the new Windows Live.
> > http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007
>


Analyzer choices for indexing and searching multiple languages

2007-12-26 Thread Jay Hill
I'm working on a project where we will be searching across several languages
with a single query. There will be different categories which will include
different groups of languages to search (i.e. category "a": English, French,
Spanish; category "b": Spanish, Portugese, Itailian, etc) Originally I began
indexing each language using a language-specific Analyzer, but I'm not sure
to handle the QueryParser at search time, not sure which Analyzer to pass to
it.

Does anyone have any experience with indexing all the languages using the
StandardAnalyzer? Right now we only have European languages to index, so I'm
wondering if anyone has had any experience using the StandardAnalyzer to
index European languages, and then using QueryParser with the
StandardAnalyzer at search time.

Or, would it be better to analyze each language at index time using a
language-specific Analyzer, and then still use the QueryParser with the
StandardAnalyzer at search time. I've considered building a BooleanQuery of
QueryParsers with each QueryParser built with a language-specific Analyzer,
but that seems like it would be bound to be very slow.

Any opinions or thoughts appreciated.

-Jay


StopWords problem

2007-12-26 Thread Liaqat Ali

Hi, Doro Cohen

Thanks for your reply, but I am facing a small problem over here. As I 
am using notepad for coding, then in which format the file should be saved.



public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" 
,"کو" ,"ہے" };


Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


If I save it in ANSI format it will lose the contents, I tried Unicode 
but it does not work and I also tried UTF-8, but it also generate two 
errors of identifying two illegal characters. What should be the 
solution. Kindly guide in this.


Thanks ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread 李晓峰
"javac" has an option "-encoding", which tells the compiler the encoding 
the input source file is using, this will probably solve the problem.
or you can try the unicode escape: \u, then you can save it in ANSI, 
had for human to read though.
or use an IDE, eclipse is a good choice, you can set the source file 
encoding, and it will take care of the compiler for you.


regards.

Hi, Doro Cohen

Thanks for your reply, but I am facing a small problem over here. As I 
am using notepad for coding, then in which format the file should be 
saved.



public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
,"کا" ,"کو" ,"ہے" };


Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


If I save it in ANSI format it will lose the contents, I tried Unicode 
but it does not work and I also tried UTF-8, but it also generate two 
errors of identifying two illegal characters. What should be the 
solution. Kindly guide in this.


Thanks ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread Liaqat Ali

李晓峰 wrote:
"javac" has an option "-encoding", which tells the compiler the 
encoding the input source file is using, this will probably solve the 
problem.
or you can try the unicode escape: \u, then you can save it in 
ANSI, had for human to read though.
or use an IDE, eclipse is a good choice, you can set the source file 
encoding, and it will take care of the compiler for you.


regards.

Hi, Doro Cohen

Thanks for your reply, but I am facing a small problem over here. As 
I am using notepad for coding, then in which format the file should 
be saved.



public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
,"کا" ,"کو" ,"ہے" };


Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


If I save it in ANSI format it will lose the contents, I tried 
Unicode but it does not work and I also tried UTF-8, but it also 
generate two errors of identifying two illegal characters. What 
should be the solution. Kindly guide in this.


Thanks ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hi,
Thanks alot for your suggestion.
Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread 李晓峰

It's the notepad.
It adds byte-order-mark(BOM, in this case 65279, or 0xfeff.) in front of 
your file, which javac does not recognize for reasons not quite clear to me.

here is the bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
it won't be fixed, so try to eliminate BOM before compile your code.

Liaqat Ali wrote:

李晓峰 wrote:
"javac" has an option "-encoding", which tells the compiler the 
encoding the input source file is using, this will probably solve the 
problem.
or you can try the unicode escape: \u, then you can save it in 
ANSI, had for human to read though.
or use an IDE, eclipse is a good choice, you can set the source file 
encoding, and it will take care of the compiler for you.


regards.

Hi, Doro Cohen

Thanks for your reply, but I am facing a small problem over here. As 
I am using notepad for coding, then in which format the file should 
be saved.



public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
,"کا" ,"کو" ,"ہے" };


Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


If I save it in ANSI format it will lose the contents, I tried 
Unicode but it does not work and I also tried UTF-8, but it also 
generate two errors of identifying two illegal characters. What 
should be the solution. Kindly guide in this.


Thanks ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hi,
Thanks alot for your suggestion.
Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread 李晓峰

or you can save it as "Unicode" and javac -encoding Unicode

this way you can still use notepad.

Liaqat Ali 写道:

李晓峰 wrote:
"javac" has an option "-encoding", which tells the compiler the 
encoding the input source file is using, this will probably solve the 
problem.
or you can try the unicode escape: \u, then you can save it in 
ANSI, had for human to read though.
or use an IDE, eclipse is a good choice, you can set the source file 
encoding, and it will take care of the compiler for you.


regards.

Hi, Doro Cohen

Thanks for your reply, but I am facing a small problem over here. As 
I am using notepad for coding, then in which format the file should 
be saved.



public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" 
,"کا" ,"کو" ,"ہے" };


Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


If I save it in ANSI format it will lose the contents, I tried 
Unicode but it does not work and I also tried UTF-8, but it also 
generate two errors of identifying two illegal characters. What 
should be the solution. Kindly guide in this.


Thanks ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hi,
Thanks alot for your suggestion.
Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread Doron Cohen
On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote:

> Using javac -encoding UTF-8 still raises the following error.
>
> urduIndexer.java : illegal character: \65279
> ?
> ^
> 1 error
>
> What I am doing wrong?
>

If you have the stop-words in a file, say one word in a line,
they can be read like this:

BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
String word = r.readLine();// loop this line, you get the picture

(Make sure to specify encoding "UTF8" when saving the file from notepad).

Regards,
Doron


Re: StopWords problem

2007-12-26 Thread Liaqat Ali

Doron Cohen wrote:

On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote:

  

Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?




If you have the stop-words in a file, say one word in a line,
they can be read like this:

BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
String word = r.readLine();// loop this line, you get the picture

(Make sure to specify encoding "UTF8" when saving the file from notepad).

Regards,
Doron

  

Hi, Doron

The compilation problem is solved, but there is no change in the index.


public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" 
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"

,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Again these words are appeared in the index with high ranks.

Regards,
Liaqat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread Grant Ingersoll

Are you altering (stemming) the token before it gets to the StopFilter?

On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote:


Doron Cohen wrote:

On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote:



Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?




If you have the stop-words in a file, say one word in a line,
they can be read like this:

   BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
   String word = r.readLine();// loop this line, you get the  
picture


(Make sure to specify encoding "UTF8" when saving the file from  
notepad).


Regards,
Doron



Hi, Doron

The compilation problem is solved, but there is no change in the  
index.



public static final String[] URDU_STOP_WORDS =  
{ "کی 
" ,"کا 
" ,"کو 
" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان 
" ,"ایک 
" ,"تھا 
" ,"تھی 
" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };

Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Again these words are appeared in the index with high ranks.

Regards,
Liaqat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread Liaqat Ali

Grant Ingersoll wrote:

Are you altering (stemming) the token before it gets to the StopFilter?

On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote:


Doron Cohen wrote:

On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote:



Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?




If you have the stop-words in a file, say one word in a line,
they can be read like this:

   BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
   String word = r.readLine();// loop this line, you get the 
picture


(Make sure to specify encoding "UTF8" when saving the file from 
notepad).


Regards,
Doron



Hi, Doron

The compilation problem is solved, but there is no change in the index.


public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" 
,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" 
,"تک" };

Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Again these words are appeared in the index with high ranks.

Regards,
Liaqat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No, at this level I am not using any stemming technique. I am just 
trying to eliminate stop words.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread Grant Ingersoll


On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote:


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No, at this level I am not using any stemming technique. I am just  
trying to eliminate stop words.


Can you share your analyzer code?

-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StopWords problem

2007-12-26 Thread Liaqat Ali

Grant Ingersoll wrote:


On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote:


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No, at this level I am not using any stemming technique. I am just 
trying to eliminate stop words.


Can you share your analyzer code?

-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hi, Grant

I think i did not make my self clear. I am trying to pass a list of Urdu 
Stop words as a argument to the Standard Analyzer. But it does work well 
for me..


public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" 
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"

,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Kindly give some guidelines.

Regards,
Liaqat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]