Modifying StopAnalyzer
Hi, Erick Thanks for your suggestion, putting the declaration of StringBuffer variable sb inside the for loop is working well. I want to ask another question, can we modify the StopyAnalyzer to insert Stop Words of another language, instead of English, like Urdu given below: public static final String[] URDU_STOP_WORDS = { "پر", "کا", "کی", "کو" }; - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Modifying StopAnalyzer
> > can we modify the StopyAnalyzer to insert Stop Words of > another language, instead of English, like Urdu given below: > public static final String[] URDU_STOP_WORDS = { "پر", "کا", "کی", "کو" }; > "new StandardAnalyzer(URDU_STOP_WORDS)" should work. Regards, Doron
Re: Index lucene database details.
I would start at the Lucene Java home page (http://lucene.apache.org/java ) and dig in from there. There are a number of good docs on Scoring and the IR model used (Boolean plus Vector.) From there, I would dig into the javadocs and whip up some example code that indexes a set of tokens and documents with a controlled vocabulary. From there, you can dig into the source itself, especially the new DocumentsWriter class. And, of course, along the way, please feel free to submit documentation patches! Also, this mailing list and the java-dev mailing list have a wealth of information about the internals of Lucene, so please dig through the archives and ask questions here as well. -Grant On Dec 22, 2007, at 9:10 PM, Berlin Brown wrote: Do you guys have article links or other documents to describe the lucene database. Eg. what is it composed of? -- Berlin Brown http://botspiritcompany.com/botlist/spring/help/about.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: optimize Index problem
Great, I think. Except now I am really interested about the exception and what settings you had for heap size, Lucene version, etc. On Dec 23, 2007, at 11:03 PM, Zhou Qi wrote: Hi , Grant After I adjust the mergefactor of indexwriter from 1000 to 100, it worked. Thank you. 22 Dec 2007 07:05:05 -0600, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: AUTOMATIC REPLY LUX is closed until 7th January 2008 most information about LUX is available at www.lux.org.uk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Pagination ...
Any advice on this? Thanks. > From: [EMAIL PROTECTED] > To: java-user@lucene.apache.org > Subject: Pagination ... > Date: Sat, 22 Dec 2007 10:19:30 -0500 > > > Hi, > > What is the most efficient way to do pagination in Lucene? I have always done > the following because this "flavor" of the search call allows me to specify > the top N hits (e.g. 1000) and a Sort object: > > TopFieldDocs topFieldDocs = searcher.search(query, null, 1000, > SORT_BY_DATE); > > Is it the best way? Thank you. > > _ > Don't get caught with egg on your face. Play Chicktionary! > http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec _ Get the power of Windows + Web with the new Windows Live. http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007
Re: Pagination ...
Using the search function for pagination will carry out unnecessary index search when you are going previous or next. Generally, most of the information need (e.g 80%) can be satisfied by the first 100 documents (20%). In lucene, the returing documents is set to 100 for the sake of speed. I am not quite sure my way of pagination is best: but it works fine under test preasure: Just keep the first search result in cache and fetch the snippet when the document is presented in current page. 2007/12/26, Dragon Fly <[EMAIL PROTECTED]>: > > > Any advice on this? Thanks. > > > From: [EMAIL PROTECTED] > > To: java-user@lucene.apache.org > > Subject: Pagination ... > > Date: Sat, 22 Dec 2007 10:19:30 -0500 > > > > > > Hi, > > > > What is the most efficient way to do pagination in Lucene? I have always > done the following because this "flavor" of the search call allows me to > specify the top N hits ( e.g. 1000) and a Sort object: > > > > TopFieldDocs topFieldDocs = searcher.search(query, null, 1000, > SORT_BY_DATE); > > > > Is it the best way? Thank you. > > > > _ > > Don't get caught with egg on your face. Play Chicktionary! > > http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec > > _ > Get the power of Windows + Web with the new Windows Live. > http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007
Re: Index lucene database details.
Hi Grant, The exception is throw from java native method."Failed to merge indexes, java.lang.OutOfMemoryError: Java heap space ". ( I have set the -Xmx1024m in JVM.) I guess it is similar as the problem appeared in previous thread before ( http://www.nabble.com/Index-merge-and-java-heap-space-tt505274.html#a505274). But I don't know the detail reason. Anyone has answer? 2007/12/26, Grant Ingersoll <[EMAIL PROTECTED]>: > > I would start at the Lucene Java home page (http://lucene.apache.org/java > ) and dig in from there. There are a number of good docs on Scoring > and the IR model used (Boolean plus Vector.) From there, I would dig > into the javadocs and whip up some example code that indexes a set of > tokens and documents with a controlled vocabulary. From there, you > can dig into the source itself, especially the new DocumentsWriter > class. And, of course, along the way, please feel free to submit > documentation patches! > > Also, this mailing list and the java-dev mailing list have a wealth of > information about the internals of Lucene, so please dig through the > archives and ask questions here as well. > > -Grant > > On Dec 22, 2007, at 9:10 PM, Berlin Brown wrote: > > > Do you guys have article links or other documents to describe the > > lucene database. Eg. what is it composed of? > > > > -- > > Berlin Brown > > http://botspiritcompany.com/botlist/spring/help/about.html > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Pagination ...
You might want to take a look at Solr (http://lucene.apache.org/solr/). You could either use Solr directly, or see how they implement paging. --Mike On Dec 26, 2007 12:12 PM, Zhou Qi <[EMAIL PROTECTED]> wrote: > Using the search function for pagination will carry out unnecessary index > search when you are going previous or next. Generally, most of the > information need (e.g 80%) can be satisfied by the first 100 documents > (20%). In lucene, the returing documents is set to 100 for the sake of > speed. > > I am not quite sure my way of pagination is best: but it works fine under > test preasure: Just keep the first search result in cache and fetch the > snippet when the document is presented in current page. > > 2007/12/26, Dragon Fly <[EMAIL PROTECTED]>: > > > > > > Any advice on this? Thanks. > > > > > From: [EMAIL PROTECTED] > > > To: java-user@lucene.apache.org > > > Subject: Pagination ... > > > Date: Sat, 22 Dec 2007 10:19:30 -0500 > > > > > > > > > Hi, > > > > > > What is the most efficient way to do pagination in Lucene? I have > always > > done the following because this "flavor" of the search call allows me to > > specify the top N hits ( e.g. 1000) and a Sort object: > > > > > > TopFieldDocs topFieldDocs = searcher.search(query, null, 1000, > > SORT_BY_DATE); > > > > > > Is it the best way? Thank you. > > > > > > _ > > > Don't get caught with egg on your face. Play Chicktionary! > > > http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec > > > > _ > > Get the power of Windows + Web with the new Windows Live. > > http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007 >
Analyzer choices for indexing and searching multiple languages
I'm working on a project where we will be searching across several languages with a single query. There will be different categories which will include different groups of languages to search (i.e. category "a": English, French, Spanish; category "b": Spanish, Portugese, Itailian, etc) Originally I began indexing each language using a language-specific Analyzer, but I'm not sure to handle the QueryParser at search time, not sure which Analyzer to pass to it. Does anyone have any experience with indexing all the languages using the StandardAnalyzer? Right now we only have European languages to index, so I'm wondering if anyone has had any experience using the StandardAnalyzer to index European languages, and then using QueryParser with the StandardAnalyzer at search time. Or, would it be better to analyze each language at index time using a language-specific Analyzer, and then still use the QueryParser with the StandardAnalyzer at search time. I've considered building a BooleanQuery of QueryParsers with each QueryParser built with a language-specific Analyzer, but that seems like it would be bound to be very slow. Any opinions or thoughts appreciated. -Jay
StopWords problem
Hi, Doro Cohen Thanks for your reply, but I am facing a small problem over here. As I am using notepad for coding, then in which format the file should be saved. public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" ,"کو" ,"ہے" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); If I save it in ANSI format it will lose the contents, I tried Unicode but it does not work and I also tried UTF-8, but it also generate two errors of identifying two illegal characters. What should be the solution. Kindly guide in this. Thanks .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
"javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode escape: \u, then you can save it in ANSI, had for human to read though. or use an IDE, eclipse is a good choice, you can set the source file encoding, and it will take care of the compiler for you. regards. Hi, Doro Cohen Thanks for your reply, but I am facing a small problem over here. As I am using notepad for coding, then in which format the file should be saved. public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" ,"کو" ,"ہے" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); If I save it in ANSI format it will lose the contents, I tried Unicode but it does not work and I also tried UTF-8, but it also generate two errors of identifying two illegal characters. What should be the solution. Kindly guide in this. Thanks .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
李晓峰 wrote: "javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode escape: \u, then you can save it in ANSI, had for human to read though. or use an IDE, eclipse is a good choice, you can set the source file encoding, and it will take care of the compiler for you. regards. Hi, Doro Cohen Thanks for your reply, but I am facing a small problem over here. As I am using notepad for coding, then in which format the file should be saved. public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" ,"کو" ,"ہے" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); If I save it in ANSI format it will lose the contents, I tried Unicode but it does not work and I also tried UTF-8, but it also generate two errors of identifying two illegal characters. What should be the solution. Kindly guide in this. Thanks .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hi, Thanks alot for your suggestion. Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
It's the notepad. It adds byte-order-mark(BOM, in this case 65279, or 0xfeff.) in front of your file, which javac does not recognize for reasons not quite clear to me. here is the bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 it won't be fixed, so try to eliminate BOM before compile your code. Liaqat Ali wrote: 李晓峰 wrote: "javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode escape: \u, then you can save it in ANSI, had for human to read though. or use an IDE, eclipse is a good choice, you can set the source file encoding, and it will take care of the compiler for you. regards. Hi, Doro Cohen Thanks for your reply, but I am facing a small problem over here. As I am using notepad for coding, then in which format the file should be saved. public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" ,"کو" ,"ہے" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); If I save it in ANSI format it will lose the contents, I tried Unicode but it does not work and I also tried UTF-8, but it also generate two errors of identifying two illegal characters. What should be the solution. Kindly guide in this. Thanks .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hi, Thanks alot for your suggestion. Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
or you can save it as "Unicode" and javac -encoding Unicode this way you can still use notepad. Liaqat Ali 写道: 李晓峰 wrote: "javac" has an option "-encoding", which tells the compiler the encoding the input source file is using, this will probably solve the problem. or you can try the unicode escape: \u, then you can save it in ANSI, had for human to read though. or use an IDE, eclipse is a good choice, you can set the source file encoding, and it will take care of the compiler for you. regards. Hi, Doro Cohen Thanks for your reply, but I am facing a small problem over here. As I am using notepad for coding, then in which format the file should be saved. public static final String[] URDU_STOP_WORDS = { "کے" ,"کی" ,"سے" ,"کا" ,"کو" ,"ہے" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); If I save it in ANSI format it will lose the contents, I tried Unicode but it does not work and I also tried UTF-8, but it also generate two errors of identifying two illegal characters. What should be the solution. Kindly guide in this. Thanks .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hi, Thanks alot for your suggestion. Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: > Using javac -encoding UTF-8 still raises the following error. > > urduIndexer.java : illegal character: \65279 > ? > ^ > 1 error > > What I am doing wrong? > If you have the stop-words in a file, say one word in a line, they can be read like this: BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("Urdu.txt"),"UTF8")); String word = r.readLine();// loop this line, you get the picture (Make sure to specify encoding "UTF8" when saving the file from notepad). Regards, Doron
Re: StopWords problem
Doron Cohen wrote: On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? If you have the stop-words in a file, say one word in a line, they can be read like this: BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("Urdu.txt"),"UTF8")); String word = r.readLine();// loop this line, you get the picture (Make sure to specify encoding "UTF8" when saving the file from notepad). Regards, Doron Hi, Doron The compilation problem is solved, but there is no change in the index. public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی" ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); Again these words are appeared in the index with high ranks. Regards, Liaqat - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
Are you altering (stemming) the token before it gets to the StopFilter? On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote: Doron Cohen wrote: On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? If you have the stop-words in a file, say one word in a line, they can be read like this: BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("Urdu.txt"),"UTF8")); String word = r.readLine();// loop this line, you get the picture (Make sure to specify encoding "UTF8" when saving the file from notepad). Regards, Doron Hi, Doron The compilation problem is solved, but there is no change in the index. public static final String[] URDU_STOP_WORDS = { "کی " ,"کا " ,"کو " ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی" ,"ان " ,"ایک " ,"تھا " ,"تھی " ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); Again these words are appeared in the index with high ranks. Regards, Liaqat - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
Grant Ingersoll wrote: Are you altering (stemming) the token before it gets to the StopFilter? On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote: Doron Cohen wrote: On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? If you have the stop-words in a file, say one word in a line, they can be read like this: BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream("Urdu.txt"),"UTF8")); String word = r.readLine();// loop this line, you get the picture (Make sure to specify encoding "UTF8" when saving the file from notepad). Regards, Doron Hi, Doron The compilation problem is solved, but there is no change in the index. public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی" ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); Again these words are appeared in the index with high ranks. Regards, Liaqat - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No, at this level I am not using any stemming technique. I am just trying to eliminate stop words. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No, at this level I am not using any stemming technique. I am just trying to eliminate stop words. Can you share your analyzer code? -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWords problem
Grant Ingersoll wrote: On Dec 26, 2007, at 5:24 PM, Liaqat Ali wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No, at this level I am not using any stemming technique. I am just trying to eliminate stop words. Can you share your analyzer code? -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hi, Grant I think i did not make my self clear. I am trying to pass a list of Urdu Stop words as a argument to the Standard Analyzer. But it does work well for me.. public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی" ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" }; Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS); Kindly give some guidelines. Regards, Liaqat - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]