Re: result.jsp in the webdemo
I choosed another variable name, then it worked, but why? /Michelle Quoting [EMAIL PROTECTED]: Hello. I'm trying to modify the result.jsp file in lucene webdemo. I can create a for- loop but can't declare any variable anywhere in the jsp file. I get the the following error: Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Invalid expression statement. int interface; ^ An error occurred between lines: 104 and 120 in the jsp file: /results.jsp Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: ';' expected. int interface; ^ An error occurred between lines: 104 and 120 in the jsp file: /results.jsp Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: '}' expected. int interface; ^ An error occurred between lines: 104 and 120 in the jsp file: /results.jsp Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Identifier expected. int interface; ^ 4 errors, 1 warning thanks. /Michelle - This mail sent through IMP: http://horde.org/imp/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - This mail sent through IMP: http://horde.org/imp/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: extracting keywords
I would like to know if and how can i extract the keywords list from an indexed document. I believe this is not directly possible. You can create such a list by iterating over all terms in an index and checking for each term whether the document you are interested in is part of the list of all documents that contain the current term - not very efficient. -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: result.jsp in the webdemo
Because interface is a reserved keyword in Java. You cann't utilise a variable named interface just like class,implements, Greetings Jimmy Jimmy Van Broeck Syntegra, creating winners in the digital economy +32 2 247 92 20 - check us out at www.syntegra.be http://www.syntegra.be/ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: donderdag 10 juli 2003 12:26 To: Lucene Users List Subject: Re: result.jsp in the webdemo I choosed another variable name, then it worked, but why? /Michelle Quoting [EMAIL PROTECTED]: Hello. I'm trying to modify the result.jsp file in lucene webdemo. I can create a for- loop but can't declare any variable anywhere in the jsp file. I get the the following error: Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Invalid expression statement. int interface; ^ An error occurred between lines: 104 and 120 in the jsp file: /results.jsp Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: ';' expected. int interface; ^ An error occurred between lines: 104 and 120 in the jsp file: /results.jsp Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: '}' expected. int interface; ^ An error occurred between lines: 104 and 120 in the jsp file: /results.jsp Generated servlet error: C:\jakarta-tomcat-4.0.6 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Identifier expected. int interface; ^ 4 errors, 1 warning thanks. /Michelle - This mail sent through IMP: http://horde.org/imp/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - This mail sent through IMP: http://horde.org/imp/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
interesting phrase query issue
I have several document sections that are being indexed via the StandardAnalyzer. One of these documents has the line access, the manager. When searching for the phrase access manager, this document is being returned. I understand why (at least i think i do), because a stop word is the and the , is being removed by the tokenizer, my question is is there any way I can avoid having this returned in the results? My thoughts were to create a new analyzer that indexes the word the (blick to many of those), or index the , in some way (also not good). Any suggestions? Thanks, Greg T Robertson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: interesting phrase query issue
On Thursday 17 July 2003 07:20, greg wrote: I have several document sections that are being indexed via the StandardAnalyzer. One of these documents has the line access, the manager. When searching for the phrase access manager, this document is being returned. I understand why (at least i think i do), because a stop word is the and the , is being removed by the tokenizer, my question is is there any way I can avoid having this returned in the results? My thoughts were to create a new analyzer that indexes the word the (blick to many of those), or index the , in some way (also not good). Any suggestions? You can also replace all stop words with dummy token ( might be an ok candidate?). That would be similar to indexing the (which probably is better idea than indexing ,). I'm planning to do something similar for paragraph breaks (in case of plain text, double linefeed, for HTML p etc), to prevent similar problems. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RAM index usage
Hi There, I want to create an in-memory index. I want to be able to search the index and delete documents from this index. I believe I am creating the index correctly: this._writer = new IndexWriter(new org.apache.lucene.store.RAMDirectory(), new StandardAnalyzer(), true); And can add docs to the index. What I am not sure of is how I create a reader, and how to delete from this index. The reader I create for file indexes uses the following: _reader = IndexReader.open(folder); But I am not sure how to open a reader that will use the index I created above. Any help is appreciated. Thanks, Gregg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RAM index usage
What I am not sure of is how I create a reader, and how to delete from this index. The reader I create for file indexes uses the following: _reader = IndexReader.open(folder); According to the javadocs i have IndexReader has an open method which takes a Directory. Use it instead of the IndexReader.open that takes a String. At least for opening it. g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: interesting phrase query issue
One of these documents has the line access, the manager. When searching for the phrase access manager, this document is being returned. I understand why (at least i think i do), because a stop word is the and the , is being removed by the tokenizer, my question is is there any way I can avoid having this returned in the results? I don't think you can't without reindexing the documents and changing QueryParser a bit. The reasons is although if you introduce your new tokenizer/analyzer the original documents have been indexed with those stop words removed. You have to create an analyzer that doesn't drop your stop words and start the reindexing again. However you must be careful when using your custom analyser to do the query parsing, because sometime you may want to drop the stop words in a non-quoted query, so hello and world --- +hello +world but hello and world -- +hello and world One solution that I can think of is by passing two analysers in QueryParser, one is for the standard analyser and the other is for the phrase query analyser. Down in the QueryParser.jj around this area do something like this: | term=QUOTED [ slop=SLOP ] [ CARAT boost=NUMBER ] { if (phraseAnalyzer == null) { // use phrase query custom analyser that doesn't drop stop words } else { // otherwise use normal analyzer } This may work as a matter of fact I think it should. HTH victor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CJK support in lucene
I think Tranditional Chinese use in HK and TW is supported for CJK Charactor is indentified with charactor block of: CJK_UNIFIED_IDEOGRAPHS more: http://sourceforge.net/projects/weblucene/ Che, Dong - Original Message - From: Eric Isakson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, July 17, 2003 2:04 AM Subject: FW: CJK support in lucene -Original Message- From: Eric Isakson Sent: Wednesday, July 16, 2003 2:04 PM To: 'Avnish Midha' Subject: RE: CJK support in lucene I'm no linguist, so the short answer is, I'm not sure about Taiwanese. If they share the same character sets and a bigram indexing approach makes sense for that language (read the links in the CJKTokenizer source), then it would probably work. For Latin-1 languages, it will tokenize (It is setup to deal with mixed language documents where some of the text might be Chinese and some might be English) but it will be far less efficient than the standard tokenizer supplied with the Lucene core. But you should run your own tests to see if that would be livable. Eric -Original Message- From: Avnish Midha [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 1:50 PM To: Eric Isakson Cc: Lucene Users List Subject: RE: CJK support in lucene Eric, Does this tokenizer also support Taiwanese European languages (Latin-1)? Regards, Avnish -Original Message- From: Eric Isakson [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 10:38 AM To: Avnish Midha Cc: Lucene Users List Subject: RE: CJK support in lucene This archived message has the CJKTokenizer code attached (there are some links in the code to material that describes the tokenization strategy). http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED] e.orgmsgId=330905 You have to write your own analyzer that uses this tokenizer. See http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to write an analyzer. here is one you could use: package my.package; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cjk.CJKTokenizer; import java.io.Reader; public class CJKAnalyzer extends Analyzer { public CJKAnalyzer() { } /** * Creates a TokenStream which tokenizes all the text in the provided Reader. * * @return A TokenStream built from a CJKTokenizer */ public TokenStream tokenStream( String fieldName, Reader reader ) { TokenStream result = new CJKTokenizer( reader ); result = new StopFilter(result, new String[] {}); // CJKTokenizer emitts a sometimes, haven't been able to figure it out, so this is a workaround return result; } } Lastly, you have to package those things up and use them along with the core lucene code. CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on indexing CJK languages would be a good thing to add. The existing one (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.index ingtoc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is useful to be aware of too. Good luck, Eric -Original Message- From: Avnish Midha [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 1:06 PM To: Eric Isakson Subject: CJK support in lucene Hi Eric, I read the description of the bug (#18933) reported by you on the apache site. I had a question related to this defect. In the description you have mentioned that CJK support should be included in the core build. Is there any other way we can enable the CJK support in the lucene search engine? Would be grateful to you if you could let me know of any such method of enabling CJK support in the serach engine. Eagerly waiting for your reply. Thanks Regards, Avnish Midha Phone no.: +1-949-8852540 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
multiple words indexing
Is there any way in Lucene so that I can index multiple words as a single term. For example : Jakarta Lucene appears together in my document and I want it to be indexed as a single term Jakarta Lucene and not as two sperate terms as Jakarta and Lucene. Thanks, Gourav - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene's scoring algorithm
I am curious to know if the Lucene's scoring algorithm was updated in the latest 1.3 version. I find the following scoring algorithm in the Similarity class of JAVA API documents. This method is different from the one shown in official FAQ. Could you tell me which one is being used in 1.3? If the algorithm was updated, please send me the formula. I will appreciate that. Thanks, Chong-Ki The score of query q for document d is defined in terms of these methods as follows: score(q,d) = http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#tf(int) tf(t in d) * http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher) idf(t) * http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Fi eld.html#getBoost() getBoost(t.field in d) * http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#lengthNorm(java.lang.String, int) lengthNorm(t.field in d) * http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#coord(int, int) coord(q,d) * http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#queryNorm(float) queryNorm(q) t in q http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simil arity.html For the official FAQ, Lucene's scoring algorithm is shown as, 31. How does Lucene assigns scores to hits ? Here is a quote from Doug himself (posted on July 2001 to the Lucene users mailing list): For the record, Lucene's scoring algorithm is, roughly: score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query tf_d : the square root of the frequency of t in d idf_t : log(numDocs/docFreq_t+1) + 1.0 numDocs : number of documents in index docFreq_t : number of documents containing t norm_q: sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t (I hope that's right!) [Doug later added...] Make that: score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d where boost_t: the user-specified boost for term t coord_q_d : number of terms in both query and document / number of terms in query The coordination factor gives an AND-like boost to documents that contain, e.g., all three terms in a three word query over those that contain just two of the words. http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se arch http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.s earchtoc=faq#q31 toc=faq#q31