Re: result.jsp in the webdemo

2003-07-17 Thread di99mwo
I choosed another variable name, then it worked, but why?

/Michelle


Quoting [EMAIL PROTECTED]:

 Hello.
 
 I'm trying to modify the result.jsp file in lucene webdemo. I can create a
 for-
 loop but can't declare any variable anywhere in the jsp file. 
 
 I get the the following error:
 
 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Invalid expression
 
 statement.
   int interface;  
   ^
 
 
 An error occurred between lines: 104 and 120 in the jsp file: /results.jsp
 
 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: ';' expected.
   int interface;  
  ^
 
 
 An error occurred between lines: 104 and 120 in the jsp file: /results.jsp
 
 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: '}' expected.
   int interface;  
  ^
 
 
 An error occurred between lines: 104 and 120 in the jsp file: /results.jsp
 
 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Identifier
 expected.
   int interface;  
^
 4 errors, 1 warning
 
 
 
 thanks.
 
 /Michelle 
 
 
 
   
   
 
 
 
 -
 This mail sent through IMP: http://horde.org/imp/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




-
This mail sent through IMP: http://horde.org/imp/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: extracting keywords

2003-07-17 Thread Eric Jain
 I would like to know if and how can i extract the keywords list from
 an indexed document.

I believe this is not directly possible. You can create such a list by
iterating over all terms in an index and checking for each term whether
the document you are interested in is part of the list of all documents
that contain the current term - not very efficient.

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: result.jsp in the webdemo

2003-07-17 Thread Jimmy Van Broeck
Because interface is a reserved keyword in Java. You cann't utilise a
variable named interface just like class,implements,

Greetings
Jimmy


Jimmy Van Broeck
Syntegra, creating winners in the digital economy
+32 2 247 92 20 - check us out at www.syntegra.be http://www.syntegra.be/



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: donderdag 10 juli 2003 12:26
To: Lucene Users List
Subject: Re: result.jsp in the webdemo


I choosed another variable name, then it worked, but why?

/Michelle


Quoting [EMAIL PROTECTED]:

 Hello.

 I'm trying to modify the result.jsp file in lucene webdemo. I can create a
 for-
 loop but can't declare any variable anywhere in the jsp file.

 I get the the following error:

 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Invalid
expression

 statement.
   int interface;
   ^


 An error occurred between lines: 104 and 120 in the jsp file: /results.jsp

 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: ';' expected.
   int interface;
  ^


 An error occurred between lines: 104 and 120 in the jsp file: /results.jsp

 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: '}' expected.
   int interface;
  ^


 An error occurred between lines: 104 and 120 in the jsp file: /results.jsp

 Generated servlet error:
 C:\jakarta-tomcat-4.0.6
 \work\Standalone\localhost\luceneweb\results$jsp.java:250: Identifier
 expected.
   int interface;
^
 4 errors, 1 warning



 thanks.

 /Michelle








 -
 This mail sent through IMP: http://horde.org/imp/

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






-
This mail sent through IMP: http://horde.org/imp/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



interesting phrase query issue

2003-07-17 Thread greg
I have several document sections that are being indexed via the 
StandardAnalyzer.  One of these documents has the line access, the 
manager.  When searching for the phrase access manager, this document is 
being returned.  I understand why (at least i think i do), because a stop 
word is the and the , is being removed by the tokenizer, my question is 
is there any way I can avoid having this returned in the results?  My 
thoughts were to create a new analyzer that indexes the word the (blick to 
many of those), or index the , in some way (also not good).  Any 
suggestions? 

Thanks, 

Greg T Robertson

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: interesting phrase query issue

2003-07-17 Thread Tatu Saloranta
On Thursday 17 July 2003 07:20, greg wrote:
 I have several document sections that are being indexed via the
 StandardAnalyzer.  One of these documents has the line access, the
 manager.  When searching for the phrase access manager, this document is
 being returned.  I understand why (at least i think i do), because a stop
 word is the and the , is being removed by the tokenizer, my question is
 is there any way I can avoid having this returned in the results?  My
 thoughts were to create a new analyzer that indexes the word the (blick
 to many of those), or index the , in some way (also not good).  Any
 suggestions?

You can also replace all stop words with dummy token ( might be an ok 
candidate?). That would be similar to indexing the (which probably is  
better idea than indexing ,).

I'm planning to do something similar for paragraph breaks (in case of plain 
text, double linefeed, for HTML p etc), to prevent similar problems.

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RAM index usage

2003-07-17 Thread Gregg Cote
Hi There,

I want to create an in-memory index.  I want to be able to search the index
and delete documents from this index.  I believe I am creating the index
correctly:

 this._writer = new IndexWriter(new org.apache.lucene.store.RAMDirectory(),
   new StandardAnalyzer(), true);

And can add docs to the index.

What I am not sure of is how I create a reader, and how to delete from this
index.  The reader I create for file indexes uses the following:

_reader = IndexReader.open(folder);

But I am not sure how to open a reader that will use the index I created
above.

Any help is appreciated.

Thanks,
Gregg



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RAM index usage

2003-07-17 Thread greg
What I am not sure of is how I create a reader, and how to delete from this
index.  The reader I create for file indexes uses the following: 

_reader = IndexReader.open(folder); 

According to the javadocs i have IndexReader has an open method which takes 
a Directory.  Use it instead of the IndexReader.open that takes a String.  
At least for opening it. 

g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: interesting phrase query issue

2003-07-17 Thread Victor Hadianto
 One of these documents has the line access, the
 manager.  When searching for the phrase access manager, this document is
 being returned.  I understand why (at least i think i do), because a stop
 word is the and the , is being removed by the tokenizer, my question is
 is there any way I can avoid having this returned in the results?  

I don't think you can't without reindexing the documents and changing 
QueryParser a bit. The reasons is although if you introduce your new 
tokenizer/analyzer the original documents have been indexed with those stop 
words removed.

You have to create an analyzer that doesn't drop your stop words and start the 
reindexing again.

However you must be careful when using your custom analyser to do the query 
parsing, because sometime you may want to drop the stop words in a non-quoted 
query, so 

hello and world --- +hello +world

but

hello and world -- +hello and world

One solution that I can think of is by passing two analysers in QueryParser, 
one is for the standard analyser and the other is for the phrase query 
analyser. Down in the QueryParser.jj around this area do something like this:

 | term=QUOTED
   [ slop=SLOP ]
   [ CARAT boost=NUMBER ]
   {
 if (phraseAnalyzer == null)  {
// use phrase query custom analyser that doesn't drop stop words
} else {  
 // otherwise use normal analyzer
}

This may work as a matter of fact I think it should.

HTH

victor


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CJK support in lucene

2003-07-17 Thread Che Dong
I think Tranditional Chinese use in HK and TW is supported for CJK Charactor is 
indentified with charactor block of: CJK_UNIFIED_IDEOGRAPHS

more:
http://sourceforge.net/projects/weblucene/

Che, Dong
- Original Message - 
From: Eric Isakson [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 17, 2003 2:04 AM
Subject: FW: CJK support in lucene




-Original Message-
From: Eric Isakson 
Sent: Wednesday, July 16, 2003 2:04 PM
To: 'Avnish Midha'
Subject: RE: CJK support in lucene


I'm no linguist, so the short answer is, I'm not sure about Taiwanese. If they share 
the same character sets and a bigram indexing approach makes sense for that language 
(read the links in the CJKTokenizer source), then it would probably work.

For Latin-1 languages, it will tokenize (It is setup to deal with mixed language 
documents where some of the text might be Chinese and some might be English) but it 
will be far less efficient than the standard tokenizer supplied with the Lucene core. 
But you should run your own tests to see if that would be livable.

Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:50 PM
To: Eric Isakson
Cc: Lucene Users List
Subject: RE: CJK support in lucene



Eric,

Does this tokenizer also support Taiwanese  European languages (Latin-1)?

Regards,
Avnish

-Original Message-
From: Eric Isakson [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 16, 2003 10:38 AM
To: Avnish Midha
Cc: Lucene Users List
Subject: RE: CJK support in lucene


This archived message has the CJKTokenizer code attached (there are some links in the 
code to material that describes the tokenization strategy).

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
e.orgmsgId=330905

You have to write your own analyzer that uses this tokenizer. See 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to 
write an analyzer.

here is one you could use:
package my.package;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKTokenizer;
import java.io.Reader;

public class CJKAnalyzer extends Analyzer {

public CJKAnalyzer() {
}

/**
 * Creates a TokenStream which tokenizes all the text in the provided Reader.
 *
 * @return  A TokenStream built from a CJKTokenizer
 */
public TokenStream tokenStream( String fieldName, Reader reader )
{
TokenStream result = new CJKTokenizer( reader );
result = new StopFilter(result, new String[] {}); // CJKTokenizer emitts a  
sometimes, haven't been able to figure it out, so this is a workaround
return result;
}
}

Lastly, you have to package those things up and use them along with the core lucene 
code.

CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on 
indexing CJK languages would be a good thing to add. The existing one 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.index
ingtoc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more 
direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is 
useful to be aware of too.

Good luck,
Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:06 PM
To: Eric Isakson
Subject: CJK support in lucene



Hi Eric,

I read the description of the bug (#18933) reported by you on the apache site. I had a 
question related to this defect. In the description you have mentioned that CJK 
support should be included in the core build. Is there any other way we can enable the 
CJK support in the lucene search engine? Would be grateful to you if you could let me 
know of any such method of enabling CJK support in the serach engine.

Eagerly waiting for your reply.

Thanks  Regards,
Avnish Midha
Phone no.: +1-949-8852540




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



multiple words indexing

2003-07-17 Thread Gourav Raj Budhia
Is there any way in Lucene so that I can index multiple words as a single
term.

For example : Jakarta Lucene appears together in my document and I want it
to be indexed as a single term Jakarta Lucene  and not as two sperate
terms as Jakarta and Lucene.

Thanks,
Gourav



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene's scoring algorithm

2003-07-17 Thread Chong-Ki Tsang
I am curious to know if the Lucene's scoring algorithm was updated in
the latest 1.3 version.

I find the following scoring algorithm in the Similarity class of JAVA
API documents. This method is different from the one shown in official
FAQ. Could you tell me which one is being used in 1.3? If the algorithm
was updated, please send me the formula. I will appreciate that.

 

Thanks,

Chong-Ki

 

The score of query q for document d is defined in terms of these methods
as follows: 


score(q,d) =



 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#tf(int) tf(t in d) *
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher) idf(t) *
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Fi
eld.html#getBoost() getBoost(t.field in d) *
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#lengthNorm(java.lang.String, int) lengthNorm(t.field in d) 

 *
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#coord(int, int) coord(q,d) *
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#queryNorm(float) queryNorm(q) 


t in q 

 

 

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simil
arity.html

 

 

For the official FAQ, Lucene's scoring algorithm is shown as,

 

31. How does Lucene assigns scores to hits ?

Here is a quote from Doug himself (posted on July 2001 to the Lucene
users mailing list): 

 

For the record, Lucene's scoring algorithm is, roughly:

 

  score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)

 

where:

  score_d   : score for document d

  sum_t : sum for all terms t

  tf_q  : the square root of the frequency of t in the query

  tf_d  : the square root of the frequency of t in d

  idf_t : log(numDocs/docFreq_t+1) + 1.0

  numDocs   : number of documents in index

  docFreq_t : number of documents containing t

  norm_q: sqrt(sum_t((tf_q*idf_t)^2))

  norm_d_t  : square root of number of tokens in d in the same field as
t

 

(I hope that's right!)

 

[Doug later added...]

 

Make that:

  

  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d

 

where

 

  boost_t: the user-specified boost for term t

  coord_q_d  : number of terms in both query and document / number of
terms in query

 

The coordination factor gives an AND-like boost to documents that
contain,

e.g., all three terms in a three word query over those that contain just
two

of the words.

 

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se
arch
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.s
earchtoc=faq#q31 toc=faq#q31