date:20031126

RE: Dates and others

2003-11-26 Thread Dion Almaer

Hi guys -

So I am getting happier with search, and just pushed the lucene version live at:

http://www.theserverside.com (on the leftbar) and:
http://www.theserverside.com/home/search/index.jsp

The only real item that I still want to tweak more is getting recent results higher in 
the list.

I was wondering if something like this could work (or if there is a better solution)

At index time, I have the date of the content.  I could do some math where the higher 
the date
(based on the time_t version or whatever) the more of a setBoost(metric). Or, for 
every month in the
past, create a larger negative number to setBoost()... or something like that.

Would something like this make sense?

Dion


> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, November 23, 2003 3:52 PM
> To: Lucene Users List
> Subject: Re: Dates and others
> 
> On Saturday, November 22, 2003, at 06:33  PM, Dion Almaer wrote:
> > 3. I have some fields suck as title, owner, etc as well as 
> the content 
> > blob which I index and use as the default search field.  Is 
> there an 
> > easy way to extend the QueryParser to merge it with a 
> MultiTermQuery 
> > which can also search this meta data and give them certain 
> weights?  
> > Or, if you go down this path do you have to leave the QueryParser 
> > behind and build your own queries?  Any best practices 
> would be great.
> 
> And Ype said:
> You can provide field weights at document indexing time 
> (norms) and use a MultiTermQuery for searching multiple 
> fields. At query time you can again use field weights.
> I don't know how the scoring of the MultiTermQuery is done, 
> it might use the max. score over the fields of a document, or 
> combine the scores in the fields of a document.
>  end Ype's reply cut and paste
> 
> I'm a little confused with this question and Ype's reply.  
> MultiTermQuery is an abstract base class under Query, which 
> is the parent for WildcardQuery and FuzzyQuery.
> 
> What I think you're after is using MultiFieldQueryParser, but 
> you want to weight the fields differently.  You can add the 
> boosts at indexing time using Field.setBoost.  Unfortunately 
> at the moment MultiFieldQueryParser is not very extensible - 
> there are some open issues with its subclassability but 
> subclassing MFQP and overriding getFieldQuery will do the 
> trick when the subclassing issues are resolved allowing you 
> to boost at query time.
> 
> Making an educated guess at what you're doing with Lucene, 
> Dion, I'd venture to say that boosting at indexing time is 
> sufficient for your needs.
> 
>   Erik
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: log4j.properties

2003-11-26 Thread Victor Hadianto


> java -Dlog4j.configuration=log4j.xml
org.pdfbox.searchengine.lucene.IndexFiles
> -create -index c:\\index ..

Hmm try to create log4j.xml instead of log4j.properties as specified in the
command line parameter.



- Original Message - 
From: "Tun Lin" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, November 27, 2003 2:05 AM
Subject: RE: log4j.properties


> I have integrated Lucene and PDFBox and tried the following command to
index
> files
>
> java -Dlog4j.configuration=log4j.xml
org.pdfbox.searchengine.lucene.IndexFiles
> -create -index c:\\index ..
>
> But I have the following error message:
> log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParse
> r).
> log4j:WARN Please initialize the log4j system properly.
>
> Anyone can help?
>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 26, 2003 5:19 PM
> To: Lucene Users List
> Subject: Re: log4j.properties
>
> What does this have to do with Lucene?
>
>
> On Wednesday, November 26, 2003, at 01:04  AM, Tun Lin wrote:
>
> > I have created the following "log4j.properties" and put it in your
> > classpath but it still has that error. Anyone can help?
> >
> > log4j.rootCategory=stdout
> >
> > log4j.appender.stdout=org.apache.log4j.ConsoleAppender
> > log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
> > log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Collaborative Filtering API

2003-11-26 Thread Steven J. Owens

On Tue, Nov 25, 2003 at 01:18:19PM -0500, Michael Giles wrote:
> Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota.

 I've actually worked on a system that bundled GroupLens.  I think
it was Vignette StoryServer.  The Vignette docs were incredibly dense
with MarketingNewSpeak, so I could never quite figure out what they
said GroupLens actually *did* (not at a web-capable terminal right
now, or I'd just google it).

 Collaborative filtering in general is a topic I'm interested in,
and is why I first got into Lucene.  I wanted and still want to build
a collaborative filtering search engine for mailing lists and the
like.

 I do remember that FireFly's engine was supposed to graph all of
the users' ratings on a topic in an N-dimensional space, and then find
users "close" to the same user in that N-dimensional space, and
suggest topics that they'd liked, but that the current user hadn't
rated.

 I'm interested in more of a "free market" sort of approach than
in statistical analysis; I want to build a system that helps usrs
express their opinions, then nurture an emerging consensus.  My
experience has been that systems that systems/technologies that try to
facilitate the way users already do things, instead of replacing them
with new ways of doing things, tend to work better.

-- 
Steven J. Owens
[EMAIL PROTECTED]

"I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt." - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question - not returning desired results

2003-11-26 Thread Erik Hatcher

On Wednesday, November 26, 2003, at 11:08  AM, Pleasant, Tracy wrote:
But now i have another question.

Let's say I have 'return_results.pl' in the document in one of the
fields.
Actually there is a little bit more to it than understanding the 
analysis phase, and you were right in saying you need to understand '*' 
and '~' as well.

More below...

When I search for return_res* or return_res~ it won't return the
document.
this is searching for all terms that start with "return_res", and 
during analysis you split this into "return" and "res", so no terms 
match.  Same goes for the fuzzy query with ~.

But searching for any of these does return the document:
1. 'return_results'
2. 'results' or 'return'
3. 'results.pl'
4. 'results~'
5. 'return_results~'
in all of these cases, you're searching for terms that got split by the 
analyzer on indexing (and during QueryParser analysis for 
"return_results", "results.pl").

Tricky stuff, eh?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question - not returning desired results

2003-11-26 Thread Erik Hatcher

On Wednesday, November 26, 2003, at 11:08  AM, Pleasant, Tracy wrote:
But now i have another question.

Let's say I have 'return_results.pl' in the document in one of the
fields.
When I search for return_res* or return_res~ it won't return the
document.
But searching for any of these does return the document:
1. 'return_results'
2. 'results' or 'return'
3. 'results.pl'
4. 'results~'
5. 'return_results~'
I guess I have to read more about the '*' and '~'?
What does my AnalysisDemo tell you about all of the text you're feeding 
in here?  :))  The answer lies therein!

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question - not returning desired results

2003-11-26 Thread Erik Hatcher

On Wednesday, November 26, 2003, at 11:33  AM, Pleasant, Tracy wrote:
Your website says:

org.apache.lucene.analysis.standard.StandardAnalyzer:
[xy&z] [corporation] [EMAIL PROTECTED] [com]
When I run it it keeps the entire email '[EMAIL PROTECTED]
but according to your website it separates the '[EMAIL PROTECTED]' from the
'com'
Is there a difference between the versions of Lucene? I'm using 1.3rc2.
Yes, I fixed the bug in the StandardTokenizer that caused e-mail 
addresses to get split, but fixed it after the article was written.  
Good eye!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Eliminating duplicate result

2003-11-26 Thread Pleasant, Tracy

You are searching for the same term and you are searching the same index twice, it 
will return the same results... 

I don't get what you are asking.

-Original Message-
From: Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 3:19 AM
To: Lucene Users List
Subject: Re: Eliminating duplicate result

> When you are doing two searches are you searching for two different terms?
> 

No, I am searching for the same term.

What is the easyest way to eliminate duplicate documents if one is doing two searches 
on the same index?

Have anybody done something similar?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Question - not returning desired results

2003-11-26 Thread Pleasant, Tracy

Erik,

I think there may be a typo in the website.

When I run the AnalyzerDemo :

Analzying "xy&z corporation - [EMAIL PROTECTED]"
org.apache.lucene.analysis.standard.StandardAnalyzer:
[xy&z] [corporation] [EMAIL PROTECTED] 

Your website says:

org.apache.lucene.analysis.standard.StandardAnalyzer:
[xy&z] [corporation] [EMAIL PROTECTED] [com] 

When I run it it keeps the entire email '[EMAIL PROTECTED]
but according to your website it separates the '[EMAIL PROTECTED]' from the
'com'

Is there a difference between the versions of Lucene? I'm using 1.3rc2.

Plus I think what I want is a StandardAnalyzer with a little tweaking.
The simple one was fine until I realized that it doesn't do numbers,
which I need as part of my search since numbers is important for what
I'm doing. The Standard does numbers but I need it to be a little
different of course. Thanks for the site.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 4:58 AM
To: Lucene Users List
Subject: Re: Search Question - not returning desired results

On Tuesday, November 25, 2003, at 12:11  PM, Pleasant, Tracy wrote:
>
> The documents I have index contain information regarding file names 
> also.
>
> For instance 'return_results.pl' or something like that may be in the 
> document fields.
>
> I am not understanding Lucene's way of searching:
>
> 1. If I search for 'return_results', the search does not return 
> anything
> 2. If I search for 'results' or 'return', the search does not return 
> anything
> 3. If I search for 'results.pl', the search does return the document 
> containg 'return_results.pl'
> 4. If I search for 'results~', the search does return the document 
> containg 'return_results.pl'
> 5. If I search for 'return_results~', the search does not return 
> anything
>
> What is going on?
>
> I want it to return the document in all of the situations.
>
> I also don't want to have to use '~' all the time.

We sure do have a recurring theme lately :)  Analysis!

Please refer to my article at java.net:

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Look at the AnalysisDemo code.  Copy it over and try it out on the text 
you're using and the Analyzer you're using.  The bracketed text that 
comes out are the "tokens" that you can search on.  It is very very 
important to understand this process and to really know what terms come 
out of text you hand it - otherwise it is a mystery why some things can 
be found and some things cannot despite your expectations to the 
contrary.

A follow-up to the Analysis is querying - and QueryParser has it's own 
set of quirks and caveats related to how things are tokenized/analyzed. 
  And, I've got just the follow-up article for you handy...

http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html

If you digest both of these articles (analysis one first please) then I 
think a lot of questions that get asked on this list will be implicitly 
answered.  Understanding analysis is key.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Question - not returning desired results

2003-11-26 Thread Pleasant, Tracy

It seems like what I should we using is something more like a
SimpleAnalyzer or StopAnalyzer.

I've changed my code and the query to use SimpleAnalyzer.

But now i have another question.

Let's say I have 'return_results.pl' in the document in one of the
fields. 

When I search for return_res* or return_res~ it won't return the
document.

But searching for any of these does return the document:
1. 'return_results'
2. 'results' or 'return'
3. 'results.pl'
4. 'results~'
5. 'return_results~'

I guess I have to read more about the '*' and '~'?

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 4:58 AM
To: Lucene Users List
Subject: Re: Search Question - not returning desired results

On Tuesday, November 25, 2003, at 12:11  PM, Pleasant, Tracy wrote:
>
> The documents I have index contain information regarding file names 
> also.
>
> For instance 'return_results.pl' or something like that may be in the 
> document fields.
>
> I am not understanding Lucene's way of searching:
>
> 1. If I search for 'return_results', the search does not return 
> anything
> 2. If I search for 'results' or 'return', the search does not return 
> anything
> 3. If I search for 'results.pl', the search does return the document 
> containg 'return_results.pl'
> 4. If I search for 'results~', the search does return the document 
> containg 'return_results.pl'
> 5. If I search for 'return_results~', the search does not return 
> anything
>
> What is going on?
>
> I want it to return the document in all of the situations.
>
> I also don't want to have to use '~' all the time.

We sure do have a recurring theme lately :)  Analysis!

Please refer to my article at java.net:

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Look at the AnalysisDemo code.  Copy it over and try it out on the text 
you're using and the Analyzer you're using.  The bracketed text that 
comes out are the "tokens" that you can search on.  It is very very 
important to understand this process and to really know what terms come 
out of text you hand it - otherwise it is a mystery why some things can 
be found and some things cannot despite your expectations to the 
contrary.

A follow-up to the Analysis is querying - and QueryParser has it's own 
set of quirks and caveats related to how things are tokenized/analyzed. 
  And, I've got just the follow-up article for you handy...

http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html

If you digest both of these articles (analysis one first please) then I 
think a lot of questions that get asked on this list will be implicitly 
answered.  Understanding analysis is key.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Question - not returning desired results

2003-11-26 Thread Pleasant, Tracy

Thanks this helps a lot :)

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 4:58 AM
To: Lucene Users List
Subject: Re: Search Question - not returning desired results

On Tuesday, November 25, 2003, at 12:11  PM, Pleasant, Tracy wrote:
>
> The documents I have index contain information regarding file names 
> also.
>
> For instance 'return_results.pl' or something like that may be in the 
> document fields.
>
> I am not understanding Lucene's way of searching:
>
> 1. If I search for 'return_results', the search does not return 
> anything
> 2. If I search for 'results' or 'return', the search does not return 
> anything
> 3. If I search for 'results.pl', the search does return the document 
> containg 'return_results.pl'
> 4. If I search for 'results~', the search does return the document 
> containg 'return_results.pl'
> 5. If I search for 'return_results~', the search does not return 
> anything
>
> What is going on?
>
> I want it to return the document in all of the situations.
>
> I also don't want to have to use '~' all the time.

We sure do have a recurring theme lately :)  Analysis!

Please refer to my article at java.net:

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Look at the AnalysisDemo code.  Copy it over and try it out on the text 
you're using and the Analyzer you're using.  The bracketed text that 
comes out are the "tokens" that you can search on.  It is very very 
important to understand this process and to really know what terms come 
out of text you hand it - otherwise it is a mystery why some things can 
be found and some things cannot despite your expectations to the 
contrary.

A follow-up to the Analysis is querying - and QueryParser has it's own 
set of quirks and caveats related to how things are tokenized/analyzed. 
  And, I've got just the follow-up article for you handy...

http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html

If you digest both of these articles (analysis one first please) then I 
think a lot of questions that get asked on this list will be implicitly 
answered.  Understanding analysis is key.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: log4j.properties

2003-11-26 Thread Tun Lin

I have integrated Lucene and PDFBox and tried the following command to index
files 

java -Dlog4j.configuration=log4j.xml org.pdfbox.searchengine.lucene.IndexFiles
-create -index c:\\index .. 

But I have the following error message:
log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly.

Anyone can help?

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 26, 2003 5:19 PM
To: Lucene Users List
Subject: Re: log4j.properties

What does this have to do with Lucene?


On Wednesday, November 26, 2003, at 01:04  AM, Tun Lin wrote:

> I have created the following "log4j.properties" and put it in your 
> classpath but it still has that error. Anyone can help?
>
> log4j.rootCategory=stdout
>
> log4j.appender.stdout=org.apache.log4j.ConsoleAppender
> log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
> log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: log4j.properties

2003-11-26 Thread Stephane Vaucher

As I've said previously, it's a log4j problem and not a lucene probleme, 
you should post there.

sv

On Wed, 26 Nov 2003, Tun Lin wrote:

> I have created the following "log4j.properties" and put it in your classpath but
> it still has that error. Anyone can help?
> 
> log4j.rootCategory=stdout
> 
> log4j.appender.stdout=org.apache.log4j.ConsoleAppender
> log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
> log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Tokenizing text custom way

2003-11-26 Thread MOYSE Gilles (Cetelem)

Do you want to define expressions, i.e. a set of terms that must be
intpreted as a whole ?
For instance, when the Analyzer catchs "time" followed by "out" it returns
"time_out" ?


-Message d'origine-
De : Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 26 novembre 2003 12:12
À : Lucene Users List
Objet : Re: Tokenizing text custom way


> You will need to write a custom analyzer.  Don't worry, though it's
> quite straightforward.  You will also need to write a Tokenizer, but
> Lucene helps you a lot here.

Wouldn't I achieve the same result if I index "time out" like "time_out",
using StandardAnalyzer and later if I search for "time out" (inside quotes)
I should get proper result, but if I search for "time" I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher

On Wednesday, November 26, 2003, at 06:12  AM, Dragan Jotanovic wrote:
You will need to write a custom analyzer.  Don't worry, though 
it's
quite straightforward.  You will also need to write a Tokenizer, but
Lucene helps you a lot here.
Wouldn't I achieve the same result if I index "time out" like 
"time_out",
using StandardAnalyzer and later if I search for "time out" (inside 
quotes)
I should get proper result, but if I search for "time" I shouldn't get
result. Is this right?
I'm confused on what you are planning doing.  Are you going to replace 
all spaces with an underscore before handing it to the analyzer?  
StandardAnalyzer will still split at the underscores though.

If you have special tokenization needs, why try to hack it somehow 
rather than address it cleanly in the way Lucene was designed to work?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Chinese input.

2003-11-26 Thread Otis Gospodnetic

Maybe this will help?

  http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23545

Otis

--- Tun Lin <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> May I know how do I analyse Chinese input from Chinese text in
> Lucene?
> 
> Do I use Analyser function in Lucene? If yes, how to go about using
> it?
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-26 Thread Dragan Jotanovic

> You will need to write a custom analyzer.  Don't worry, though it's
> quite straightforward.  You will also need to write a Tokenizer, but
> Lucene helps you a lot here.

Wouldn't I achieve the same result if I index "time out" like "time_out",
using StandardAnalyzer and later if I search for "time out" (inside quotes)
I should get proper result, but if I search for "time" I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question - not returning desired results

2003-11-26 Thread Erik Hatcher

On Tuesday, November 25, 2003, at 12:11  PM, Pleasant, Tracy wrote:
The documents I have index contain information regarding file names 
also.

For instance 'return_results.pl' or something like that may be in the 
document fields.

I am not understanding Lucene's way of searching:

1. If I search for 'return_results', the search does not return 
anything
2. If I search for 'results' or 'return', the search does not return 
anything
3. If I search for 'results.pl', the search does return the document 
containg 'return_results.pl'
4. If I search for 'results~', the search does return the document 
containg 'return_results.pl'
5. If I search for 'return_results~', the search does not return 
anything

What is going on?

I want it to return the document in all of the situations.

I also don't want to have to use '~' all the time.
We sure do have a recurring theme lately :)  Analysis!

Please refer to my article at java.net:

	http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Look at the AnalysisDemo code.  Copy it over and try it out on the text 
you're using and the Analyzer you're using.  The bracketed text that 
comes out are the "tokens" that you can search on.  It is very very 
important to understand this process and to really know what terms come 
out of text you hand it - otherwise it is a mystery why some things can 
be found and some things cannot despite your expectations to the 
contrary.

A follow-up to the Analysis is querying - and QueryParser has it's own 
set of quirks and caveats related to how things are tokenized/analyzed. 
 And, I've got just the follow-up article for you handy...

	http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html

If you digest both of these articles (analysis one first please) then I 
think a lot of questions that get asked on this list will be implicitly 
answered.  Understanding analysis is key.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher

On Tuesday, November 25, 2003, at 06:41  AM, Dragan Jotanovic wrote:
Hi. I need to tokenize text while indexing but I don't want space to 
be delimiter. Delimiter should be my custom character (for example 
comma). I understand that I would probably need to implement my own 
analyzer, but could someone help me where to start. Is there any other 
way to do this without writing custom analyzer?
You will need to write a custom analyzer.  Don't worry, though it's 
quite straightforward.  You will also need to write a Tokenizer, but 
Lucene helps you a lot here.  Lucene's LetterTokenizer is simply this:

public class LetterTokenizer extends CharTokenizer {
  /** Construct a new LetterTokenizer. */
  public LetterTokenizer(Reader in) {
super(in);
  }
  /** Collects only characters which satisfy
   * [EMAIL PROTECTED] Character#isLetter(char)}.*/
  protected boolean isTokenChar(char c) {
return Character.isLetter(c);
  }
}
You could change the isTokenChar method in your custom CommaTokenizer 
to only return true if the character is not a ','.  And you might want 
to implement the normalize method to lowercase (look at 
LowerCaseTokenizer).

My advice is for you to check out Lucene's source code in the 
TokenStream hierarchy (ctrl-H in IntelliJ is quite nice! :).  
CharTokenizer seems a good starting point for you.  Then have a look at 
SimpleAnalyzer:

public final class SimpleAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
  }
}
Just create your own CommaAnalyzer that uses your CommaTokenizer 
similar to this.  Have a look at my java.net article and try the sample 
code provided there to observe the analysis process in greater detail 
so you can check that you get what you expect.

and if I enter 'time' as a search word, I don't want to get "time out" 
in results. I need exact keyword matching. I would achieve this if I 
tokenize "time out" as one token while idexing.
It will be a little trickier on the query part if you're using 
QueryParser - you will need to double-quote "time out" for it to work, 
I believe - but don't worry about this until you get the analysis phase 
worked out and then we can revisit the QueryParser issue then.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher

woah that seems like an awfully complex answer to the question of 
how to tokenize at a comma rather than a space!  %-)

On Tuesday, November 25, 2003, at 11:48  AM, MOYSE Gilles (Cetelem) 
wrote:

Hi.

You should define expressions.
To define expressions, you first have to define an expression file.
An expression file contains one expressions per line.
For instance :
	time_out
	expert_system
	...
You can use any character to specify the "expression link". Here, I 
use the
underscore (_).

Then, you have to build an expression loader. You can store 
expressions in
recursives HashMap.
Such HashMap must be built so that HashMap.get("word1") = HashMap, and
(HashMap.get("word1")).get("word2") = null, if you want to code the
expression "word1_word2".
In other words 'HashMap.get("a_word")' returns a hashMap containing 
all the
successors of the word 'a_word'.

So, if your expression file looks like that :
time_out
expert_system
expert_in_information
you'll have to build a loader which returns a HashMap H so that :
H.keySet() = {"time", "expert"}
((HashMap)H.get("time")).keySet = {"out"}
((HashMap)H.get("time")).get("out") = null // null indicates the end
of the expression
((HashMap)H.get("expert")).keySet = {"system", "in"}
((HashMap)H.get("expert")).get("system") = null
((HashMap)((HashMap)H.get("expert")).get("in")).keySet() =
{"information"}
((HashMap)((HashMap)H.get("expert")).get("in")).get("information") =
null
These recursives HashMaps code the following tree :
time - out - null
system --- expert - null
  |- in - information- null
Such an expression loader may be designed this way :

public static HashMap getExpressionMap( File wordfile ) {
HashMap result = new HashMap();

try
{
String line = null;
LineNumberReader in = new LineNumberReader(new
FileReader(wordfile));
HashMap hashToAdd = null;

while ((line = in.readLine()) != null)
{
if (line.startsWith(FILE_COMMENT_CHARACTER))
continue;
if (line.trim().length() == 0)
continue;
StringTokenizer stok = new
StringTokenizer(line, " \t_");
String curTok = "";
HashMap currentHash = result;

// Test wether the expression contains 2 at
least words or not
if (stok.countTokens() < 2)
{
	System.err.println("Warning : '" +
line + "' in file '" + wordfile.getAbsolutePath() + "' line " +
in.getLineNumber() +
		" is not an expression.\n\tA
valid expression contains at least 2 words.");
	continue;
}

while (stok.hasMoreTokens())
{
	curTok = stok.nextToken();
	if
(curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end 
of the
line, break
		break;
	if (stok.hasMoreTokens())
		hashToAdd = new HashMap(6);
	else
		hashToAdd = (HashMap)null;
		
	if
(!(currentHash.containsKey(curTok)))
		currentHash.put(curTok,
hashToAdd);
		
	currentHash =
(HashMap)currentHash.get(curTok);
}
			}
			return result;
		}
		// On error, use an empty table
		catch ( Exception e )
		{
			System.err.println("While processing '" +
wordfile.getAbsolutePath() + "' : " + e.getMessage());
			e.printStackTrace();
			return new HashMap();
		}
	}

Then, you must build a filter with 2 FIFO stacks : one is the 
expression
stack, the other is the default stack.
Then, you define a 'curMap' variable, initially pointing onto the 
HashMap
returned by the ExpressionFileLoader.

When you receive a token, you check wether it is null or not;
	If it is, you check if the standard stack is null or not.
		If it is not, you pop a token from the default stack and you
return it.
		If it is, you return null
	If it is not (the token is not null), you check whether it is
contained in the HashMap or not (curMap.containsKey(token)).
		If it is not contained and you were building an expression,
you pop all the terms in the expression stack to push them in the 
default
stack (so as not to loose information)
		If it is not contained and the default stack is empty, you
return the token.
		If it is not conatined and the default stack is not empty,
you return the poped token from the default stack and you push the 
current
token.
	If the token is contained in the curMap, then the token MAY be the
first element of an expression.
		You push the token in the expression stack, and you dive
into the next level in your expression tree (curMap = 
curMap.get("token"))
		If the next level (now, curMap), is null, then you have
completed your expression. You can pop all the tokens from the 
expresion
stack to concatenate them, separated by underscores, and push

Re: log4j.properties

2003-11-26 Thread Erik Hatcher

What does this have to do with Lucene?

On Wednesday, November 26, 2003, at 01:04  AM, Tun Lin wrote:

I have created the following "log4j.properties" and put it in your 
classpath but
it still has that error. Anyone can help?

log4j.rootCategory=stdout

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: unexpected results from query

2003-11-26 Thread Erik Hatcher

On Tuesday, November 25, 2003, at 10:45  PM, marc wrote:
Hi,

assume a field has the following text

"Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) "

the following searches all return this document

AMP
&
&
can someone explain this to me..i figured that only the first query 
would be successful
This depends on the Analyzer you're using.  I'm assuming you're using 
the QueryParser and an analyzer that rips off special characters - so 
essentially the TermQuery underneath is always for AMP.

Have a look at my first java.net article which shows the analysis 
process.  Run your sample text through the code provided there to see 
the effect first-hand.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Eliminating duplicate result

2003-11-26 Thread Dragan Jotanovic

> When you are doing two searches are you searching for two different terms?
> 

No, I am searching for the same term.






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Dates and others

Re: log4j.properties

Re: Collaborative Filtering API

Re: Search Question - not returning desired results

Re: Search Question - not returning desired results

Re: Search Question - not returning desired results

RE: Eliminating duplicate result

RE: Search Question - not returning desired results

RE: Search Question - not returning desired results

RE: Search Question - not returning desired results

RE: log4j.properties

Re: log4j.properties

RE: Tokenizing text custom way

Re: Tokenizing text custom way

Re: Chinese input.

Re: Tokenizing text custom way

Re: Search Question - not returning desired results

Re: Tokenizing text custom way

Re: Tokenizing text custom way

Re: log4j.properties

Re: unexpected results from query

Re: Eliminating duplicate result

22 matches

Site Navigation

Mail list logo

Footer information