RE: Null or no analyzer

2004-10-20 Thread Morus Walter
Aviran writes:
 You can use WhiteSpaceAnalyzer
 
Can he? If Elections 2004 is one token in the subject field (keyword), 
this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' 
and `2004'.
So I guess he has to write an identity analyzer himself unless there is
one provided (which doesn't seem to be the case).
The only alternatives are not using query parser or extending query parser
for a key word syntax, as far as I can see.

Morus
 
 -Original Message-
 From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 19, 2004 11:23 AM
 To: Lucene Users List
 Subject: Null or no analyzer
 
 
 Hi All
 
   I have a question regarding selection of Analyzer's during query parsing
 
 
   i have three field in my index db_id, full_text, subject
   all three are indexed, however while indexing I specified to lucene to
 index db_id and subject but not tokenize them
 
   I want to give a single search box in my application to enable searching
 for documents
   some query can look lile  motor cross rally this will get fed to
 QueryParser to do the relevent parsing
 
   however if the user enters  Jhon Kerry  subject:Elections 2004 I want to
 make sure that No analyzer is used fro the subject field ? how can that be
 done.
 
   this is because I expect the users to know the subject from a List of
 controlled vocabularies and also I am searching for  documents that have the
 exact subject I tried using the PerFieldAnalyzerWrapper, but how do I get
 hold a Analyzer that  does nothing but pass the text trough to the Searcher
 ?
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Range Query

2004-10-20 Thread Karthik N S
Hi

   Jonathan


  When searching I also pad the query term ???

   When Exactly are u handling this  [ using During Indexing Process Also or
while  Search on Process Only  ]

   Can u be Please  be specific.

   [  if time permits and possible please can u send me the sample Code for
the same ]

   . :)


 Thx in advance


-Original Message-
From: Jonathan Hager [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 3:31 AM
To: Lucene Users List
Subject: Re: Range Query


That is exactly right.  It is searching the ASCII.  To solve it I pad
my price using a method like this:

  /**
   * Pads the Price so that all prices are the same number of characters and
   * can be compared lexigraphically.
   * @param price
   * @return
   */
  public static String formatPriceAsString(Double price) {
if (price == null) {
  return null;
}
return PRICE_FORMATTER.format(price.doubleValue());
  }

where PRICE_FORMATTER contains enough digits for your largest number.

  private static final DecimalFormat PRICE_FORMATTER = new
DecimalFormat(000.00);

When searching I also pad the query term.  I looked into hooking into
QueryParser, but since the lower/upper prices for my application are
different inputs, I choose to handle them without hooking into the
QueryParser.

Jonathan


On Tue, 19 Oct 2004 12:35:06 +0530, Karthik N S
[EMAIL PROTECTED] wrote:

 Hi

 Guys

 Apologies.

 I  have  a Field Type  Text  'ItemPrice' ,  Using it to Store   Price
 Factor in numeric  such as  10, 25.25 , 50.00

 If I am suppose to Find the Range factor  between 2   prices

 ex -
  Contents:shoes +ItemPrice:[10.00 TO 50.60]

 I get results  other  then the Range that has been  executed   [This may
be
 due to query parsing the Ascii values instead of  numeric values ]

 Am  I am missing something in the Querry syntax  or Is this the wrong way
to
 construct the Query.

 Please Somebody Advise me ASAP.  :(

 Thx in advance

   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Downloading Full Copies of Web Pages

2004-10-20 Thread Karthik N S
Hi


Try

nutch   [ http://www.nutch.org/docs/en/about.html ]  underneath it uses
Lucene  :)





-Original Message-
From: Luciano Barbosa [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 3:06 AM
To: [EMAIL PROTECTED]
Subject: Downloading Full Copies of Web Pages


Hi folks,
I want to download full copies of web pages and storage them locally as
well the hyperlink structures as local directories. I tried to use
Lucene, but I've realized that  it doesn't have a crawler.
Does anyone know a software that make this?
Thanks,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Downloading Full Copies of Web Pages

2004-10-20 Thread John Moylan
wget does this. Little point in reinventing the wheel.
Luciano Barbosa wrote:
Hi folks,
I want to download full copies of web pages and storage them locally as 
well the hyperlink structures as local directories. I tried to use 
Lucene, but I've realized that  it doesn't have a crawler.
Does anyone know a software that make this?
Thanks,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
**
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RTÉ may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
**
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


TestRangeQuery.java

2004-10-20 Thread Karthik N S

Hi

Does anybody have Trouble in Compiling   TestRangeQuery.java   in Eclipse
3.0 IDE,

[
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/test/org/apache/lucene/
search ]

Seem's there is an Error


doc.add(new Field(id, id + docCount, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field(content, content, Field.Store.NO,
Field.Index.TOKENIZED));



Compiler Error is with Lucene1.4.1, Win O/s
Field.Store.yes is not Found





Thx in Advance


  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Null or no analyzer

2004-10-20 Thread Aviran
AFIK if the term Election 2004 will be between quotation marks this should
work fine.

Aviran
http://aviran.mordos.com

-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 20, 2004 2:25 AM
To: Lucene Users List
Subject: RE: Null or no analyzer


Aviran writes:
 You can use WhiteSpaceAnalyzer
 
Can he? If Elections 2004 is one token in the subject field (keyword), 
this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' 
and `2004'.
So I guess he has to write an identity analyzer himself unless there is one
provided (which doesn't seem to be the case). The only alternatives are not
using query parser or extending query parser for a key word syntax, as far
as I can see.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Range Query

2004-10-20 Thread Chuck Williams
Karthik,

It is all spelled out in a Lucene HowTo here:
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

Have fun with it,

Chuck

   -Original Message-
   From: Karthik N S [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, October 20, 2004 12:15 AM
   To: Lucene Users List; Jonathan Hager
   Subject: RE: Range Query
   
   Hi
   
  Jonathan
   
   
 When searching I also pad the query term ???
   
  When Exactly are u handling this  [ using During Indexing Process
   Also or
   while  Search on Process Only  ]
   
  Can u be Please  be specific.
   
  [  if time permits and possible please can u send me the sample
Code
   for
   the same ]
   
  . :)
   
   
Thx in advance
   
   
   -Original Message-
   From: Jonathan Hager [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, October 20, 2004 3:31 AM
   To: Lucene Users List
   Subject: Re: Range Query
   
   
   That is exactly right.  It is searching the ASCII.  To solve it I
pad
   my price using a method like this:
   
 /**
  * Pads the Price so that all prices are the same number of
characters
   and
  * can be compared lexigraphically.
  * @param price
  * @return
  */
 public static String formatPriceAsString(Double price) {
   if (price == null) {
 return null;
   }
   return PRICE_FORMATTER.format(price.doubleValue());
 }
   
   where PRICE_FORMATTER contains enough digits for your largest
number.
   
 private static final DecimalFormat PRICE_FORMATTER = new
   DecimalFormat(000.00);
   
   When searching I also pad the query term.  I looked into hooking
into
   QueryParser, but since the lower/upper prices for my application are
   different inputs, I choose to handle them without hooking into the
   QueryParser.
   
   Jonathan
   
   
   On Tue, 19 Oct 2004 12:35:06 +0530, Karthik N S
   [EMAIL PROTECTED] wrote:
   
Hi
   
Guys
   
Apologies.
   
I  have  a Field Type  Text  'ItemPrice' ,  Using it to Store  
   Price
Factor in numeric  such as  10, 25.25 , 50.00
   
If I am suppose to Find the Range factor  between 2   prices
   
ex -
 Contents:shoes +ItemPrice:[10.00 TO 50.60]
   
I get results  other  then the Range that has been  executed
[This
   may
   be
due to query parsing the Ascii values instead of  numeric values ]
   
Am  I am missing something in the Querry syntax  or Is this the
wrong
   way
   to
construct the Query.
   
Please Somebody Advise me ASAP.  :(
   
Thx in advance
   
  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Spell checker

2004-10-20 Thread Lynn Li
Where can I download it? 

Thanks,
Lynn

-Original Message-
From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED]
Sent: Monday, October 11, 2004 1:26 PM
To: Lucene Users List
Subject: Spell checker 


hy lucene users
i developed a Spell checker for lucene inspired by the David Spencer code

see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker

Nicolas Maisonneuve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Null or no analyzer

2004-10-20 Thread Erik Hatcher
On Oct 20, 2004, at 9:55 AM, Aviran wrote:
AFIK if the term Election 2004 will be between quotation marks this 
should
work fine.
No, it won't.  The Analyzer will analyze it, and the WhitespaceAnalyzer 
would split it into two tokens [Election] and [2004].

This is a tricky situation with no clear *best* way to do this sort of 
thing.  However, given what I've seen of this thread so far I'd 
recommend using the PerFieldAnalyzerWrapper and associate the fields 
indexed as Field.Keyword with a KeywordAnalyzer.  There have been some 
variants of this posted on the list - it is not included in the API, 
however perhaps it should be.  Or perhaps there are other options to 
solve this recurring dilemma folks have with Field.Keyword indexed 
fields and QueryParser?

Erik

Aviran
http://aviran.mordos.com
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 2:25 AM
To: Lucene Users List
Subject: RE: Null or no analyzer
Aviran writes:
You can use WhiteSpaceAnalyzer
Can he? If Elections 2004 is one token in the subject field 
(keyword),
this will fail, since WhiteSpeceAnalyzer will tokenize that to 
`Elections'
and `2004'.
So I guess he has to write an identity analyzer himself unless there 
is one
provided (which doesn't seem to be the case). The only alternatives 
are not
using query parser or extending query parser for a key word syntax, as 
far
as I can see.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Null or no analyzer

2004-10-20 Thread Rupinder Singh Mazara
Hi Erik

 I think the best solutuion is to have a NullAnalayzer class that
 allows a simple pass through

 The query parser then can be passed with a PerFieldAnalayzer that knows
 when to select NullAnalayzer or some other  based on the Field:data... 
Field2:pp
format  this is something that the query parser is already geared up to do

regards

 Rupinder

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: 20 October 2004 16:29
To: Lucene Users List
Subject: Re: Null or no analyzer



On Oct 20, 2004, at 9:55 AM, Aviran wrote:
 AFIK if the term Election 2004 will be between quotation marks this
 should
 work fine.

No, it won't.  The Analyzer will analyze it, and the WhitespaceAnalyzer
would split it into two tokens [Election] and [2004].

This is a tricky situation with no clear *best* way to do this sort of
thing.  However, given what I've seen of this thread so far I'd
recommend using the PerFieldAnalyzerWrapper and associate the fields
indexed as Field.Keyword with a KeywordAnalyzer.  There have been some
variants of this posted on the list - it is not included in the API,
however perhaps it should be.  Or perhaps there are other options to
solve this recurring dilemma folks have with Field.Keyword indexed
fields and QueryParser?

   Erik




 Aviran
 http://aviran.mordos.com

 -Original Message-
 From: Morus Walter [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 20, 2004 2:25 AM
 To: Lucene Users List
 Subject: RE: Null or no analyzer


 Aviran writes:
 You can use WhiteSpaceAnalyzer

 Can he? If Elections 2004 is one token in the subject field
 (keyword),
 this will fail, since WhiteSpeceAnalyzer will tokenize that to
 `Elections'
 and `2004'.
 So I guess he has to write an identity analyzer himself unless there
 is one
 provided (which doesn't seem to be the case). The only alternatives
 are not
 using query parser or extending query parser for a key word syntax, as
 far
 as I can see.




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Null or no analyzer

2004-10-20 Thread sergiu gordea
Erik Hatcher wrote:
On Oct 20, 2004, at 9:55 AM, Aviran wrote:
AFIK if the term Election 2004 will be between quotation marks this 
should
work fine.

No, it won't.  The Analyzer will analyze it, and the 
WhitespaceAnalyzer would split it into two tokens [Election] and [2004].

This is a tricky situation with no clear *best* way to do this sort of 
thing.  However, given what I've seen of this thread so far I'd 
recommend using the PerFieldAnalyzerWrapper and associate the fields 
indexed as Field.Keyword with a KeywordAnalyzer.  There have been some 
variants of this posted on the list - it is not included in the API, 
however perhaps it should be.  Or perhaps there are other options to 
solve this recurring dilemma folks have with Field.Keyword indexed 
fields and QueryParser?

Erik
I still don't understand what is wrong with the Idea of indexing the 
title in a separate field and searching with a Phrase query
+title:Elections 2004 ?
I think that the real problem is that the title is not tokenized and the 
title contains more then Elections 2004

I think it is worthing to give a try to this solution.
Or maybe I don't understand the problem correctly ...
All the best,
Sergiu



Aviran
http://aviran.mordos.com
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 2:25 AM
To: Lucene Users List
Subject: RE: Null or no analyzer
Aviran writes:
You can use WhiteSpaceAnalyzer
Can he? If Elections 2004 is one token in the subject field (keyword),
this will fail, since WhiteSpeceAnalyzer will tokenize that to 
`Elections'
and `2004'.
So I guess he has to write an identity analyzer himself unless there 
is one
provided (which doesn't seem to be the case). The only alternatives 
are not
using query parser or extending query parser for a key word syntax, 
as far
as I can see.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Null or no analyzer

2004-10-20 Thread Rupinder Singh Mazara
hi

the basic problem here is that there  are data source which contain
a) id, b) text c) title d) authors AND  d) subject heading

text, title and authors need to be tokenized

the subject heading can be one or more words,
anyone searching such datasource is expected to know the subject headings ,
if the user is trying to find all articles that have the phrases
Jhon Kerry and Goerge Bush as well as that are classified as Election
2004
it is possible that there are other documents that are classified as Nation
Service Records
or Tax Returns etc...

so the object is to find documents that have the above mentioned phrases as
well as one one
of the subject classifiers, so as to pull out the most meaning full
documents

the subject classifiers pretain to domain knowledge, and it is possible that
2 or more
subject classification headings are composed of the same set of words, but
the sequence
in which they appear can drastically alter the meaning hence tokenizing the
subject field
is not exactly a healthy solution.

also such search tools are meant for people who know / understand  this
classification system
Taxonomy of animals can be taken as one such example,

hope this helps define the problem






I still don't understand what is wrong with the Idea of indexing the
title in a separate field and searching with a Phrase query
+title:Elections 2004 ?
I think that the real problem is that the title is not tokenized and the
title contains more then Elections 2004

I think it is worthing to give a try to this solution.

Or maybe I don't understand the problem correctly ...

All the best,

 Sergiu








 Aviran
 http://aviran.mordos.com

 -Original Message-
 From: Morus Walter [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 20, 2004 2:25 AM
 To: Lucene Users List
 Subject: RE: Null or no analyzer


 Aviran writes:

 You can use WhiteSpaceAnalyzer

 Can he? If Elections 2004 is one token in the subject field (keyword),
 this will fail, since WhiteSpeceAnalyzer will tokenize that to
 `Elections'
 and `2004'.
 So I guess he has to write an identity analyzer himself unless there
 is one
 provided (which doesn't seem to be the case). The only alternatives
 are not
 using query parser or extending query parser for a key word syntax,
 as far
 as I can see.




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to find coherent terms

2004-10-20 Thread Miro Max
Hello,

i've to realize one function in my project and i hope
i can find someone who can help me.

the idee is about search of coherent terms!

my imagination:

1. search for a specific term_a
2. result: hits from lucene
   resultlist:
   term_a term_b term_c term_d
   term_b term_a term_e
   term_e term_a term_b term_f
3. now i can see that the term_a is in a speciall
relation to term_b - but how can i check this with
lucene? is this supported by any function of lucene or
does exist any other api?

thx

miro






___
Gesendet von Yahoo! Mail - Jetzt mit 100MB Speicher kostenlos - Hier anmelden: 
http://mail.yahoo.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Spell checker

2004-10-20 Thread Aviran
Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009

Aviran
http://aviran.mordos.com

-Original Message-
From: Lynn Li [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 20, 2004 10:52 AM
To: 'Lucene Users List'
Subject: RE: Spell checker 


Where can I download it? 

Thanks,
Lynn

-Original Message-
From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED]
Sent: Monday, October 11, 2004 1:26 PM
To: Lucene Users List
Subject: Spell checker 


hy lucene users
i developed a Spell checker for lucene inspired by the David Spencer code

see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker

Nicolas Maisonneuve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Null or no analyzer

2004-10-20 Thread Sergiu Gordea
Rupinder Singh Mazara wrote:
hi
the basic problem here is that there  are data source which contain
a) id, b) text c) title d) authors AND  d) subject heading
 

text, title and authors need to be tokenized
the subject heading can be one or more words,
 

the subject must be also tokennized, otherwise you cannot get any 
results that doesn't match the Term exaclty

so ... for example, let's asume you have the folowing titles:
George Trash Elections
George Trash
if you search for George Trash and your title is not tokenized you 
will get just the second document (I hope I'm
not making any mistake when I say that, anyway it can be easily tested).

anyone searching such datasource is expected to know the subject headings ,
if the user is trying to find all articles that have the phrases
Jhon Kerry and Goerge Bush as well as that are classified as Election
2004
it is possible that there are other documents that are classified as Nation
Service Records
or Tax Returns etc...
 

how is there represented in the GUI as a select box? or input field?
if it is select box, if you have the concept of unique domain concept  
.. you can use a  a not tokenized string, or even a numerical
representation, but I think it is not your case.
In the case of input fields .. again I suggest you to tokenize the string

so the object is to find documents that have the above mentioned phrases as
well as one one
of the subject classifiers, so as to pull out the most meaning full
documents
 

no problem ... once again .. use
+subject:my searched subject
the subject classifiers pretain to domain knowledge, and it is possible that
2 or more
subject classification headings are composed of the same set of words, but
the sequence
in which they appear can drastically alter the meaning hence tokenizing the
subject field
is not exactly a healthy solution.
 

the tokenization doesn't change the word order, in the case you use a 
PhraseQuery you will get the correct results

+title:George Bush
doesn't return documents with the title
Bush George
also such search tools are meant for people who know / understand  this
classification system
 

:)) This is a general truth the the result are better when the people 
know what they are searching for :)

Taxonomy of animals can be taken as one such example,
hope this helps define the problem
 

I cannot see anything special in your problem.
Before strating to implement a complex solution probably will be better 
to give it a chance to the simple one ...
I ensure you that you won't loose anything, and even if you decide to 
implement complex solutions you will have
a lot of reusable code.

so ... Have fun,
 Sergiu
PS: if you can provide an example with a false positive please ... 
provide us the case



 

I still don't understand what is wrong with the Idea of indexing the
title in a separate field and searching with a Phrase query
+title:Elections 2004 ?
I think that the real problem is that the title is not tokenized and the
title contains more then Elections 2004
I think it is worthing to give a try to this solution.
Or maybe I don't understand the problem correctly ...
All the best,
Sergiu


   

 

Aviran
http://aviran.mordos.com
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 2:25 AM
To: Lucene Users List
Subject: RE: Null or no analyzer
Aviran writes:
   

You can use WhiteSpaceAnalyzer
 

Can he? If Elections 2004 is one token in the subject field (keyword),
this will fail, since WhiteSpeceAnalyzer will tokenize that to
`Elections'
and `2004'.
So I guess he has to write an identity analyzer himself unless there
is one
provided (which doesn't seem to be the case). The only alternatives
are not
using query parser or extending query parser for a key word syntax,
as far
as I can see.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Spell checker

2004-10-20 Thread Jonathan Hager
I investigated how the algorithm implemented in this spell checker
compares with my simple implementation of a spell checker.

First here is what my implementation looks like:

//Each word becomes a single Lucene Document

//To find suggestions:
   FuzzyQuery fquery = new FuzzyQuery(new Term(word, word));
   Hits dicthits = dictionarySearcher.search(fquery);

For a simple test I misspelled brown, as follows:
 * bronw
 * bruwn
 * brownz

To validate my testcases I checked if Microsoft Word and Google had
any idea what I was trying to spell.  Google suggested brown, brown,
browns, respectively.

Words suggestions were:

bronw==brown, brow
bruwn==brown, brawn, bruin
brownz==browns, brown

The suggestions using  David Spencer/Nicolas Maisonneuve's algorithm
against my index were:

bronw==jaron, brooks, citron, brookline
bruwn==brush
brownz==bronze, brooks, brooke, brookline


The suggestions using my real simple algorithm against my index were:

bronw==brown, brwn, brush
bruwn==brown, brwn, brush
brownz==brown, bronze

It appears that  David Spencer/Nicolas Maisonneuve's Spell Checking
Algorithm returns a broader result set than most commercial algorithms
or a real simple algorithm.  I will be the first to say, that this is
just anecdotal evidence and not a rigourous test of either algorithm. 
But until extensive testing has been done I'm going to stick with my
real simple dictionary lookup.

Jonathan

On Wed, 20 Oct 2004 12:56:39 -0400, Aviran [EMAIL PROTECTED] wrote:
 Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009
 
 Aviran
 http://aviran.mordos.com
 
 
 
 -Original Message-
 From: Lynn Li [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 20, 2004 10:52 AM
 To: 'Lucene Users List'
 Subject: RE: Spell checker
 
 Where can I download it?
 
 Thanks,
 Lynn
 
 -Original Message-
 From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 11, 2004 1:26 PM
 To: Lucene Users List
 Subject: Spell checker
 
 hy lucene users
 i developed a Spell checker for lucene inspired by the David Spencer code
 
 see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker
 
 Nicolas Maisonneuve
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Spell checker

2004-10-20 Thread Alexey Lef
If you look at the FuzzyQuery code, it is based on computing Levenshtein
distance between the original term and every term in the index and keeping
the terms that are within the specified relative distance of the original
term. This would explain why FuzzyQuery may work well for small indexes but
for large indexes (I have ~5 million terms in mine) it is impossibly slow.

What n-gram based (or any other secondary index based) spell checkers are
trying to do is to select a limited number of candidate terms in a very
quick manner and then apply the distance algorithm to them. If you use the
same cutoff rules as the FuzzyQuery, you will get a very similar result set.
Secondary index-based spell checkers also give you a lot more control on how
many similar terms to bring back and in what order.

Regards,

Alexey


-Original Message-
From: Jonathan Hager [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 20, 2004 6:48 PM
To: Lucene Users List
Subject: Re: Spell checker


I investigated how the algorithm implemented in this spell checker
compares with my simple implementation of a spell checker.

First here is what my implementation looks like:

//Each word becomes a single Lucene Document

//To find suggestions:
   FuzzyQuery fquery = new FuzzyQuery(new Term(word, word));
   Hits dicthits = dictionarySearcher.search(fquery);

For a simple test I misspelled brown, as follows:
 * bronw
 * bruwn
 * brownz

To validate my testcases I checked if Microsoft Word and Google had
any idea what I was trying to spell.  Google suggested brown, brown,
browns, respectively.

Words suggestions were:

bronw==brown, brow
bruwn==brown, brawn, bruin
brownz==browns, brown

The suggestions using  David Spencer/Nicolas Maisonneuve's algorithm
against my index were:

bronw==jaron, brooks, citron, brookline
bruwn==brush
brownz==bronze, brooks, brooke, brookline


The suggestions using my real simple algorithm against my index were:

bronw==brown, brwn, brush
bruwn==brown, brwn, brush
brownz==brown, bronze

It appears that  David Spencer/Nicolas Maisonneuve's Spell Checking
Algorithm returns a broader result set than most commercial algorithms
or a real simple algorithm.  I will be the first to say, that this is
just anecdotal evidence and not a rigourous test of either algorithm. 
But until extensive testing has been done I'm going to stick with my
real simple dictionary lookup.

Jonathan

On Wed, 20 Oct 2004 12:56:39 -0400, Aviran [EMAIL PROTECTED] wrote:
 Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009
 
 Aviran
 http://aviran.mordos.com
 
 
 
 -Original Message-
 From: Lynn Li [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 20, 2004 10:52 AM
 To: 'Lucene Users List'
 Subject: RE: Spell checker
 
 Where can I download it?
 
 Thanks,
 Lynn
 
 -Original Message-
 From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 11, 2004 1:26 PM
 To: Lucene Users List
 Subject: Spell checker
 
 hy lucene users
 i developed a Spell checker for lucene inspired by the David Spencer code
 
 see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker
 
 Nicolas Maisonneuve
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TestRangeQuery.java

2004-10-20 Thread Vladimir Yuryev
Hi,
If tests work without eclipse it is necessary to adjust correctly 
their performance in eclipse:-)

Good luke,
Vladimir.
On Wed, 20 Oct 2004 19:10:45 +0530
 Karthik N S [EMAIL PROTECTED] wrote:
Hi
Does anybody have Trouble in Compiling   TestRangeQuery.java   in 
Eclipse
3.0 IDE,

[
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/test/org/apache/lucene/
search ]
Seem's there is an Error
doc.add(new Field(id, id + docCount, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field(content, content, Field.Store.NO,
Field.Index.TOKENIZED));

Compiler Error is with Lucene1.4.1, Win O/s
Field.Store.yes is not Found


Thx in Advance
 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]