Re: Lucene Unicode Usage

2005-02-10 Thread Andrzej Bialecki
Owen Densmore wrote:
I'm building an index from a FileMaker database by dumping the data to a 
tab-separated file.  Because the FileMaker output is encoded in 
MacRoman, and uses Mac line separators, I run a script across the tab 
file to clean it up:
tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs 
(for inter-field CRs) with blanks, and runs a character converter to 
build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, both 
of which understand UTF.
However, it matters how you have read in the files in your Java 
application. Did you use InputStreamReader with the default platform 
encoding (probably 8859-1), or did you specify UTF-8 explicitly?

BUT -- when I look at the indexes created in Lucene using Luke, I get 
unprintable letters!  Writing programs to dump the terms (using Writer 
By default Luke uses the standard platform-specific font dialog. On 
Windows this font doesn't support Unicode glyphs, so you will see just 
blanks (or rectangles). In the upcoming release you will be able to 
select the display font.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck [auf Viren geprueft]

2005-02-10 Thread Jonathan O'Connor
Aad, Well at least that's easier.

Ciao,
Jonathan O'Connor
XCOM Dublin



Aad Nales [EMAIL PROTECTED]
09/02/2005 16:16
Please respond to
Lucene Users List lucene-user@jakarta.apache.org


To
Lucene Users List lucene-user@jakarta.apache.org
cc

Subject
Re: sounds like spellcheck [auf Viren geprueft]






Jonathan O'Connor wrote:

Aad,
Are you trying to check the spelling of English words by Dutch children? 


Uh no, I am trying to correct the spelling of Dutch words by Dutch
children who, as most children do, make phonetic spelling mistakes.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Frankfurt (16.02.2005), 
Duesseldorf (23.02.2005) und Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe Mobilisierung von Lotus Notes Applikationen  in Frankfurt 
(17.02.2005), Duesseldorf (24.02.2005) und Berlin (05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein fur 
den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, 
Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine 
fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine 
Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of 
the intended recipient. Any review, distribution by others or forwarding 
without express permission is strictly prohibited. If you are not the intended 
recipient, please contact the sender and delete all copies.


Re: wildcards, stemming and searching

2005-02-10 Thread Erik Hatcher
How would you deal with a query like a*z though?
I suspect, however, that you only care about suffix queries and 
stemming those.  If thats the case, then you could subclass 
getWildcardQuery and do internal stemming (remove trailing wildcard, 
run it through the analyzer directly there and return a modified 
WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't 
necessarily stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() -StandardFilter() - LowercaseFilter() - 
StopFilter() - PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with * in the value side of the comparison. We also 
need to analyze the value side of the query against the same 
analyzer in which the index was built with. This leads to some 
problems and would like your solution opinion.

User queries.
somefield = united*
After the analyzer hits united*, we get back unit. Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to escape the * char before 
the Analyzer hits it. For example

somefield = united*  - unitedXXWILDCARDXX
After analysis this then becomes unitedxxwildcardxx, which we can 
then turn into a WildcardQuery united*

The problem here is that the term united will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: wildcards, stemming and searching

2005-02-10 Thread aaz
How would you deal with a query like a*z though?
Yeah I know, a user submitting that is certainly possible. I have no idea. I 
am starting to think that NOT stemming on indexing might be the safest 
solution.

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 10, 2005 8:55 AM
Subject: Re: wildcards, stemming and searching


How would you deal with a query like a*z though?
I suspect, however, that you only care about suffix queries and stemming 
those.  If thats the case, then you could subclass getWildcardQuery and do 
internal stemming (remove trailing wildcard, run it through the analyzer 
directly there and return a modified WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't necessarily 
stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() -StandardFilter() - LowercaseFilter() - 
StopFilter() - PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with * in the value side of the comparison. We also need 
to analyze the value side of the query against the same analyzer in 
which the index was built with. This leads to some problems and would 
like your solution opinion.

User queries.
somefield = united*
After the analyzer hits united*, we get back unit. Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to escape the * char before 
the Analyzer hits it. For example

somefield = united*  - unitedXXWILDCARDXX
After analysis this then becomes unitedxxwildcardxx, which we can then 
turn into a WildcardQuery united*

The problem here is that the term united will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-10 Thread Luke Shannon
Are there any issues with having a bunch of boolean queries and than adding
them to one big boolean queries (making them all required)?

Or should I be looking at Query.combine()?

Thanks,

Luke
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, February 08, 2005 12:02 PM
Subject: Re: Problem searching Field.Keyword field


Kelvin - I respectfully disagree - could you elaborate on why this is
not an appropriate use of Field.Keyword?

If the category is How To, Field.Text would split this (depending on
the Analyzer) into how and to.

If the user is selecting a category from a drop-down, though, you
shouldn't be using QueryParser on it, but instead aggregating a
TermQuery(category, How To) into a BooleanQuery with the rest of
it.  The rest may be other API created clauses and likely a piece from
QueryParser.

Erik


On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:

 As I posted previously, Field.Keyword is appropriate in only certain
 situations. For your use-case, I believe Field.Text is more suitable.

 k

 On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
 This may or may not be correct, but I am indexing it as a keyword
 because I provide a (required) radio button on the add screen for
 the user to determine which category the document should be
 assigned. Then in the search, provide a dropdown that can be used
 in the advanced search so that they can search only for a specific
 category of documents (like HowTo, Troubleshooting, etc).

 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 February 08, 2005 9:32 AM To: Lucene Users List
 Subject: RE: Problem searching Field.Keyword field

 Mike, is there a reason why you're indexing category as keyword
 not text?

 k

 On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:

 Thanks for the quick response.

 Sorry for my lack of understanding, but I am learning! Won't the
 query parser still handle this query? My limited understanding
 was that the search call provides the 'all' field as default
 field for query terms in the case where fields aren't specified.
 Using the current code, searches like author:Mike and
 title:Lucene work fine.

 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
 Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
 Re: Problem searching Field.Keyword field

 You're using the query parser with the standard analyser. You
 should construct a term query manually instead.


 --
 Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd.

 --
 -- - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]


 --
 -- - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]


 
 - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]


 
 - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-10 Thread Luke Shannon
Are there any issues with having a bunch of boolean queries and than adding
them to one big boolean queries (making them all required)?

Or should I be looking at Query.combine()?

Thanks,

Luke
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, February 08, 2005 12:02 PM
Subject: Re: Problem searching Field.Keyword field


Kelvin - I respectfully disagree - could you elaborate on why this is
not an appropriate use of Field.Keyword?

If the category is How To, Field.Text would split this (depending on
the Analyzer) into how and to.

If the user is selecting a category from a drop-down, though, you
shouldn't be using QueryParser on it, but instead aggregating a
TermQuery(category, How To) into a BooleanQuery with the rest of
it.  The rest may be other API created clauses and likely a piece from
QueryParser.

Erik


On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:

 As I posted previously, Field.Keyword is appropriate in only certain
 situations. For your use-case, I believe Field.Text is more suitable.

 k

 On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
 This may or may not be correct, but I am indexing it as a keyword
 because I provide a (required) radio button on the add screen for
 the user to determine which category the document should be
 assigned. Then in the search, provide a dropdown that can be used
 in the advanced search so that they can search only for a specific
 category of documents (like HowTo, Troubleshooting, etc).

 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 February 08, 2005 9:32 AM To: Lucene Users List
 Subject: RE: Problem searching Field.Keyword field

 Mike, is there a reason why you're indexing category as keyword
 not text?

 k

 On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:

 Thanks for the quick response.

 Sorry for my lack of understanding, but I am learning! Won't the
 query parser still handle this query? My limited understanding
 was that the search call provides the 'all' field as default
 field for query terms in the case where fields aren't specified.
 Using the current code, searches like author:Mike and
 title:Lucene work fine.

 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
 Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
 Re: Problem searching Field.Keyword field

 You're using the query parser with the standard analyser. You
 should construct a term query manually instead.


 --
 Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd.

 --
 -- - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]


 --
 -- - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]


 
 - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]


 
 - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: != queries

2005-02-10 Thread Jason Haruska
If this is a query you need to support often, you could create a field
x that contains x in every document. Then search on that with your
prohibited query.

If not, you could get the document list by doing your search then
removing all of those documents from a complete set outside of lucene.


On Thu, 10 Feb 2005 11:19:03 -0700, aaz [EMAIL PROTECTED] wrote:
 Ok, that makes sense. Any suggestions on how to AND that prohibited clause
 with a query to get everything?
 
 - Original Message -
 From: Miles Barr [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, February 10, 2005 11:07 AM
 Subject: Re: != queries
 
  On Thu, 2005-02-10 at 11:02 -0700, aaz wrote:
  I have an index with field documentNumber. There are 10 documents. One
  of the documents has documentNumber A5058970
 
  I want to return all matches where documentNumber != A505*. I should get
  9 docs back.
 
  I construct a query like
 
  wq = WildcardQuery(documentNumber,a505*)
 
  BooleanQuery bq = new BooleanQuery();
  bq.addQuery(wq,false,true);
 
  I always get no results for this type of query.
 
  Ideas?
 
  A restriction can only filter out search results and not add to them. So
  the search is starting with an empty set, then trying to filter out the
  results with a document number starting A505, i.e. doing nothing.
 
 
 
 
  --
  Miles Barr [EMAIL PROTECTED]
  Runtime Collective Ltd.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-10 Thread Paul Elschot
On Thursday 10 February 2005 18:44, Luke Shannon wrote:
 Are there any issues with having a bunch of boolean queries and than adding
 them to one big boolean queries (making them all required)?

The 1.4.3 and earlier BooleanScorer has an out of bounds exception
for More than 32 required/prohibited clauses in query.

In the development version this restriction has gone.

The limitation of the maximum clause count (default 1024,
configurable) is still there.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Newbie questions

2005-02-10 Thread Paul Jans
Hi,

A couple of newbie questions. I've searched the
archives and read the Javadoc but I'm still having
trouble figuring these out. 

1. What's the best way to index and handle queries
like the following: 

Find me all users with (a CS degree and a GPA  3.0)
or (a Math degree and a GPA  3.5).

2. What are the best practices for using Lucene in a
clustered J2EE environment? A standalone index/search
server or storing the index in the database or
something else ?

Thank you in advance,
PJ




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: new segment for each document

2005-02-10 Thread Doug Cutting
Daniel Naber wrote:
On Thursday 10 February 2005 22:27, Ravi wrote:
I tried setting the minMergeFactor on the writer to one. But
it did not work.
I think there's an off-by-one bug so two is the smallest value that works 
as expected.
You can simply create a new IndexWriter for each add and then close it. 
 IndexWriter is pretty lightweight, so this shouldn't have too much 
overhead.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Negative Match

2005-02-10 Thread Erik Hatcher
On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote:
I think I found a pretty good way to do a negative match.
In this query I am looking for all the Documents that have a 
kcfileupload
field with any value except for jpg.

Query negativeMatch = new WildcardQuery(new 
Term(kcfileupload,
*jpg*));
 BooleanQuery typeNegAll = new BooleanQuery();
Query allResults = new WildcardQuery(new Term(kcfileupload, 
*));
IndexSearcher searcher = new IndexSearcher(fsDir);
BooleanClause clause = new BooleanClause(negativeMatch, false,
true);
typeNegAll.add(allResults, true, false);
typeNegAll.add(clause);
Hits hits = searcher.search(typeNegAll);

With the little testing I have done this *seems* to work. Does anyone 
see a
problem with this approach?
Sure do you realize what WildcardQuery does under the covers?  It 
literally expands to a BooleanQuery for all terms that match the 
pattern.  There is an adjustable limit built-in of 1,024 clauses to 
BooleanQuery.  You obviously have not hit that limit ... yet!

You're better off using the advice offered on this thread 
previously create a single dummy field with a fixed value for all 
documents.  Combine a TermQuery for that dummy value with a prohibited 
clause like y our negativeMatch above.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Access Lucene from PHP or Perl

2005-02-10 Thread Andy
Greetings.

Can anyone point me to a how-to tutorial on how to
access Lucene from a web page generated by PHP pr
Perl? I've been looking but couldn't find anything.
Thanks a lot.

And

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]