Re: PHP-Lucene Integration

2005-02-09 Thread Daniel Cortes
Hi, I have a problem about PHP and Lucen. too.
I have PhpBB (a forum) and a JAVA portal, I need to index post on Lucene 
Index, phpBB use a DB of mysql.
I have 2 options, first index the database, a thing that I don't do 
never, and I think that is complex because I supose I have to decide how 
often to re-index the database.
The second option and the option that I think is the best it's to do 
that every add or modify button in the phpBB calls a JAVA thread 
that recive parameters how text of topic, autor and other things, this 
things will be indexed but not stored and the only thing to store will 
be url of topic.
I hope this will be good for someone.

PD: I don't have idea how to do the second option until yet :D.Because I 
have to modify all the buttons and I don't have to call a JAVA thread 
since PHP, I hope that I haven't to install JAVA bridge for this 
because, I don't have comunication PHP -JAVA only thing that I need is 
call a JAVA thread.
Perhaps my ideas are erroneous, please tell me.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-09 Thread Miles Barr
On Tue, 2005-02-08 at 12:19 -0500, Steven Rowe wrote:
 Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't 
 mess with its input in any way, but just returns it as-is?
 
 I realize that under most circumstances, it would probably be more code 
 to use it than just constructing a TermQuery, but having it would 
 regularize query handling, and simplify new users' experience.  And for 
 the purposes of the PerFieldAnalyzerWrapper, it could be helpful.

It's fairly straightforward to write one. Here's the one I put together
for PerFieldAnalyzerWrapper situations:


package org.apache.lucene.analysis;

import java.io.Reader;

public class VerbatimAnalyzer extends Analyzer {

public VerbatimAnalyzer() {
super();
}

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new VerbatimTokenizer(reader);

return result;
}


/**
 * This tokenizer assumes that the entire input is just one token.
 */
public static class VerbatimTokenizer extends CharTokenizer {

public VerbatimTokenizer(Reader reader) {
super(reader);
}

protected boolean isTokenChar(char c) {
return true;
}
}
}


-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-09 Thread mark harwood
A GUI plugin for Squirrel SQL (
http://squirrel-sql.sourceforge.net/) would make a
great way of configuring the mapping.
It already does all the heavy lifting for connecting
to different types of database and poking around the
internals.
I've got the bare bones of a plugin sorted (Connect to
any DB, right click table name, click define Lucene
index..., list DB column names/types). Next steps are
controls to define the required mapping, run indexing
and provide an option to save the configuration in
some XML format for ongoing batch operation.

Before taking this further I suppose some wider
questions are:

1) Should we build this mapper into Luke instead? We
would have to lift a LOT of the DB handling smarts
from Squirrel. Luke however is doing a lot with
Analyzer configuration which would certainly be useful
code in any mapping tool (can we lift those and use in
Squirrel?).
2) What should the XML for the batch-driven
configuration look like? Is it ANT tasks or a custom
framework? 
3) If our mapping understands the make-up of the rdbms
and the Lucene index should we introduce a
higher-level software layer for searching which sits
over the rdbms and Lucene and abstracts them to some
extent? This layer would know where to go to retrieve
field values or construct filters ie understands
whether to retrieve title field for display from
database column or a Lucene stored field and whether
the price $100 search criteria is resolved by a
lucene query or an RDBMS-query to produce a Lucene
filter. It seems like currently, every DB+Lucene
integration project struggles with designing a
solution to manage this divide and handcodes the
solution.

Any thoughts appreciated







___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-09 Thread Aad Nales
Not sure that I get everything:
In the framework that we have built we use a 'simple' object mapping 
that connects a database table with an object and implicetely with a 
cache. It is build on top of JDBC.

The key fields of the database are used to create a DbKey element a 
simple array of Object's. Through our mapping layer you can read and 
write objects as atomic elements the layer takes care of stuff like 
inheritance and delegation. (e.g. our object model has Items, Versions 
and States, reading an item creates an ItemBean, a VersionBean with the 
current version and a StateBean with the Version most current state).

The framework foresees in a number of different applications e.g. Forum, 
News, Questionaires, Pages and then some each inherited from Item. What 
we needed was an indexing approach that could do the following:

1. When an Page, Questionnaire, NewsItem etc. was updated this needed to 
be reflected in the search results directly.
2. Every now and then (say once per night) a batch update was needed.

During the creation of our search solution we encountered a number of 
issues:

1. Double results.
When you execute a search every hit counts but if a forum thread 
consists of 200 items then 200 hits on reply's does not really add to 
the users feeling of service. We added the concept of 'key' field to the 
index. Double results are filtered out only the highest hit is displayed.

2. Delegate objects
A thread consists of messages. The content of each message should be in 
the index. In order to solve this we create detailers. A detailer is 
called with a DbKey and returns a set of Objects. Each individual object 
is parsed (by calling the detailer again) based on the rules that are 
defined in search.xml (see previous mail).

3. Batch Jobs
To optimize the index now and then we need to reindex the whole thing. 
This is done by executing a query and getting the DbKey elements. (This 
by the way is done in a so called DbMap as spare implementation a 
HashMap where objects are only loaded into the cache when a getValue() 
is called on it's entry). The query is called on the Item 'table'. The 
parsing is done by calling the apprioriate detailers per Item type.

Now coming back to our discussion.
- The cache/mapping layer does not care much for the type of database 
since it is build on JDBC and does not use any stored procedures or 
constraints other than primary keys.

- Searches are executed on an object that assures that no reader or 
writers are active.

- The result of a search is given back as Map. This way the uri that is 
created as part of the result can be completely ignored if your 
application so pleases.

Eriks suggestions:
Per field analyzers and wrappers are not a problem and could very easily 
be added to this framework.

Creating an object as a result is possible i guess, but does this not 
defeat the purpose of a search index somewhat? The information in the 
index especially when set against a database are to present those fields 
that are interesting to be searched.

The second part i don't quite 'get' is how would the 'dot' mapping work 
company.president.name for instance? I can see it writing to the index 
but not creating object returning from a call? Or would this simply be a 
key field that is then used as part of query? Using it to navigate an 
object structure is quite feasible especially if you would create a key. 
E.g. I would store a key in lucene called: company.role.person and a 
related field with the csv values XYZ, VP, Jenssen. Then if the 
company 'object' can be derived from so kind of persistent object the 
result of the query would be:

persistentObject.getCompany(XYZ).getRole(VP).getPerson(Jenssen);
The stuff we build so far would be able to cope with something like that 
i guess, although quite some elements would still be missing. Using 
Lucene this way more or less creates a 'unified' index.

Also: I have not been able to look a Squirrel.
Cheers,
Aad

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-09 Thread Erik Hatcher
The only caveat to your VerbatimAnalyzer is that it will still split 
strings that are over 255 characters.  CharTokenizer does that.  
Granted, though, that keyword fields probably don't make much sense to 
be that long.

As mentioned yesterday - I added the LIA KeywordAnalyzer into the 
contrib area of Subversion.  I had built one like you had also, but the 
one I contributed reads the entire input stream into a StringBuffer 
ensuring it does not get split like CharTokenizer would.

Erik
On Feb 9, 2005, at 4:40 AM, Miles Barr wrote:
On Tue, 2005-02-08 at 12:19 -0500, Steven Rowe wrote:
Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't
mess with its input in any way, but just returns it as-is?
I realize that under most circumstances, it would probably be more 
code
to use it than just constructing a TermQuery, but having it would
regularize query handling, and simplify new users' experience.  And 
for
the purposes of the PerFieldAnalyzerWrapper, it could be helpful.
It's fairly straightforward to write one. Here's the one I put together
for PerFieldAnalyzerWrapper situations:
package org.apache.lucene.analysis;
import java.io.Reader;
public class VerbatimAnalyzer extends Analyzer {
public VerbatimAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new VerbatimTokenizer(reader);
return result;
}
/**
 * This tokenizer assumes that the entire input is just one token.
 */
public static class VerbatimTokenizer extends CharTokenizer {
public VerbatimTokenizer(Reader reader) {
super(reader);
}
protected boolean isTokenChar(char c) {
return true;
}
}
}
--
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-09 Thread Miles Barr
On Wed, 2005-02-09 at 06:56 -0500, Erik Hatcher wrote:
 The only caveat to your VerbatimAnalyzer is that it will still split 
 strings that are over 255 characters.  CharTokenizer does that.  
 Granted, though, that keyword fields probably don't make much sense to 
 be that long.
 
 As mentioned yesterday - I added the LIA KeywordAnalyzer into the 
 contrib area of Subversion.  I had built one like you had also, but the 
 one I contributed reads the entire input stream into a StringBuffer 
 ensuring it does not get split like CharTokenizer would.

That's good to know. When indexing web sites I use the URL as the
identifier and hence store it in a keyword field. While not common, it
is possible for URLs to be longer than 255 characters. That could have
led to some very awkward bugs to track down.

I'll probably switch over to your KeywordAnalyzer.


-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



sounds like spellcheck

2005-02-09 Thread Aad Nales
In my Clipper days I could build an index on English words using a 
technique that was called soundex. Searching in that index resulted in 
hits of words that sounded the same. From what i remember this technique 
only worked for English. Has it ever been generalized?

What i am trying to solve is this. A customer is looking for a solution 
to spelling mistakes made by children (upto 10) when typing in queries. 
The site is Dutch. Common mistakes are 'sgool' when searching for 
'school'. The 'normal' spellcheckers and suggestors typically generate a 
list where the 'sounds like' candidates' are too far away from the 
result. So what I am thinking about doing is this:

1. create a parser that takes a word and creates a soundindex entry.
2. create list of 'correctly' spelled words either based on the index of 
the website or on some kind of dictionary.
2a. perhaps create a n-gram index based on these words

3. accept a query, figure out that a spelling mistake has been made
3a find alternatives by parsing the query and searching the 'sound like 
index' and then calculate and order  the results

Steps 2 and 3 have been discussed at length in this forum and have even 
made it to the sandbox. What I am left with is 1.

My thinking is processing a series of replacement statements that go like:
--
g sounds like ch if the immediate predecessor is an s.
o sounds like oo if the immediate predecessor is a consonant
--
But before I takes this to the next step I am wondering if anybody has 
created or thought up alternative solutions?

Cheers,
Aad



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck

2005-02-09 Thread Morus Walter
Aad Nales writes:
 
 Steps 2 and 3 have been discussed at length in this forum and have even 
 made it to the sandbox. What I am left with is 1.
 
 My thinking is processing a series of replacement statements that go like:
 --
 g sounds like ch if the immediate predecessor is an s.
 o sounds like oo if the immediate predecessor is a consonant
 --
 
 But before I takes this to the next step I am wondering if anybody has 
 created or thought up alternative solutions?
 
An implementation of a rule based system to create such a pronounciation
form, can be found in a library called makelib that is part of an editor
named leanedit.
Unfortunatley the website seems to be down.
The lib is LGPL. If you're interested, I can send you a copy of the 
sources. The only ruleset available is german though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sounds like spellcheck

2005-02-09 Thread Aad Nales
Morus Walter wrote:
Unfortunatley the website seems to be down.
 

Do you have the url? The sources are off course very welcome as well.
Cheers,
Aad
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck

2005-02-09 Thread Erik Hatcher
On Feb 9, 2005, at 7:23 AM, Aad Nales wrote:
In my Clipper days I could build an index on English words using a 
technique that was called soundex. Searching in that index resulted in 
hits of words that sounded the same. From what i remember this 
technique only worked for English. Has it ever been generalized?
I do not know how Soundex/Metaphone/Double Metaphone work with 
non-English languages, but these algorithms are in Jakarta Commons 
Codec.  I used the Metaphone algorithm as a custom analyzer example in 
Lucene in Action.  You'll see it in the source code distribution under 
src/lia/analysis/codec.  I did a couple of variations, one that adds 
the metaphoned version as a token in the same position and one that 
simply replaces it in the token stream.

I even envisioned this sounds-like feature being used for children.  I 
was mulling over this idea while having lunch with my son one day last 
spring (he was 5 at the time).  I asked him how to spell cool cat and 
he replied c-o-l c-a-t.  I tried it out with the metaphone algorithm 
and it matches!

http://www.lucenebook.com/search?query=cool+cat
Erik

What i am trying to solve is this. A customer is looking for a 
solution to spelling mistakes made by children (upto 10) when typing 
in queries. The site is Dutch. Common mistakes are 'sgool' when 
searching for 'school'. The 'normal' spellcheckers and suggestors 
typically generate a list where the 'sounds like' candidates' are too 
far away from the result. So what I am thinking about doing is this:

1. create a parser that takes a word and creates a soundindex entry.
2. create list of 'correctly' spelled words either based on the index 
of the website or on some kind of dictionary.
2a. perhaps create a n-gram index based on these words

3. accept a query, figure out that a spelling mistake has been made
3a find alternatives by parsing the query and searching the 'sound 
like index' and then calculate and order  the results

Steps 2 and 3 have been discussed at length in this forum and have 
even made it to the sandbox. What I am left with is 1.

My thinking is processing a series of replacement statements that go 
like:
--
g sounds like ch if the immediate predecessor is an s.
o sounds like oo if the immediate predecessor is a consonant
--

But before I takes this to the next step I am wondering if anybody has 
created or thought up alternative solutions?

Cheers,
Aad



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck

2005-02-09 Thread Kelvin Tan
Hey Aad, I believe http://jakarta.apache.org/lucene/docs/contributions.html has 
a link to Phonetix 
(http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html),
 an LGPL-licensed lib for phonetic algorithms like Soundex, Metaphone and 
DoubleMetaphone. There are Lucene adapters.

As to the suitability of the algorithms, I haven't taken a look at the Phonetix 
implementation, but if 
http://spottedtiger.tripod.com/D_Language/D_DoubleMetaPhone.html is anything to 
go by (do a search for dutch), then it should meet your needs, or at least 
won't be difficult to customize.

Is that what you're looking for?

k

On Wed, 09 Feb 2005 13:23:57 +0100, Aad Nales wrote:
 In my Clipper days I could build an index on English words using a
 technique that was called soundex. Searching in that index resulted
 in hits of words that sounded the same. From what i remember this
 technique only worked for English. Has it ever been generalized?

 What i am trying to solve is this. A customer is looking for a
 solution to spelling mistakes made by children (upto 10) when
 typing in queries. The site is Dutch. Common mistakes are 'sgool'
 when searching for 'school'. The 'normal' spellcheckers and
 suggestors typically generate a list where the 'sounds like'
 candidates' are too far away from the result. So what I am thinking
 about doing is this:

 1. create a parser that takes a word and creates a soundindex entry.

 2. create list of 'correctly' spelled words either based on the
 index of the website or on some kind of dictionary.
 2a. perhaps create a n-gram index based on these words

 3. accept a query, figure out that a spelling mistake has been made
 3a find alternatives by parsing the query and searching the 'sound
 like index' and then calculate and order  the results

 Steps 2 and 3 have been discussed at length in this forum and have
 even made it to the sandbox. What I am left with is 1.

 My thinking is processing a series of replacement statements that
 go like: --
 g sounds like ch if the immediate predecessor is an s. o sounds
 like oo if the immediate predecessor is a consonant --

 But before I takes this to the next step I am wondering if anybody
 has created or thought up alternative solutions?

 Cheers,
 Aad


 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sounds like spellcheck

2005-02-09 Thread Aad Nales
Thanks for the reference to Metaphone et al. This is the direction I am 
looking for. What I don't get is why so much of the 'knowledge' of these 
algoritms is stored in the 'process'. I guess it has to be performance.

cheers,
Aad
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-09 Thread Erik Hatcher
On Feb 9, 2005, at 4:51 AM, mark harwood wrote:
A GUI plugin for Squirrel SQL (
http://squirrel-sql.sourceforge.net/) would make a
great way of configuring the mapping.
That would be slick!
1) Should we build this mapper into Luke instead? We
would have to lift a LOT of the DB handling smarts
from Squirrel. Luke however is doing a lot with
Analyzer configuration which would certainly be useful
code in any mapping tool (can we lift those and use in
Squirrel?).
The dilemma with Luke is that its not ASL'd (because of the thinlet 
integration).  Anyone up for a Swing conversion project?  :)

It would be quite cool if Lucene had a built-in UI tool (like or 
actually Luke).  Luke itself is ASL'd and I believe Andrzej has said 
he'd gladly donate it to Lucene's codebase, but the Thinlet LGPL is an 
issue.

2) What should the XML for the batch-driven
configuration look like? Is it ANT tasks or a custom
framework?
Don't concern yourselves with Ant at the moment.  Anything that is 
easily callable from Java can be made into an Ant task.  In fact, the 
minimum requirements for an Ant task is a public void execute() 
method.  Whatever Java infrastructure you come up with, I'll gladly 
create the Ant task wrapper for it when its ready.

3) If our mapping understands the make-up of the rdbms
and the Lucene index should we introduce a
higher-level software layer for searching which sits
over the rdbms and Lucene and abstracts them to some
extent? This layer would know where to go to retrieve
field values or construct filters ie understands
whether to retrieve title field for display from
database column or a Lucene stored field and whether
the price $100 search criteria is resolved by a
lucene query or an RDBMS-query to produce a Lucene
filter. It seems like currently, every DB+Lucene
integration project struggles with designing a
solution to manage this divide and handcodes the
solution.
Wow... that is getting pretty clever.  I like it!
I don't personally have a need for relational database indexing, but I 
support this effort to make a generalized mapping facility.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck [auf Viren geprueft]

2005-02-09 Thread Jonathan O'Connor
Aad,
Are you trying to check the spelling of English words by Dutch children?
Then, Phonetix or any of these other solutions may not be perfect.
From my little knowledge of Dutch, a g is some sort of velar fricative
(pronounced at the back of throat). And ch in english is also a velar
fricative.
You have to hope that the soundex/metaphone rules are broad enough to be
used by both languages.

Interesting little problem. No J2EE libraries to call, just static String
convertToSoundex(String word) to implement. Ah if only I could do more of
that sort of coding.
Ciao,
Jonathan O'Connor
XCOM Dublin



Kelvin Tan [EMAIL PROTECTED]
09/02/2005 13:03
Please respond to
Lucene Users List lucene-user@jakarta.apache.org


To
Lucene Users List lucene-user@jakarta.apache.org
cc

Subject
Re: sounds like spellcheck [auf Viren geprueft]






Hey Aad, I believe
http://jakarta.apache.org/lucene/docs/contributions.html has a link to
Phonetix (
http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html
), an LGPL-licensed lib for phonetic algorithms like Soundex, Metaphone
and DoubleMetaphone. There are Lucene adapters.

As to the suitability of the algorithms, I haven't taken a look at the
Phonetix implementation, but if
http://spottedtiger.tripod.com/D_Language/D_DoubleMetaPhone.html is
anything to go by (do a search for dutch), then it should meet your
needs, or at least won't be difficult to customize.

Is that what you're looking for?

k

On Wed, 09 Feb 2005 13:23:57 +0100, Aad Nales wrote:
 In my Clipper days I could build an index on English words using a
 technique that was called soundex. Searching in that index resulted
 in hits of words that sounded the same. From what i remember this
 technique only worked for English. Has it ever been generalized?

 What i am trying to solve is this. A customer is looking for a
 solution to spelling mistakes made by children (upto 10) when
 typing in queries. The site is Dutch. Common mistakes are 'sgool'
 when searching for 'school'. The 'normal' spellcheckers and
 suggestors typically generate a list where the 'sounds like'
 candidates' are too far away from the result. So what I am thinking
 about doing is this:

 1. create a parser that takes a word and creates a soundindex entry.

 2. create list of 'correctly' spelled words either based on the
 index of the website or on some kind of dictionary.
 2a. perhaps create a n-gram index based on these words

 3. accept a query, figure out that a spelling mistake has been made
 3a find alternatives by parsing the query and searching the 'sound
 like index' and then calculate and order  the results

 Steps 2 and 3 have been discussed at length in this forum and have
 even made it to the sandbox. What I am left with is 1.

 My thinking is processing a series of replacement statements that
 go like: --
 g sounds like ch if the immediate predecessor is an s. o sounds
 like oo if the immediate predecessor is a consonant --

 But before I takes this to the next step I am wondering if anybody
 has created or thought up alternative solutions?

 Cheers,
 Aad


 
 - To unsubscribe, e-mail: lucene-user-
 [EMAIL PROTECTED] For additional commands, e-mail:
 [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Frankfurt (16.02.2005), 
Duesseldorf (23.02.2005) und Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe Mobilisierung von Lotus Notes Applikationen  in Frankfurt 
(17.02.2005), Duesseldorf (24.02.2005) und Berlin (05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein fur 
den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, 
Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine 
fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine 
Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of 
the intended recipient. Any review, distribution by others or forwarding 
without express permission is strictly prohibited. If you are not the intended 
recipient, please contact the sender and delete all copies.


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-09 Thread Andrzej Bialecki
Erik Hatcher wrote:
1) Should we build this mapper into Luke instead? We
would have to lift a LOT of the DB handling smarts
from Squirrel. Luke however is doing a lot with
Analyzer configuration which would certainly be useful
code in any mapping tool (can we lift those and use in
Squirrel?).
You are welcome - you can take any parts except for Thinlet.java (which 
is LGPL-ed).


The dilemma with Luke is that its not ASL'd (because of the thinlet 
integration).  Anyone up for a Swing conversion project?  :)

It would be quite cool if Lucene had a built-in UI tool (like or 
actually Luke).  Luke itself is ASL'd and I believe Andrzej has said 
he'd gladly donate it to Lucene's codebase, but the Thinlet LGPL is an 
issue.

Yes, I can confirm that all the parts of Luke that I wrote are under 
ASL, and I would actually prefer to donate it than to maintain it all on 
my own, especially with the recent speed of development.

Regarding Thinlet - there is some ongoing discussion on forking the 
project (it's a long story), and we're lobbying up to put the fork under 
 ASL - but it's up to the original author to decide this, and he's 
rather reluctant to let it go like this...

So, if anyone wants to rewrite Luke in Swing, SwiXML or something else, 
he's more than welcome - but this won't be me, because I hate Swing 
programming...

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck [auf Viren geprueft]

2005-02-09 Thread Aad Nales
Jonathan O'Connor wrote:
Aad,
Are you trying to check the spelling of English words by Dutch children? 
 

Uh no, I am trying to correct the spelling of Dutch words by Dutch 
children who, as most children do, make phonetic spelling mistakes.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


wildcards, stemming and searching

2005-02-09 Thread aaz
Hi,
We are not using QueryParser and have some custom Query construction.

We have an index that indexes various documents. Each document is Analyzed and 
indexed via

StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - 
PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we need to 
deal with * in the value side of the comparison. We also need to analyze 
the value side of the query against the same analyzer in which the index was 
built with. This leads to some problems and would like your solution opinion.

User queries.

somefield = united*

After the analyzer hits united*, we get back unit. Hence we cannot detect 
that the user requested a wildcard.

Lets say we come up with some solution to escape the * char before the 
Analyzer hits it. For example

somefield = united*  - unitedXXWILDCARDXX

After analysis this then becomes unitedxxwildcardxx, which we can then turn 
into a WildcardQuery united*

The problem here is that the term united will never exist in the indexing due 
to the stemming which did not stem properly due to our escape mechanism.

How can I solve this problem?



Follow-up to sorting tokenised field

2005-02-09 Thread Kauler, Leto S

Have been reading this thread
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11180.htm
l.

Praveen Peddi (or anyone else), did you ever try the patch?  I would be
interested to know what sort of performance difference it makes.

I have been trying to create a most-simple solution to indexing and
sorting.  I was hoping that it would be possible to sort on our fields
without requiring the use (and therefore prior knowledge of) specific
sort fields.

Useful would be the ability to add a sort term to fields, along with
their regular terms.  If the field is not tokenised then a sort term
might not be necessary so the sort engine performs as normal, but if the
field is tokenised then the engine could use this defined sort term,
thus allowing all terms to be kept together in the one field.

I don't know what the technical implications of this is though.  Just a
thought.
--Leto

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Tokenised and non-tokenised terms in one field

2005-02-09 Thread Kauler, Leto S
Hi all,

Seeking some best practice advice, or even if there is an alternative
solution.  Sorry for the email length, just trying to explain
succinctly.

Currently we add fields to our index like this (for reference, Field
booleans are STORE, INDEX, TOKENISE):

doc.add(new Field(field, value, true, true, false));
doc.add(new Field(field, value, false, true, true));

This creates two fields in a document with same name. One is stored but
not tokenised, the other which is not stored but tokenised, and both are
indexed for searchability.  The non-tokenised term is so we can do
exact-match searches.

In my mind, the terms of a title field might look like:

title
 - A Guide to Lucene (PDF)  [stored flag?]
title
 - guide
 - lucene
 - pdf

Can these be merged together in some way, and would it even make sense
to do so?  I am thinking in terms of creating a more lightweight index.

Thanks, --Leto

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Unicode Usage

2005-02-09 Thread Owen Densmore
I'm building an index from a FileMaker database by dumping the data to 
a tab-separated file.  Because the FileMaker output is encoded in 
MacRoman, and uses Mac line separators, I run a script across the tab 
file to clean it up:
	tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's 
vtabs (for inter-field CRs) with blanks, and runs a character converter 
to build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, 
both of which understand UTF.

BUT -- when I look at the indexes created in Lucene using Luke, I get 
unprintable letters!  Writing programs to dump the terms (using Writer 
subclasses which handle unicode correctly) shows that indeed the files 
now have odd characters when viewed w/ jEdit and BBEdit.

The analyzer used to build the index looks like:
public class RedfishAnalyser extends Analyzer {
  String[] stopwords;
  public RedfishAnalyser(String[] stopwords) {
this.stopwords = stopwords;
  }
  public RedfishAnalyser() {
this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
  }
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
   stopwords));
  }
}
Yikes, what am I doing wrong?!  Is the analyzer at fault?  Its about 
the only place where I can see a problem happening.

Thanks for any pointers,
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Unicode Usage

2005-02-09 Thread aurora
So you got a utf8 encoded text file. But how do you read the file into  
Java? The default encoding of Java is likely to be something other than  
utf8. Make sure you specify the encoding like:

  InputStreamReader( new FileInputStream(filename), UTF-8);
On Wed, 9 Feb 2005 22:32:38 -0700, Owen Densmore [EMAIL PROTECTED]  
wrote:

I'm building an index from a FileMaker database by dumping the data to a  
tab-separated file.  Because the FileMaker output is encoded in  
MacRoman, and uses Mac line separators, I run a script across the tab  
file to clean it up:
	tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs  
(for inter-field CRs) with blanks, and runs a character converter to  
build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, both  
of which understand UTF.

BUT -- when I look at the indexes created in Lucene using Luke, I get  
unprintable letters!  Writing programs to dump the terms (using Writer  
subclasses which handle unicode correctly) shows that indeed the files  
now have odd characters when viewed w/ jEdit and BBEdit.

The analyzer used to build the index looks like:
 public class RedfishAnalyser extends Analyzer {
   String[] stopwords;
   public RedfishAnalyser(String[] stopwords) {
 this.stopwords = stopwords;
   }
   public RedfishAnalyser() {
 this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
   }
   public TokenStream tokenStream(String fieldName, Reader reader) {
 return new PorterStemFilter(
 new StopFilter(
 new LowerCaseFilter(
 new StandardFilter(
 new StandardTokenizer(reader))),
stopwords));
   }
 }
Yikes, what am I doing wrong?!  Is the analyzer at fault?  Its about the  
only place where I can see a problem happening.

Thanks for any pointers,
Owen

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]