Re: PHP-Lucene Integration
Hi, I have a problem about PHP and Lucen. too. I have PhpBB (a forum) and a JAVA portal, I need to index post on Lucene Index, phpBB use a DB of mysql. I have 2 options, first index the database, a thing that I don't do never, and I think that is complex because I supose I have to decide how often to re-index the database. The second option and the option that I think is the best it's to do that every add or modify button in the phpBB calls a JAVA thread that recive parameters how text of topic, autor and other things, this things will be indexed but not stored and the only thing to store will be url of topic. I hope this will be good for someone. PD: I don't have idea how to do the second option until yet :D.Because I have to modify all the buttons and I don't have to call a JAVA thread since PHP, I hope that I haven't to install JAVA bridge for this because, I don't have comunication PHP -JAVA only thing that I need is call a JAVA thread. Perhaps my ideas are erroneous, please tell me. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
On Tue, 2005-02-08 at 12:19 -0500, Steven Rowe wrote: Why is there no KeywordAnalyzer? That is, an analyzer which doesn't mess with its input in any way, but just returns it as-is? I realize that under most circumstances, it would probably be more code to use it than just constructing a TermQuery, but having it would regularize query handling, and simplify new users' experience. And for the purposes of the PerFieldAnalyzerWrapper, it could be helpful. It's fairly straightforward to write one. Here's the one I put together for PerFieldAnalyzerWrapper situations: package org.apache.lucene.analysis; import java.io.Reader; public class VerbatimAnalyzer extends Analyzer { public VerbatimAnalyzer() { super(); } public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new VerbatimTokenizer(reader); return result; } /** * This tokenizer assumes that the entire input is just one token. */ public static class VerbatimTokenizer extends CharTokenizer { public VerbatimTokenizer(Reader reader) { super(reader); } protected boolean isTokenChar(char c) { return true; } } } -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
A GUI plugin for Squirrel SQL ( http://squirrel-sql.sourceforge.net/) would make a great way of configuring the mapping. It already does all the heavy lifting for connecting to different types of database and poking around the internals. I've got the bare bones of a plugin sorted (Connect to any DB, right click table name, click define Lucene index..., list DB column names/types). Next steps are controls to define the required mapping, run indexing and provide an option to save the configuration in some XML format for ongoing batch operation. Before taking this further I suppose some wider questions are: 1) Should we build this mapper into Luke instead? We would have to lift a LOT of the DB handling smarts from Squirrel. Luke however is doing a lot with Analyzer configuration which would certainly be useful code in any mapping tool (can we lift those and use in Squirrel?). 2) What should the XML for the batch-driven configuration look like? Is it ANT tasks or a custom framework? 3) If our mapping understands the make-up of the rdbms and the Lucene index should we introduce a higher-level software layer for searching which sits over the rdbms and Lucene and abstracts them to some extent? This layer would know where to go to retrieve field values or construct filters ie understands whether to retrieve title field for display from database column or a Lucene stored field and whether the price $100 search criteria is resolved by a lucene query or an RDBMS-query to produce a Lucene filter. It seems like currently, every DB+Lucene integration project struggles with designing a solution to manage this divide and handcodes the solution. Any thoughts appreciated ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
Not sure that I get everything: In the framework that we have built we use a 'simple' object mapping that connects a database table with an object and implicetely with a cache. It is build on top of JDBC. The key fields of the database are used to create a DbKey element a simple array of Object's. Through our mapping layer you can read and write objects as atomic elements the layer takes care of stuff like inheritance and delegation. (e.g. our object model has Items, Versions and States, reading an item creates an ItemBean, a VersionBean with the current version and a StateBean with the Version most current state). The framework foresees in a number of different applications e.g. Forum, News, Questionaires, Pages and then some each inherited from Item. What we needed was an indexing approach that could do the following: 1. When an Page, Questionnaire, NewsItem etc. was updated this needed to be reflected in the search results directly. 2. Every now and then (say once per night) a batch update was needed. During the creation of our search solution we encountered a number of issues: 1. Double results. When you execute a search every hit counts but if a forum thread consists of 200 items then 200 hits on reply's does not really add to the users feeling of service. We added the concept of 'key' field to the index. Double results are filtered out only the highest hit is displayed. 2. Delegate objects A thread consists of messages. The content of each message should be in the index. In order to solve this we create detailers. A detailer is called with a DbKey and returns a set of Objects. Each individual object is parsed (by calling the detailer again) based on the rules that are defined in search.xml (see previous mail). 3. Batch Jobs To optimize the index now and then we need to reindex the whole thing. This is done by executing a query and getting the DbKey elements. (This by the way is done in a so called DbMap as spare implementation a HashMap where objects are only loaded into the cache when a getValue() is called on it's entry). The query is called on the Item 'table'. The parsing is done by calling the apprioriate detailers per Item type. Now coming back to our discussion. - The cache/mapping layer does not care much for the type of database since it is build on JDBC and does not use any stored procedures or constraints other than primary keys. - Searches are executed on an object that assures that no reader or writers are active. - The result of a search is given back as Map. This way the uri that is created as part of the result can be completely ignored if your application so pleases. Eriks suggestions: Per field analyzers and wrappers are not a problem and could very easily be added to this framework. Creating an object as a result is possible i guess, but does this not defeat the purpose of a search index somewhat? The information in the index especially when set against a database are to present those fields that are interesting to be searched. The second part i don't quite 'get' is how would the 'dot' mapping work company.president.name for instance? I can see it writing to the index but not creating object returning from a call? Or would this simply be a key field that is then used as part of query? Using it to navigate an object structure is quite feasible especially if you would create a key. E.g. I would store a key in lucene called: company.role.person and a related field with the csv values XYZ, VP, Jenssen. Then if the company 'object' can be derived from so kind of persistent object the result of the query would be: persistentObject.getCompany(XYZ).getRole(VP).getPerson(Jenssen); The stuff we build so far would be able to cope with something like that i guess, although quite some elements would still be missing. Using Lucene this way more or less creates a 'unified' index. Also: I have not been able to look a Squirrel. Cheers, Aad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
The only caveat to your VerbatimAnalyzer is that it will still split strings that are over 255 characters. CharTokenizer does that. Granted, though, that keyword fields probably don't make much sense to be that long. As mentioned yesterday - I added the LIA KeywordAnalyzer into the contrib area of Subversion. I had built one like you had also, but the one I contributed reads the entire input stream into a StringBuffer ensuring it does not get split like CharTokenizer would. Erik On Feb 9, 2005, at 4:40 AM, Miles Barr wrote: On Tue, 2005-02-08 at 12:19 -0500, Steven Rowe wrote: Why is there no KeywordAnalyzer? That is, an analyzer which doesn't mess with its input in any way, but just returns it as-is? I realize that under most circumstances, it would probably be more code to use it than just constructing a TermQuery, but having it would regularize query handling, and simplify new users' experience. And for the purposes of the PerFieldAnalyzerWrapper, it could be helpful. It's fairly straightforward to write one. Here's the one I put together for PerFieldAnalyzerWrapper situations: package org.apache.lucene.analysis; import java.io.Reader; public class VerbatimAnalyzer extends Analyzer { public VerbatimAnalyzer() { super(); } public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new VerbatimTokenizer(reader); return result; } /** * This tokenizer assumes that the entire input is just one token. */ public static class VerbatimTokenizer extends CharTokenizer { public VerbatimTokenizer(Reader reader) { super(reader); } protected boolean isTokenChar(char c) { return true; } } } -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
On Wed, 2005-02-09 at 06:56 -0500, Erik Hatcher wrote: The only caveat to your VerbatimAnalyzer is that it will still split strings that are over 255 characters. CharTokenizer does that. Granted, though, that keyword fields probably don't make much sense to be that long. As mentioned yesterday - I added the LIA KeywordAnalyzer into the contrib area of Subversion. I had built one like you had also, but the one I contributed reads the entire input stream into a StringBuffer ensuring it does not get split like CharTokenizer would. That's good to know. When indexing web sites I use the URL as the identifier and hence store it in a keyword field. While not common, it is possible for URLs to be longer than 255 characters. That could have led to some very awkward bugs to track down. I'll probably switch over to your KeywordAnalyzer. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sounds like spellcheck
In my Clipper days I could build an index on English words using a technique that was called soundex. Searching in that index resulted in hits of words that sounded the same. From what i remember this technique only worked for English. Has it ever been generalized? What i am trying to solve is this. A customer is looking for a solution to spelling mistakes made by children (upto 10) when typing in queries. The site is Dutch. Common mistakes are 'sgool' when searching for 'school'. The 'normal' spellcheckers and suggestors typically generate a list where the 'sounds like' candidates' are too far away from the result. So what I am thinking about doing is this: 1. create a parser that takes a word and creates a soundindex entry. 2. create list of 'correctly' spelled words either based on the index of the website or on some kind of dictionary. 2a. perhaps create a n-gram index based on these words 3. accept a query, figure out that a spelling mistake has been made 3a find alternatives by parsing the query and searching the 'sound like index' and then calculate and order the results Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? Cheers, Aad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
Aad Nales writes: Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? An implementation of a rule based system to create such a pronounciation form, can be found in a library called makelib that is part of an editor named leanedit. Unfortunatley the website seems to be down. The lib is LGPL. If you're interested, I can send you a copy of the sources. The only ruleset available is german though. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
Morus Walter wrote: Unfortunatley the website seems to be down. Do you have the url? The sources are off course very welcome as well. Cheers, Aad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
On Feb 9, 2005, at 7:23 AM, Aad Nales wrote: In my Clipper days I could build an index on English words using a technique that was called soundex. Searching in that index resulted in hits of words that sounded the same. From what i remember this technique only worked for English. Has it ever been generalized? I do not know how Soundex/Metaphone/Double Metaphone work with non-English languages, but these algorithms are in Jakarta Commons Codec. I used the Metaphone algorithm as a custom analyzer example in Lucene in Action. You'll see it in the source code distribution under src/lia/analysis/codec. I did a couple of variations, one that adds the metaphoned version as a token in the same position and one that simply replaces it in the token stream. I even envisioned this sounds-like feature being used for children. I was mulling over this idea while having lunch with my son one day last spring (he was 5 at the time). I asked him how to spell cool cat and he replied c-o-l c-a-t. I tried it out with the metaphone algorithm and it matches! http://www.lucenebook.com/search?query=cool+cat Erik What i am trying to solve is this. A customer is looking for a solution to spelling mistakes made by children (upto 10) when typing in queries. The site is Dutch. Common mistakes are 'sgool' when searching for 'school'. The 'normal' spellcheckers and suggestors typically generate a list where the 'sounds like' candidates' are too far away from the result. So what I am thinking about doing is this: 1. create a parser that takes a word and creates a soundindex entry. 2. create list of 'correctly' spelled words either based on the index of the website or on some kind of dictionary. 2a. perhaps create a n-gram index based on these words 3. accept a query, figure out that a spelling mistake has been made 3a find alternatives by parsing the query and searching the 'sound like index' and then calculate and order the results Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? Cheers, Aad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
Hey Aad, I believe http://jakarta.apache.org/lucene/docs/contributions.html has a link to Phonetix (http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html), an LGPL-licensed lib for phonetic algorithms like Soundex, Metaphone and DoubleMetaphone. There are Lucene adapters. As to the suitability of the algorithms, I haven't taken a look at the Phonetix implementation, but if http://spottedtiger.tripod.com/D_Language/D_DoubleMetaPhone.html is anything to go by (do a search for dutch), then it should meet your needs, or at least won't be difficult to customize. Is that what you're looking for? k On Wed, 09 Feb 2005 13:23:57 +0100, Aad Nales wrote: In my Clipper days I could build an index on English words using a technique that was called soundex. Searching in that index resulted in hits of words that sounded the same. From what i remember this technique only worked for English. Has it ever been generalized? What i am trying to solve is this. A customer is looking for a solution to spelling mistakes made by children (upto 10) when typing in queries. The site is Dutch. Common mistakes are 'sgool' when searching for 'school'. The 'normal' spellcheckers and suggestors typically generate a list where the 'sounds like' candidates' are too far away from the result. So what I am thinking about doing is this: 1. create a parser that takes a word and creates a soundindex entry. 2. create list of 'correctly' spelled words either based on the index of the website or on some kind of dictionary. 2a. perhaps create a n-gram index based on these words 3. accept a query, figure out that a spelling mistake has been made 3a find alternatives by parsing the query and searching the 'sound like index' and then calculate and order the results Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? Cheers, Aad - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
Thanks for the reference to Metaphone et al. This is the direction I am looking for. What I don't get is why so much of the 'knowledge' of these algoritms is stored in the 'process'. I guess it has to be performance. cheers, Aad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
On Feb 9, 2005, at 4:51 AM, mark harwood wrote: A GUI plugin for Squirrel SQL ( http://squirrel-sql.sourceforge.net/) would make a great way of configuring the mapping. That would be slick! 1) Should we build this mapper into Luke instead? We would have to lift a LOT of the DB handling smarts from Squirrel. Luke however is doing a lot with Analyzer configuration which would certainly be useful code in any mapping tool (can we lift those and use in Squirrel?). The dilemma with Luke is that its not ASL'd (because of the thinlet integration). Anyone up for a Swing conversion project? :) It would be quite cool if Lucene had a built-in UI tool (like or actually Luke). Luke itself is ASL'd and I believe Andrzej has said he'd gladly donate it to Lucene's codebase, but the Thinlet LGPL is an issue. 2) What should the XML for the batch-driven configuration look like? Is it ANT tasks or a custom framework? Don't concern yourselves with Ant at the moment. Anything that is easily callable from Java can be made into an Ant task. In fact, the minimum requirements for an Ant task is a public void execute() method. Whatever Java infrastructure you come up with, I'll gladly create the Ant task wrapper for it when its ready. 3) If our mapping understands the make-up of the rdbms and the Lucene index should we introduce a higher-level software layer for searching which sits over the rdbms and Lucene and abstracts them to some extent? This layer would know where to go to retrieve field values or construct filters ie understands whether to retrieve title field for display from database column or a Lucene stored field and whether the price $100 search criteria is resolved by a lucene query or an RDBMS-query to produce a Lucene filter. It seems like currently, every DB+Lucene integration project struggles with designing a solution to manage this divide and handcodes the solution. Wow... that is getting pretty clever. I like it! I don't personally have a need for relational database indexing, but I support this effort to make a generalized mapping facility. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck [auf Viren geprueft]
Aad, Are you trying to check the spelling of English words by Dutch children? Then, Phonetix or any of these other solutions may not be perfect. From my little knowledge of Dutch, a g is some sort of velar fricative (pronounced at the back of throat). And ch in english is also a velar fricative. You have to hope that the soundex/metaphone rules are broad enough to be used by both languages. Interesting little problem. No J2EE libraries to call, just static String convertToSoundex(String word) to implement. Ah if only I could do more of that sort of coding. Ciao, Jonathan O'Connor XCOM Dublin Kelvin Tan [EMAIL PROTECTED] 09/02/2005 13:03 Please respond to Lucene Users List lucene-user@jakarta.apache.org To Lucene Users List lucene-user@jakarta.apache.org cc Subject Re: sounds like spellcheck [auf Viren geprueft] Hey Aad, I believe http://jakarta.apache.org/lucene/docs/contributions.html has a link to Phonetix ( http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html ), an LGPL-licensed lib for phonetic algorithms like Soundex, Metaphone and DoubleMetaphone. There are Lucene adapters. As to the suitability of the algorithms, I haven't taken a look at the Phonetix implementation, but if http://spottedtiger.tripod.com/D_Language/D_DoubleMetaPhone.html is anything to go by (do a search for dutch), then it should meet your needs, or at least won't be difficult to customize. Is that what you're looking for? k On Wed, 09 Feb 2005 13:23:57 +0100, Aad Nales wrote: In my Clipper days I could build an index on English words using a technique that was called soundex. Searching in that index resulted in hits of words that sounded the same. From what i remember this technique only worked for English. Has it ever been generalized? What i am trying to solve is this. A customer is looking for a solution to spelling mistakes made by children (upto 10) when typing in queries. The site is Dutch. Common mistakes are 'sgool' when searching for 'school'. The 'normal' spellcheckers and suggestors typically generate a list where the 'sounds like' candidates' are too far away from the result. So what I am thinking about doing is this: 1. create a parser that takes a word and creates a soundindex entry. 2. create list of 'correctly' spelled words either based on the index of the website or on some kind of dictionary. 2a. perhaps create a n-gram index based on these words 3. accept a query, figure out that a spelling mistake has been made 3a find alternatives by parsing the query and searching the 'sound like index' and then calculate and order the results Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? Cheers, Aad - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Frankfurt (16.02.2005), Duesseldorf (23.02.2005) und Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Frankfurt (17.02.2005), Duesseldorf (24.02.2005) und Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein fur den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen. This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: Configurable indexing of an RDBMS, has it been done before?
Erik Hatcher wrote: 1) Should we build this mapper into Luke instead? We would have to lift a LOT of the DB handling smarts from Squirrel. Luke however is doing a lot with Analyzer configuration which would certainly be useful code in any mapping tool (can we lift those and use in Squirrel?). You are welcome - you can take any parts except for Thinlet.java (which is LGPL-ed). The dilemma with Luke is that its not ASL'd (because of the thinlet integration). Anyone up for a Swing conversion project? :) It would be quite cool if Lucene had a built-in UI tool (like or actually Luke). Luke itself is ASL'd and I believe Andrzej has said he'd gladly donate it to Lucene's codebase, but the Thinlet LGPL is an issue. Yes, I can confirm that all the parts of Luke that I wrote are under ASL, and I would actually prefer to donate it than to maintain it all on my own, especially with the recent speed of development. Regarding Thinlet - there is some ongoing discussion on forking the project (it's a long story), and we're lobbying up to put the fork under ASL - but it's up to the original author to decide this, and he's rather reluctant to let it go like this... So, if anyone wants to rewrite Luke in Swing, SwiXML or something else, he's more than welcome - but this won't be me, because I hate Swing programming... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck [auf Viren geprueft]
Jonathan O'Connor wrote: Aad, Are you trying to check the spelling of English words by Dutch children? Uh no, I am trying to correct the spelling of Dutch words by Dutch children who, as most children do, make phonetic spelling mistakes. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
wildcards, stemming and searching
Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem?
Follow-up to sorting tokenised field
Have been reading this thread http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11180.htm l. Praveen Peddi (or anyone else), did you ever try the patch? I would be interested to know what sort of performance difference it makes. I have been trying to create a most-simple solution to indexing and sorting. I was hoping that it would be possible to sort on our fields without requiring the use (and therefore prior knowledge of) specific sort fields. Useful would be the ability to add a sort term to fields, along with their regular terms. If the field is not tokenised then a sort term might not be necessary so the sort engine performs as normal, but if the field is tokenised then the engine could use this defined sort term, thus allowing all terms to be kept together in the one field. I don't know what the technical implications of this is though. Just a thought. --Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Tokenised and non-tokenised terms in one field
Hi all, Seeking some best practice advice, or even if there is an alternative solution. Sorry for the email length, just trying to explain succinctly. Currently we add fields to our index like this (for reference, Field booleans are STORE, INDEX, TOKENISE): doc.add(new Field(field, value, true, true, false)); doc.add(new Field(field, value, false, true, true)); This creates two fields in a document with same name. One is stored but not tokenised, the other which is not stored but tokenised, and both are indexed for searchability. The non-tokenised term is so we can do exact-match searches. In my mind, the terms of a title field might look like: title - A Guide to Lucene (PDF) [stored flag?] title - guide - lucene - pdf Can these be merged together in some way, and would it even make sense to do so? I am thinking in terms of creating a more lightweight index. Thanks, --Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Unicode Usage
I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF. BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit. The analyzer used to build the index looks like: public class RedfishAnalyser extends Analyzer { String[] stopwords; public RedfishAnalyser(String[] stopwords) { this.stopwords = stopwords; } public RedfishAnalyser() { this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS; } public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), stopwords)); } } Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening. Thanks for any pointers, Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Unicode Usage
So you got a utf8 encoded text file. But how do you read the file into Java? The default encoding of Java is likely to be something other than utf8. Make sure you specify the encoding like: InputStreamReader( new FileInputStream(filename), UTF-8); On Wed, 9 Feb 2005 22:32:38 -0700, Owen Densmore [EMAIL PROTECTED] wrote: I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF. BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit. The analyzer used to build the index looks like: public class RedfishAnalyser extends Analyzer { String[] stopwords; public RedfishAnalyser(String[] stopwords) { this.stopwords = stopwords; } public RedfishAnalyser() { this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS; } public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), stopwords)); } } Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening. Thanks for any pointers, Owen -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]