Re: Lucene Unicode Usage
Owen Densmore wrote: I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF. However, it matters how you have read in the files in your Java application. Did you use InputStreamReader with the default platform encoding (probably 8859-1), or did you specify UTF-8 explicitly? BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer By default Luke uses the standard platform-specific font dialog. On Windows this font doesn't support Unicode glyphs, so you will see just blanks (or rectangles). In the upcoming release you will be able to select the display font. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck [auf Viren geprueft]
Aad, Well at least that's easier. Ciao, Jonathan O'Connor XCOM Dublin Aad Nales [EMAIL PROTECTED] 09/02/2005 16:16 Please respond to Lucene Users List lucene-user@jakarta.apache.org To Lucene Users List lucene-user@jakarta.apache.org cc Subject Re: sounds like spellcheck [auf Viren geprueft] Jonathan O'Connor wrote: Aad, Are you trying to check the spelling of English words by Dutch children? Uh no, I am trying to correct the spelling of Dutch words by Dutch children who, as most children do, make phonetic spelling mistakes. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Frankfurt (16.02.2005), Duesseldorf (23.02.2005) und Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Frankfurt (17.02.2005), Duesseldorf (24.02.2005) und Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein fur den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen. This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: wildcards, stemming and searching
How would you deal with a query like a*z though? I suspect, however, that you only care about suffix queries and stemming those. If thats the case, then you could subclass getWildcardQuery and do internal stemming (remove trailing wildcard, run it through the analyzer directly there and return a modified WildcardQuery instance. With wildcard queries though, this is risky. Prefixes won't necessarily stem to what the full word would stem to. Erik On Feb 9, 2005, at 6:26 PM, aaz wrote: Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcards, stemming and searching
How would you deal with a query like a*z though? Yeah I know, a user submitting that is certainly possible. I have no idea. I am starting to think that NOT stemming on indexing might be the safest solution. - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 10, 2005 8:55 AM Subject: Re: wildcards, stemming and searching How would you deal with a query like a*z though? I suspect, however, that you only care about suffix queries and stemming those. If thats the case, then you could subclass getWildcardQuery and do internal stemming (remove trailing wildcard, run it through the analyzer directly there and return a modified WildcardQuery instance. With wildcard queries though, this is risky. Prefixes won't necessarily stem to what the full word would stem to. Erik On Feb 9, 2005, at 6:26 PM, aaz wrote: Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
Are there any issues with having a bunch of boolean queries and than adding them to one big boolean queries (making them all required)? Or should I be looking at Query.combine()? Thanks, Luke - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, February 08, 2005 12:02 PM Subject: Re: Problem searching Field.Keyword field Kelvin - I respectfully disagree - could you elaborate on why this is not an appropriate use of Field.Keyword? If the category is How To, Field.Text would split this (depending on the Analyzer) into how and to. If the user is selecting a category from a drop-down, though, you shouldn't be using QueryParser on it, but instead aggregating a TermQuery(category, How To) into a BooleanQuery with the rest of it. The rest may be other API created clauses and likely a piece from QueryParser. Erik On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote: As I posted previously, Field.Keyword is appropriate in only certain situations. For your use-case, I believe Field.Text is more suitable. k On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote: This may or may not be correct, but I am indexing it as a keyword because I provide a (required) radio button on the add screen for the user to determine which category the document should be assigned. Then in the search, provide a dropdown that can be used in the advanced search so that they can search only for a specific category of documents (like HowTo, Troubleshooting, etc). -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 9:32 AM To: Lucene Users List Subject: RE: Problem searching Field.Keyword field Mike, is there a reason why you're indexing category as keyword not text? k On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote: Thanks for the quick response. Sorry for my lack of understanding, but I am learning! Won't the query parser still handle this query? My limited understanding was that the search call provides the 'all' field as default field for query terms in the case where fields aren't specified. Using the current code, searches like author:Mike and title:Lucene work fine. -Original Message- From: Miles Barr [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject: Re: Problem searching Field.Keyword field You're using the query parser with the standard analyser. You should construct a term query manually instead. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. -- -- - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- -- - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
Are there any issues with having a bunch of boolean queries and than adding them to one big boolean queries (making them all required)? Or should I be looking at Query.combine()? Thanks, Luke - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, February 08, 2005 12:02 PM Subject: Re: Problem searching Field.Keyword field Kelvin - I respectfully disagree - could you elaborate on why this is not an appropriate use of Field.Keyword? If the category is How To, Field.Text would split this (depending on the Analyzer) into how and to. If the user is selecting a category from a drop-down, though, you shouldn't be using QueryParser on it, but instead aggregating a TermQuery(category, How To) into a BooleanQuery with the rest of it. The rest may be other API created clauses and likely a piece from QueryParser. Erik On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote: As I posted previously, Field.Keyword is appropriate in only certain situations. For your use-case, I believe Field.Text is more suitable. k On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote: This may or may not be correct, but I am indexing it as a keyword because I provide a (required) radio button on the add screen for the user to determine which category the document should be assigned. Then in the search, provide a dropdown that can be used in the advanced search so that they can search only for a specific category of documents (like HowTo, Troubleshooting, etc). -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 9:32 AM To: Lucene Users List Subject: RE: Problem searching Field.Keyword field Mike, is there a reason why you're indexing category as keyword not text? k On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote: Thanks for the quick response. Sorry for my lack of understanding, but I am learning! Won't the query parser still handle this query? My limited understanding was that the search call provides the 'all' field as default field for query terms in the case where fields aren't specified. Using the current code, searches like author:Mike and title:Lucene work fine. -Original Message- From: Miles Barr [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject: Re: Problem searching Field.Keyword field You're using the query parser with the standard analyser. You should construct a term query manually instead. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. -- -- - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- -- - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: != queries
If this is a query you need to support often, you could create a field x that contains x in every document. Then search on that with your prohibited query. If not, you could get the document list by doing your search then removing all of those documents from a complete set outside of lucene. On Thu, 10 Feb 2005 11:19:03 -0700, aaz [EMAIL PROTECTED] wrote: Ok, that makes sense. Any suggestions on how to AND that prohibited clause with a query to get everything? - Original Message - From: Miles Barr [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 10, 2005 11:07 AM Subject: Re: != queries On Thu, 2005-02-10 at 11:02 -0700, aaz wrote: I have an index with field documentNumber. There are 10 documents. One of the documents has documentNumber A5058970 I want to return all matches where documentNumber != A505*. I should get 9 docs back. I construct a query like wq = WildcardQuery(documentNumber,a505*) BooleanQuery bq = new BooleanQuery(); bq.addQuery(wq,false,true); I always get no results for this type of query. Ideas? A restriction can only filter out search results and not add to them. So the search is starting with an empty set, then trying to filter out the results with a document number starting A505, i.e. doing nothing. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
On Thursday 10 February 2005 18:44, Luke Shannon wrote: Are there any issues with having a bunch of boolean queries and than adding them to one big boolean queries (making them all required)? The 1.4.3 and earlier BooleanScorer has an out of bounds exception for More than 32 required/prohibited clauses in query. In the development version this restriction has gone. The limitation of the maximum clause count (default 1024, configurable) is still there. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie questions
Hi, A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? Thank you in advance, PJ __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new segment for each document
Daniel Naber wrote: On Thursday 10 February 2005 22:27, Ravi wrote: I tried setting the minMergeFactor on the writer to one. But it did not work. I think there's an off-by-one bug so two is the smallest value that works as expected. You can simply create a new IndexWriter for each add and then close it. IndexWriter is pretty lightweight, so this shouldn't have too much overhead. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Match
On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote: I think I found a pretty good way to do a negative match. In this query I am looking for all the Documents that have a kcfileupload field with any value except for jpg. Query negativeMatch = new WildcardQuery(new Term(kcfileupload, *jpg*)); BooleanQuery typeNegAll = new BooleanQuery(); Query allResults = new WildcardQuery(new Term(kcfileupload, *)); IndexSearcher searcher = new IndexSearcher(fsDir); BooleanClause clause = new BooleanClause(negativeMatch, false, true); typeNegAll.add(allResults, true, false); typeNegAll.add(clause); Hits hits = searcher.search(typeNegAll); With the little testing I have done this *seems* to work. Does anyone see a problem with this approach? Sure do you realize what WildcardQuery does under the covers? It literally expands to a BooleanQuery for all terms that match the pattern. There is an adjustable limit built-in of 1,024 clauses to BooleanQuery. You obviously have not hit that limit ... yet! You're better off using the advice offered on this thread previously create a single dummy field with a fixed value for all documents. Combine a TermQuery for that dummy value with a prohibited clause like y our negativeMatch above. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Access Lucene from PHP or Perl
Greetings. Can anyone point me to a how-to tutorial on how to access Lucene from a web page generated by PHP pr Perl? I've been looking but couldn't find anything. Thanks a lot. And __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]