RE: Lucene 1.4 RC 3 issue with temp directory
Your catalina.bat script is guessing your CATALINA_HOME environment variable since you don't have one set and is setting java.io.tmpdir based on that guess. You could work around this by setting a CATALINA_HOME environment variable or setting the system property org.apache.lucene.lockdir. That doesn't solve the problem for Lucene locks when java.io.tmpdir is set to a relative path that does not exist though. Eric -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Monday, May 17, 2004 2:15 PM To: [EMAIL PROTECTED] Subject: Lucene 1.4 RC 3 issue with temp directory Hi All, I just upgraded to 1.4 RC 3 and am now unable to open my index. I am getting: java.io.IOException: The system cannot find the path specified at java.io.WinNTFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:828) at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297) at org.apache.lucene.store.Lock.obtain(Lock.java:53) at org.apache.lucene.store.Lock$With.run(Lock.java:108) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:38) I _have_ reindexed using the new lucene jar. I am positive the path is correct as I can open an index in the same directory with the old Lucene with no problems. I notice that the problem only occurs when I am deployed inside of Tomcat. If I run searches on the command line or through JUnit everything functions correctly. When I print out the lockDir location that is trying to be obtained above, it looks like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, except ..\temp does not exist. When I create the directory, it works. I suppose I could create the temp directory for every index, but I didn't know that was a requirement. I do notice that Tomcat has a temp directory at the top, so it is probably setting some system property (java.io.tmpdir) variable to ..\temp that is being picked up by Lucene? The question is, what changed in RC 3 that would cause this to be used when it wasn't before? On a side note, would it be useful to create the lock directory if it doesn't exist? If the developers think so, I can submit the patch for it. Thanks, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexSearcher on JAR resources?
This isn't exactly what you were asking for and I know it is a somewhat ugly way to implement this (violates some OO rules by having knowledge of RAMDirectory internal implementation) but I thought it might be of use to some folks and/or might provide a starting point for someone else to try and tackle this. It provides a mechanism for getting a RAMDirectory for an index stored on your classpath provided that you know the names of the files that comprise the index using ClassLoader.getResource The test class creates an index, puts it in a jar, adds the jar to a class loader, then reads the index from the jar via the class loader into a RAMDirectory. Eric -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, May 07, 2004 9:21 AM To: Lucene Users List Subject: Re: IndexSearcher on JAR resources? On May 7, 2004, at 6:14 AM, Edin Pezerovic wrote: Hi, I found following entry within the mail-archives: http://www.mail-archive.com/[EMAIL PROTECTED]/ msg02129.html Is there now (2 years ago) a possibility to have the index within a jar-file? Someone posted something like this at one point, but ironically I cannot _find_ it. I definitely would be interested in having something like this handy. If anyone has pointers to the implementations posted, please let us know. Thanks, Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: languages supported by lucene 1.2.1 in eclipse help system
I'm assuming what you have is an eclipse plugin that is making use of the eclipse help system. If what you are doing is relying on the lucene eclipse plugin, you may want to look at the help system anyway since it will give you an example of an eclipse plugin that is using the lucene plugin. The eclipse help system uses lucene but they have their own Analyzer class that uses BreakIterator to identify tokens for languages other than english and german. The lucene eclipse plugin just exports the lucene jar and the html parser so that any plugin that depends on the lucene plugin (like the help system) will have those jars in the classpath of their plugin. For english they use the PorterStemFilter with a StopAnalyzer and a stopword list. For german, they use the GermanAnalyzer supplied by the lucene jar. In the latest CVS at :pserver:[EMAIL PROTECTED]:/home/eclipse see the project in org.eclipse.help.base/src/org/eclipse/help/internal/search in older eclipse versions see the R2_1_maintenance branch of org.eclipse.help/src/org/eclipse/help/internal/search the class DefaultAnalyzer is the analyzer implementation for languages other than english and german and WordTokenStream is where they use BreakIterator to break the content from the reader into individual tokens. The default Eclipse help system sets these extensions in the org.eclipse.help.base plugin: !-- Text Analyzers for search -- extension id=org.eclipse.help.base.Analyzer_en point=org.eclipse.help.base.luceneAnalyzer analyzer locale=en class=org.eclipse.help.internal.search.Analyzer_en /analyzer /extension extension id=org.eclipse.help.base.Analyzer_de point=org.eclipse.help.base.luceneAnalyzer analyzer locale=de class=org.apache.lucene.analysis.de.GermanAnalyzer /analyzer /extension Look at the extension point schema in http://dev.eclipse.org/viewcvs/index.cgi/~checkout~/org.eclipse.help.base/schema/luceneAnalyzer.exsd?rev=HEADcontent-type=text/plain for how to declare your own analyzer extensions. Beware though, I read that this affects all help searches in that language, not just the ones for your plugin. Also, since the WordTokenStream is in a package with internal in its path, you aren't supposed to ever make use of that class from other plugins, so if you wanted your own analyzer based on that class and a stop list, you shouldn't use that class without talking the eclipse help developers into moving it outside of an internal package. Most of this has been around for a while, so it is probably the same or very similar in previous eclipse versions, you may need to poke around at the extension point schema in your eclipse plugins directory to verify that the extension point works the same way in your version of eclipse. I haven't used it in versions prior to 3.0M8 Hope this is useful to you, Eric -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Saturday, April 24, 2004 10:18 AM To: Lucene Users List Subject: Re: languages supported by lucene 1.2.1 in eclipse help system That's no myth :) Core Lucene (even the current version) does not include classes that know how to analyze/tokenize text in languages other than English, Russian, and German. However, take a look at the Snowball contributions in Lucene Sandbox, where a few more analyzers are available, including those for CJK group of langauges. Otis --- Jason Elliott [EMAIL PROTECTED] wrote: We have a plugin in our eclipse project named org.apache.lucene_1.2.1. It works quite well in that help system. I've been notified that this particular version of the lucene search analyzer searches well in German and English (GE), but not so well in the rest of the languages on this planet. I have several questions 1.If it does not search very well in French, Italian and Japanese (FIJ), what does that really mean to a user conducting searches? a.If this is a myth and the searches work the same in EFIG-J, please let me know that. b.If this is not a myth and there are plugins that enable the search to work well in FIJ? Thanks jason - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searches containing a dollar sign $
I think Erik Hatcher commented on a similar problem the other day. When QueryParser handles a * query which it creates as a prefix query, the token the prefix query is built from is not analyzed. StandardAnalyzer would turn abc$def into two tokens abc and def QueryParser would take query 2 and build a PrefixQuery with abc as the prefix and query 3 as a PrefixQuery with abc$ as the prefix. There are probably a million valid reasons why this is appropriate default behavior for QueryParser. One off the top of my head is that with a stemming analyzer, you may not get an approriate stem if you analyzed the prefix. In this case, if this is not appropriate behavior for your application, you should probably create a custom query parser with different behavior. Eric Here is the snip of QueryParser.jj that builds the query objects. The only one that is analyzed is the field query. The term productions generally break on whitespace and special unescaped query operators (see the .jj file for the full details): term=TERM | term=PREFIXTERM { prefix=true; } | term=WILDTERM { wildcard=true; } | term=NUMBER ) [ FUZZY { fuzzy=true; } ] [ CARAT boost=NUMBER [ FUZZY { fuzzy=true; } ] ] { String termImage=discardEscapeChar(term.image); if (wildcard) { q = getWildcardQuery(field, termImage); } else if (prefix) { q = getPrefixQuery(field, discardEscapeChar(term.image.substring (0, term.image.length()-1))); } else if (fuzzy) { q = getFuzzyQuery(field, termImage); } else { q = getFieldQuery(field, analyzer, termImage); } } -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, March 18, 2004 11:44 AM To: Lucene Users List Subject: Re: Searches containing a dollar sign $ Are you indexing your documents with the same Analyzer? Are you using QueryParser? Are you able to get query 3) to work when using queries directly, without a QueryParser? Otis --- Reece [EMAIL PROTECTED] wrote: Hi, I have a field that has a dollar sign in it like this: abc$def I perform the following queries using the StandardAnalyzer: 1). myField:abc$def - work 2). myField:abc*- work 3). myField:abc$* - no work Why doesn't the third query work? Is there an analyzer that will handle all three of these queries? Thanks, Reece __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Japanese Analyzer
I've been using the CJKAnalyzer for a while now and our native japanese speaking development staff haven't had any complaints with the results they are getting in their searches. Just be sure you get all the character encoding issues straight. One of the gotchas I ran into when I first started working with this was improper character encoding handling in my web application. Eric -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, January 29, 2004 1:46 PM To: Lucene Users List Subject: Re: Japanese Analyzer I think that's the only one we've got. You can browse the Lucene Sandbox contributions directory, it's there. Otis --- Weir, Michael [EMAIL PROTECTED] wrote: Is the CJKAnalyzer the best to use for Japanese? If not, which is? If so, from where can I download it? Thanks. Michael Weir . Transform Research Inc. . 613.238.1363 x.114 This message may contain privileged and/or confidential information. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it; do not open any attachments, delete it immediately from your system and notify the sender promptly by e-mail that you have done so. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Keyword search with space and wildcard
Not sure about documented examples, but I often find the unit tests (in src/test of lucene's CVS) to be very useful for examples but I didn't see any for what you are looking for. Basically, query parser builds up a vector of BooleanClause objects then loops over those on a BooleanQuery object calling add(BooleanClause). I agree JavaCC isn't really simple to follow, but there is a lot of plain java in there that does the parts you are interested in and if you build the .java file and ignore the token parsing stuff, you can look at in your favorite java IDE. What you can do is cast the query you get from QueryParser to a BooleanQuery (that is the only type of Query that QueryParser will return) then create your WildcardQuery or any other queries you need that you didn't get in the query string and add them as clauses to the BooleanQuery using add(Query query, boolean required, boolean prohibited). I don't know how query combine works (never used it), but the javadoc comment leads me to believe it is not what you are looking for and a bit of poking around in the sources gives me the same impression. Eric -Original Message- From: Brian Campbell [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 02, 2003 11:05 AM To: [EMAIL PROTECTED] Subject: Re: Keyword search with space and wildcard Great. Is there an example anywhere on how I might be able to build such a Query? QueryParser isn't really all that simple since it's built with JavaCC. What might be ideal for me is if I can continue to use the highlevel interface to build the main query (ie use it to parse my query string and return me some kind of Query - BooleanQuery, TermQuery, etc) and then build a WildcardQuery by hand and combine the two together? For example, is it as simple as calling Query.combine() to combine the two? Is there a better way? Is there a documented example like this? Thanks! -Brian This can be done, AFAIK. This is one thing that many people seem unaware of: you don't HAVE to use QueryParser to build queries. In your case it seems like you should be able to construct query you want if you either by-pass QueryParser, or create a dummy analyzer (one that does no tokenization but returns all input as one token). _ Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige using MSN Messenger http://entertainment.msn.com/imastar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: doc number as integer
I remember this coming up before...long causes thread saftey issues... http://www.javaworld.com/javaworld/jw-09-1997/jw-09-raceconditions.html I couldn't find anything on sun's java site to reference, but I didn't look to hard. Eric -Original Message- From: Neil [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 1:40 PM To: [EMAIL PROTECTED] Subject: doc number as integer It seems that since the index document number value is a positive int, this restricts the number of documents in an index to ( 2^31 - 1 ) = 2,147,483,647. Do I misunderstand? I mean, that's enough for me, but it seems a kind of surprising restriction, considering long could be used instead for unimaginably large numbers of documents. Well, I grant I probably can't imagine 2 billion documents either, but google can. Just curious, sorry to bother anyone. Neil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: javacc problem + path/link problem in html demo
JavaCC 3 is not supported by ant yet... http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19468 http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=763762 http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=774059 Eric -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, August 01, 2003 2:31 PM To: [EMAIL PROTECTED] Subject: javacc problem + path/link problem in html demo I get the error, when I run ant, that it won't build. why am I building? when I run the web demo, all the links are formed with luceneweb/ preceding them (the links are incorrect): and the links come out as: http://localhost:8080/luceneweb/examples/foo.jsp when it should be: http://localhost:8080/examples/foo.jsp and I'm using tomcat, btw. I hunted down the line that gets the path in HTMLDocument (in the demo), and added some scaffolding to see what it says the link is; and so I wanted to recompile it (the thought is, that I could do a substring on the path, if is indeed adding luceneweb/ as part of all the paths). (it's a bit of a hack, but if it would work) anyways, I downloaded javacc and am trying to build, with no avail. I've read through the newsgroup archives, read the help files, and looked on the net...so here I am emailing the group thanks so much. some more detail: ant can't find javacc - (also, it wants javacc.zip; but the javacc distrib. I got only comes with javacc.jar) from my default.properties file (I added this myself): # Home directory of JavaCC javacc.home = c:/Java tools/javacc-3.1/ javacc.zip.dir = ${javacc.home}/bin/lib javacc.zip = ${javacc.zip.dir}/javacc.jar (the above snippet seems to do no good :( -Jill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: javacc problem + path/link problem in html demo
JavaCC 2 is no longer available. You will have to upgrade or dig for it (i.e. you and whomever you get it from would be violating the license agreement). Eric -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, August 01, 2003 2:40 PM To: Lucene Users List Subject: RE: javacc problem + path/link problem in html demo I saw that bug, but tried it anyways.. but since the bug is still active, do you know where I can an earlier copy of javacc? (and which version, exactly, that I need?) thanks -Jill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: JavaCC v3 and Lucene
if you go to the bug and right click and Save Target As... (in IE, not sure what the Netscape equivalent is) on the link for the attachment: 06/14/03 01:23 javacc3-ant-support.jar to be added to /lib (application/octet-stream) then save it as javacc3-ant-support.jar into your Lucene /lib directory. Then save this other attachment (it is a patch file). 06/14/03 02:39 Complete Patch including refactoring the javacc tasks out of the compile target (text/plain) then apply this patch. Not sure what tools you can use to do that, I use the Team support in Eclipse www.eclipse.org (Team-Apply Patch). I noticed a day or two ago that the build.xml diff is a little bit out of synch with current CVS, so you may need to look at that some. I started fixing up a new patch but haven't gotten enough free time to fix it yet. Eric -Original Message- From: Liliya Kharevych [mailto:[EMAIL PROTECTED] Sent: Monday, July 21, 2003 6:56 PM To: [EMAIL PROTECTED] Subject: RE: JavaCC v3 and Lucene Hi, I was trying to build Lucene with JavaCC 3.0 and completly got lost. Sorry about the dummy question, but where can I download the patch? I tried the bug URL, and was able to download JavaCC_3.java, but the last attachment is this big text file and I cannot figure out what to do with it. As I understand build.xml should be changed and javacc3-ant-support.jar should be somewhere but I cannot find it. Thanks, lily - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: CJK support in lucene
This archived message has the CJKTokenizer code attached (there are some links in the code to material that describes the tokenization strategy). http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=330905 You have to write your own analyzer that uses this tokenizer. See http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to write an analyzer. here is one you could use: package my.package; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cjk.CJKTokenizer; import java.io.Reader; public class CJKAnalyzer extends Analyzer { public CJKAnalyzer() { } /** * Creates a TokenStream which tokenizes all the text in the provided Reader. * * @return A TokenStream built from a CJKTokenizer */ public TokenStream tokenStream( String fieldName, Reader reader ) { TokenStream result = new CJKTokenizer( reader ); result = new StopFilter(result, new String[] {}); // CJKTokenizer emitts a sometimes, haven't been able to figure it out, so this is a workaround return result; } } Lastly, you have to package those things up and use them along with the core lucene code. CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on indexing CJK languages would be a good thing to add. The existing one (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is useful to be aware of too. Good luck, Eric -Original Message- From: Avnish Midha [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 1:06 PM To: Eric Isakson Subject: CJK support in lucene Hi Eric, I read the description of the bug (#18933) reported by you on the apache site. I had a question related to this defect. In the description you have mentioned that CJK support should be included in the core build. Is there any other way we can enable the CJK support in the lucene search engine? Would be grateful to you if you could let me know of any such method of enabling CJK support in the serach engine. Eagerly waiting for your reply. Thanks Regards, Avnish Midha Phone no.: +1-949-8852540 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FW: CJK support in lucene
-Original Message- From: Eric Isakson Sent: Wednesday, July 16, 2003 2:04 PM To: 'Avnish Midha' Subject: RE: CJK support in lucene I'm no linguist, so the short answer is, I'm not sure about Taiwanese. If they share the same character sets and a bigram indexing approach makes sense for that language (read the links in the CJKTokenizer source), then it would probably work. For Latin-1 languages, it will tokenize (It is setup to deal with mixed language documents where some of the text might be Chinese and some might be English) but it will be far less efficient than the standard tokenizer supplied with the Lucene core. But you should run your own tests to see if that would be livable. Eric -Original Message- From: Avnish Midha [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 1:50 PM To: Eric Isakson Cc: Lucene Users List Subject: RE: CJK support in lucene Eric, Does this tokenizer also support Taiwanese European languages (Latin-1)? Regards, Avnish -Original Message- From: Eric Isakson [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 10:38 AM To: Avnish Midha Cc: Lucene Users List Subject: RE: CJK support in lucene This archived message has the CJKTokenizer code attached (there are some links in the code to material that describes the tokenization strategy). http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED] e.orgmsgId=330905 You have to write your own analyzer that uses this tokenizer. See http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to write an analyzer. here is one you could use: package my.package; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cjk.CJKTokenizer; import java.io.Reader; public class CJKAnalyzer extends Analyzer { public CJKAnalyzer() { } /** * Creates a TokenStream which tokenizes all the text in the provided Reader. * * @return A TokenStream built from a CJKTokenizer */ public TokenStream tokenStream( String fieldName, Reader reader ) { TokenStream result = new CJKTokenizer( reader ); result = new StopFilter(result, new String[] {}); // CJKTokenizer emitts a sometimes, haven't been able to figure it out, so this is a workaround return result; } } Lastly, you have to package those things up and use them along with the core lucene code. CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on indexing CJK languages would be a good thing to add. The existing one (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.index ingtoc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is useful to be aware of too. Good luck, Eric -Original Message- From: Avnish Midha [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2003 1:06 PM To: Eric Isakson Subject: CJK support in lucene Hi Eric, I read the description of the bug (#18933) reported by you on the apache site. I had a question related to this defect. In the description you have mentioned that CJK support should be included in the core build. Is there any other way we can enable the CJK support in the lucene search engine? Would be grateful to you if you could let me know of any such method of enabling CJK support in the serach engine. Eagerly waiting for your reply. Thanks Regards, Avnish Midha Phone no.: +1-949-8852540 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: '-' character not interpreted correctly in field names
You left out the ~ character in your _FIELDNAME_START_CHAR production. That character tells the grammar that it should take all the characters except the ones you specified (the complement). Change: | #_FIELDNAME_START_CHAR: ( [ , \t, +, -, !, (, ), :, To: | #_FIELDNAME_START_CHAR: ( ~[ , \t, +, -, !, (, ), :, and it should probably work. Eric -Original Message- From: Victor Hadianto [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 09, 2003 4:53 AM To: Lucene Users List Subject: Re: '-' character not interpreted correctly in field names Hi Erik and others, I'm looking for a similar solution where I need QueryParser not to drop the - characters from the field name. Hower outside the field I do want the - sign interpreted as not modifier. I'm definitely not an expert in JavaCC and to be honest I only have a limited idea about Erik's suggestion work, Anyway I followed the suggestion and added the following: | #_WHITESPACE: ( | \t ) | #_FIELDNAME_START_CHAR: ( [ , \t, +, -, !, (, ), :, | ^, [, ], \, {, }, ~, *, ? ] | _ESCAPED_CHAR ) | #_FIELDNAME_CHAR: ( _FIELDNAME_START_CHAR | _ESCAPED_CHAR ) and again below I added: | TERM: _TERM_START_CHAR (_TERM_CHAR)* | FIELDNAME: _FIELDNAME_START_CHAR (_FIELDNAME_CHAR)* And I changed: LOOKAHEAD(2) fieldToken=TERM COLON { field = fieldToken.image; } to: ... LOOKAHEAD(2) fieldToken=FIELDNAME COLON { field = fieldToken.image; } Well after doing all this mods all the query that involved field names cause problem, for example if I searched for fieldname:hello The query is blank (yes blank, nothing in it) and if the fieldname does contain a dash (-) for example: field-name:hello They query is: +field -name hello is dropped. Does anyone has any idea? Help and suggestions will be much appreciated. I really need to get this dash working, changing the field name will be my last resort which I won't explore until I really have to. Thanks, Victor On Thu, 15 May 2003 04:54 am, Eric Isakson wrote: I think the query parser changes would not be too bad, I've outlined a couple of relavant lines you should look at so you don't have to try and comprehend the productions for the entire QueryParser. I do not think I would like to have to maintain one of those myself though. Your other unmentioned alternative is to choose field names that match the TERM production of QueryParser.jj without escapes. QueryParser.jj line 557: fieldToken=TERM COLON { field = fieldToken.image; } and earlier... #_ESCAPED_CHAR: \\ [ \\, +, -, !, (, ), :, ^, [, ], \, {, }, ~, *, ? ] | #_TERM_START_CHAR: ( ~[ , \t, +, -, !, (, ), :, | ^, [, ], \, {, }, ~, *, ? ] | _ESCAPED_CHAR ) | | #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR ) ... TERM: _TERM_START_CHAR (_TERM_CHAR)* So the characters you need to avoid in your field names are the ones from _ESCAPED_CHAR, [ \\, +, -, !, (, ), :, ^, [, ], \, {, }, ~, *, ? ] If you need to modify the parser, you will probably want to add a FIELDNAME token and other supporting productions that look really similar to these lines I've copied but modify the complement, ~[...], at the beginning of _FIELDNAME_START_CHAR (you would add this production) so it will match the - that you are using in your field names (and fix it to match any other characters you want to use in field names that it doesn't allow right now). Eric -Original Message- From: Jon Pipitone [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 14, 2003 2:26 PM To: Lucene Users List Subject: Re: '-' character not interpreted correctly in field names Eric Isakson wrote: I just looked at the QueryParser.jj code, your field names never get processed by the analyzer. It does look like the query parser will honor escapes though. I haven't tried this, but try a query like foo\-bar:foo and have a look at the QueryParser.jj file for how it handles field names when parsing your query. Hrm.. that's what I had found too. So, you're saying that, other than escaping dashes, I'd have to change QueryParser.. ? I'm not too familiar just yet with JavaCC syntax, so reading through QueryParser is a little tough going. Thanks Eric, jp -Original Message- From: Jon Pipitone [mailto:[EMAIL PROTECTED] Sent: Monday, May 12, 2003 4:03 PM To: Lucene Users List Subject: Re: '-' character not interpreted correctly in field names Hi Otis, Terry, You can write a custom Analyzer that does not remove dashes from tokens, and use it for both indexing and searching.This is a frequent question and answer on this list. Sorry for the noise, but I haven't been able to find a solution in the mailing list archives, or by writing my own analyzer: public class MyAnalyzer
RE: Querying Question
This query.toLowerCase() lowercased your query to become: name:\checkpoint\ and value:\filenane_1\ The keyword AND must be uppercase when the query parser gets a hold of it. If your RepositoryIndexAnalyzer lowercases its tokens you don't need to do query.toLowerCase(). If it doesn't lowercase its tokens, you may want to modify it so that it does. Eric -Original Message- From: Rob Outar [mailto:[EMAIL PROTECTED] Sent: Thursday, April 03, 2003 5:11 PM To: Lucene Users List Subject: Querying Question Importance: High Hi all, I am a little fuzzy on complex querying using AND, OR, etc.. For example: I have the following name/value pairs file 1 = name = checkpoint value = filename_1 file 2 = name = checkpoint value = filename_2 file 3 = name = checkpoint value = filename_3 file 4 = name = checkpoint value = filename_4 I ran the following Query: name:\checkpoint\ AND value:\filenane_1\ Instead of getting back file 1, I got back all four files? Then after trying different things I did: +(name:\checkpoint\) AND +(value:\filenane_1\) it then returned file 1. Our project queries solely on name value pairs and we need the ability to query using AND, OR, NOTS, etc.. What the correct syntax for such queries? The code I use is : QueryParser p = new QueryParser(, new RepositoryIndexAnalyzer()); this.query = p.parse(query.toLowerCase()); Hits hits = this.searcher.search(this.query); Thanks as always, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing and searching non-latin languages using utf-8
Have you verified that your form inputs are getting to your query objects without the String being mangled due to encoding problems? I'm getting japanese in UTF-8 and use the technique described at http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the data from the browser to Lucene. I build my index using the HTMLParser in the lucene demos and give them a Reader object that was created from an InputStreamReader that specifies the HTML file encodings (Shift_jis in my case). There are a bunch of other issues I'm working on to support Japanese, but I'm getting search results at this point. The two places that encodings should come into play for you are parsing your source content into the Reader or String that you use to create org.apache.lucene.document.Field objects and getting the user query from their browser to the Query objects. Eric -- Eric D. IsaksonSAS Institute Inc. Application Developer SAS Campus Drive XML Technologies Cary, NC 27513 (919) 531-3639 http://www.sas.com -Original Message- From: MERCIER ALEXANDRE [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2003 11:36 AM To: [EMAIL PROTECTED] Subject: Indexing and searching non-latin languages using utf-8 Hi all, I've a matter with indexing then searching docs written in non-latin languages and encoded in utf-8 (Russian, by example). I have a web application, with a simple form to search in the contents of the docs. When I submit the form, I encode the query term in utf-8 with encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm not sure. Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it in utf-8, I believe. So when it reads the file, it correctly gets russian characters (2 bytes) but when writing them in the index, they seem different (I've listed the terms in my application console). If someone has a solution to resolve my problem, all advices are welcome. Thanks. Alex - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing and searching non-latin languages using utf-8
There are a bunch of other issues... I should have qualified that. There really aren't any issues with the Lucene core to support Japanese, just other issues in my app that uses Lucene and working with my content providers to ensure consistent use of encodings, etc. I have found what I think is a bug in the CJKTokenizer in that it emits an empty string token after processing my japanese characters. I haven't found the bug in CJKTokenizer yet, but as a workaround I'm using a StopFilter that removes it. -Original Message- From: Eric Isakson [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2003 11:52 AM To: Lucene Users List Subject: RE: Indexing and searching non-latin languages using utf-8 Have you verified that your form inputs are getting to your query objects without the String being mangled due to encoding problems? I'm getting japanese in UTF-8 and use the technique described at http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the data from the browser to Lucene. I build my index using the HTMLParser in the lucene demos and give them a Reader object that was created from an InputStreamReader that specifies the HTML file encodings (Shift_jis in my case). There are a bunch of other issues I'm working on to support Japanese, but I'm getting search results at this point. The two places that encodings should come into play for you are parsing your source content into the Reader or String that you use to create org.apache.lucene.document.Field objects and getting the user query from their browser to the Query objects. Eric -- Eric D. IsaksonSAS Institute Inc. Application Developer SAS Campus Drive XML Technologies Cary, NC 27513 (919) 531-3639 http://www.sas.com -Original Message- From: MERCIER ALEXANDRE [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2003 11:36 AM To: [EMAIL PROTECTED] Subject: Indexing and searching non-latin languages using utf-8 Hi all, I've a matter with indexing then searching docs written in non-latin languages and encoded in utf-8 (Russian, by example). I have a web application, with a simple form to search in the contents of the docs. When I submit the form, I encode the query term in utf-8 with encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm not sure. Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it in utf-8, I believe. So when it reads the file, it correctly gets russian characters (2 bytes) but when writing them in the index, they seem different (I've listed the terms in my application console). If someone has a solution to resolve my problem, all advices are welcome. Thanks. Alex - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi Language support
Hi Günter, I had a similar requirement for my use of Lucene. We have documents with mixed languages, some of the text in the user's native language and some in English. We made the decision to not use any of the stemming analyzers and index with no stop words (I didn't like the no stop words decision, but it wasn't really my call). My analyzer tokenStream method: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); return result; } Do you really need stemming in your application? Do you really need stop words? See this note http://archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=653731 for a discussion about the advantages/disadvantages of stemming. If you still want stop words, you can create a list that includes words from more than one language, then use the same analyzer for all of your content. If you still need stemming, you will probably have to give your user the ability to tell you which language index they wish to search and you would probably be better off maintaining separate indices for each language at that point. Best of luck, Eric -Original Message- From: Günter Kukies [mailto:[EMAIL PROTECTED] Sent: Thursday, March 06, 2003 2:08 AM To: Lucene Users List Subject: Multi Language support Hello, that is what I know about indexing international documents: 1. I have a language ID 2. with this ID I choose an special Analzer for that language 3. I can use one index for all languages But what about searching for international documents? I don't have a language ID, because the user is interested in documents with his native language and a second language mostly english. So, what Analyzer do I use for searching? Thanks Günter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Phrase query and porter stemmer
Ramesh, I haven't examined the code closely that does this positioning, but this is how I believe it works: Let say you had a token stream that returned the tokens you, are, running, faster, than, me that didn't do any setPositionIncrement calls. The default increment is 1. Each token in the stream gets a position that allows you do things like proximity searches the query are than~3 would find the document that token stream came from since are occurrs at position 2 and than at position 5 and 5-2 = 3. Now lets say you wanted to stem running to run but keep the original token. You would create a token filter that inserted the stem run into the token stream when the running token occurred but also kept the original token running. If you didn't set the position increment on the second token then the distance between are and than would become 6-2 = 4 which is greater than 3 and your proximity query would fail. When you set the position increment to zero for the added token it gets treated like it is at the same position as the original token which prevents you from breaking your proximity query. Proximity queries are the place I know this affects. I'm unsure how the positions affect other parts Lucene. Hope I got all that right and that it helps you understand the setPositionIncrement. Eric -Original Message- From: Mailing Lists Account [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 13, 2003 7:07 AM To: Lucene Users List Subject: Re: Phrase query and porter stemmer Hi Eric, Thanks for the reply. The option of custom token filter sounds good to me. I am not sure what is the advantage of Token.setPositionIncrement() option. Let me look into the docs before I ask further questions on this. regards Ramesh Eric Isakson wrote: You won't get hits for security if you do not use the stemmer. The stem of security is the token that gets stored in the index. If you don't use the stemming algorithm when you create the index you could search for security and only get those documents that contain security. See the FAQ http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi ngtoc=faq#q15 If you have a list of terms you want to treat differently (i.e. you know there are certain words you don't want to stem) you could build a custom TokenFilter that checks the tokens for those words before applying the stemming algorithm then add that TokenFilter to your analyzer. You might also consider allowing the tokens to be stemmed and adding the original non-stemmed term at the same position using Token.setPositionIncrement(0), you might also want to figure out some way to boost the score on those non-stemmed tokens when you build your query (not sure how you might accomplish that, but some custom query parsing code could do the trick). Eric -Original Message- From: Mailing Lists Account [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 12, 2003 4:17 AM To: [EMAIL PROTECTED] Subject: Phrase query and porter stemmer Hi, I use PorterStemmer with my analyzer for indexing the documents. And I have been using the same analyzer for searching too. When I search for a phrase like security AND database, I would like to avoid matches for terms like secure or securities . I observed that Google and couple of search engines do not return such matches. 1) In otherwords, in a single query, is it possible not to choose porter stemmer for phrase queries and use for other queries (such as Term query etc) 2) As an alternative, is it advisable to manually construct a PhraseQuery by adding terms without appling porter stemmer ? regards Ramesh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Phrase query and porter stemmer
You won't get hits for security if you do not use the stemmer. The stem of security is the token that gets stored in the index. If you don't use the stemming algorithm when you create the index you could search for security and only get those documents that contain security. See the FAQ http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q15 If you have a list of terms you want to treat differently (i.e. you know there are certain words you don't want to stem) you could build a custom TokenFilter that checks the tokens for those words before applying the stemming algorithm then add that TokenFilter to your analyzer. You might also consider allowing the tokens to be stemmed and adding the original non-stemmed term at the same position using Token.setPositionIncrement(0), you might also want to figure out some way to boost the score on those non-stemmed tokens when you build your query (not sure how you might accomplish that, but some custom query parsing code could do the trick). Eric -Original Message- From: Mailing Lists Account [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 12, 2003 4:17 AM To: [EMAIL PROTECTED] Subject: Phrase query and porter stemmer Hi, I use PorterStemmer with my analyzer for indexing the documents. And I have been using the same analyzer for searching too. When I search for a phrase like security AND database, I would like to avoid matches for terms like secure or securities . I observed that Google and couple of search engines do not return such matches. 1) In otherwords, in a single query, is it possible not to choose porter stemmer for phrase queries and use for other queries (such as Term query etc) 2) As an alternative, is it advisable to manually construct a PhraseQuery by adding terms without appling porter stemmer ? regards Ramesh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Is the searched string 'on' a special case ?
Assuming you are using StandardAnalyzer, the default stop words are: public static final String[] STOP_WORDS = { a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with }; Your state field must not be built with StandardAnalyzer or ON would have been removed by the analyzer when you created the field. It looks like you will need to use lower level APIs than QueryParser to create your Query object or don't use the default stop words. Eric -Original Message- From: Alain Lauzon [mailto:[EMAIL PROTECTED]] Sent: Monday, January 13, 2003 1:23 PM To: Lucene Users List Subject: Is the searched string 'on' a special case ? I have an index wtih many fields, and specially, one for company name and one for state. When I search for : +company:inc~100 I get 114 results from 2 states, HI (Hawaii) and ON (Ontario). If I search for : +state:hi +company:inc~100 I get 7 results for Hawaii. But when I search for: +state:on +company:inc~100 I get no results at all for Ontario. So what is going on ? I tried with many other states and all are working, but not 'on'. Is 'on' a special case ? Like on/off ? Alain Lauzon [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Query grouping
Abhay, This query is processed by the query parser... ((+SEARCH_NAME:dvd +SEARCH_NAME:cd ) OR (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd )) AND ((-SEARCH_NAME:player) OR (-DEF_DOC_F IELD:player)) and comes out looking like... +((+SEARCH_NAME:dvd +SEARCH_NAME:cd) (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd)) ++((-SEARCH_NAME:player) (-DEF_DOC_FIELD:player)) Using org.apache.lucene.search.Query.toString(String fieldName) I use this representation as it shows me what happened after my query was processed by the QueryParser and Analyzer, so stop words would be removed and case modified if the analyzer does such things. This part... +((+SEARCH_NAME:dvd +SEARCH_NAME:cd) (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd)) will produce a set of documents as hits that have the dvd and cd terms in those fields This part... +((-SEARCH_NAME:player) (-DEF_DOC_FIELD:player)) will always produce an empty set when the two sets are joined with an intersection, you will always get an empty set The problem is that when using NOT or - operator, it excludes documents from the set of found documents not the set of all documents. This is correct Lucene behavior. So, since their are no found documents in that required part of the query, your results will always be no hits. This is mentioned in the jGuru FAQ at http://www.jguru.com/faq/view.jsp?EID=593598 Rearranging the query the way you mentioned is the correct way to deal with this. Eric -Original Message- From: Abhay Saswade [mailto:[EMAIL PROTECTED]] Sent: Friday, January 03, 2003 9:07 PM To: [EMAIL PROTECTED] Subject: Re: Query grouping ... However, when I try to do this in a single query by grouping I get no result ((+SEARCH_NAME:dvd +SEARCH_NAME:cd ) OR (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd )) AND ((-SEARCH_NAME:player) OR (-DEF_DOC_F IELD:player)) I don't get any results on a single term query like this (and this explains why I am not getting any results in above query) -SEARCH_NAME:player Is this a known issue? Is there any way of dealing with above-mentioned problem other than rearranging query like this? (+SEARCH_NAME:dvd +SEARCH_NAME:cd -SEARCH_NAME:player) OR (+DEF_DOC_FIELD:dvd +DEF_DOC_FIELD:cd -DEF_DOC_FIELD:player) Thanks Abhay From: Otis Gospodnetic [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Subject: Re: Query grouping Date: Fri, 3 Jan 2003 12:33:53 -0800 (PST) Does the '... AND -.' part even make sense? Why not just - ? Also, AND + doesn't make sense, does it? +field:term means the term has to be in the result, so AND is not really needed, is it? I am not sure if spaces after 'SEARCH_NAME:' make a difference or not Also, field:term1 field:term2 implies term1 OR term2, so no need for OR there, especially with +, I think. Otis --- Abhay Saswade [EMAIL PROTECTED] wrote: I am using lucene release 1.2. I am using StandardAnalyzer. Have anybody faced this problem? I get same results when I run following queries 1. (+SEARCH_NAME:jhon +SEARCH_NAME:joy) AND -SEARCH_NAME:chan 2. (+SEARCH_NAME:jhon AND +SEARCH_NAME: joy) AND -SEARCH_NAME:chan 3. (+SEARCH_NAME:jhon OR +SEARCH_NAME: joy) AND -SEARCH_NAME:chan But when I regroup the query by putting brackets around the last term like mentioned below I don't get any results 1. (+SEARCH_NAME:jhon +SEARCH_NAME: joy) AND (-SEARCH_NAME:chan) 2. (+SEARCH_NAME:jhon AND +SEARCH_NAME: joy) AND (-SEARCH_NAME:chan) 3. (+SEARCH_NAME:jhon OR +SEARCH_NAME: joy) AND (-SEARCH_NAME:chan) This is just an example. I need to do grouping on various fields. Am I missing something? Is there any document other than http://jakarta.apache.org/lucene/docs/queryparsersyntax.html? Can somebody throw some light on this? Thanks, Abhay _ MSN 8 with e-mail virus protection service: 2 months FREE* http://join.msn.com/?page=features/virus -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] _ Add photos to your e-mail with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: package information?
I think this info is available via the Manifest that is created during the build. This is cut from the build.xml from the latest CVS... !-- Create Jar MANIFEST file -- echo file=${build.manifest}Manifest-Version: 1.0 Created-By: Apache Jakarta Name: org/apache/lucene Specification-Title: Lucene Search Engine Specification-Version: ${version} Specification-Vendor: Lucene Implementation-Title: org.apache.lucene Implementation-Version: build ${DSTAMP} ${TSTAMP} Implementation-Vendor: Lucene /echo This is only added to the core jar, there is no such Manifest generated for the demo jar. Eric -Original Message- From: petite_abeille [mailto:[EMAIL PROTECTED]] Sent: Friday, December 20, 2002 3:04 PM To: [EMAIL PROTECTED] Subject: package information? Hi, Would it be possible for Lucene to provide package informations? Basically all the java.lang.Package attributes... Things like implementation vendor, name, version and so on... This would make it easier to identify which packages/versions are used. Thanks. PA. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: help w/ phrase query
Dominic, Are you constructing the PhraseQuery directly using it's add(Term) method to add terms to the query? If so, you need to make sure your terms go through the same normalization (via the Analyzer) that your content went through when you created your index. So if the field you are querying was created in your index using StandardAnalyzer, the terms in your query should also be run through StandardAnalyzer. Does this help? if not, give us a little more detail about what Analyzer you are using to create your index and how you are creating your PhraseQuery object. Eric -Original Message- From: host unknown [mailto:[EMAIL PROTECTED]] Sent: Friday, December 13, 2002 1:17 PM To: [EMAIL PROTECTED] Subject: help w/ phrase query Hi All. I'm out of ideas on how to get the PhraseQuery to return any results. I'm guessing I might not be indexing properly when the document data is being stored. Is there any particular Field type that should be used. I've tried both Field.Text(String, String) and Field.Text(String, Reader). If Field type is irrelevantany pointers on where to look next are appreciated. Dominic madison.com _ MSN 8 with e-mail virus protection service: 2 months FREE* http://join.msn.com/?page=features/virus -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Accentuated characters
Don't know if any of the code in this French analyzer that was contributed by Patrick Talbot may apply, any reason you don't just use it? see http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgNo=870 Eric -- Eric D. IsaksonSAS Institute Inc. Application Developer SAS Campus Drive XML Technologies Cary, NC 27513 (919) 531-3639 http://www.sas.com -Original Message- From: stephane vaucher [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 10, 2002 2:58 PM To: [EMAIL PROTECTED] Subject: Accentuated characters Hello everyone, I wish to implement a TokenFilter that will remove accentuated characters so for example 'é' will become 'e'. As I would rather not reinvent the wheel, I've tried to find something on the web and on the mailing lists. I saw a mention of a contrib that could do this (see http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), but I don't see anything applicable. Has anyone done this yet, if so I would much appreciate some pointers (or code), otherwise, I'll be happy to contribute whatever I produce (but it might be very simple since I'll only need to deal with french). Cheers, Stephane -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Accentuated characters
If you really want to make your own TokenFilter, have a look at org.apache.lucene.analysis.LowerCaseFilter.next() it does: public final Token next() throws java.io.IOException { Token t = input.next(); if (t == null) return null; t.termText = t.termText.toLowerCase(); return t; } The termText member of the Token class is package scoped, so you will have to implement your filter in the org.apache.lucene.analysis package. No worries about encoding as the termText is already a java (unicode) string. You will just have to provide the mechanism to get the accented characters converted to there non-accented equivalents. java.text.Collator has some magic that does this for string comparisons but I couldn't find any public methods that give you access to convert a string to its non-accented equivalent. Eric -- Eric D. IsaksonSAS Institute Inc. Application Developer SAS Campus Drive XML Technologies Cary, NC 27513 (919) 531-3639 http://www.sas.com -Original Message- From: stephane vaucher [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 10, 2002 2:58 PM To: [EMAIL PROTECTED] Subject: Accentuated characters Hello everyone, I wish to implement a TokenFilter that will remove accentuated characters so for example 'é' will become 'e'. As I would rather not reinvent the wheel, I've tried to find something on the web and on the mailing lists. I saw a mention of a contrib that could do this (see http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), but I don't see anything applicable. Has anyone done this yet, if so I would much appreciate some pointers (or code), otherwise, I'll be happy to contribute whatever I produce (but it might be very simple since I'll only need to deal with french). Cheers, Stephane -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]