RE: Dates and others
Hi guys - So I am getting happier with search, and just pushed the lucene version live at: http://www.theserverside.com (on the leftbar) and: http://www.theserverside.com/home/search/index.jsp The only real item that I still want to tweak more is getting recent results higher in the list. I was wondering if something like this could work (or if there is a better solution) At index time, I have the date of the content. I could do some math where the higher the date (based on the time_t version or whatever) the more of a setBoost(metric). Or, for every month in the past, create a larger negative number to setBoost()... or something like that. Would something like this make sense? Dion > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Sunday, November 23, 2003 3:52 PM > To: Lucene Users List > Subject: Re: Dates and others > > On Saturday, November 22, 2003, at 06:33 PM, Dion Almaer wrote: > > 3. I have some fields suck as title, owner, etc as well as > the content > > blob which I index and use as the default search field. Is > there an > > easy way to extend the QueryParser to merge it with a > MultiTermQuery > > which can also search this meta data and give them certain > weights? > > Or, if you go down this path do you have to leave the QueryParser > > behind and build your own queries? Any best practices > would be great. > > And Ype said: > You can provide field weights at document indexing time > (norms) and use a MultiTermQuery for searching multiple > fields. At query time you can again use field weights. > I don't know how the scoring of the MultiTermQuery is done, > it might use the max. score over the fields of a document, or > combine the scores in the fields of a document. > end Ype's reply cut and paste > > I'm a little confused with this question and Ype's reply. > MultiTermQuery is an abstract base class under Query, which > is the parent for WildcardQuery and FuzzyQuery. > > What I think you're after is using MultiFieldQueryParser, but > you want to weight the fields differently. You can add the > boosts at indexing time using Field.setBoost. Unfortunately > at the moment MultiFieldQueryParser is not very extensible - > there are some open issues with its subclassability but > subclassing MFQP and overriding getFieldQuery will do the > trick when the subclassing issues are resolved allowing you > to boost at query time. > > Making an educated guess at what you're doing with Lucene, > Dion, I'd venture to say that boosting at indexing time is > sufficient for your needs. > > Erik > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: log4j.properties
> java -Dlog4j.configuration=log4j.xml org.pdfbox.searchengine.lucene.IndexFiles > -create -index c:\\index .. Hmm try to create log4j.xml instead of log4j.properties as specified in the command line parameter. - Original Message - From: "Tun Lin" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Thursday, November 27, 2003 2:05 AM Subject: RE: log4j.properties > I have integrated Lucene and PDFBox and tried the following command to index > files > > java -Dlog4j.configuration=log4j.xml org.pdfbox.searchengine.lucene.IndexFiles > -create -index c:\\index .. > > But I have the following error message: > log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse > r). > log4j:WARN Please initialize the log4j system properly. > > Anyone can help? > > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 26, 2003 5:19 PM > To: Lucene Users List > Subject: Re: log4j.properties > > What does this have to do with Lucene? > > > On Wednesday, November 26, 2003, at 01:04 AM, Tun Lin wrote: > > > I have created the following "log4j.properties" and put it in your > > classpath but it still has that error. Anyone can help? > > > > log4j.rootCategory=stdout > > > > log4j.appender.stdout=org.apache.log4j.ConsoleAppender > > log4j.appender.stdout.layout=org.apache.log4j.PatternLayout > > log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Collaborative Filtering API
On Tue, Nov 25, 2003 at 01:18:19PM -0500, Michael Giles wrote: > Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota. I've actually worked on a system that bundled GroupLens. I think it was Vignette StoryServer. The Vignette docs were incredibly dense with MarketingNewSpeak, so I could never quite figure out what they said GroupLens actually *did* (not at a web-capable terminal right now, or I'd just google it). Collaborative filtering in general is a topic I'm interested in, and is why I first got into Lucene. I wanted and still want to build a collaborative filtering search engine for mailing lists and the like. I do remember that FireFly's engine was supposed to graph all of the users' ratings on a topic in an N-dimensional space, and then find users "close" to the same user in that N-dimensional space, and suggest topics that they'd liked, but that the current user hadn't rated. I'm interested in more of a "free market" sort of approach than in statistical analysis; I want to build a system that helps usrs express their opinions, then nurture an emerging consensus. My experience has been that systems that systems/technologies that try to facilitate the way users already do things, instead of replacing them with new ways of doing things, tend to work better. -- Steven J. Owens [EMAIL PROTECTED] "I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt." - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Question - not returning desired results
On Wednesday, November 26, 2003, at 11:08 AM, Pleasant, Tracy wrote: But now i have another question. Let's say I have 'return_results.pl' in the document in one of the fields. Actually there is a little bit more to it than understanding the analysis phase, and you were right in saying you need to understand '*' and '~' as well. More below... When I search for return_res* or return_res~ it won't return the document. this is searching for all terms that start with "return_res", and during analysis you split this into "return" and "res", so no terms match. Same goes for the fuzzy query with ~. But searching for any of these does return the document: 1. 'return_results' 2. 'results' or 'return' 3. 'results.pl' 4. 'results~' 5. 'return_results~' in all of these cases, you're searching for terms that got split by the analyzer on indexing (and during QueryParser analysis for "return_results", "results.pl"). Tricky stuff, eh? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Question - not returning desired results
On Wednesday, November 26, 2003, at 11:08 AM, Pleasant, Tracy wrote: But now i have another question. Let's say I have 'return_results.pl' in the document in one of the fields. When I search for return_res* or return_res~ it won't return the document. But searching for any of these does return the document: 1. 'return_results' 2. 'results' or 'return' 3. 'results.pl' 4. 'results~' 5. 'return_results~' I guess I have to read more about the '*' and '~'? What does my AnalysisDemo tell you about all of the text you're feeding in here? :)) The answer lies therein! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Question - not returning desired results
On Wednesday, November 26, 2003, at 11:33 AM, Pleasant, Tracy wrote: Your website says: org.apache.lucene.analysis.standard.StandardAnalyzer: [xy&z] [corporation] [EMAIL PROTECTED] [com] When I run it it keeps the entire email '[EMAIL PROTECTED] but according to your website it separates the '[EMAIL PROTECTED]' from the 'com' Is there a difference between the versions of Lucene? I'm using 1.3rc2. Yes, I fixed the bug in the StandardTokenizer that caused e-mail addresses to get split, but fixed it after the article was written. Good eye! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Eliminating duplicate result
You are searching for the same term and you are searching the same index twice, it will return the same results... I don't get what you are asking. -Original Message- From: Dragan Jotanovic [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 3:19 AM To: Lucene Users List Subject: Re: Eliminating duplicate result > When you are doing two searches are you searching for two different terms? > No, I am searching for the same term. What is the easyest way to eliminate duplicate documents if one is doing two searches on the same index? Have anybody done something similar? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Question - not returning desired results
Erik, I think there may be a typo in the website. When I run the AnalyzerDemo : Analzying "xy&z corporation - [EMAIL PROTECTED]" org.apache.lucene.analysis.standard.StandardAnalyzer: [xy&z] [corporation] [EMAIL PROTECTED] Your website says: org.apache.lucene.analysis.standard.StandardAnalyzer: [xy&z] [corporation] [EMAIL PROTECTED] [com] When I run it it keeps the entire email '[EMAIL PROTECTED] but according to your website it separates the '[EMAIL PROTECTED]' from the 'com' Is there a difference between the versions of Lucene? I'm using 1.3rc2. Plus I think what I want is a StandardAnalyzer with a little tweaking. The simple one was fine until I realized that it doesn't do numbers, which I need as part of my search since numbers is important for what I'm doing. The Standard does numbers but I need it to be a little different of course. Thanks for the site. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 4:58 AM To: Lucene Users List Subject: Re: Search Question - not returning desired results On Tuesday, November 25, 2003, at 12:11 PM, Pleasant, Tracy wrote: > > The documents I have index contain information regarding file names > also. > > For instance 'return_results.pl' or something like that may be in the > document fields. > > I am not understanding Lucene's way of searching: > > 1. If I search for 'return_results', the search does not return > anything > 2. If I search for 'results' or 'return', the search does not return > anything > 3. If I search for 'results.pl', the search does return the document > containg 'return_results.pl' > 4. If I search for 'results~', the search does return the document > containg 'return_results.pl' > 5. If I search for 'return_results~', the search does not return > anything > > What is going on? > > I want it to return the document in all of the situations. > > I also don't want to have to use '~' all the time. We sure do have a recurring theme lately :) Analysis! Please refer to my article at java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Look at the AnalysisDemo code. Copy it over and try it out on the text you're using and the Analyzer you're using. The bracketed text that comes out are the "tokens" that you can search on. It is very very important to understand this process and to really know what terms come out of text you hand it - otherwise it is a mystery why some things can be found and some things cannot despite your expectations to the contrary. A follow-up to the Analysis is querying - and QueryParser has it's own set of quirks and caveats related to how things are tokenized/analyzed. And, I've got just the follow-up article for you handy... http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html If you digest both of these articles (analysis one first please) then I think a lot of questions that get asked on this list will be implicitly answered. Understanding analysis is key. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Question - not returning desired results
It seems like what I should we using is something more like a SimpleAnalyzer or StopAnalyzer. I've changed my code and the query to use SimpleAnalyzer. But now i have another question. Let's say I have 'return_results.pl' in the document in one of the fields. When I search for return_res* or return_res~ it won't return the document. But searching for any of these does return the document: 1. 'return_results' 2. 'results' or 'return' 3. 'results.pl' 4. 'results~' 5. 'return_results~' I guess I have to read more about the '*' and '~'? -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 4:58 AM To: Lucene Users List Subject: Re: Search Question - not returning desired results On Tuesday, November 25, 2003, at 12:11 PM, Pleasant, Tracy wrote: > > The documents I have index contain information regarding file names > also. > > For instance 'return_results.pl' or something like that may be in the > document fields. > > I am not understanding Lucene's way of searching: > > 1. If I search for 'return_results', the search does not return > anything > 2. If I search for 'results' or 'return', the search does not return > anything > 3. If I search for 'results.pl', the search does return the document > containg 'return_results.pl' > 4. If I search for 'results~', the search does return the document > containg 'return_results.pl' > 5. If I search for 'return_results~', the search does not return > anything > > What is going on? > > I want it to return the document in all of the situations. > > I also don't want to have to use '~' all the time. We sure do have a recurring theme lately :) Analysis! Please refer to my article at java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Look at the AnalysisDemo code. Copy it over and try it out on the text you're using and the Analyzer you're using. The bracketed text that comes out are the "tokens" that you can search on. It is very very important to understand this process and to really know what terms come out of text you hand it - otherwise it is a mystery why some things can be found and some things cannot despite your expectations to the contrary. A follow-up to the Analysis is querying - and QueryParser has it's own set of quirks and caveats related to how things are tokenized/analyzed. And, I've got just the follow-up article for you handy... http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html If you digest both of these articles (analysis one first please) then I think a lot of questions that get asked on this list will be implicitly answered. Understanding analysis is key. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Question - not returning desired results
Thanks this helps a lot :) -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 4:58 AM To: Lucene Users List Subject: Re: Search Question - not returning desired results On Tuesday, November 25, 2003, at 12:11 PM, Pleasant, Tracy wrote: > > The documents I have index contain information regarding file names > also. > > For instance 'return_results.pl' or something like that may be in the > document fields. > > I am not understanding Lucene's way of searching: > > 1. If I search for 'return_results', the search does not return > anything > 2. If I search for 'results' or 'return', the search does not return > anything > 3. If I search for 'results.pl', the search does return the document > containg 'return_results.pl' > 4. If I search for 'results~', the search does return the document > containg 'return_results.pl' > 5. If I search for 'return_results~', the search does not return > anything > > What is going on? > > I want it to return the document in all of the situations. > > I also don't want to have to use '~' all the time. We sure do have a recurring theme lately :) Analysis! Please refer to my article at java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Look at the AnalysisDemo code. Copy it over and try it out on the text you're using and the Analyzer you're using. The bracketed text that comes out are the "tokens" that you can search on. It is very very important to understand this process and to really know what terms come out of text you hand it - otherwise it is a mystery why some things can be found and some things cannot despite your expectations to the contrary. A follow-up to the Analysis is querying - and QueryParser has it's own set of quirks and caveats related to how things are tokenized/analyzed. And, I've got just the follow-up article for you handy... http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html If you digest both of these articles (analysis one first please) then I think a lot of questions that get asked on this list will be implicitly answered. Understanding analysis is key. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: log4j.properties
I have integrated Lucene and PDFBox and tried the following command to index files java -Dlog4j.configuration=log4j.xml org.pdfbox.searchengine.lucene.IndexFiles -create -index c:\\index .. But I have the following error message: log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse r). log4j:WARN Please initialize the log4j system properly. Anyone can help? -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2003 5:19 PM To: Lucene Users List Subject: Re: log4j.properties What does this have to do with Lucene? On Wednesday, November 26, 2003, at 01:04 AM, Tun Lin wrote: > I have created the following "log4j.properties" and put it in your > classpath but it still has that error. Anyone can help? > > log4j.rootCategory=stdout > > log4j.appender.stdout=org.apache.log4j.ConsoleAppender > log4j.appender.stdout.layout=org.apache.log4j.PatternLayout > log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: log4j.properties
As I've said previously, it's a log4j problem and not a lucene probleme, you should post there. sv On Wed, 26 Nov 2003, Tun Lin wrote: > I have created the following "log4j.properties" and put it in your classpath but > it still has that error. Anyone can help? > > log4j.rootCategory=stdout > > log4j.appender.stdout=org.apache.log4j.ConsoleAppender > log4j.appender.stdout.layout=org.apache.log4j.PatternLayout > log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Tokenizing text custom way
Do you want to define expressions, i.e. a set of terms that must be intpreted as a whole ? For instance, when the Analyzer catchs "time" followed by "out" it returns "time_out" ? -Message d'origine- De : Dragan Jotanovic [mailto:[EMAIL PROTECTED] Envoyé : mercredi 26 novembre 2003 12:12 À : Lucene Users List Objet : Re: Tokenizing text custom way > You will need to write a custom analyzer. Don't worry, though it's > quite straightforward. You will also need to write a Tokenizer, but > Lucene helps you a lot here. Wouldn't I achieve the same result if I index "time out" like "time_out", using StandardAnalyzer and later if I search for "time out" (inside quotes) I should get proper result, but if I search for "time" I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
On Wednesday, November 26, 2003, at 06:12 AM, Dragan Jotanovic wrote: You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index "time out" like "time_out", using StandardAnalyzer and later if I search for "time out" (inside quotes) I should get proper result, but if I search for "time" I shouldn't get result. Is this right? I'm confused on what you are planning doing. Are you going to replace all spaces with an underscore before handing it to the analyzer? StandardAnalyzer will still split at the underscores though. If you have special tokenization needs, why try to hack it somehow rather than address it cleanly in the way Lucene was designed to work? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Chinese input.
Maybe this will help? http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23545 Otis --- Tun Lin <[EMAIL PROTECTED]> wrote: > Hi, > > May I know how do I analyse Chinese input from Chinese text in > Lucene? > > Do I use Analyser function in Lucene? If yes, how to go about using > it? > __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
> You will need to write a custom analyzer. Don't worry, though it's > quite straightforward. You will also need to write a Tokenizer, but > Lucene helps you a lot here. Wouldn't I achieve the same result if I index "time out" like "time_out", using StandardAnalyzer and later if I search for "time out" (inside quotes) I should get proper result, but if I search for "time" I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Question - not returning desired results
On Tuesday, November 25, 2003, at 12:11 PM, Pleasant, Tracy wrote: The documents I have index contain information regarding file names also. For instance 'return_results.pl' or something like that may be in the document fields. I am not understanding Lucene's way of searching: 1. If I search for 'return_results', the search does not return anything 2. If I search for 'results' or 'return', the search does not return anything 3. If I search for 'results.pl', the search does return the document containg 'return_results.pl' 4. If I search for 'results~', the search does return the document containg 'return_results.pl' 5. If I search for 'return_results~', the search does not return anything What is going on? I want it to return the document in all of the situations. I also don't want to have to use '~' all the time. We sure do have a recurring theme lately :) Analysis! Please refer to my article at java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Look at the AnalysisDemo code. Copy it over and try it out on the text you're using and the Analyzer you're using. The bracketed text that comes out are the "tokens" that you can search on. It is very very important to understand this process and to really know what terms come out of text you hand it - otherwise it is a mystery why some things can be found and some things cannot despite your expectations to the contrary. A follow-up to the Analysis is querying - and QueryParser has it's own set of quirks and caveats related to how things are tokenized/analyzed. And, I've got just the follow-up article for you handy... http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html If you digest both of these articles (analysis one first please) then I think a lot of questions that get asked on this list will be implicitly answered. Understanding analysis is key. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
On Tuesday, November 25, 2003, at 06:41 AM, Dragan Jotanovic wrote: Hi. I need to tokenize text while indexing but I don't want space to be delimiter. Delimiter should be my custom character (for example comma). I understand that I would probably need to implement my own analyzer, but could someone help me where to start. Is there any other way to do this without writing custom analyzer? You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Lucene's LetterTokenizer is simply this: public class LetterTokenizer extends CharTokenizer { /** Construct a new LetterTokenizer. */ public LetterTokenizer(Reader in) { super(in); } /** Collects only characters which satisfy * [EMAIL PROTECTED] Character#isLetter(char)}.*/ protected boolean isTokenChar(char c) { return Character.isLetter(c); } } You could change the isTokenChar method in your custom CommaTokenizer to only return true if the character is not a ','. And you might want to implement the normalize method to lowercase (look at LowerCaseTokenizer). My advice is for you to check out Lucene's source code in the TokenStream hierarchy (ctrl-H in IntelliJ is quite nice! :). CharTokenizer seems a good starting point for you. Then have a look at SimpleAnalyzer: public final class SimpleAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new LowerCaseTokenizer(reader); } } Just create your own CommaAnalyzer that uses your CommaTokenizer similar to this. Have a look at my java.net article and try the sample code provided there to observe the analysis process in greater detail so you can check that you get what you expect. and if I enter 'time' as a search word, I don't want to get "time out" in results. I need exact keyword matching. I would achieve this if I tokenize "time out" as one token while idexing. It will be a little trickier on the query part if you're using QueryParser - you will need to double-quote "time out" for it to work, I believe - but don't worry about this until you get the analysis phase worked out and then we can revisit the QueryParser issue then. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tokenizing text custom way
woah that seems like an awfully complex answer to the question of how to tokenize at a comma rather than a space! %-) On Tuesday, November 25, 2003, at 11:48 AM, MOYSE Gilles (Cetelem) wrote: Hi. You should define expressions. To define expressions, you first have to define an expression file. An expression file contains one expressions per line. For instance : time_out expert_system ... You can use any character to specify the "expression link". Here, I use the underscore (_). Then, you have to build an expression loader. You can store expressions in recursives HashMap. Such HashMap must be built so that HashMap.get("word1") = HashMap, and (HashMap.get("word1")).get("word2") = null, if you want to code the expression "word1_word2". In other words 'HashMap.get("a_word")' returns a hashMap containing all the successors of the word 'a_word'. So, if your expression file looks like that : time_out expert_system expert_in_information you'll have to build a loader which returns a HashMap H so that : H.keySet() = {"time", "expert"} ((HashMap)H.get("time")).keySet = {"out"} ((HashMap)H.get("time")).get("out") = null // null indicates the end of the expression ((HashMap)H.get("expert")).keySet = {"system", "in"} ((HashMap)H.get("expert")).get("system") = null ((HashMap)((HashMap)H.get("expert")).get("in")).keySet() = {"information"} ((HashMap)((HashMap)H.get("expert")).get("in")).get("information") = null These recursives HashMaps code the following tree : time - out - null system --- expert - null |- in - information- null Such an expression loader may be designed this way : public static HashMap getExpressionMap( File wordfile ) { HashMap result = new HashMap(); try { String line = null; LineNumberReader in = new LineNumberReader(new FileReader(wordfile)); HashMap hashToAdd = null; while ((line = in.readLine()) != null) { if (line.startsWith(FILE_COMMENT_CHARACTER)) continue; if (line.trim().length() == 0) continue; StringTokenizer stok = new StringTokenizer(line, " \t_"); String curTok = ""; HashMap currentHash = result; // Test wether the expression contains 2 at least words or not if (stok.countTokens() < 2) { System.err.println("Warning : '" + line + "' in file '" + wordfile.getAbsolutePath() + "' line " + in.getLineNumber() + " is not an expression.\n\tA valid expression contains at least 2 words."); continue; } while (stok.hasMoreTokens()) { curTok = stok.nextToken(); if (curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the line, break break; if (stok.hasMoreTokens()) hashToAdd = new HashMap(6); else hashToAdd = (HashMap)null; if (!(currentHash.containsKey(curTok))) currentHash.put(curTok, hashToAdd); currentHash = (HashMap)currentHash.get(curTok); } } return result; } // On error, use an empty table catch ( Exception e ) { System.err.println("While processing '" + wordfile.getAbsolutePath() + "' : " + e.getMessage()); e.printStackTrace(); return new HashMap(); } } Then, you must build a filter with 2 FIFO stacks : one is the expression stack, the other is the default stack. Then, you define a 'curMap' variable, initially pointing onto the HashMap returned by the ExpressionFileLoader. When you receive a token, you check wether it is null or not; If it is, you check if the standard stack is null or not. If it is not, you pop a token from the default stack and you return it. If it is, you return null If it is not (the token is not null), you check whether it is contained in the HashMap or not (curMap.containsKey(token)). If it is not contained and you were building an expression, you pop all the terms in the expression stack to push them in the default stack (so as not to loose information) If it is not contained and the default stack is empty, you return the token. If it is not conatined and the default stack is not empty, you return the poped token from the default stack and you push the current token. If the token is contained in the curMap, then the token MAY be the first element of an expression. You push the token in the expression stack, and you dive into the next level in your expression tree (curMap = curMap.get("token")) If the next level (now, curMap), is null, then you have completed your expression. You can pop all the tokens from the expresion stack to concatenate them, separated by underscores, and push
Re: log4j.properties
What does this have to do with Lucene? On Wednesday, November 26, 2003, at 01:04 AM, Tun Lin wrote: I have created the following "log4j.properties" and put it in your classpath but it still has that error. Anyone can help? log4j.rootCategory=stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: unexpected results from query
On Tuesday, November 25, 2003, at 10:45 PM, marc wrote: Hi, assume a field has the following text "Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) " the following searches all return this document AMP & & can someone explain this to me..i figured that only the first query would be successful This depends on the Analyzer you're using. I'm assuming you're using the QueryParser and an analyzer that rips off special characters - so essentially the TermQuery underneath is always for AMP. Have a look at my first java.net article which shows the analysis process. Run your sample text through the code provided there to see the effect first-hand. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Eliminating duplicate result
> When you are doing two searches are you searching for two different terms? > No, I am searching for the same term. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]