Re: IN Query for NumericFields
I suspect he's running the query through an analyzer that is dropping out single digit numerics, which would basically be a query that pulls back everything from the indexes.. or at least I think so. Uwe Schindler wrote: Sorry, if you have an IN query, it must be BooleanClause.Occur.SHOULD, as the CategoryID can be 1, or 3 or 7. You query should not match any doc (I verified this). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Thursday, December 10, 2009 7:03 PM To: java-user@lucene.apache.org Subject: RE: IN Query for NumericFields Cannot be :-) Is the precstep identical? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: comparis.ch - Roman Baeriswyl [mailto:roman.baeris...@comparis.ch] Sent: Thursday, December 10, 2009 5:24 PM To: 'java-user@lucene.apache.org' Subject: RE: IN Query for NumericFields I tried Query q = new BooleanQuery(); ((BooleanQuery)q).Add(NumericRangeQuery.NewLongRange(CategoryID, 1, 1, true, true), BooleanClause.Occur.MUST); ((BooleanQuery)q).Add(NumericRangeQuery.NewLongRange(CategoryID, 3, 3, true, true), BooleanClause.Occur.MUST); ((BooleanQuery)q).Add(NumericRangeQuery.NewLongRange(CategoryID, 7, 7, true, true), BooleanClause.Occur.MUST); But that seems to mach all Documents in my Index. -Original Message- From: shashi@gmail.com [mailto:shashi@gmail.com] On Behalf Of Shashi Kant Sent: Donnerstag, 10. Dezember 2009 16:40 To: java-user@lucene.apache.org Subject: Re: IN Query for NumericFields Have you looked at BooleanQuery? Create individual TermQuery and OR them using BooleanQuery. On Thu, Dec 10, 2009 at 10:34 AM, comparis.ch - Roman Baeriswyl roman.baeris...@comparis.ch wrote: Hi, I do have some indices where I need to get results based on a fixed number list (not a range) Let's say I have a field named CategoryID and I now need all results where CategoryID is 1,3 or 7. In Lucene 2.4 I created a QueryParser which looked like: CategoryID:(1 3 7). But the Query Parser won't work with NumericFields... How can I achieve the same for NumericFields? Btw I'm using Lucene.net. Thanks for Help //Roman comparis.ch auf Twitter folgen: http://twitter.com/comparis Ein Freund auf Facebook werden: http://www.facebook.com/comparis.ch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org comparis.ch auf Twitter folgen: http://twitter.com/comparis Ein Freund auf Facebook werden: http://www.facebook.com/comparis.ch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: singular and plural search
If I recall correctly the highlighter also has an analyzer passed to it. Ensure that this is the same one as well. Matt m.harig wrote: Thanks erick , It works fine , if i use the (code snippet found from nabble) same analyzer for both indexing querying . But the highlighter has gone for plural words. Hope i need to search more , i'll come back to you once if i can't find out. Thanks again erick. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why does this search succeed with web app, but not Luke?
Luke defaults to KeywordAnalyzer when you do a search on it. You have to specifically choose StandardAnalyzer. You are probably already doing this, but I figure its worth a check. Matt Andrzej Bialecki wrote: oh...@cox.net wrote: Hi Phil, Well, kind of... but... Then, why, when I do the search in Luke, do I get the results I cited: == succeeds .yyy == fails (no results) I guess that I've been assuming that the search in Luke is correct and I've been using that to test my understanding, but maybe that's an invalid assumption? Luke has some bugs, that's for sure, but not as many as one would think ;) I recommend the following exercise: * first, check what the Rewritten query looks like, in both cases. This could be enlightening, because depending on the choice of default field and query analyzer results could differ dramatically. * then, if a query succeeds in matching one or more documents, open this document and view its fields using Reconstruct edit, especially the Tokenized version of the field. At this point any potential mismatch in query terms vs. analyzed tokens in the field should become apparent. -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there a way for me to handle a multiword synonym correctly?
Create a field that is specifically for this type of matches. What you could then do is at indexing time manipulate your data in such a way that it can be matched in a punctuation irrelevant way. So in this field you would convert all non letter characters into spaces, and reduce all white space instances to single ones ( becomes ) , you could also likely lowercase it at the same time. Then at search time perform a special search against this field that does the same thing to the query string. At this point plain old phrase queries should work for you. Our corpus contains remarkably obnoxious items in it like: Rara^tm3.1Ipc So we need to be able to do very similar things as you are describing, the above mentioned technique worked like a charm. Matt Donna L Gresh wrote: I saw some discussion on the board but I'm not sure I've got quite the same problem. As an example, I have a query that might be a technical skill: SAP EM FIN AM I would like that to match a document that has *either* SAP.EM.FIN.AM or SAP EM FIN AM (in that order and all together, not spread out through the document). The approach I had tried was at index time if I saw SAP.EM.FIN.AM I would consider SAP EM FIN AM a synonym for it, using the Lucene in Action example. Luke shows me that I have two terms in the index for this document: SAP.EM.FIN.AM and SAP EM FIN AM (one term). Thus it appears differently in the index than if it had been organically found as just the string of tokens, in which case there would be separate terms for SAP, EM, and so on. At query time if I look for SAP EM FIN AM it is formed as a phrase query with a slop of 0 which does *not* match the one term version SAP EM FIN AM. (For that matter a simple boolean query doesn't find it either) Luke confirms the fact that the phrase query does not find my synonym term. The query SAP EM FIN AM finds *only* documents that originally had those separated tokens in them. Is there a way to handle this situation such that at index time I can turn SAP.EM.FIN.AM into something that will be found with a query for SAP EM FIN AM? Thanks for any pointers Donna -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching doubt
Well.. search on both anyhow. about us OR aboutus should hit the spot I think. Matt Ian Lea wrote: The question was, how given a string aboutus in a document, you can return that document as a result to the query about us (note the space). So we're mostly discussing how to detect and then break the word aboutus to two words. I haven't really been following this thread so apologies if way off target, but reading the above makes me wonder if it can simply be reversed: remove the space from about us and search on that. -- Ian. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: indexing multiple email addresses in one field
And to address the stop word issue, you can override the stop word list that it uses. Most analyzers that use stop words, (Standard included) has an option to pass it an arbitrary list of StopWords which will override the defaults. You could also just roll your own (which is what you are going to end up doing here anyhow) When you do, just don't include stop word removal in the processing of your token stream. Matt Phil Whelan wrote: Hi Matthew / Paul, On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowanco...@aconex.com wrote: Matthew Hall wrote: Place a delimiter between the email addresses that doesn't get removed in your analyzer. (preferably something you know will never be searched on) Or add them separately (rather than: doc.add(new Field(email, f...@bar.com b...@foo.com c...@bar.foo ...); use doc.add(new Field(email, f...@bar.com); doc.add(new Field(email, b...@foo.com); doc.add(new Field(email, c...@bar.foo); ), using an Analyzer that overrides getPositionIncrementGap(). This inserts a 'gap' between each set of Tokens for the same Field, which stops phrase queries from 'crossing the boundaries' between subsequent values. I like the sound of that! I think I understand it. getPositionIncrementGap() returns 0 by default which keeps the email field tokens sequential. Overriding with 1, will add an effective blank token between the email addresses (overriding with 2 would leave 2). Similar to Matthew's delimiter token, but a bit neater. So the token (with positions in brackets) would look something like this. foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10) Up until now I've only been using the WhiteSpaceAnalyzer, as I've been keeping quite a tight control over the fields going into the index (not making best use of Lucene). What Analyzer would you recommend I use for this. I'll also be indexing IPs, and other things, but that's pretty much the same story. It seems I have to use the same Analyzer for the all the fields in the index? I've been looking at StandardAnalyzer, but I do not want to remove stop words. I want to keep letters and numbers mainly, and also override getPositionIncrementGap? Is there anything that does these things already, or close to it? Overriding getPositionIncrementGap shouldn't be difficult though. Cheers, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to index IP addresses?
I'm a little unclear on how you could be getting both aa.bb.cc.dd as a term, and then also the octets. Are you adding the contents field into the index multiple times, possibly with separate analyzers? Could you possibly try a test, very simple case? Just create an index with a single lucene document, with that documents contents being aa.bb.cc.dd and then take a look at the index via Luke again. When you look at the terms section (Its what comes up by default) you SHOULD see only aa, bb, cc, and dd as the top (and thusly ONLY terms in the index). This could vary depending on your analyzer, as some will show an index containing only a single term aa.bb.cc.dd. What I would not expect is an index that would contain both. Furthermore by making the field not analyzed you will now have a trickier time searching for it. As you will need to use a keyword analyzer or something similar to search, which if I'm understanding the spirit of your problem isn't really something that you want to do. So, if you could run that test scenario that I've outlined for you I think you should be able to have a nice test bed to see what the results of swapping to different analyzers will have on the data that you are trying to index. Then, after you have played with that a bit you should be able to re-expand your corpus again, and see if the analyzer you have chosen continues to stand up. I.. had thought that StandardAnalyzer already kept IP addresses together as a single token, but maybe its doing something... special and interesting and thusly you are seeing the behavior that you are describing. Matt oh...@cox.net wrote: Hi, Oh. Ok, thanks! I'll give that a try. Jim Armasu wrote: Keyword: Field.Index.NOT_ANALYZED -Original Message- From: oh...@cox.net [mailto:oh...@cox.net] Sent: Thursday, July 30, 2009 4:36 PM To: java-user@lucene.apache.org Subject: How to index IP addresses? Hi, I am trying to index information in some proprietary-formatted files. In particular, these files contain some IP addresses in dotted notation, e.g., aa.bb.cc.dd. For my initial test, I have a Document implementation, and after I extract what I need into a String named Info, I do: doc.add(new Field(contents, Info, Field.Store.YES, Field.Index.ANALYZED)); From looking at the resulting index using Luke, it appears that I am getting terms for the full IP address string (e.g., aa.bb.cc.dd), but I am also getting terms for each octet of each IP address string, e.g.: aa bb cc dd I'm still just getting started with Lucene, but from the research that I've done, it seems like Lucene is treating the . in the dotted notation strings as noise. Is that correct? If so, is there a way to get it not to do that? Thanks, Jim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street, floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania. Registration number J40/12967/2005. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: indexing multiple email addresses in one field
1. Sure, just have an analyzer that splits on all non letter characters. 2. Phrase queries keep the order intact. (And yes, the positional information for the terms is kept, which is what allows span queries to work) So searching on the following foo bar com will match f...@bar.com but not b...@foo.com Matt Phil Whelan wrote: Hi, We have a very large lucene index that we're developing that has a field of email addresses. (Actually mulitple fields with multiple emails addresses, but I'll simplify here) Each document will have one email field containing multiple email addresses. I am indexing email addresses only using WhitespaceAnalyzer, so to preserve the exact adresses and store multiple emails for one document. Example... doc.add(new Field(email, f...@bar.com b...@foo.com c...@bar.foo, Field.Store.YES, Field.Index.ANALYZED )); Terms for this document will then be... email:f...@bar.com email:b...@foo.com email:c...@bar.foo The problem I having is that these terms are rarely re-used in other documents. There is little overlap with email usage, and there is a lot of very long emails addresses. Because of this, the number of terms in my index is very big and I think it's is causing performance issues and bloating the index. I think I'm not using Lucene optimally here. A couple of questions... 1) Is there a way I can analyze these emails down to smaller terms but still search for the exact email address? For instance, if I used a different analyzer and broke these down to the terms foo, bar, and com, is Lucene able to find email:f...@bar.com without matching email:c...@foo.bar? 2) Does Lucene retain the positional information of tokens in the index? Knowing this will help me anwer question 1. Thanks, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: indexing multiple email addresses in one field
Place a delimiter between the email addresses that doesn't get removed in your analyzer. (preferably something you know will never be searched on) That way you can ensure that each email matches independently of each other. So something like f...@bar.com DELIM123 b...@foo.com DELIM123 c...@bar.foo Matt Phil Whelan wrote: On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall mh...@informatics.jax.org wrote: 1. Sure, just have an analyzer that splits on all non letter characters. 2. Phrase queries keep the order intact. (And yes, the positional information for the terms is kept, which is what allows span queries to work) So searching on the following foo bar com will match f...@bar.com but not b...@foo.com Thanks, I really appreciate your help with this. That's great to know. Can I take this a little further... If I have f...@bar.com b...@foo.com c...@bar.foo and analyze it I get foo bar com bar foo com com bar foo, so perhaps I need a different way of delimiting the emails, as it will match some other combinations here, eg. f...@com.com which is not one of the emails. Has anyone done anything similar? I can imagine that one option would be to filter the returned docs based on the original content of the string I'm analyzing. Does Lucene do this for me? Thanks, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New to Lucene - some questions about demo
Restart tomcat. When the indexes are read in at initialization time they are a snapshot of what the indexes contained at that moment. Unless the demo specifically either closes its IndexReader and creates a new one, or calls IndexReader.reopen periodically (Which I don't remember it doing) you will not see updates in the web app until you restart. Matt Ohaya wrote: Hi, I'm just starting to work with Lucene, and I guess that I learn best by working with code, so I've started with the demos in the Lucene distribution. I got the IndexFiles.java and IndexHTML.java working, and also the luceneweb.war is deployed to Tomcat. I used IndexFiles.java to index some text files, and then used both the SearchFiles.java and the luceneweb web app to do some testing. One of the things that I noticed with the luceneweb web app is that when I searched, the search results returned Summary of null, so I added: doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.NOT_ANALYZED)); to the IndexFiles.java, and ran it again. I had expected that I'd then be able to do a search for something like summary:foofoo, but when I did that, I got no results. I also tried SearchFiles.java, and again got no results. I tried using Luke, and that is showing that the summary field is in the indexes, so I'm wondering why I am not able to search on other fields such as summary, path, etc.? Can anyone explain what else I need to do, esp. in the luceneweb web app, to be able to search these other fields? Thanks! Jim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New to Lucene - some questions about demo
Oh, also check to see which Analyzer the demo webapp/indexer is using. Its entirely possible the analyzer that has been chosen isn't lowercasing input, which could also cause you issues. I'd be willing to bet your issue lies in one of these two problems I've mentioned ^^ Matt Matthew Hall wrote: Restart tomcat. When the indexes are read in at initialization time they are a snapshot of what the indexes contained at that moment. Unless the demo specifically either closes its IndexReader and creates a new one, or calls IndexReader.reopen periodically (Which I don't remember it doing) you will not see updates in the web app until you restart. Matt Ohaya wrote: Hi, I'm just starting to work with Lucene, and I guess that I learn best by working with code, so I've started with the demos in the Lucene distribution. I got the IndexFiles.java and IndexHTML.java working, and also the luceneweb.war is deployed to Tomcat. I used IndexFiles.java to index some text files, and then used both the SearchFiles.java and the luceneweb web app to do some testing. One of the things that I noticed with the luceneweb web app is that when I searched, the search results returned Summary of null, so I added: doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.NOT_ANALYZED)); to the IndexFiles.java, and ran it again. I had expected that I'd then be able to do a search for something like summary:foofoo, but when I did that, I got no results. I also tried SearchFiles.java, and again got no results. I tried using Luke, and that is showing that the summary field is in the indexes, so I'm wondering why I am not able to search on other fields such as summary, path, etc.? Can anyone explain what else I need to do, esp. in the luceneweb web app, to be able to search these other fields? Thanks! Jim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New to Lucene - some questions about demo
Yeah, Ian has it nailed on the head here. Can't believe I missed it in the initial writeup. Matt Ian Lea wrote: Jim Glancing at SearchFiles.java I can see Analyzer analyzer = new StandardAnalyzer(); ... QueryParser parser = new QueryParser(field, analyzer); ... Query query = parser.parse(line); so any query term you enter will be run through StandardAnalyzer which will, amongst other things, convert it to lowercase and will not match the indexed value of FooFoo. If you're just playing, it would probably be easiest to tell lucene to analyze the summary field e.g. doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.ANALYZED)); That will cause FooFoo to be indexed as foofoo and thus should be matched on search. -- Ian. On Tue, Jul 28, 2009 at 2:22 PM, oh...@cox.net wrote: Ian and Matthew, I've tried foofoo, summary:foofoo, FooFoo, and summary:FooFoo. No results returned for any of those :(. Also, Matthew, I bounced Tomcat after running IndexFiles, so I don't think that's the problem either :(... I looked at the SearchFiles.java code, and it looks like it's literally using whatever query string I'm entering (ditto for luceneweb). Is there something with the query itself that needs to be modified to support searching on the fields other than the contents field (recall, I'm pretty sure that all those other fields are in the index, via Luke)? Jim Ian Lea ian@gmail.com wrote: Hi Field.Index.NOT_ANALYZED means it will be stored as is i.e. FooFoo in your example, and if you search for foofoo it won't match. A search for FooFoo would, assuming that your search terms are not being lowercased. -- Ian. On Tue, Jul 28, 2009 at 1:56 PM, Ohayaoh...@cox.net wrote: Hi, I'm just starting to work with Lucene, and I guess that I learn best by working with code, so I've started with the demos in the Lucene distribution. I got the IndexFiles.java and IndexHTML.java working, and also the luceneweb.war is deployed to Tomcat. I used IndexFiles.java to index some text files, and then used both the SearchFiles.java and the luceneweb web app to do some testing. One of the things that I noticed with the luceneweb web app is that when I searched, the search results returned Summary of null, so I added: doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.NOT_ANALYZED)); to the IndexFiles.java, and ran it again. I had expected that I'd then be able to do a search for something like summary:foofoo, but when I did that, I got no results. I also tried SearchFiles.java, and again got no results. I tried using Luke, and that is showing that the summary field is in the indexes, so I'm wondering why I am not able to search on other fields such as summary, path, etc.? Can anyone explain what else I need to do, esp. in the luceneweb web app, to be able to search these other fields? Thanks! Jim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New to Lucene - some questions about demo
You can choose to do either, Having items in multiple fields allows you to apply field specific boosts, thusly making matches to certain fields more important to others. But, if that's not something that you care about the second technique is useful in that it vastly simplifies your index structure (And thusly your query structure) So, it depends on what you want to be able to do in the end. Do you envision doing something like being able to search by the summary and the contents at the same time, but weighing hits to the summary as a higher priority? If so, use multiple fields. If not, keep this first iteration in lucene simple, and compress everything down. Also please note that the + + in the example cited is important. That space will ensure that your contents and summary fields will be tokenized properly. (Just in case they are single words lets say). Matt oh...@cox.net wrote: Hi Matthew and Ian, Thanks, I'll try that, but, in the meantime, I've been doing some reading (Lucene in Action), and on pg. 159, section 5.3, it discusses Querying on multiple fields. I was just about to try to what's described in that section, i.e., using MultiFieldQueryParser.parse(), or, as another note on pg. 161 mentions, doing something like: doc.add(Field.Unstored(contents, contents + + summary); So, I guess I'm a little confused (happens a lot :)!): In the situation I'm talking about (starting with the Lucene demo and demo webapp, and trying to be able to index and search more than just the contents field), do I not need to use the MultiFieldQueryParser.parse() or do what they call create a synthentic content? Thanks, Jim Matthew Hall mh...@informatics.jax.org wrote: Yeah, Ian has it nailed on the head here. Can't believe I missed it in the initial writeup. Matt Ian Lea wrote: Jim Glancing at SearchFiles.java I can see Analyzer analyzer = new StandardAnalyzer(); ... QueryParser parser = new QueryParser(field, analyzer); ... Query query = parser.parse(line); so any query term you enter will be run through StandardAnalyzer which will, amongst other things, convert it to lowercase and will not match the indexed value of FooFoo. If you're just playing, it would probably be easiest to tell lucene to analyze the summary field e.g. doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.ANALYZED)); That will cause FooFoo to be indexed as foofoo and thus should be matched on search. -- Ian. On Tue, Jul 28, 2009 at 2:22 PM, oh...@cox.net wrote: Ian and Matthew, I've tried foofoo, summary:foofoo, FooFoo, and summary:FooFoo. No results returned for any of those :(. Also, Matthew, I bounced Tomcat after running IndexFiles, so I don't think that's the problem either :(... I looked at the SearchFiles.java code, and it looks like it's literally using whatever query string I'm entering (ditto for luceneweb). Is there something with the query itself that needs to be modified to support searching on the fields other than the contents field (recall, I'm pretty sure that all those other fields are in the index, via Luke)? Jim Ian Lea ian@gmail.com wrote: Hi Field.Index.NOT_ANALYZED means it will be stored as is i.e. FooFoo in your example, and if you search for foofoo it won't match. A search for FooFoo would, assuming that your search terms are not being lowercased. -- Ian. On Tue, Jul 28, 2009 at 1:56 PM, Ohayaoh...@cox.net wrote: Hi, I'm just starting to work with Lucene, and I guess that I learn best by working with code, so I've started with the demos in the Lucene distribution. I got the IndexFiles.java and IndexHTML.java working, and also the luceneweb.war is deployed to Tomcat. I used IndexFiles.java to index some text files, and then used both the SearchFiles.java and the luceneweb web app to do some testing. One of the things that I noticed with the luceneweb web app is that when I searched, the search results returned Summary of null, so I added: doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.NOT_ANALYZED)); to the IndexFiles.java, and ran it again. I had expected that I'd then be able to do a search for something like summary:foofoo, but when I did that, I got no results. I also tried SearchFiles.java, and again got no results. I tried using Luke, and that is showing that the summary field is in the indexes, so I'm wondering why I am not able to search on other fields such as summary, path, etc.? Can anyone explain what else I need to do, esp. in the luceneweb web app, to be able to search these other fields? Thanks! Jim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java
Re: New to Lucene - some questions about demo
Oh.. no. If you specifically include a fieldname: blah in your clause, you don't need a MultiFieldQueryParser. The purpose of the MFQP is to turn queries like this blah automatically into this field1: blah AND field2: blah AND field3: blah (Or OR if you set it up properly) When you setup the MFQP you specify what fields you want to have this behavior apply to, and can even give each field its own specific analyzer. So if in your index you have multiple fields, each of which was created with a different analyzer, you could search these effortlessly in your webapp using the MFQP. (If for example you have an exact_contents and a contents field, one where punctuation and capitalization matters, one where it does not) Hope that clears things up for you. Matt oh...@cox.net wrote: Matthew, I'll keep your comments in mind, but I'm still confused about something. I currently haven't changed much in the demo, other than adding that doc.add for summary. With JUST that doc.add, having done my reading, I kind of expected NOT to be able to search on the summary at all, but it kind of seems like SOMETIMES, I am still getting responses when I search on something in summary. Does that mean that Lucene will automatically do multi-field searching? Maybe I've been up too long, but it seems like, for example, when I search on summary:foofoo I am not getting a response, but, for example, if I search on: summary:foofoo AND contents:test1 I get results in the search response. Since I haven't yet added the MultiField query, shouldn't it ONLY be searching on the contents field (because the summary:foofo should have been false, and because I am using an AND)? Like I said, maybe I've been staring at this too long, and need to do some more structured testing :)... Sorry. Later, Jim Matthew Hall mh...@informatics.jax.org wrote: You can choose to do either, Having items in multiple fields allows you to apply field specific boosts, thusly making matches to certain fields more important to others. But, if that's not something that you care about the second technique is useful in that it vastly simplifies your index structure (And thusly your query structure) So, it depends on what you want to be able to do in the end. Do you envision doing something like being able to search by the summary and the contents at the same time, but weighing hits to the summary as a higher priority? If so, use multiple fields. If not, keep this first iteration in lucene simple, and compress everything down. Also please note that the + + in the example cited is important. That space will ensure that your contents and summary fields will be tokenized properly. (Just in case they are single words lets say). Matt oh...@cox.net wrote: Hi Matthew and Ian, Thanks, I'll try that, but, in the meantime, I've been doing some reading (Lucene in Action), and on pg. 159, section 5.3, it discusses Querying on multiple fields. I was just about to try to what's described in that section, i.e., using MultiFieldQueryParser.parse(), or, as another note on pg. 161 mentions, doing something like: doc.add(Field.Unstored(contents, contents + + summary); So, I guess I'm a little confused (happens a lot :)!): In the situation I'm talking about (starting with the Lucene demo and demo webapp, and trying to be able to index and search more than just the contents field), do I not need to use the MultiFieldQueryParser.parse() or do what they call create a synthentic content? Thanks, Jim Matthew Hall mh...@informatics.jax.org wrote: Yeah, Ian has it nailed on the head here. Can't believe I missed it in the initial writeup. Matt Ian Lea wrote: Jim Glancing at SearchFiles.java I can see Analyzer analyzer = new StandardAnalyzer(); ... QueryParser parser = new QueryParser(field, analyzer); ... Query query = parser.parse(line); so any query term you enter will be run through StandardAnalyzer which will, amongst other things, convert it to lowercase and will not match the indexed value of FooFoo. If you're just playing, it would probably be easiest to tell lucene to analyze the summary field e.g. doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.ANALYZED)); That will cause FooFoo to be indexed as foofoo and thus should be matched on search. -- Ian. On Tue, Jul 28, 2009 at 2:22 PM, oh...@cox.net wrote: Ian and Matthew, I've tried foofoo, summary:foofoo, FooFoo, and summary:FooFoo. No results returned for any of those :(. Also, Matthew, I bounced Tomcat after running IndexFiles, so I don't think that's the problem either :(... I looked at the SearchFiles.java code, and it looks like it's literally using whatever query string I'm entering (ditto for luceneweb). Is there something with the query itself that needs to be modified to support searching on the fields other than the contents field (recall, I'm
Re: Batch searching
This was at least one of the threads that was bouncing around... I'm fairly sure there were others as well. Hopefully its worth the read to you ^^ http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html Phil Whelan wrote: On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hallmh...@informatics.jax.org wrote: Not sure if this helps you, but some of the issue you are facing seem similar to those in the real time search threads. Hi Matthew, Do you have a pointer of where to go to see the real time threads? Thanks, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combining hits
Erm.. I have to be missing something here, wouldn't you be able just do the following: do a search on Term 1 AND Term 2 do a search on Term 2 AND Term2 AND Term 3 This would ensure that you have two objects back, one of which is guaranteed to be a subset of the other. Then, when you are iterating on your documents to do your highlighting over the results from the first search (At least I think that's what you are doing here) check to see if the current document exists in the hits or topDocs object that came from the second search. If it does, use the three term highlighter, if it doesn't use the two term highlighter. But, what sort of reordering are you trying to do here anyhow? Doing just a normal search against Term 1 OR Term 2 OR Term 3 with a standard highlighter would most likely get you ... well exactly the same results as what you are describing. The only real difference I could see is the order that the documents are returned to you. Matt Max Lynch wrote: Hi, I am doing a search on my index for a query like this: query = \Term 1\ \Term 2\ \Term 3\ Where I want to find Term 1, Term 2 and Term 3 in the index. However, I only want to search for Term 3 if I find Term 1 and Term 2 first, to avoid doing processing on hits that only contain Term 3. To do this, I was thinking of doing a search for \Term 1\ \Term 2\ and then if there are hits for these terms, I would do another search for Term 3 on these resulting documents. I am running a background search so I am not too worried performance issues caused by searching twice. Is there a way to search on subset of documents and then combining the hits for the document? For example, if Term 1 and Term 2 are found in Document1, and Term3 is also later found in Document1, I want to be able to process the hits on my highlighter as containing all three terms. Sorry if it's confusing. Thanks, Max -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combining hits
Looking at what you wrote: I am doing a weighting system where I rank documents that have Term 1 AND Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term 2, and more highly than documents that just have Term 1 OR Term 2 but not both. Couldn't you maybe get the same effect using some clever term boosting? I.. think something like Term 1 OR Term 2 OR Term 3 ^ .25 would return in almost the exact order that you are asking for here, with the only real difference being that you would have some matches for only Term 3 way way at the bottom of your list score wise. It might be worth investigating something like this, where you cut off displaying documents that don't match a certain score thresh hold. Thus cutting out the matches that you don't want (The term3 only ones) -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Batch searching
If you did this, wouldn't you be binding the processing of the results of all queries to that of the slowest performing one within the collection? I'm guessing you are trying for some sort of performance benefit by batch processing, but I question whether or not you will actually get more performance by performing your queries in a threaded type environment, and then processing their results as they come in. Could you give a bit more description about what you are actually trying to accomplish, I'm sure this list could help better if we had more information. Matt tsuraan wrote: If I understand lucene correctly, when doing multiple simultaneous searches on the same IndexSearcher, they will basically all do their own index scans and collect results independently. If that's correct, is there a way to batch searches together, so only one index scan is done? What I'd like is a Searcher.search(Query[], Collector[]) type function, where the search only scans over the index once for each collection of (basically unrelated) searches. Is that possible, or does that even make sense? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Batch searching
Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow? I'm just trying to make sure that we are all on the same page here ^^ I can see the benefits of doing what you are describing with a very large corpus that is expected to grow at quick rate, but if that's not really your use case, then perhaps it might be worth investigating if a simpler solution would serve you just as well. In the example you provided, you are only talking about searching against 1M documents, which I can guarantee will search with VERY good performance in a single properly setup lucene index. Now if we are talking more on the order of... 100M or more documents you may be onto something. Well, that's my thoughts anyhow Matt tsuraan wrote: If you did this, wouldn't you be binding the processing of the results of all queries to that of the slowest performing one within the collection? I would imagine it would, but I haven't seen too much variance between lucene query speeds in our data. I'm guessing you are trying for some sort of performance benefit by batch processing, but I question whether or not you will actually get more performance by performing your queries in a threaded type environment, and then processing their results as they come in. Could you give a bit more description about what you are actually trying to accomplish, I'm sure this list could help better if we had more information. What I'd like to do is build lots of small indices (a few thousand documents per index) and put them into HDFS for search distribution. We already have our own map-reduce framework for searching, but HDFS seems to be a really good fit for an actual storage mechanism. My concern is that when we have one searcher using thousands of HDFS-backed indices, the seeking might get a bit nasty. HDFS apparently has pretty good seeking performance, but it really looks like it was designed for streaming, so if I could make my searches use sequential index access, I would expect better performance than having a ton of simultaneous searches making HDFS seek all over the place. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Batch searching
Not sure if this helps you, but some of the issue you are facing seem similar to those in the real time search threads. Basically their problem involves indexing twitter and the blogosphere, and making lucene work for super large data sets like that. Perhaps some of the discussion in those threads could help. I'd imagine they went over things like massively distributed searches and such, and other things that might be of interest to you. Sorry I can't be of more help than that. Matt tsuraan wrote: Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow? in terms of lucene documents, we tend to have in the 10M-100M range. Currently we use merging to make larger indices from smaller ones, so a single index can have a lot of documents in it, but merging takes a long time so I'm trying to test out just using a ton of tiny indices to see if the search penalty from doing that is worth the time savings from not having to build and optimize large indices. I'm just trying to make sure that we are all on the same page here ^^ I can see the benefits of doing what you are describing with a very large corpus that is expected to grow at quick rate, but if that's not really your use case, then perhaps it might be worth investigating if a simpler solution would serve you just as well. The indices also grow pretty quickly. We have some cases where we get nearly a million new documents per day. I haven't looked at those machines for quite a while, but I guess they'd probably have well over a hundred million documents, and still are growing. We also don't have a lot of simultaneous searches yet, but that's changing, so I'm getting concerned about how well that's being handled. We expect that we will soon be dealing with tens to hundreds of searches being executed simultaneously. In the example you provided, you are only talking about searching against 1M documents, which I can guarantee will search with VERY good performance in a single properly setup lucene index. Now if we are talking more on the order of... 100M or more documents you may be onto something. But in any case, there isn't currently any framework for making multiple searches simultaneously use an index in a coordinated fashion. I was pretty much just planning on adding it to my tests if such a thing existed. Since it doesn't, I guess I'll stuck to searching in parallel and hoping that the Linux VFS layer is smart enough to keep things fast until I have time to try putting something together myself. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Tokenizer queston: how can I force ? and ! to be separate tokens?
I'd think extending WhiteSpaceTokenizer would be a good place to start. Then create a new Analyzer that exactly mirrors your current Analyzer, with the exception that it uses your new tokenizer instead of WhiteSpaceTokenizer (Well.. there is of course my assumption that you are using an Analyzer that already uses WhiteSpaceTokenizer... but you likely are) OBender wrote: Hi All, I need to make ? and ! characters to be a separate token e.g. to split [how are you?] in to 4 tokens [how], [are], [you] and [?] what would be the best way to do this? Thanks -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search in non-linguistic text
Assuming your dataset isn't incredibly large, I think you could.. cheat here, and optimize your data for searching. Am I correct in assuming that BC, should also match on ABCD? If so, then yes your current thoughts on the problems that you face are correct, and everything you do will be turning into a contains search, which is yes.. not the best performance you have ever seen. However, knowing this, you can manipulate your data in such a way, that you can get around that limitation, and turn everything into a prefix (or postfix) search if you so prefer. So here's what you do: When you are indexing the term ABCD, you are actually going to add several documents into the index (or into various special purpose indexes, if you so prefer.. but more on that later on) Lets say you want to turn everything into a prefix search under the covers. In the index you would store the following values, all of which point at the document ABCD 'ABCD' 'BCD' 'CD' 'D' Then, when you do your search for the terms BC you will really be searching on BC*, which will produce a match to the second document. Now Lucene documents can be considered as giant data holding object, you can and SHOULD have fields in the document that are not used at search time, but ARE used at display generation time (or whatever layer feeds your display, if you are going in a more OO fashion). Now this technique isn't without its drawbacks of course, you will see an increase in your index size, but unless you are playing around with some VERY large datasets that really shouldn't matter. Now, if I was the one implementing this, I would probably make at least two indexes, one for exact punctuation relevant data. The other index would contain the data that I've described above, with one important difference, any and all punctuation (including whitespace) has been removed, and all of the letters in your codes were collapsed down into a single word. That way you can perform two searches, and ensure that exact punctuation relevant matches will appear higher in your results list than non punctuation relevant ones. Anyhow, that's pretty much it in a nutshell. I think this technique should work for you, after you have decided JesL wrote: Hello, Are there any suggestions / best practices for using Lucene for searching non-linguistic text? What I mean by non-linguistic is that it's not English or any other language, but rather product codes. This is presenting some interesting challenges. Among them are the need for pretty lax wildcard searches. For example, ABC should match on ABCD, but so should BCD. Also, it needs to be agnostic to special characters. So, ABC/D should match ABCD as well as ABC-D or ABC D. As I write an analyzer to handle these cases, I seem to be pretty quickly degrading into a like '%blah%' search, with rules to treat all special characters as single-character, optional wildcards. I'm concerned that the performance of this will be disappointing, though. Any help would be much appreciated. Thanks! - Jes -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Ugh
They are upgrading our mail servers here, so if you are seeing.. many MANY duplicates of things I posted.. I'm really sorry about that. T_T Matt -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Out of curiosity, when you try your other test string aaa _bbb ccc what do the token byte offsets show? Matt Mark Harwood wrote: On 1 Jul 2009, at 17:39, k.sayama wrote: I could verify Token byte offsets The sytsem outputs aaa:0:3 bbb:0:3 ccc:4:7 That explains the highlighter behaviour. Clearly BBB is not at position 0-3 in the String you supplied String CONTENTS = AAA :BBB CCC; Looks like the Tokenizer needs fixing. Is this yours or a standard Lucene class? If the latter, raising a JIRA bug with a Junit test would be the best way to get things moving. Cheers Mark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Does the same thing happen when you use SimpleAnalyzer, or StandardAnalyzer? I have a sneaking suspicion that the : in your contents string is what's causing your issue here, as : is a reserved character that denotes a field specification. But I could be wrong. Try swapping analyzers, if you no longer have the same issue with Simple, try Standard. Assuming the same problem shows up there, I think you might need to do something about the :. Matt k.sayama wrote: hello. i've tried to highlight string using Highligheter(2.4.1) and JapaneseAnalyzer but the following code extract show the problem String F = f; String CONTENTS = AAA :BBB CCC; JapaneseAnalyzer analyzer = new JapaneseAnalyzer(); QueryParser qp = new QueryParser( F, analyzer ); Query query = qp.parse( BBB ); Highlighter h = new Highlighter( new QueryScorer( query, F ) ); System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) ); The sytsem outputs BAAA/B :BBB CCC When you change CONTENTS to AAA _BBB CCC the system outputs AAA _/B CCC Are there any problems? Thanks in advance - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query which gives high score proportional to 'distinct term matches'
Well, we have a very similar requirement here, but for us its for every single field that we wanted this kind of behavior. We got this in by eliminating the TF (Term Frequency) contribution to score via a custom Similarity. (Which is very easy to do.) I... think in the newer versions of lucene you can omit TF more programatically at query time, but I don't recall if you could do it on a per field basis. Anyone else want to speak on this a bit better? Matt chandrakant k wrote: I have a index which has got fields like title : content : If I search for, lets say obama fly , then the documents having obama and fly should be given high scores irrespective of the number of times they may occur. This requirement is for fields - title and content. The implementation which I did with a simple OR query will score high the documents for e.g. having more occurrence of 'obama' even if it has no occurrence 'fly' word in it. The tf for 'obama' here in this case is more; so even if 'fly' word is not present the document is scored higher. Expected behaviour is that - (a) documents having 'obama' and 'fly' both should be scored higher in order of their tf . (b) documents having either of terms should be given scores but less than those matched in (a) I tried by overiding the the coord() in a Custom Similarity implementation and boosting it if multiple terms match, but what I see is that coord() is gets boosted even if same word matches in multiple fields (say obama is present in title: and content: ). Searching for solutions, I have not got any results which talk about similar requirement... I guess I am not using right keywords Thanks Chandrakant K. -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: No hits while searching!
( PhysicianDocumentBuilder.PhysicianFieldInfo.FIRST_NAME .toString(), new MetaphoneReplacementAnalyzer()); wrapper.addAnalyzer( PhysicianDocumentBuilder.PhysicianFieldInfo.LAST_NAME .toString(), new MetaphoneReplacementAnalyzer()); } /** * @see PerFieldAnalyzerWrapper#tokenStream(String, Reader) */ @Override public TokenStream tokenStream(String fieldName, Reader reader) { return wrapper.tokenStream(fieldName, reader); } } //lastly the query builder if(physicianQuery.getExactNameSearch()){ if(StringUtils.isNotEmpty(physicianQuery.getFirstNameStartsWith())){ TermQuery term = new TermQuery(new Term(FIRST_NAME_EXACT.toString(), physicianQuery.getFirstNameStartsWith())); query.add(term,MUST); } if(StringUtils.isNotEmpty(physicianQuery.getLastNameStartsWith())){ TermQuery term = new TermQuery(new Term(LAST_NAME_EXACT.toString(), physicianQuery.getLastNameStartsWith())); query.add(term,MUST); } else{ //we want metaphone search if (StringUtils.isNotEmpty(physicianQuery.getFirstNameStartsWith())) { query.add(buildMultiTermPrefixQuery(FIRST_NAME.toString(), physicianQuery.getFirstNameStartsWith()), MUST); } if (StringUtils.isNotEmpty(physicianQuery.getLastNameStartsWith())) { query.add(buildMultiTermPrefixQuery(LAST_NAME.toString(), physicianQuery.getLastNameStartsWith()), MUST); } } -- View this message in context: http://www.nabble.com/No-hits-while-searching%21-tp23735920p23735920.htm l Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: No hits while searching!
Just build your own. Here's exactly what you are looking for: (Mind you I just whipped this out, and didn't compile it... so there could be minor syntax errors here.) You will also obviously have to make your own package declaration, and your own imports. So anyhow, the really neat thing about lucene, is being able to do exactly what we just did here, you can chain these tokenizers and filters together in almost any way you want, and create custom analyzers outta them. Its a good thing to become familiar with, because I will nearly promise you that this analyzer here will ALSO probably be insufficient for your needs. Anyhow, hope this helps. Matt /** * Custom Lowercase Analyzer * * @author mhall * * This analyzer tokenizes on whitespace, and then lowercases the token. * */ public class LowerCaseAnalyzer extends Analyzer { public LowerCaseAnalyzer() { super(); } /** * Worker for this Analyzer. * * Specifically this analyzer chains together WhitespaceTokenizer - * LowerCaseFilter together to form customized Tokens */ public TokenStream tokenStream(String fieldName, Reader reader) { return new LowerCaseFilter(new WhitespaceTokenizer(reader)); } } vanshi wrote: Thanks Matt sithu. Yes, It was due to stop word analyzer...now i'm using a simple analyzer temporarily, as I know even simple analyzer cannot handle quotes in names. However, can somebody plz direct me towards how to handle quotes with the name in query using lowercase analyzer? thanks, Vanshi Matthew Hall-7 wrote: Yeah, he's gotta be. You might be better of using something like a lowercase analyzer here, since punctuation in a name is likely important. Matt Sudarsan, Sithu D. wrote: Do you use stopword filtering? Sincerely, Sithu D Sudarsan -Original Message- From: vanshi [mailto:nilu.tha...@gmail.com] Sent: Monday, June 01, 2009 11:39 AM To: java-user@lucene.apache.org Subject: Re: No hits while searching! Thanks Erick, I was able to get this work...as you said ..Luke is a great tool to look in to what gets stored as indexes though in my case I was searching before the indexes were created so i was getting zero hits. On side note, I'm running a strange output with prefix query...it only works when i have 3 or more than 3 letters in the first name/last name. Any idea what is going on here? Please see the output from log here. 02:05:20,996 INFO [PhysicianQueryBuilder] Entered addTypeSpecificTerms in PhysicianQuerybuilder with exactName=true 02:05:20,996 INFO [PhysicianQueryBuilder] Before running Prefix query, First name: ang 02:05:20,996 INFO [PhysicianQueryBuilder] Before running Prefix query, Last name: john 02:05:21,012 INFO [LuceneIndexService] the query is: +(FIRST_NAME_EXACT:ang*) +(LAST_NAME_EXACT:john*) 02:05:21,012 INFO [LuceneIndexService] Result Size: 1 02:06:03,578 INFO [PhysicianQueryBuilder] Entered addTypeSpecificTerms in PhysicianQuerybuilder with exactName=true 02:06:03,578 INFO [PhysicianQueryBuilder] Before running term query, First name: a 02:06:03,578 INFO [PhysicianQueryBuilder] Before running term query, Last name: johns 02:06:03,578 INFO [LuceneIndexService] the query is: +() +(LAST_NAME_EXACT:johns*) 02:06:03,578 INFO [LuceneIndexService] Result Size: 0 02:08:01,548 INFO [PhysicianQueryBuilder] Entered addTypeSpecificTerms in PhysicianQuerybuilder with exactName=true 02:08:01,548 INFO [PhysicianQueryBuilder] Before running term query, First name: an 02:08:01,548 INFO [PhysicianQueryBuilder] Before running term query, Last name: johns 02:08:01,548 INFO [LuceneIndexService] the query is: +() +(LAST_NAME_EXACT:johns*) 02:08:01,580 INFO [LuceneIndexService] Result Size: 0 As one can see the query works with first name=ang but not with first name=a or an. Appreciate all your inputs. Vanshi Erick Erickson wrote: The most common issue with this kind of thing is that UN_TOKENIZEDimplies no case folding. So if your case differs you won't get a match. That aside, the very first thing I'd do is get a copy of Luke (google Lucene Luke) and examine the index to see if what's in your index is what you *think* is in there. The second thing I'd do is look at query.toString() to see what the actual query is. You can even paste the output of toString() into Luke and see what happens. I'm not sure what buildMultiTermPrefixQuery is all about, but I assume you have a good reason for using that. But the other strategy I use for this kind of what happened? question is to peel back to simpler cases until I get what I expect, then build back up until it breaks. But really get a copy of Luke, it's a wonderful tool that'll give you lots of insight about what's *really* going on... Best Erick On Wed
Re: Parsing large xml files
2g... should not be a maximum for any Jvm that I know of. Assuming you are running a 32 bit Jvm you are actually able to address a bit under 4G of memory, I've always used around 3.6G when trying to max out a 32 bit jvm. Technically speaking it should be able to address 4g under a 32 bit or, however a certain percentage of the memory is set aside for overhead, so you can only really use a bit less than the max. If you have a 64 bit os/jvm (which you likely might), you can use the -d64 setting for your runtime environment to set your maximum memory much.. MUCH higher, for example we regularly use 6G of memory on our application servers here at the lab. Hope this helps you a bit, Matt crack...@comcast.net wrote: http://vtd-xml.sf.net - Original Message - From: Sithu D. Sudarsan sithu.sudar...@fda.hhs.gov To: java-user@lucene.apache.org Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific Subject: Parsing large xml files Hi, While trying to parse xml documents of about 50MB size, we run into OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB (that is the max), does not help. Is there any API that could be used to handle such large single xml files? If Lucene is not the right place, please let me know alternate places to look for, Thanks in advance, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Parsing large xml files
Yeah, there's a setting on windows that allows you to use up to .. erm 3G I think it was. The limitation there is due to the silly windows file system. I'm don't remember off hand exactly what that setting was, but I'm 100% certain that its there. If you do a google search for jvm maximum memory settings on windows you should be able to find a few articles about it. (At least that's certainly my recollection) Secondly, if you have a linux machine available you should likely just use that, particularly if its a 64 bit processor because then a whole ton more memory becomes available to you. When I'm developing my indexes I do it via eclipse on my windows platform, but with the actual directories themselves mounted from a solaris machine. When I go to actually MAKE the indexes I simply login to the machine do a quick ant compile, and run them. Sure its an extra step, but the gains are more than worth it in our case. Matt Sudarsan, Sithu D. wrote: Hi Matt, We use 32 bit JVM. Though it is supposed to have upto 4GB, any assignment above 2GB in Windows XP fails. The machine has quad-core dual processor. On Linux we're able to use 4GB though! If there is any setting that will let us use 4GB do let me know. Thanks, Sithu D Sudarsan -Original Message- From: Matthew Hall [mailto:mh...@informatics.jax.org] Sent: Friday, May 22, 2009 8:59 AM To: java-user@lucene.apache.org Subject: Re: Parsing large xml files 2g... should not be a maximum for any Jvm that I know of. Assuming you are running a 32 bit Jvm you are actually able to address a bit under 4G of memory, I've always used around 3.6G when trying to max out a 32 bit jvm. Technically speaking it should be able to address 4g under a 32 bit or, however a certain percentage of the memory is set aside for overhead, so you can only really use a bit less than the max. If you have a 64 bit os/jvm (which you likely might), you can use the -d64 setting for your runtime environment to set your maximum memory much.. MUCH higher, for example we regularly use 6G of memory on our application servers here at the lab. Hope this helps you a bit, Matt crack...@comcast.net wrote: http://vtd-xml.sf.net - Original Message - From: Sithu D. Sudarsan sithu.sudar...@fda.hhs.gov To: java-user@lucene.apache.org Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific Subject: Parsing large xml files Hi, While trying to parse xml documents of about 50MB size, we run into OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB (that is the max), does not help. Is there any API that could be used to handle such large single xml files? If Lucene is not the right place, please let me know alternate places to look for, Thanks in advance, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching index problems with tomcat
For writing indexes? Well I guess it depends on what you want.. but I personally use this: (2.3.2 API) File INDEX_DIR = /data/searchtool/thisismyindexdirectory Analyzer analyzer = new WhateverConcreteAnalyzerYouWant(); writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true); Your best bet would be to peruse the API docs of whatever lucene version you are using. However, I'm still pretty sure this ISN'T your actual issue here. Looking at your full path example those still seem to be by reference to me. Let me be more specific and tell you EXACTLY what I mean by that, Lets say you are running your program in the following directory: /home/test/app/ Trying to open an index like you have below will effectively be trying to open an index in the following location: /home/test/app/home/marco/RdfIndexLucene What I think you MEAN to be doing is: /home/marco/RdfIndexLucene That leading slash is VERY VERY important, as its the entire difference between an relative path and an absolute one. Matt Marco Lazzara wrote: I was talking with my teacher. Is it correct to use FSDirectory?Could you please look again at the code I've posted here?? Should I choose a different way to Indexing ?? Marco Lazzara 2009/5/22 Ian Lea ian@gmail.com OK. I'd still like to see some evidence, but never mind. Next suggestion is the old standby - cut the code down to the absolute minimum to demonstrate the problem and post it here. I know you've already posted some code, but maybe not all of it, and definitely not cut down to the absolute minimum. -- Ian. On Thu, May 21, 2009 at 10:48 PM, Marco Lazzara marco.lazz...@gmail.com wrote: _I strongly suggest that you use a full path name and/or provide some evidence that your readers and writers are using the same directory and thus lucene index. _ I try a full path like home/marco/RdfIndexLucene,even media/disk/users/fratelli/RDFIndexLucene.But nothing is changed. MARCOLAZZARA _ _ Its been a few days, and we haven't heard back about this issue, can we assume that you fixed it via using fully qualified paths then? Matt Ian Lea wrote: Marco You haven't answered Matt's question about where you are running it from. Tomcat's default directory may well not be the same as yours. I strongly suggest that you use a full path name and/or provide some evidence that your readers and writers are using the same directory and thus lucene index. -- Ian. On Wed, May 20, 2009 at 9:59 AM, Marco Lazzara marco.lazz...@gmail.com wrote: I've posted the indexing part,but I don't use this in my app.After I create the index,I put that in a folder like /home/marco/RDFIndexLucece and when I run the query I'm only searching (and not indexing). String[] fieldsearch = new String[] {name, synonyms, propIn}; //RDFinder rdfind = new RDFinder(RDFIndexLucene/,fieldsearch); TreeMapInteger, ArrayListString paths; try { this.paths = this.rdfind.Search(text, path); } catch (ParseException e1) { e1.printStackTrace(); } catch (IOException e1) { e1.printStackTrace(); } Marco Lazzara Sorry, anyhow looking over this quickly here's a summarization of what I see: You have documents in your index that look like the following: name which is indexed and stored. synonyms which are indexed and stored path, which is stored but not indexed propin, which is stored and indexed propinnum, which is stored but not indexed and ... vicinity I guess which is stored but not indexed For an analyzer you are using Standard analyzer (which considering all the Italian? is an interesting choice.) And you are opening your index using FSDirectory, in what appears to be a by reference fashion (You don't have a fully qualified path to where your index is, you are ASSUMING that its in the same directory as this code, unless FSDirectory is not implemented as I think it is.) Now can I see the consumer code? Specifically the part where you are opening the index/constructing your queries? I'm betting what's going on here is you are deploying this as a war file into tomcat, and its just not really finding the index as a result of how the war file is getting deployed, but looking more closely at the source code should reveal if my suspicion is correct here. Also runtime wise, when you run your standalone app, where specifically in your directory structure are you running it from? Cause if you are opening your index reader/searcher in the same way as you are creating your writer here, I'm pretty darn certain that will cause you problems. Matt Marco Lazzara wrote: _Could you further post your Analyzer Setup/Query Building code from BOTH apps. _ there is only one code.It is the same for web and for standalone. And it is exactly the real problem!!the code is
Re: Searching index problems with tomcat
because that's the default index write behavior. It will create any directory that you ask it to. Matt Marco Lazzara wrote: ok.I understand what you really mean but It doesn't work. I understand one thing.For example When i try to open an index in the following location : RDFIndexLucene/ but the folder doesn't exist,*Lucene create an empty folder named RDFIndexLucene* in my home folder...WHY??? MARCO LAZZARA 2009/5/22 Matthew Hall mh...@informatics.jax.org For writing indexes? Well I guess it depends on what you want.. but I personally use this: (2.3.2 API) File INDEX_DIR = /data/searchtool/thisismyindexdirectory Analyzer analyzer = new WhateverConcreteAnalyzerYouWant(); writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true); Your best bet would be to peruse the API docs of whatever lucene version you are using. However, I'm still pretty sure this ISN'T your actual issue here. Looking at your full path example those still seem to be by reference to me. Let me be more specific and tell you EXACTLY what I mean by that, Lets say you are running your program in the following directory: /home/test/app/ Trying to open an index like you have below will effectively be trying to open an index in the following location: /home/test/app/home/marco/RdfIndexLucene What I think you MEAN to be doing is: /home/marco/RdfIndexLucene That leading slash is VERY VERY important, as its the entire difference between an relative path and an absolute one. Matt Marco Lazzara wrote: I was talking with my teacher. Is it correct to use FSDirectory?Could you please look again at the code I've posted here?? Should I choose a different way to Indexing ?? Marco Lazzara 2009/5/22 Ian Lea ian@gmail.com OK. I'd still like to see some evidence, but never mind. Next suggestion is the old standby - cut the code down to the absolute minimum to demonstrate the problem and post it here. I know you've already posted some code, but maybe not all of it, and definitely not cut down to the absolute minimum. -- Ian. On Thu, May 21, 2009 at 10:48 PM, Marco Lazzara marco.lazz...@gmail.com wrote: _I strongly suggest that you use a full path name and/or provide some evidence that your readers and writers are using the same directory and thus lucene index. _ I try a full path like home/marco/RdfIndexLucene,even media/disk/users/fratelli/RDFIndexLucene.But nothing is changed. MARCOLAZZARA _ _ Its been a few days, and we haven't heard back about this issue, can we assume that you fixed it via using fully qualified paths then? Matt Ian Lea wrote: Marco You haven't answered Matt's question about where you are running it from. Tomcat's default directory may well not be the same as yours. I strongly suggest that you use a full path name and/or provide some evidence that your readers and writers are using the same directory and thus lucene index. -- Ian. On Wed, May 20, 2009 at 9:59 AM, Marco Lazzara marco.lazz...@gmail.com wrote: I've posted the indexing part,but I don't use this in my app.After I create the index,I put that in a folder like /home/marco/RDFIndexLucece and when I run the query I'm only searching (and not indexing). String[] fieldsearch = new String[] {name, synonyms, propIn}; //RDFinder rdfind = new RDFinder(RDFIndexLucene/,fieldsearch); TreeMapInteger, ArrayListString paths; try { this.paths = this.rdfind.Search(text, path); } catch (ParseException e1) { e1.printStackTrace(); } catch (IOException e1) { e1.printStackTrace(); } Marco Lazzara Sorry, anyhow looking over this quickly here's a summarization of what I see: You have documents in your index that look like the following: name which is indexed and stored. synonyms which are indexed and stored path, which is stored but not indexed propin, which is stored and indexed propinnum, which is stored but not indexed and ... vicinity I guess which is stored but not indexed For an analyzer you are using Standard analyzer (which considering all the Italian? is an interesting choice.) And you are opening your index using FSDirectory, in what appears to be a by reference fashion (You don't have a fully qualified path to where your index is, you are ASSUMING that its in the same directory as this code, unless FSDirectory is not implemented as I think it is.) Now can I see the consumer code? Specifically the part where you are opening the index/constructing your queries? I'm betting what's going on here is you are deploying this as a war file into tomcat, and its just not really finding the index as a result of how the war file
Re: Searching index problems with tomcat
humor me. Open up your indexing software package. Step 1: In all places where you reference your index, replace whatever the heck you have there with the following EXACT STRING: /home/marco/testIndex Do not leave off the leading slash. After you have made these changes to the indexing software, recompile and create your indexes. Step 2: After your indexing process completes do the following: cd /home/marco/testIndex/index You should see files in there, they will look something like this: drwxrwxr-x 3 mhallprogs 4.0K May 18 11:19 .. -rw-rw-r-- 1 mhallprogs 80 May 21 16:47 _9j7.fnm -rw-rw-r-- 1 mhallprogs 4.1G May 21 16:50 _9j7.fdt -rw-rw-r-- 1 mhallprogs 434M May 21 16:50 _9j7.fdx -rw-rw-r-- 1 mhallprogs 280M May 21 16:52 _9j7.frq -rw-rw-r-- 1 mhallprogs 108M May 21 16:52 _9j7.prx -rw-rw-r-- 1 mhallprogs 329M May 21 16:52 _9j7.tis -rw-rw-r-- 1 mhallprogs 4.7M May 21 16:52 _9j7.tii -rw-rw-r-- 1 mhallprogs 108M May 21 16:52 _9j7.nrm -rw-rw-r-- 1 mhallprogs 47 May 21 16:52 segments_9je -rw-rw-r-- 1 mhallprogs 20 May 21 16:52 segments.gen You have now confirmed that you are actually creating indexes. And the indexes you are creating exist at EXACTLY the place you have asked them to. Step 3: Then.. go download luke, and open these indexes. Perform a query on them, confirm that the data you want is actually IN the indexes. Step 4: Now, open up your standalone application, and replace whatever you are using in the to open the index with the SAME string I have listed above. Perform a search, verify that the indexes are there, and actually return values. Step 5: Lastly, go into your web application and again replace the path with the one I have above, recompile, and perform a search. Verify that the indexes are actually THERE and searchable. This.. damn well SHOULD work, if it doesn't it is likely pointing to some other issues in what you have setup. For example your tomcat instance could perhaps not have permission to read the lucene indexes directory. You should be able to tell this in the tomcat logs, BUT don't do this yet. Carefully and fully follow the steps I have outlined for you, and then you have chased down the full debugging path for this. If this yields nothing for you, I'd be happy to take a closer look at your source code, but until then give this a shot. Oh.. if it fails, please post back EXACTLY which steps in the above outlined process failed for you, as that will be really really helpful. Matt Marco Lazzara wrote: I dont't know hot to solve the problem..I've tried all rationals things.Maybe the last thing is to try to index not with FSDirectory but with something else.I have to peruse the api documentation. But.IF IT WAS A LUCENE'S BUG??? 2009/5/22 Matthew Hall mh...@informatics.jax.org because that's the default index write behavior. It will create any directory that you ask it to. Matt Marco Lazzara wrote: ok.I understand what you really mean but It doesn't work. I understand one thing.For example When i try to open an index in the following location : RDFIndexLucene/ but the folder doesn't exist,*Lucene create an empty folder named RDFIndexLucene* in my home folder...WHY??? MARCO LAZZARA 2009/5/22 Matthew Hall mh...@informatics.jax.org For writing indexes? Well I guess it depends on what you want.. but I personally use this: (2.3.2 API) File INDEX_DIR = /data/searchtool/thisismyindexdirectory Analyzer analyzer = new WhateverConcreteAnalyzerYouWant(); writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true); Your best bet would be to peruse the API docs of whatever lucene version you are using. However, I'm still pretty sure this ISN'T your actual issue here. Looking at your full path example those still seem to be by reference to me. Let me be more specific and tell you EXACTLY what I mean by that, Lets say you are running your program in the following directory: /home/test/app/ Trying to open an index like you have below will effectively be trying to open an index in the following location: /home/test/app/home/marco/RdfIndexLucene What I think you MEAN to be doing is: /home/marco/RdfIndexLucene That leading slash is VERY VERY important, as its the entire difference between an relative path and an absolute one. Matt Marco Lazzara wrote: I was talking with my teacher. Is it correct to use FSDirectory?Could you please look again at the code I've posted here?? Should I choose a different way to Indexing ?? Marco Lazzara 2009/5/22 Ian Lea ian@gmail.com OK. I'd still like to see some evidence, but never mind. Next suggestion is the old standby - cut the code down to the absolute minimum to demonstrate the problem and post it here. I know you've already posted some code, but maybe not all of it, and definitely not cut
Re: Searching index problems with tomcat
Right, so again, you are opening your index by reference there. You application has to assume that the index that its looking for exists in the same directory as the application itself lives. Since you are deploying this application as a deployable war file that's not going to work really well. Well.. on the other hand this seems to be commented out in this snippet, but wherever you actually DO initialize the directory you are using to help your index, try doing it with the full path. In your example below: RDFinder rdfind = new RDFinder(/home/marco/RDFIndexLucene/,fieldsearch); instead of what you have written here. Matt Marco Lazzara wrote: I've posted the indexing part,but I don't use this in my app.After I create the index,I put that in a folder like /home/marco/RDFIndexLucece and when I run the query I'm only searching (and not indexing). String[] fieldsearch = new String[] {name, synonyms, propIn}; //RDFinder rdfind = new RDFinder(RDFIndexLucene/,fieldsearch); TreeMapInteger, ArrayListString paths; try { this.paths = this.rdfind.Search(text, path); } catch (ParseException e1) { e1.printStackTrace(); } catch (IOException e1) { e1.printStackTrace(); } Marco Lazzara Sorry, anyhow looking over this quickly here's a summarization of what I see: You have documents in your index that look like the following: name which is indexed and stored. synonyms which are indexed and stored path, which is stored but not indexed propin, which is stored and indexed propinnum, which is stored but not indexed and ... vicinity I guess which is stored but not indexed For an analyzer you are using Standard analyzer (which considering all the Italian? is an interesting choice.) And you are opening your index using FSDirectory, in what appears to be a by reference fashion (You don't have a fully qualified path to where your index is, you are ASSUMING that its in the same directory as this code, unless FSDirectory is not implemented as I think it is.) Now can I see the consumer code? Specifically the part where you are opening the index/constructing your queries? I'm betting what's going on here is you are deploying this as a war file into tomcat, and its just not really finding the index as a result of how the war file is getting deployed, but looking more closely at the source code should reveal if my suspicion is correct here. Also runtime wise, when you run your standalone app, where specifically in your directory structure are you running it from? Cause if you are opening your index reader/searcher in the same way as you are creating your writer here, I'm pretty darn certain that will cause you problems. Matt Marco Lazzara wrote: _Could you further post your Analyzer Setup/Query Building code from BOTH apps. _ there is only one code.It is the same for web and for standalone. And it is exactly the real problem!!the code is the same,libraries are the same,query index etc etc. are the same. This is the class that create index public class AlternativeRDFIndexing { private Analyzer analyzer; private Directory directory; private IndexWriter iwriter; private WordNetSynonymEngine wns; private AlternativeResourceAnalysis rs; public ArrayListString commonnodes; //private RDFinder rdfind = new RDFinder(RDFIndexLucene/,new String[] {name}); //public boolean Exists(String node) throws ParseException, IOException{ // //return rdfind.Exists(node); //} public AlternativeRDFIndexing(String inputfilename) throws IOException, ParseException{ commonnodes = new ArrayListString(); // bisogna istanziare un oggetto per fare analisi sul documento rdf rs = new AlternativeResourceAnalysis(inputfilename); ArrayListString nodelist = rs.getResources(); int nodesize = nodelist.size(); ArrayListString sourcelist = rs.getsource(); int sourcesize = sourcelist.size(); //sinonimi wns = new WordNetSynonymEngine(sinonimi/); //creazione di un analyzer standard analyzer = new StandardAnalyzer(); //Memorizza l'indice in RAM: //Directory directory = new RAMDirector(); //Memorizza l'indice su file directory = FSDirectory.getDirectory(RDFIndexLucene/); //Creazione istanza per la scrittura dell'indice //Tale istanza viene fornita di analyzer, di un boolean per indicare se ricreare o meno da zero //la struttura e di una dimensione massima (o infinita IndexWriter.MaxFieldLength.UNLIMITED) iwriter = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000)); //costruiamo un indice con solo n documenti: un documento per nodo for (int i = 0; i nodesize; i++){ Document doc
Re: Searching index problems with tomcat
Things that could help us immensely here. Can you post your indexReader/Searcher initialization code from your standalone app, as well as your webapp. Could you further post your Analyzer Setup/Query Building code from both apps. Could you further post the document creation code used at indexing time? (Which analyzer, and which fields are indexed/stored) Give us this, and I'm pretty darn sure we can nail down your issue. Matt Ian Lea wrote: ... There are no exceptions.When I run the query a new shell is displayed but with no result. New shell? _*Are you sure the index is the same - what do IndexReader.maxDoc(), numDocs() and getVersion() say, standalone and in tomcat? *_What do you mean with this question?? IndexReader ir = ... System.out.printf(maxDoc=%s, ..., ir.maxDoc(), ...); and run in tomcat and standalone. To absolutely confirm you're looking at the same index, and it has documents, etc. -- Ian. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching index problems with tomcat
Sorry, anyhow looking over this quickly here's a summarization of what I see: You have documents in your index that look like the following: name which is indexed and stored. synonyms which are indexed and stored path, which is stored but not indexed propin, which is stored and indexed propinnum, which is stored but not indexed and ... vicinity I guess which is stored but not indexed For an analyzer you are using Standard analyzer (which considering all the Italian? is an interesting choice.) And you are opening your index using FSDirectory, in what appears to be a by reference fashion (You don't have a fully qualified path to where your index is, you are ASSUMING that its in the same directory as this code, unless FSDirectory is not implemented as I think it is.) Now can I see the consumer code? Specifically the part where you are opening the index/constructing your queries? I'm betting what's going on here is you are deploying this as a war file into tomcat, and its just not really finding the index as a result of how the war file is getting deployed, but looking more closely at the source code should reveal if my suspicion is correct here. Also runtime wise, when you run your standalone app, where specifically in your directory structure are you running it from? Cause if you are opening your index reader/searcher in the same way as you are creating your writer here, I'm pretty darn certain that will cause you problems. Matt Marco Lazzara wrote: _Could you further post your Analyzer Setup/Query Building code from BOTH apps. _ there is only one code.It is the same for web and for standalone. And it is exactly the real problem!!the code is the same,libraries are the same,query index etc etc. are the same. This is the class that create index public class AlternativeRDFIndexing { private Analyzer analyzer; private Directory directory; private IndexWriter iwriter; private WordNetSynonymEngine wns; private AlternativeResourceAnalysis rs; public ArrayListString commonnodes; //private RDFinder rdfind = new RDFinder(RDFIndexLucene/,new String[] {name}); //public boolean Exists(String node) throws ParseException, IOException{ // //return rdfind.Exists(node); //} public AlternativeRDFIndexing(String inputfilename) throws IOException, ParseException{ commonnodes = new ArrayListString(); // bisogna istanziare un oggetto per fare analisi sul documento rdf rs = new AlternativeResourceAnalysis(inputfilename); ArrayListString nodelist = rs.getResources(); int nodesize = nodelist.size(); ArrayListString sourcelist = rs.getsource(); int sourcesize = sourcelist.size(); //sinonimi wns = new WordNetSynonymEngine(sinonimi/); //creazione di un analyzer standard analyzer = new StandardAnalyzer(); //Memorizza l'indice in RAM: //Directory directory = new RAMDirector(); //Memorizza l'indice su file directory = FSDirectory.getDirectory(RDFIndexLucene/); //Creazione istanza per la scrittura dell'indice //Tale istanza viene fornita di analyzer, di un boolean per indicare se ricreare o meno da zero //la struttura e di una dimensione massima (o infinita IndexWriter.MaxFieldLength.UNLIMITED) iwriter = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000)); //costruiamo un indice con solo n documenti: un documento per nodo for (int i = 0; i nodesize; i++){ Document doc = new Document(); //creazione dei vari campi // ogni documento avrˆ // un campo name: nome del nodo // indicazione di memorizzazione(Store.YES) e indicizzazione con analyzer(ANALYZED) String node = nodelist.get(i); //if (sourcelist.contains(node)) break; //if (rdfind.Exists(node)) commonnodes.add(node); Field field = new Field(name, node, Field.Store.YES,Field.Index.ANALYZED); //Aggiunta campo al documento doc.add(field); //Aggiungo i sinonimi String[] nodesynonyms = wns.getSynonyms(node); for (int is = 0; is nodesynonyms.length; is++) { field = new Field(synonyms, nodesynonyms[is], Field.Store.YES,Field.Index.ANALYZED); //Aggiunta campo al documento doc.add(field); } // uno o piu campi path_i: path minimali dalle sorgenti al nodo // non indicizzati for (int j = 0; j sourcesize; j++) { String source = sourcelist.get(j);
NE Lucene User's Interest Group
Since everyone else seems to be trying to start these up I figured I would poll the community and see if there is any interest in the greater new england ares for a Lucene users group. Searching over on Google leads me to believe that such a group doesn't currently exist, and I think it would certainly be something interesting to attempt. Sadly. I'm betting that for any such group to succeed it would likely have to be Boston based, but perhaps there is a secret pocket of Lucene user's in Maine that I am not aware of. So, would anyone be interested in a such a thing, with location/topics of discussion TBD based on interest? Assuming there is enough interest in such a thing I would be willing to help organize/plan it, and I think I could convince my group to discuss our practical application of Lucene when searching Genomic information. -Matt - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to get the word before and the word after the matched Term?
Well, when you get the Document object, you have access to the fields in that document, including the text that was searched against. You could simply retrieve this string, and then use simple java String manipulation to get what you want. Matt Kamal Najib wrote: Hi all, I want to get the word before and the word after the matched Term.For Example if i have the Text The drug was freshly prepared at 4-hour intervals . Eleven courses were administered to seven patients at this dose level and no patient experienced nausea or vomiting and the matched Term for example patient i want to get the word level and the word experienced(and and no are stop words, therefore i d'ont want to get them.).I have looked at the Class Termposition but in this Class i can only get the position of the matched Term, how can i get the word before and after it, any suggestion?. Thank you in advance. Kamal - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Stemming
Ganesh wrote: My opinion is Stemming process is to get the base word. Here it is not doing so. Unfortunately this is where your problem lies, stemming doesn't do this, it breaks words that are almost lexically equivalent down into a similar root word. thus cat = cats. From the wiki: *Stemming* is the process for reducing inflected (or sometimes derived) words to their stem http://en.wikipedia.org/wiki/Word_stem, base or root http://en.wikipedia.org/wiki/Root_%28linguistics%29 form – generally a written word form. The stem need not be identical to the morphological root http://en.wikipedia.org/wiki/Morphological_root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm http://en.wikipedia.org/wiki/Algorithm has been a long-standing problem in computer science http://en.wikipedia.org/wiki/Computer_science; the first paper on the subject was published in 1968. The process of stemming, often called *conflation http://en.wikipedia.org/wiki/Conflation*, is useful in search engines http://en.wikipedia.org/wiki/Search_engine for query expansion http://en.wikipedia.org/wiki/Query_expansion or indexing http://en.wikipedia.org/wiki/Index_%28search_engine%29 and other natural language processing http://en.wikipedia.org/wiki/Natural_language_processing problems. But the words hard, and harder mean different things (In the opinion of those who developed the Snowball algorithm), and as such shouldn't be stemming down to a single word. Now, I find it to be an arguable point about hard and harder not being close enough to stem to the same root, but in order to get this effect you will need to either change the snowball algorithm, or process your words into a more base form before they go into the stemmed, which is a hairy road indeed ^^ Hope this helps. Matt -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: 'problem with indexformat and luke
Which version of luke are you using? Timon Roth wrote: hello list i am using lucene 2.9. when i try to open the index with luke i got an error: unknown format version: -8 any hints? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Seattle / PNW Hadoop + Lucene User Group?
Same here, sadly there isn't much call for Lucene user groups in Maine. It would be nice though ^^ Matt Amin Mohammed-Coleman wrote: I would love to come but I'm afraid I'm stuck in rainy old England :( Amin On 18 Apr 2009, at 01:08, Bradford Stephens bradfordsteph...@gmail.com wrote: OK, we've got 3 people... that's enough for a party? :) Surely there must be dozens more of you guys out there... c'mon, accelerate your knowledge! Join us in Seattle! On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ebook resources - including lucene in action
Strange.. as far as I can tell I never even got this email at all, was it not originally sent to the lucene lists? Matt Grant Ingersoll wrote: Lest you think silence equals acceptance... This is not appropriate use of these lists. -Grant On Apr 19, 2009, at 11:58 PM, wu fuheng wrote: welcome to download http://www.ultraie.com/admin/flist.php - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A Challenge!: Combining 2 searches into a single resultset?
-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/A-Challenge%21%3A-Combining-2-searches-into-a-single-resultset--tp23085506p23099744.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A Challenge!: Combining 2 searches into a single resultset?
Erm, I likely should have mentioned that this technique requires the use of a MultiFieldQueryParser. Matt Matthew Hall wrote: If you can build an analyzer that tokenizes the second field so that it filters out the words you don't want, you can then take advantage of more intelligent queries as well. So for the example that pjaol wrote, the query would become something like this: Query= body:(game OR redskins) keyword:(redskins)^10 Depending on your corpus this may or may not be possible, the determining factor being whether or not the list of words being removed from each document to create the second field varies. (more specifically I mean, in some documents you remove the word game, and in others you don't, if this is the case this technique won't work for you.) Matt theDude_2 wrote: Ah, Interesting... I didnt think of that! I will try it and report back pjaol wrote: Why not put the keywords into the same document as another field? and search both fields at once, you can then use lucene syntax to give a boosting to the keyword fields. e.g. body:A good game last night by the redskins keyword: redskins Query= body:(game OR redskins) keyword:(game OR redskins)^10 And adjust the boosting until you're happy. Check out for querying multiple fields http://wiki.apache.org/lucene-java/LuceneFAQ#head-300f0756fdaa71f522c96a868351f716573f2d77 You might even want to consider Solr and it's dismax search component http://wiki.apache.org/solr/DisMaxRequestHandler to make it easier On Fri, Apr 17, 2009 at 11:19 AM, theDude_2 aornst...@webmd.net wrote: I appreciate your response, and read the wiki article concerning the Federated search and I'm not sure that my project falls into the Federated Search bucket... What I've done is created 2 indexes created with the same documents. One index, contains the full documents - great for pure relevancy search The second index: contains all of the same documents, but a small subset of each documents contents - only allowing words to be indexed that we deem as good words - (for example) if this was a football article database Index 1: would index 100% of the article about the Redskins and the New York Giants Index 2: would index the same article by only the good words in the document like Redskins, Giants, Quarterback, Linebacker, etc. What I'm trying to do, if it's even possible! is run the search on both indexes containing references to the same article, and multiple the scores together to get a final score that would represent something like a relative AND good word score Figuring that if a user searches on Who is the Quarterback for the Giants this will get the user an article that is both related to the query, and deemed important to the query... I will look further into federated search and related items, but I think that lucene probably wont be able to help me with this, am I right? pjaol wrote: I'd start by doing some research on the question rather than asking for a solution.. What your asking for can be considered 'Federated Search' http://en.wikipedia.org/wiki/Federated_search And it can be conceived in as many ways as you have document types. Any answer will probably end up customized and weighted by your document silo value, usually companies weight those by business rules rather than head down the path of federated search, as it's just quicker and cheaper, and you can accomplish more. e.g Medication = score *2 (as higher advertising incentives) Diseases = score Books = score * 0.75 ( thousands of books, which nobody buys etc..) You might also want to try consolidating your data into 1 schema, and consider layering or collapsing results based on type. P On Fri, Apr 17, 2009 at 10:39 AM, theDude_2 aornst...@webmd.net wrote: (bump) - any thoughts? theDude_2 wrote: hi! I am trying to do something a little unique... I have a 90k text documents that I am trying to search Search A: indexes and searches the documents using regular relevancy search Search B: indexes and searches the documents using a smaller subset of key words that I have chosen This gives me 2 seperate scores: Score A, and Score B... I am trying to show the top 10 results of the scores combined so FinalScoretextDoc = (scoreA_of_td1 * 0.5) * (scoreB_of_td1 * 0.5) While it seems straightforward, I do not want to calculate the scores of all the documents outside of lucene. How can I integrate this better into the lucene search engine? Is this possible to do by any simple means? Thanks guys + gals! -- View this message in context: http://www.nabble.com/A-Challenge%21%3A-Combining-2-searches-into-a-single-resultset--tp23085506p23098961.html Sent from the Lucene - Java Users
Re: Query any data
I think I would tackle this in a slightly different manner. When you are creating this index, make sure that that field has a default value. Make sure this value is something that could never appear in the index otherwise. Then, when you goto place this field into the index, either write out your actual value, or the default one. Then when you get the document back, you can look at that field, and solve your question. You can also craft queries that specifically avoid entries that don't have a value in this field with a not clause. Hope this helps, Matt Erick Erickson wrote: searching for fieldname:* will be *extremely* expensive as it will, by default, build a giant OR clause consisting of every term in the field. You'll throw MaxClauses exceptions right and left. I'd follow Tim's thread lead first Best Erick 2009/4/8 王巍巍 ww.wang...@gmail.com first you should change your querypaser to accept wildcard query by calling method of QueryParser setAllowLeadingWildcard then you can query like this: fieldname:* 2009/4/9 Tim Williams william...@gmail.com On Wed, Apr 8, 2009 at 11:45 AM, addman addiek...@yahoo.com wrote: Hi, Is it possible to create a query to search a field for any value? I just need to know if the optional field contain any data at all. google for: lucene field existence There's no way built in, one strategy[1] is to have a 'meta field' that contains the names of the fields the document contains. --tim [1] - http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg07703.html - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- 王巍巍(Weiwei Wang) Department of Computer Science Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Mobile: 86-13913310569 MSN: ww.wang...@gmail.com Homepage: http://cs.nju.edu.cn/rl/weiweiwang -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How Can I make an analyzer that ignore the numbers o the texts ???
You can define your own STOP_LIST and pass it in as a constructor to most analyzers. For example from the Lucene Javadocs: StandardAnalyzer public *StandardAnalyzer*(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html[] stopWords) Builds an analyzer with the given stop words. The only thing that you need to be careful of is to make sure that the analyzer isn't doing some sort of conversion of the tokens before the stoplist is checked, but otherwise that should work out just fine. Matt Ariel wrote: Hi everybody: I would want to know how Can I make an analyzer that ignore the numbers o the texts like the stop words are ignored ??? For example that the terms : 3.8, 100, 4.15, 4,33 don't be added to the index. How can I do that ??? Regards Ariel -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is the right query syntax for matching some field's substring?
Which analyzer are you using here? Depending on your choice the comma separated values might be being kept together in your index, rather than tokenized as you expected. Secondly, you should get Luke, and take a look into your index, this should give you a much better idea of what's going on in your index. Anyhow, closely examine your analyzer choice, and then your query type choice and see if that's where the problem lies. Matt Bon wrote: Hi all, I've a question about the query syntax statement, There is a lucene text field and the value of the field like ,11,12,15,16, if I want to query some data and the value of the field has included some number what I like(11 or 15), how can I do? I try to give a query like (filed_name:,11,) but it can not get the matching. or I must reformat the field value with some other symbol not the symbol comma ',' Bon - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Memory Leak?
Perhaps this is a simple question, but looking at your stack trace, I'm not seeing where it was set during the tomcat initialization, so here goes: Are you setting up the jvm's heap size during your Tomcat initialization somewhere? If not, that very well could be part of your issue, as the standard JVM heapsize varies from platform to platform, so your windows based installation of tomcat simply might not have enough JVM Heap available to completely instantiate your RAMDirectory. So, to start what is your heap currently set at for tomcat? Secondly, if you try to increase it to a more reasonable value (say 512M or 1G) do you still run into this issue? Matt Chetan Shah wrote: The stack trace is attached. http://www.nabble.com/file/p22667542/dump dump The file size of _30.cfx - 1462KB _32.cfs - 3432KB _30.cfs - 645KB The source code of WatchListHTMLUtilities.getHTMLTitle is as follows : File f = new File(htmlFileName); FileInputStream fis = new FileInputStream(f); org.apache.lucene.demo.html.HTMLParser parser = new HTMLParser(fis); String title = parser.getTitle(); fis.close(); fis = null; f = null; return title; Michael McCandless-2 wrote: Hmm... after how many queries do you see the crash? Can you post the full OOME stack trace? You're using a RAMDirectory to hold the entire index... how large is your index? Mike Chetan Shah wrote: After reading this forum post : http://www.nabble.com/Lucene-Memory-Leak-tt19276999.html#a19364866 I created a Singleton For Standard Analyzer too. But the problem still persists. I have 2 singletons now. 1 for Standard Analyzer and other for IndexSearcher. The code is as follows : package watchlistsearch.core; import java.io.IOException; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import watchlistsearch.utils.Constants; public class IndexSearcherFactory { private static IndexSearcherFactory instance = null; private IndexSearcher indexSearcher; private IndexSearcherFactory() { } public static IndexSearcherFactory getInstance() { if (IndexSearcherFactory.instance == null) { IndexSearcherFactory.instance = new IndexSearcherFactory(); } return IndexSearcherFactory.instance; } public IndexSearcher getIndexSearcher() throws IOException { if (this.indexSearcher == null) { Directory directory = new RAMDirectory(Constants.INDEX_DIRECTORY); indexSearcher = new IndexSearcher(directory); } return this.indexSearcher; } } package watchlistsearch.core; import java.io.IOException; import org.apache.log4j.Logger; import org.apache.lucene.analysis.standard.StandardAnalyzer; --- public class AnalyzerFactory { private static AnalyzerFactory instance = null; private StandardAnalyzer standardAnalyzer; Logger logger = Logger.getLogger(AnalyzerFactory.class); private AnalyzerFactory() { } public static AnalyzerFactory getInstance() { if (AnalyzerFactory.instance == null) { AnalyzerFactory.instance = new AnalyzerFactory(); } return AnalyzerFactory.instance; } public StandardAnalyzer getStandardAnalyzer() throws IOException { if (this.standardAnalyzer == null) { this.standardAnalyzer = new StandardAnalyzer(); logger.debug(StandardAnalyzer Initialized..); } return this.standardAnalyzer; } } -- View this message in context: http://www.nabble.com/Memory-Leak--tp22663917p22666121.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail:
Re: newbie seeking explanation of semantics of Field class
Comments inline: rolaren...@earthlink.net wrote: R2.4 I have been looking through the soon-to-be-superseded (by its 2nd ed.) book Lucene In Action (hope it's ok on this newsgroup to say I like that book); also at these two tutorials: http://darksleep.com/lucene/ and http://www.informit.com/articles/article.aspx?p=461633seqNum=3 and also at the Lucene online docco (http://lucene.apache.org/java/2_4_0/index.html) the last of which has nothing on the topic at all! I've also tried to search http://www.nabble.com/Lucene---Java-Users-f45.html -- but there are almost 10,000 docs there on Field. so that is too much data. The book is consistent with the two tutorials, but all three seem to be out of date (and the design less clear) compared to the code: http://lucene.apache.org/java/2_4_0/api/index.html I have copied some code and it is working for me, but I am a little uncertain how to decide what value of Field.Index and Field.Store to choose in order to get the behavior I'd like. If I read the javadocs, and decide to ignore all the expert items, it looks like this: Field.Store.NO = I'll never see that data again; I wonder why I'd do this? This is useful in the cases where you have data you want to be able to search by, but never need to display it. For example in my application we have complex data like: kitGsfco1 ^ http://www.informatics.jax.org/javawi2/servlet/WIFetch?page=alleleDetailid=MGI:3530308 In one of our searchable indexes we do quite a bit of transformation to this data, and remove all of the punctuation, etc etc. so it turns into: kit gsfcol This is great for searching, cause it allows us to have punctuation irrelevant search results, but the user simply doesn't care whatsoever. So at display time, we show them the unmodified, case sensitive version of this data, which is stored in another field. Field.Store.YES = good, the data will be stored Storage takes up space, so if you are ONLY going to search on a piece of data, and never display it, you should not store it. Field.Store.COMPRESS = even better, stored and compressed; why would anyone do anything else? I agree. Field.Index.NO = I cannot search that data, but if I need its value for a given document (e.g., to decorate a result), I can retrieve it (use-case: maybe, the date the document was created -- but why not just make that searchable? I am having a hard time thinking of an actually useful piece of data that could go here and would not want to be one of ANALYZED or NOT_ANALYZED) Correct, you use this type of data as additional information about the data you matched on. Field.Index.ANALYZED = the normal value, I would guess, except in the special case of stuff not searchable but used to decorate results (Field.Index.NO) Correct. Field.Index.NOT_ANALYZED = I can search for this value, but it won't get analyzed, so it is searched for as the very same value I put in (the docco suggests product numbers: any other interesting use-cases anyone can suggest?) Its highly useful for exact match searching. = thanks in advance for helping me get clearer on this! -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: waaaay too many files in the index!
Did you optimize your index? If not, depending on your merge factor, this could be a very normal index for you. -Matt John Byrne wrote: Hi, I've got a weird problem with a lucene index, using 2.3.1. The index contains 6660 files. I don't know how this happened.Maybe somone can tell me something about the files themselves? (examples below) On one day, between 10 and 40 of these files were being created every minute. The index updates are triggered by updates to an SVN repository, but I can't find any corresponding activity in the SVN logs. The lucene files all have names like this: _1qsw.cfs _1qsx.cfs _1qsy.cfs _1qsz.cfs _1qt0.cfs and are mostly 5K in size. My application uses just one instance each of IndexReader/IndexWriter/IndexSearcher. From looking at Can anyone shed any light on these files? I'm not too hopeful about fixing this index because we are getting too many open files, even with an unlimited ulimit, but any info/suggestions would be great. Thanks. -John - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance issue
Do you NEED to be using 7 fields here? Like Erick said, if you could give us an example of the types of data you are trying to search against, it would be quite helpful. Its possible that you might be able to say collapse your 7 fields down to a single field, which would likely reduce the overall number of or clauses in your searches, speeding things up nicely. At my project we search two letter prefix searches in sub seconds, for much larger datasets. Alot of this however is directly due to how our indexes are structured. -Matt Erick Erickson wrote: Prefix queries are expensive here. The problem is that each one forms a very large OR clause on all the terms that start with those two letters. For instance, if a field in your index contained mine milanta mica a prefix search on mi would form mine OR milanta OR mica. Doing this across seven fields could get expensive. Two things: 1 what is the problem you are trying to solve? Perhaps some of the folks on the list can give you some suggestions. You can think about many strategies depending upon what you want to accomplish. A 300M index isn't very big, so you could, for instance, think about indexing a separate field that contains only the two beginning letters and search *that* in this case. I'll assume that three letter prefix queries are OK. 2 How are you measuring query time? If you're measuring the time it takes when you first start a searcher, be aware that the first few queries are usually slow because the caches haven't been filled. Further, are you measuring total response time or are you measuring *just* the query time? It's possible that the time is being spent assembling the response in your code rather than actual searching. You might insert some timers to determine that. Best Erick On Mon, Feb 2, 2009 at 2:58 AM, Mittal, Sourabh (IDEAS) sourabh-931.mit...@morganstanley.com wrote: Hi All, We face serious performance issues when users do 2 letter search e.g ho, jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs. Below is our implementation details: 1. Search performs on 7 fields. 2. PrefixQuery implementation on all fields 3. AND search. 4. Our indexer size is 300 MB. 5. We show only 100 top documents only on the basis of score. 6. We user StandardAnalyzer StandardTokenizer for indexing searching. 7. Lucene 2.4 8. JDK1 .6 Please suggest me how can we improve the performance. Regards, Sourabh Mittal Morgan Stanley | IDEAS Practice Areas Manikchand Ikon | South Wing 18 | Dhole Patil Road Pune, 411001 Phone: +91 20 2620-7053 sourabh-931.mit...@morganstanley.com -- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Filtering accents
If you are constrained in such a way as to not use the French Analyzer you might instead consider transforming the input as an additional step at both search/indexing time. Use something like a regex that looks for é and always replaces it with e in the index, and at search time. (expand this transformation step as needed) You likely also need to store the original word somewhere, so I would suggest adding a second stored, but unindexed field that stores the original value of the word, so when you match on your search criteria, you will also get the original form of the word in your hits object. Hope this helps, Matt egrand thomas wrote: Dear all, I'd like my lucene searches to be insensitive to (French) accents. For example, considering a indexed term métal, I want to get it when searching for metal or métal . I use lucene-2.3.2 and the searches are performed with: IndexSearcher.search(query,filter,sorter), Another filter is already used together with a Sort object. Futrhermore, I cannot use the FrenchAnalyzer as my index does not only contain French words. Can anybody help ? Thanks in advance, Tom - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IDF scoring issue
Well, you could also do a simple test of removing IDF from the scoring equation and seeing if the query then reacts the way you want it to. Simply write your own custom similarity that does this, and test out to see how it works. Handily enough, I've already done this, so here's some code you can try: Fix the package declaration to something that works for you, and then simply use the custom similarity at the appropriate times. == package org.jax.mgi.shr.searchtool; import org.apache.lucene.search.DefaultSimilarity; /** * This is our custom similarity class, which removes document frequency from * the calculation of score. * * It extends the DefaultSimilarity class, and thusly inherits most of its * methods from it. * * @author mhall * */ public class MGISimilarity extends DefaultSimilarity { /** * If we have any doc frequency at all in the index, normalize it to 1 (The * document exists) * * Otherwise, return 0 (Does not exist) * * @param docFreq * This items doc frequency * @param numDocs * How many documents this item appears in. * * This API is enforced by the DefaultSimilarity class. * */ public float idf(int docFreq, int numDocs) { if (docFreq 0) { return 1.0f; } else { return 0.0f; } } } === Rajiv2 wrote: Because, the search term is provided by a user, and that user would explicity have to put quotes around marietta ga when I beleive the search text as it is : fleming roofing inc., marietta ga -- should score higher for marietta ga rajiv Grant Ingersoll-6 wrote: On Dec 16, 2008, at 8:19 PM, Rajiv2 wrote: Hello, I'm using the default lucene Queryparser on the search text : fleming roofing inc., marietta ga Also, I don't want to modify the search text by putting quotes around marietta ga which forces the query parser to make a phrase query. Why not? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to search for -2 in field?
Are you absolutely, 100% sure that the -2 token has actually made it into your index? As a VERY basic way to check this try something like this: import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermEnum; public class IndexTerms { public static void main(String[] args) { try { IndexReader ir = IndexReader.open(C:/Search/index/index); TermEnum te = ir.terms(); while (te.next()) { System.out.println(te.term().text()); } } catch (Exception e) {;} } } Then look through the output, verifying that the tokens you are expecting to exist in your index, actually do. I have a feeling that whatever analyzer you are using is dropping the - from the front of your -2 at indexing time, and if so it can sometimes be pretty hard to tell via Luke. Hope this helps, -Matt Darren Govoni wrote: Tried them all, with quotes, without. Doesn't work. At least in Luke it doesn't. On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote: whitespace analyzer will tokenize on white space irrespective of quotes. Use standard analyzer or keyword analyzer. Prabin meitei toostep.com On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote: I'm using Luke to find the right combination of quotes,\'s and analyzers. No combination can produce a positive result for -2 String for the field 'type'. (any -number String) type: 0 -2 Word analyzer: query - rewritten = result default field is 'type'. WhitespaceAnalyzer: \-2 ConfigurationFile\ - type:-2 type:ConfigurationFile = NO -2 ConfigurationFile - -type:2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO \-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought this one would work). Same results for the other analyzers more or less. Weird. Darren On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote: Hi, While constructing the query give the query string in quotes. eg: query = queryparser.parse(\-2 word\); Prabin meitei toostep.com On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com wrote: I'm hoping to do this with a simple query string, but not sure if its possible. I'll try your suggestion though as a workaround. Thanks!! On Thu, 2008-12-11 at 16:48 +, Robert Young wrote: You could do it with a TermQuery but I'm not quite sure if that's the answer you're looking for. Cheers Rob On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, This might be a dumb question, but I have a simple field like this field: 0 -2 Word that is indexed,tokenized and stored. I've tried various ways in Lucene (using Luke) to search for -2 Word and none of them work, the query is re-written improperly. I escaped the -2 to \-2 Word and it still doesn't work. I've used all the analyzers. What's the trick here? Thanks, Darren - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using AND with MultiFieldQueryParser
Which Analyzer have you assigned per field? The PerFieldAnalyzerWrapper uses a default analyzer (the one you passed during its construction), and then you assign specific analyzers to each field that you want to have special treatment. For example: PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper( new StandardAnalyzer()); aWrapper.addAnalyzer(data, new MGIAnalyzer()); aWrapper.addAnalyzer(sdata, new StemmedMGIAnalyzer()); Now, for the fields in question, have you assigned an Analyzer that doesn't actually use stopwords? (there are several available in core) Or are you perchance using a custom Analyzer that doesn't process stop words? Could you possibly post your Initialization code for this? If so I think we could be of more help to you. Matt Rafael Cunha de Almeida wrote: On Thu, 13 Nov 2008 14:53:59 +0530 prabin meitei [EMAIL PROTECTED] wrote: Hi, From whatever you have written you are trying to write a query *word1 AND stopword AND word2 *this means that the result should contain all of word1, word2 and the stopword. Since you have already removed the stopword during index time you will never find any document matching your query. (this is expected behaviour) you can possibly use word1 OR stopword OR word2 (depends on what you want in the result) If you can clarify more about what you want in the result we can discuss on what can be done. I wanted MultiFieldQueryParser to ignore any stopword the user may type in. In that particular case I'd like the result to be word1 AND word2. I thought that was what would happen because I pass the Analyzer to MultiFieldQueryParser, so I expected the parser to ignore stopwords for fields which the analyzer drops stopwords (I use PerFieldAnalyzerWrapper analyzer). On Thu, Nov 13, 2008 at 10:30 AM, Rafael Cunha de Almeida [EMAIL PROTECTED] wrote: Hello, I used an Analyzer which removes stopwords when indexing, then I wanted to do an AND search using MultiFieldQueryParser. So I did this: word1 AND stopword AND word2 I thought the stopword would be ignored by the searcher (I use the same Analyzer to index and search). But instead, I get no results whenever I have a stopword like that. If I remove the stopword, giving me: word1 AND word2 then the search is sucessful. Is that the expected behaviour? Am I doing something wrong? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. Database
Another thing you could consider is that rather than meshing all this data into a single index, logically break out the data you need for searching into one index, and the data you need for display into another index. This is the technique we use here and its been wildly successful for us, as compared to going directly into the database. Our DB is structured for ease of data entry/annotation rather than for the ease is display. So we use our display index to have standard realized datafields that are pulled from various tables in the database. So, when we search we include the unique key that each matched term points to, and then use this unique key to pull out our display time information. Its worked pretty well for us, so its certianly a viable approach. - Matt agatone wrote: Hi, I asked this question already on lucene-general list but also got advised to ask here too. I'm working on a project that has big database in the background (some tables have about 150 rows). We decided to use Lucene for faster search. Our search works similar as all searches: you write search string, get list of hits with detail link. But there is dilemma if we should store more data into index than it's needed. One side of developing team insists that we should use lucene index as somekind of storage for data so when you get hit, you go onto details and then again use lucene to find document that matches the selected ID and take the data from Lucene index. So in the end you end with copying complete database tables into the lucene index. Other side insists on storing to index only data that is displayed directly to the user when showing the search results list and needed for search criteria. When you go onto details, you have the matching ID so you can pickup that row from database by that ID rather than search it inside Lucene index. Can someone please describe drawbacks and advantages of both approaches. Actually can someone write down what's the actual profit, where and when of the Lucene itself in real production env. IT would be great if there is anyone who could write his experience with indexing and searching large amount of data. Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query attached words
We have a similar requirement here at our work. In order to get around it we create two indexes, one of which punctuation is relevant, and one in which all punctuation is treated as a place to break tokens. We then do a search against both indexes and merge the results, it seems that such a technique might be able to help you here as well. (Though upon rereading it seems like perhaps you want SOME punctuation to be relevant, but others not, the technique itself though could still be applied with these rules used instead) - Matt Jean-Claude Antonio wrote: Thanks Erick, you are right about the various combinations. Cheers, Erick Erickson wrote: Yes you can query *method. But you have to turn leading wildcards (which I don't have right on the tips of my fingers, but know it's been an option for some time now). But your solution doesn't scale well. If you had a.b.c.d.e.f.g.h you'd have to store many combinations in order to do what you want, quickly becoming really, really ugly. But you could store the tokens a . b . c . e . f . g . h by using the appropriate analyzer (or perhaps rolling your own). Then you could use either PhraseQuerys or SpanQuerys to do what you want Best Erick On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio [EMAIL PROTECTED]wrote: Hello, If I had a file with the following content: ... object.method(); ... I would like to be able to query for object method object.method My guess is that I should store not only object.method, but also object and method as I cannot query *method. Any other suggestion? Kind regards, JClaude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Front-end match
The reason the wildcard is being dropped is because you have wrapped it in a phrase query. Wildcards are not supported in Pharse Queries. At least not in any Analyzers that I'm aware of. A really good tool to see the transformations that happen to a query is Luke, open it up against your index, go into the search section, choose the analyzer you use and start playing around. This has helped me countless times when creating I'm my own queries and not getting the results that I expect. -Matt 叶双明 wrote: I am sorry, just put the string to QueryParser. But what make me confusing the code: Query query = qp.parse(bbb:\b*\ AND ccc:\cc*\); doesn't work as i have expected. It drop the Wildcard *. 2008/9/19, 叶双明 [EMAIL PROTECTED]: Thanks! Now, I just use Query query = qp.parse(a*); and meet the my requirements. Another question: how to parser query string like: title:The Right Way AND text:go please show me in java code. thanks. 2008/9/19 Karl Wettin [EMAIL PROTECTED] 19 sep 2008 kl. 11.05 skrev 叶双明: Documentstored/uncompressed,indexedfield:abc Documentstored/uncompressed,indexedfield:bcd How can I get the first Document buy some query string like a , ab or abc but no b and bc? You would create an ngram filter that create grams from the first position only. Take a look at EdgeNGramTokenFilter in contrib/analyzers. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Sorry for my English!! 明 Please help me correct my English expression and error in syntax - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Front-end match
To be more specific (just in case you are new to lucene) Your Query: Query query = qp.parse(bbb:\b*\ AND ccc:\cc*\); What I think you actually want here: Query query = qp.parse(bbb:b* AND ccc:cc*); Give it a shot, and then like I said, go get Luke, it will help you tremendously ^^ Matthew Hall wrote: The reason the wildcard is being dropped is because you have wrapped it in a phrase query. Wildcards are not supported in Pharse Queries. At least not in any Analyzers that I'm aware of. A really good tool to see the transformations that happen to a query is Luke, open it up against your index, go into the search section, choose the analyzer you use and start playing around. This has helped me countless times when creating I'm my own queries and not getting the results that I expect. -Matt 叶双明 wrote: I am sorry, just put the string to QueryParser. But what make me confusing the code: Query query = qp.parse(bbb:\b*\ AND ccc:\cc*\); doesn't work as i have expected. It drop the Wildcard *. 2008/9/19, 叶双明 [EMAIL PROTECTED]: Thanks! Now, I just use Query query = qp.parse(a*); and meet the my requirements. Another question: how to parser query string like: title:The Right Way AND text:go please show me in java code. thanks. 2008/9/19 Karl Wettin [EMAIL PROTECTED] 19 sep 2008 kl. 11.05 skrev 叶双明: Documentstored/uncompressed,indexedfield:abc Documentstored/uncompressed,indexedfield:bcd How can I get the first Document buy some query string like a , ab or abc but no b and bc? You would create an ngram filter that create grams from the first position only. Take a look at EdgeNGramTokenFilter in contrib/analyzers. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Sorry for my English!! 明 Please help me correct my English expression and error in syntax - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: AW: Search with multiple wildcards
Well, you could certainly manipulate your search string, removing the wildcard punctuations, and then use that for what you pass to the highlighter. That should give you the functionality you are looking for. -Matt mark harwood wrote: Is this possible? Not currently, the highlighter works with a list of words (or words AND phrases using the new span support) and highlights those. To do anything else would require the higlighter to faithfully re-implement much of the logic in all of the different query types (fuzzy, wildcard, regex etc etc) which is much more challenging/difficult to maintain. - Original Message From: Sertic Mirko, Bedag [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, 11 September, 2008 12:07:36 Subject: AW: AW: Search with multiple wildcards Ok, one final question: If i query for *ll*, the query is expanded to (hallo or alle or ...), so the Highligter will highlight the words hallo or alle. But how can i highlight only the original query, so only the ll? Is this possible? Thanks a lot Mirko -Ursprüngliche Nachricht- Von: mark harwood [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 11. September 2008 11:20 An: java-user@lucene.apache.org Betreff: Re: AW: Search with multiple wildcards You need to call rewrite on the query to expand it then give that version to the highlighter - see the package javadocs. http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/highlight/package-summary.html#package_description Cheers Mark - Original Message From: Sertic Mirko, Bedag [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, 11 September, 2008 9:34:13 Subject: AW: Search with multiple wildcards Ok, i gave it a try, but i ran into this TooManyClauses Exception. I see that 3ildcard queries are expanded before they are processed, and I see that i can set the clauses count to Integer.MAXVALUE, and queries can consume a lot of memory, but one final thing is still open: does a wildcard query work together with the Lucene Highlighter? I tried it, but I only got an empty result. Without wildcards, the highlighter works pretty smooth! Regards Mirko -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 10. September 2008 18:15 An: java-user@lucene.apache.org Betreff: Re: Search with multiple wildcards Of course you can construct your own BooleanQuery programmatically. It's relatively easy, just try it. On Wed, Sep 10, 2008 at 11:52 AM, Sertic Mirko, Bedag [EMAIL PROTECTED] wrote: Jep, this is what i have read. do I need to use the query parser, or can I create a query by the api? Is there an example available? Thanks a lot Mirko -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 10. September 2008 16:45 An: java-user@lucene.apache.org Betreff: Re: Search with multiple wildcards Is this what you're referring to? Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). (from http://lucene.apache.org/java/docs/queryparsersyntax.html) I'm pretty sure you can have multiple *terms* with wildcards. Luke is your friend here, download a copy and try it G. Be sure on the search tab to specify StandardAnalyzer or some such, rather than keywordanalyzer. The phrase is trying to point out that a phrase query does NOT respect wildcards. That is, submitting ab* bc* cd* AS A PHRASE QUERY won't do what you expect. But I'm pretty sure that +field:ab* +field:bc* +field:cd* will work just fine. The key here is within single terms, which I think of as within a single term query. You can add as many TermQuerys as you want. See the query documentation for how to submit phrase queries. Best Erick On Wed, Sep 10, 2008 at 10:11 AM, Sertic Mirko, Bedag [EMAIL PROTECTED] wrote: Hi Thank you for your quick response:-) Of course I need to use the * character :-) But I have read somewhere in the documentation that leading wildcards are not supported, and only one wildcard term per query. Is this limitation resolved in the current version? Regards Mirko -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 10. September 2008 15:47 An: java-user@lucene.apache.org Betreff: Re: Search with multiple wildcards Sure, but you'll have to set the leading wildcard option, which I've forgotten the exact call, but it's in the docs. And use * rather than % G. But wildcards are tricky, especially the TooManyClauses exception. You might want to peruse the archive for wildcard posts... Best Erick On Wed, Sep 10, 2008 at 9:06 AM, Sertic Mirko, Bedag [EMAIL PROTECTED]wrote: [EMAIL PROTECTED] Is it possible to do a search with multiple wildcards in one query, for instance %MANAGE% AND CORE%? Is there a code example available? Thanks a lot Mirko
Re: AW: AW: Search with multiple wildcards
Ah.. that's a darn good point.. Though, that second bit of code you have there could be used at display time for him to get the functionality that he wants. You could also modify it somewhat, and apply it against the displayable part of the hit he's getting back rather than the individual tokens. This if of course only assuming that this functionality would be used after a contains type search was detected. Considering that's the only real usecase for a technique like this, I'm thinking its probably more trouble than its worth for his case. mark harwood wrote: That should give you the functionality you are looking for. If I understand your suggestion correctly, It won't. The Highlighter uses a tokenized version of the document text. Simplistically it does the following psuedo code: for all tokens in documentTokenStream, if(queryTermsSet.contains(token)) output b+token+/b else output token NOT for all tokens in query string fullDocumentString.replaceAll(queryStringToken, b+queryStringToken+/b So in the given example while you suggest manipulating ll to be in the query string, you cannot make ll appear as a token in documentTokenStream. Actually the Highlighter logic is a fair bit more involved than this (especially when using SpanQueryScorer) but the basis of it is there in the above pseudo code. - Original Message From: Matthew Hall [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, 11 September, 2008 14:40:26 Subject: Re: AW: AW: Search with multiple wildcards Well, you could certainly manipulate your search string, removing the wildcard punctuations, and then use that for what you pass to the highlighter. That should give you the functionality you are looking for. -Matt mark harwood wrote: Is this possible? Not currently, the highlighter works with a list of words (or words AND phrases using the new span support) and highlights those. To do anything else would require the higlighter to faithfully re-implement much of the logic in all of the different query types (fuzzy, wildcard, regex etc etc) which is much more challenging/difficult to maintain. - Original Message From: Sertic Mirko, Bedag [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, 11 September, 2008 12:07:36 Subject: AW: AW: Search with multiple wildcards Ok, one final question: If i query for *ll*, the query is expanded to (hallo or alle or ...), so the Highligter will highlight the words hallo or alle. But how can i highlight only the original query, so only the ll? Is this possible? Thanks a lot Mirko -Ursprüngliche Nachricht- Von: mark harwood [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 11. September 2008 11:20 An: java-user@lucene.apache.org Betreff: Re: AW: Search with multiple wildcards You need to call rewrite on the query to expand it then give that version to the highlighter - see the package javadocs. http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/highlight/package-summary.html#package_description Cheers Mark - Original Message From: Sertic Mirko, Bedag [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, 11 September, 2008 9:34:13 Subject: AW: Search with multiple wildcards Ok, i gave it a try, but i ran into this TooManyClauses Exception. I see that 3ildcard queries are expanded before they are processed, and I see that i can set the clauses count to Integer.MAXVALUE, and queries can consume a lot of memory, but one final thing is still open: does a wildcard query work together with the Lucene Highlighter? I tried it, but I only got an empty result. Without wildcards, the highlighter works pretty smooth! Regards Mirko -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 10. September 2008 18:15 An: java-user@lucene.apache.org Betreff: Re: Search with multiple wildcards Of course you can construct your own BooleanQuery programmatically. It's relatively easy, just try it. On Wed, Sep 10, 2008 at 11:52 AM, Sertic Mirko, Bedag [EMAIL PROTECTED] wrote: Jep, this is what i have read. do I need to use the query parser, or can I create a query by the api? Is there an example available? Thanks a lot Mirko -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 10. September 2008 16:45 An: java-user@lucene.apache.org Betreff: Re: Search with multiple wildcards Is this what you're referring to? Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). (from http://lucene.apache.org/java/docs/queryparsersyntax.html) I'm pretty sure you can have multiple *terms* with wildcards. Luke is your friend here, download a copy and try it G. Be sure on the search tab to specify StandardAnalyzer or some such, rather than keywordanalyzer. The phrase
Re: escaping special characters
You can simply change your input string to lowercase before passing it to the analyzers, which will give you the effect of escaping the boolean operators. (I.E you will now search on and or and not) Remember however that these are extremely common words, and chances are high that you are removing them via your stop words list in your analyzer. This is also assuming you are using an analyzer that does lowercasing as part of its normal processing, which many do. Matt Steven A Rowe wrote: On 08/11/2008 at 2:14 PM, Chris Hostetter wrote: Aravind R Yarram wrote: can i escape built in lucene keywords like OR, AND aswell? as of the last time i checked: no, they're baked into the grammer. I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them. (that may have changed when it switchedfrom a javac to a flex grammer though, so i'm not 100% positive) Although the StandardTokenizer was switched about a year ago from a JavaCC to a JFlex grammar, QueryParser's grammar remains in the JavaCC camp. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using lucene as a database... good idea or bad idea?
Yeah.. we do the same thing here for indexes of up to 57M documents (rows), and that's just one part of our implementation. It takes quite a bit of.. wrangling to use lucene in this manner.. but we've found it to be utterly worthwhile. Matt Ian Lea wrote: John I think it's a great idea, and do exactly this to store 5 million+ documents with info that it takes way too long to get out of our Oracle database (think days). Not as many docs as you are talking about, and less data for each doc, but I wouldn't have any concerns about scaling. There are certainly lucene indexes out there bigger than what you propose. You can compress the stored data to save some space. Run times for optimization might get interesting but see recent threads for suggestions on that. And since you are not too concerned about performance you may not need to optimize much, or even at all. Of course you need to remember that this is not a DBMS solution in the sense of transactions, recovery, etc. but I'm sure you are already aware of that. -- Ian. On Tue, Jul 29, 2008 at 2:53 AM, John Evans [EMAIL PROTECTED] wrote: Hi All, I have successfully used Lucene in the tradtiional way to provide full-text search for various websites. Now I am tasked with developing a data-store to back a web crawler. The crawler can be configured to retrieve arbitrary fields from arbitrary pages, so the result is that each document may have a random assortment of fields. It seems like Lucene may be a natural fit for this scenario since you can obviously add arbitrary fields to each document and you can store the actually data in the database. I've done some research to make sure that it would meet all of our individual requirements (that we can iterate over documents, update (delete/replace) documents, etc.) and everything looks good. I've also seen a couple of references around the net to other people trying similar things... however, I know it's not meant to be used this way, so I thought I would post here and ask for guidance? Has anyone done something similar? Is there any specific reason to think this is a bad idea? The one thing that I am least certain about his how well it will scale. We may reach the point where we have tens of millions of documents and a high percentage of those documents may be relatively large (10k-50k each). We actually would NOT be expecting/needing Lucene's normal extreme fast text search times for this, but we would need reasonable times for adding new documents to the index, retrieving documents by ID (for iterating over all documents), optimizing the index after a series of changes, etc. Any advice/input/theories anyone can contribute would be greatly appreciated. Thanks, - John -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Luke shows in top terms but no search results??
Erm.. if its not tokenized that's your problem. You are setting up an Analyzer when indexing.. but then not actually USING it. Whereas when you are searching you are running your query through the analyzer, which transforms your text in such a way that it no longer matches against your untokenized form. So, rerun your index, changing untokenized to tokenized, and I think you will see the results you are looking for. Matt samd wrote: Oh and the field is not tokenized and stored. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the percent of size of lucene's index ?
You can also use Luke after you've created your indexes to get their exact size, and other interesting data points. Like Ian said though, the decisions you make on a field by field basis will make your index size vary quite a bit, so probably the best thing you could do is simply try it out, and then examine it. Matt Ian Lea wrote: I think there are too many variables to give a simple answer. How much of your data are you storing? Indexing? Compressing? Get a representative sample of your data and try it out. -- Ian. On Wed, Jul 23, 2008 at 5:00 PM, Ariel [EMAIL PROTECTED] wrote: I need to know what is the percent of size of lucene's index respect the information I'm going to index, I have read some articles that say if a I index 120 Gb of information the index will grow until 40 Gb, that means the percent is 30 %, Could somebody tell me how can be proved that ? Is there any official document of apache lucene where says that ? I hope somebody can help me. Thanks. Ariel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search Error: Java.io.IOException: Bad file descriptor
Did you try to open the index using Luke? Luke will be able to tell you whether or not the index is in fact corrupted, but looking at your stack trace, it almost looks like the file.. simply isn't there? Matt Jamie wrote: Hi Everyone I am getting the the following error when executing Hits hits = searchers.search(query, queryFilter, sort): 18007414-java.io.IOException: Bad file descriptor 18007455- at java.io.RandomAccessFile.seek(Native Method) 18007504- at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:545) 18007592- at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) 18007678- at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240) -- 18009148- at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168) 18009247- at org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56) 18009332- at org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:43) 18009419- at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:122) 18009493- at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250) Does this mean the index is corrupted? Any idea why it would be corrupted? Jamie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search Error: Java.io.IOException: Bad file descriptor
I'm not sure which file in particular would be the one corrupter/missing, which is why I suggested looking at the index with luke. As for the jave 1.6 lucene 2.3.2 index corruption issue, I'm not 100% familiar with the details on that one, but as a quick test, you should be able to swap to a 1.5 version of java, reindex and see if that fixes things. Well.. unless your code uses something jave 6 specific I suppose. Matt Jamie wrote: Hi Matthew Thanks in advance for the suggestion. Which file do you think does not exist? This is what we have: _15zw.cfs _19od.cfs _1a5d.cfs _1a7n.cfs _1ahf.cfs _1ahh.cfs _qzl.cfs segments.gen _1993.cfs _1a0w.cfs _1a7c.cfs _1a9m.cfs _1ahg.cfs _1ahi.cfs segments_158j Aside from Luke (which requires a GUI), it is there a command line utility that can check the integrity of the index? Jamie Matthew Hall wrote: Did you try to open the index using Luke? Luke will be able to tell you whether or not the index is in fact corrupted, but looking at your stack trace, it almost looks like the file.. simply isn't there? Matt Jamie wrote: Hi Everyone I am getting the the following error when executing Hits hits = searchers.search(query, queryFilter, sort): 18007414-java.io.IOException: Bad file descriptor 18007455- at java.io.RandomAccessFile.seek(Native Method) 18007504- at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:545) 18007592- at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) 18007678- at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240) -- 18009148- at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168) 18009247- at org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56) 18009332- at org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:43) 18009419- at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:122) 18009493- at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250) Does this mean the index is corrupted? Any idea why it would be corrupted? Jamie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Term Frequency for more complex terms
I have a quick question, could someone point me towards where in the API I'll have to investigate in order to figure out the term frequencies of more complex terms? For example I want to know the tf of kit ligand treated as a phrase. I see that luke has access to this information in its explain method, but the api call is currently eluding me. Thanks, Matt -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can you create a Field that is a copy of another Field?
Hrm, sorry then I'm not sure how much more help I'm going to be able to be on this on. I have to index things that have a DAG Structure (Treelike), but in order to get that functionality into my search I simply flatten out my dag, so any single term knows all of its children, but loses the structure of those children beyond that. This approach works for my data, but it doesn't sound like it will for yours. So, while I think you can still use the general technique that I showed you on this on, I have a feeling you are going to need to customize it some for your domain. Best of luck, and if there's anything else I can help with let me know. Matt [EMAIL PROTECTED] wrote: Matthew, It has to do with the fact that we're trying to represent these Property entitities hierarchically. We are displaying them in a tree structure, similar to the way Windows Explorer displays directories and files your file system. E.g. all the states would be at the root level. If you expanded a particular state you would see all the cities in that state, etc. If the user does a search we want to filter or reduce the tree. E.g. imagine you search on the term 'Smith'. Well since it's a safe bet to assume that there's somebody with the last name of Smith in all fifty states, then all fifty states would show up at the root level. On the other hand, suppose there's one guy in the whole country named with the last name of 'Fleebleflabble' and he lives in Michigan. If I search on that term I would expect only one state, namely Michigan to show up at the root level. Each level in the heirarchy is filtered by the search specified terms in this way. Searches are not limited to people's names though. We want to reduce the tree by matches on ANY field in the Properties from 'State' to 'Name'. So for example, a seach on 'Smith' would return matches for everybody that lived in a city named 'Smith City' or on a street named 'Smith Avenue', etc. This doesn't make a lot of sense for people and addresses, I admit. I just used that as an easy follow example. But it does make sense for the data we're storing. And BTW, maybe you can see a few holes in this approach. There's a bit more to it than I've described above. We have had to get a little creative with other documents and fields in order for it work correctly. I'd be happy to elaborate if anybody is interested. There may be better ways to do it. Like I said I'm fairly new to Lucene. Was just trying to keep it simple. -- Bill -Original Message- From: Matthew Hall [mailto:[EMAIL PROTECTED] Sent: Monday, June 30, 2008 8:26 AM To: java-user@lucene.apache.org Subject: Re: Can you create a Field that is a copy of another Field? Sorry, didn't get this until this morning. Yes, both fields should be indexed and searchable, though the data_type one should likely be untokenized. Data should be indexed and tokenized with whatever appropriate Analyzer works for your data. As for what your indexing, may I ask why you are doing it like that? I would have thought indexing each property seperately (a seperate doc) would have been sufficient for your needs, but if you can explain a bit more about your situation perhaps I can be more helpful on this matter? Matt [EMAIL PROTECTED] wrote: Hmmm, I think maybe I am missing something. In your design is the 'data' field indexed, i.e. searchable? Or is it an unindexed, stored field? I was thinking that both 'data' and 'data_type' were indexed and searchable. Maybe the confusion stems from the fact that for the Document corresponding to State=California, we're not just indexing on the token 'California'. We're indexing on all the tokens from all the Properties in the set of Properties corresponding to a person's address. In my original example this would be: California, Sacremento, 94203, South, Main, 1234, Joe and Smith. For the 'data_type' field I was thinking you were saying we'd index on a single token, namely 'State' (or whatever the left-hand side is). Does that make sense? -- Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103 Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED] www.sungard.com/energy -Original Message- From: Matthew Hall [mailto:[EMAIL PROTECTED] Sent: Friday, June 27, 2008 3:33 PM To: java-user@lucene.apache.org Subject: Re: Can you create a Field that is a copy of another Field? Yup, you're pretty much there. The only part I'm a bit confused about is what you've said in your data field there, I'm thinking you mean that for the data_type: State, you would have the data entry of California, right? If so, then yup, you are spot on ^^ We use this technique all the time on our side, and its helped considerably. We then use the db_key to reference into a display time cache that holds all of the display information for the underlying object that we would ever want
Re: Can you create a Field that is a copy of another Field?
)); propertyIndexWriter.addDocument(doc); tokenStream.close(); } Hope that clears it up. BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this: State=California City=Sacremento ZipCode=94203 StreetName=South Main StreetNumber=1234 Name=Joe Smith Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to State=California. Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the keywords field in the State=California document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another. There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :) thanks in advance, -- Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103 Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED] www.sungard.com/energy -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Friday, June 27, 2008 7:26 AM To: java-user@lucene.apache.org Subject: Re: Can you create a Field that is a copy of another Field? On Jun 27, 2008, at 12:01 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello Lucene Gurus, I'm new to Lucene so sorry if this question basic or naïve. I have a Document to which I want to add a Field named, say, foo that is tokenized, indexed and unstored. I am using the Field(String name, TokenStream tokenStream) constructor to create it. The TokenStream may take a fairly long time to return all its tokens. Can you share some code here? What's the reasoning behind using it (not saying it's wrong, just wondering what led you to it)? Are you just loading it up from a file, string or something or do you have another reason? Now for querying reasons I want to add another Field named, say, bar, that is tokenized and indexed in exactly the same way as foo. I could just pass it the same TokenStream that I used to create foo but since it takes so long to return all its tokens, I was wondering if there is a way to say, create bar as a copy of foo. I looked thru the javadoc but didn't see anything. By exactly the same, do you really mean exactly the same? What's the point of that? What are the querying reasons? You may want to look at the TeeTokenFilter and the SinkTokenizer, but I guess I'd like to know more about what's going on before fully recommending anything. Is this possible in Lucene or do I just have to bite the bullet build the new Field using the same TokenStream again? -- Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103 Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] www.sungard.com/energy blocked::http://www.sungard.com/energy -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can you create a Field that is a copy of another Field?
Yup, you're pretty much there. The only part I'm a bit confused about is what you've said in your data field there, I'm thinking you mean that for the data_type: State, you would have the data entry of California, right? If so, then yup, you are spot on ^^ We use this technique all the time on our side, and its helped considerably. We then use the db_key to reference into a display time cache that holds all of the display information for the underlying object that we would ever want to present to the user. This allows our search time index to be very concise, and as a result nearly every search we hit it with is subsecond, which is a nice place to be ^^ Matt [EMAIL PROTECTED] wrote: Matthew, Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it? Document: State=California Field: 'db_key'='1395' (primary key into relational table, correct?) Field: 'data' indexed by 'California', 'Sacremento', '94203', etc. Field: 'data_type' indexed by 'State' Document: City=Sacremento Field: 'db_key'='2405' Field: 'data' indexed by 'California', 'Sacremento', '94203', etc. Field: 'data_type' indexed by 'City' Then my query for all Properties would be: +data:South My query for only 'City' Properties would be: +data:South +data_type:City Is that right? I think that would work. Very nice. Thank you very much -- Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103 Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED] www.sungard.com/energy -Original Message- From: Matthew Hall [mailto:[EMAIL PROTECTED] Sent: Friday, June 27, 2008 11:49 AM To: java-user@lucene.apache.org Subject: Re: Can you create a Field that is a copy of another Field? I'm not sure if this is helpful, but I do something VERY similar to this in my project. So, for the example you are citing I would design my index as follows: db_key, data, data_type Where the data_type is some sort of value representing the thing that's on the left hand side of your property relationship there. So, then in order to satisfy your search, the queries become quite simple: The search for everything simply searches against the data field in this index, wheras the search for a specific data_type + searchterm becomes a simple boolean query, that has a MUST clause for the data_type value. As an even BETTER bonus, this will then mean that all of your searchable values will now have relevance to each other at scoring time, which is quite useful in the long run. Hope this helps you out, Matt [EMAIL PROTECTED] wrote: Grant, Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize. We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be: State=California City=Sacremento ZipCode=94203 StreetName=South Main StreetNumber=1234 Name=Joe Smith Etc., etc. (Note: this isn't the type of data we're storing... just trying to keep it simple.) Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith. There are two types of queries that we want to do. 1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called keywords and index it by the right-hand side values as described above. 2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above. So a couple of examples from the above list might look something like: Document: State=California Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc. Field: 'State' indexed by 'California', 'Sacremento', '94203', etc. Document: City=Sacremento Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc. Field: 'City' indexed by 'California', 'Sacremento', '94203', etc. Now if I'm interested in all the Properties that match the word South, I
Re: lucene wildcard query with stop character
I assume you want all of your queries to function in this way? If so, you could just translate the * character into a ? at search time, which should give you the functionality you are asking for. Unless I'm missing something. Matt Cam Bazz wrote: Hello, Imagine I have the following documents having keys A AB ABC ABD ABCD now Imagine a query with keyword analyzer and a wildcard: AB* which will bring me ABC , ABD and ABCD but I just want to get ABC and ABD so can I make a query like AB* but does not have the character after AB Best Regards, -C.B. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene wildcard query with stop character
Hrm.. can we see a more specific example of the type of data you are trying to query against here? Matt Cam Bazz wrote: well the ? would work if the length of each token be same. however, instead of ABC I want tags that change dynamically from 1 to unlimited length. I just I could just pad every token to a normalized length such as ...000A but i am hoping there is a better method. if we could tell lucene to do it like in a regular expression until a is there to insert ??'s ... Another way could be to do the regularexpression outside lucene, but then still there is need for fetching the hits. Best. -C.B. On Thu, Jun 12, 2008 at 8:47 PM, Matthew Hall [EMAIL PROTECTED] wrote: I assume you want all of your queries to function in this way? If so, you could just translate the * character into a ? at search time, which should give you the functionality you are asking for. Unless I'm missing something. Matt Cam Bazz wrote: Hello, Imagine I have the following documents having keys A AB ABC ABD ABCD now Imagine a query with keyword analyzer and a wildcard: AB* which will bring me ABC , ABD and ABCD but I just want to get ABC and ABD so can I make a query like AB* but does not have the character after AB Best Regards, -C.B. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible Bug when Querying?
Very very interesting. I went ahead and turned on the AllowLeadingWildcard toggle and everything works just as expected now, which is odd in a way. I'm still not certain why a search for '\*ache*' would be considered to have a leading wildcard. I'm searching for the literal * character here which I would have assumed would be a completely fine thing to do in a search, but somehow its triggering the leading wildcard checking logic. Well, anyhow thanks much for the suggestion, things are working properly now. Matt Karl Wettin wrote: 15 maj 2008 kl. 18.33 skrev Matthew Hall: 12:23:05,602 INFO [STDOUT] org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': '*' not allowed as first character in PrefixQuery 12:23:05,602 INFO [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen 12:23:05,602 ERROR [STDERR] java.lang.NullPointerException 12:23:05,602 ERROR [STDERR] at org.jax.mgi.search.model.QS_MarkerSearch.searchInexactMatches(Unknown Source) Which looks to me alot like something akin to the AllowLeadingWildcard stuff that comes along with wildcardqueries. But, the odd thing is the leading character in my search string ISN'T *, its the escaped star character, which I would have thought would work with no problems at all. Have I stumbled across a bug here? Did you setAllowLeadingWildcard(true)? /** * Set to codetrue/code to allow leading wildcard characters. * p * When set, code*/code or code?/code are allowed as * the first character of a PrefixQuery and WildcardQuery. * Note that this can produce very slow * queries on big indexes. * p * Default: false. */ public void setAllowLeadingWildcard(boolean allowLeadingWildcard) { karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Possible Bug when Querying?
Greetings, I'm searching against a data set using lucene that contains searches such as the following: *ache* *aChe* etc and so forth, sadly this part of the dataset is imported via an external client, so we have no real way of controlling how they format it. Now, to make matters a bit more complex, my clients have decided to turn off all wildcard searching, EXCEPT for prefix searches, so when I process the query string I go ahead and go through it and escape out all lucene special characters, except for the trailing *. So I end up sending the following string to the query parser: \*ache* (I'm doing standard things like converting everything to lowercase) and when I put that into the query parser its throwing the following exception: 12:23:05,602 INFO [STDOUT] org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': '*' not allowed as first character in PrefixQuery 12:23:05,602 INFO [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen 12:23:05,602 ERROR [STDERR] java.lang.NullPointerException 12:23:05,602 ERROR [STDERR] at org.jax.mgi.search.model.QS_MarkerSearch.searchInexactMatches(Unknown Source) Which looks to me alot like something akin to the AllowLeadingWildcard stuff that comes along with wildcardqueries. But, the odd thing is the leading character in my search string ISN'T *, its the escaped star character, which I would have thought would work with no problems at all. Have I stumbled across a bug here? Matt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible Bug when Querying?
No I did not, because I'm not performing a search with a leading wildcard, nor am I intending to allow that behavior. But what I do want to be able to search on is a word that starts with a * by escaping it, because sadly our data contains such things. Matt Karl Wettin wrote: 15 maj 2008 kl. 18.33 skrev Matthew Hall: 12:23:05,602 INFO [STDOUT] org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': '*' not allowed as first character in PrefixQuery 12:23:05,602 INFO [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen 12:23:05,602 ERROR [STDERR] java.lang.NullPointerException 12:23:05,602 ERROR [STDERR] at org.jax.mgi.search.model.QS_MarkerSearch.searchInexactMatches(Unknown Source) Which looks to me alot like something akin to the AllowLeadingWildcard stuff that comes along with wildcardqueries. But, the odd thing is the leading character in my search string ISN'T *, its the escaped star character, which I would have thought would work with no problems at all. Have I stumbled across a bug here? Did you setAllowLeadingWildcard(true)? /** * Set to codetrue/code to allow leading wildcard characters. * p * When set, code*/code or code?/code are allowed as * the first character of a PrefixQuery and WildcardQuery. * Note that this can produce very slow * queries on big indexes. * p * Default: false. */ public void setAllowLeadingWildcard(boolean allowLeadingWildcard) { karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Quickie Luke Question
Does anyone know how to set the MaxClauseCount in luke? I'm in a situation where I've had to override it when searching against my indexes, but now I can't use luke to examine what's going on with my queries anymore. Any help would be appreciated. Matt -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question About Hits
This is more of a trying to understand the design sort of question, but its still something I need to able to succinctly express to my project manager. I know that lucene is by design not allowing us to see which fields were hit for a given document in an easy manner. Instead it presents us with a collection of hits, with each hit having the total score for the document, given all of the fields that you have searched on, with that total score being the score for the matches for each field combined via the scoring algorithm. The question I'm being asked is why is the information about how each field matched not easily accessable in lucene? I know I can go ahead and do a searcher.explain on my hit object, and then ... parse out the individual fields with their scores, but couldn't this be much more easily accessable from the hits object itself? The hits object already has a get method that allows you to pass a String value for a string name to the object, couldn't another method be added such as getScoreByField(String s) that had access to the information that was used to build the total score of the document? I'm sure part of the reason that this wasn't included were performance based, I mean it would be a fair amount of extra information for the average search to have to carry around, but for my application, and many others I'm sure, its a very important thing to be able to find out WHY a document was returned. If for nothing less than for display purposes. Anyhow, any insight as to why things are the way they are would be most appreciated, or if someone else has faced the same problems as I, and have went ahead and modified the hits object to include such things (and this is no small task) I'd love to hear about it. -Matt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Implementing CMS search function using Lucene
You could try something like this, which use when I put my own documents together: public Document getDocument(){ Document doc = new Document(); doc.add(new Field(db_key, this.getDb_key(), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field(acc_ids, this.getAcc_ids(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(setBoost(new Field(name, this.getName(), Field.Store.YES, Field.Index.TOKENIZED), 1.0f)); doc.add(setBoost(new Field(symbol, this.getSymbol(), Field.Store.YES, Field.Index.TOKENIZED), 1.0f)); doc.add(setBoost(new Field(synonyms, this.getSynonyms(), Field.Store.YES, Field.Index.TOKENIZED), .8f)); doc.add(setBoost(new Field(allele_nomen, this.getAllele_nomen(), Field.Store.YES, Field.Index.TOKENIZED), .6f)); doc.add(setBoost(new Field(old_nomen, this.getOld_nomen(), Field.Store.YES, Field.Index.TOKENIZED), .4f)); doc.add(setBoost(new Field(orth_nomen, this.getOrth_nomen(), Field.Store.YES, Field.Index.TOKENIZED), .2f)); return doc; } private static Field setBoost (Field workField, float boost) { workField.setBoost(boost); return workField; } It works out pretty well for me anyhow. Илья Казначеев wrote: В сообщении от Thursday 03 April 2008 16:24:15 Илья Казначеев написал(а): - Is there a way to set weights for different fields? Let's say, content have a weight of 1, title have a weight of 5 and picture subscribe have a weight of 0.5. If no, can I do that by hand? Already found field.setBoost(). Sorry for asking lame questions :( By the way, if setBoost() returned this, it would be much easier to assemble document, one line instead of three. Chaining rules. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Matthew Hall Software Engineer Mouse Genome Informatics [EMAIL PROTECTED] (207) 288-6012 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter Hits
I suspect you are using a different analyzer to highlight than you are using to search. A couple of things you can check: Immediately after your query simply print out hits.length, this should conclusively tell you that you query is in fact working, after that ensure that you are using the same analyzer for your highlighter that you are for your query parser. If you are not, its entirely possible that the text you are trying to highlight with is being transformed differently than how it was in the query, and as a result isn't matching against your fields anymore. Hope that helps, Matt JensBurkhardt wrote: Hello everybody, I have s slight problem using lucenes highlighter. If i have the highlighter enabled, a query creates 0 hits, if i disable the highlighter i get the hits. It seems like, when i call searcher.search() and pass my Hits hits to the highlighter function, the program quits. All prints after the highlighter call also do not appear. I have no idea what the problem is. Thanks in advise Jens Burkhardt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Looking for an example of Using Position Increment Gap
Fellows, I'm working on a project here where we are trying to use our lucene indexes to return concrete objects. One of the things we want to be able to match by is by vocabulary terms annotated to that object, as well as all of the child vocabulary terms of that annotated term. So, what I was thinking about doing is extending my index that returns objects of that type to include a new field say sub_term. In this field I would put all of the text of these vocabulary sub terms together, and introduce phrase boundries using some of the techniques that are described in the Javadoc in the analysis section. (Basically writing a custom analyzer that introduces a position increment gap between phrases) I am however curious if an example of a usage like that exists somewhere that I could use as a basis for the analyzer that I'm going to have to write to handle this case. Does anyone know of a good example? Matt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suffix search
What you need is to set the allow leading wildcard flag. qp.setAllowLeadingWildcard(true); (where qp is a query parser instance) That will let you do it, be warned however there is most definitely a significant performance degradation associated with doing this. Matt [EMAIL PROTECTED] wrote: Hi, using WildcardQuery directly it is possible to search for suffixes like *foo. The QueryParser throws an exception that this is not allowed in a WildcardQuery. Hm, now I'm confused ;) How can I configure the QueryParser to allow a wildcard as first character? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem in building Lucene
Also, ensure that you didn't inadvertently add an older version of your Jar file somewhere in your classpath. Eclipse will take the first it comes to, and skip any others found later on in the path. Right Click on your Project - Properties - Java Build Path and ensure you don't have an older version in there. Matt Koji Sekiguchi wrote: Try to reread jar file on Eclipse. To do it, right-click on your project, then choose refresh. Thank you, Koji sandeep chawla wrote: I have to change lucene code for some reason. I changed the source code of the lucene and ran the ant command on build.xml. it created a jar file in build directory then i added the jar file to my project in eclipse . I am facing a bizarre problem now. Changes i have made in source code are not reflected in new jar file.. Any help in this regards..please Thanks Sandeep - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]