RE: WildCardQuery
Can you be a little more precise about how you process your documents? 1) What's your analyser? SimpleAnalyzer? 2) How do you parse the query? Out-of-the-box QueryParser? can we not enter space or do an OR search with two words one of which has a wildcard ? Simple answer, yes. Complicated answer, words are delimited by your tokeniser. That's included in your analyser (hence my question above). The asterix syntax comes from using a query parser that transforms the query into a PrefixQuery object. sv On Fri, 1 Oct 2004, Robinson Raju w Hi , Would there be a problem if one enters space while using wildcards ? say i search for 'abc' . i get 100 hits as results 'man' gives - 200 'abc man' gives 300 but 'ab* man' 'abc ma*' ab* ma*' ab* OR ma* .. all of these return 0 results. can we not enter space or do an OR search with two words one of which has a wildcard ? Regards, Robin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexHTML parser + Constructer
Hi Apologies . Can Somebody Please tell me or how to include a constructer within 'org.apache.lucene.demo.html.HtmlParser.java' , So that using the Constructer read the String argument,Strips the HTML Tags and returns the String with out Tags. Currently 'org.apache.lucene.demo.html.HtmlParser.java' method accepts fullpath of the file and then reads the Content to Strip Tags.. Thx in Advance Karthik -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Saturday, September 25, 2004 12:47 AM To: Lucene Users List Subject: Re: demo IndexHTML parser breaks unicode? On Friday 24 September 2004 19:58, Fred Toth wrote: I've got unicode in my source HTML. In particular, within meta tags, and it's getting broken by the indexer. Note that I'm not trying to query on any of this, just store and retrieve document titles with unicode characters. Please try again with the code from CVS, Christoph Goller committed a fix for this problem (at least I think it was this problem) 1-3 weeks ago. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildCardQuery
analyzer is StandardAnalyzer. i use MultiFieldQueryParser to parse. The flow is this: I have indexed a Database view. Now i need to search against a few columns i take in the search criteria and search field , construct a wildcard query and add it to a boolean query WildcardQuery wQuery = new WildcardQuery(new Term(searchFields[0], searchString)); booleanQuery.add(wQuery, true, false); Query queryfilter = MultiFieldQueryParser.parse(filterString, filterFields, flags, analyzer); hits = parallelMultiSearcher.search(booleanQuery,queryFilter); when i dont use wild cards , it is taken as +((ITM_SHRT_DSC:natal ITM_SHRT_DSC:tylenol) (ITM_LONG_DSC:natal ITM_LONG_DSC:tylenol)) But when wildcard is used , it is taken as +ITM_SHRT_DSC:nat* tylenol +ITM_LONG_DSC:nat* Tylenol the first return around 300 records , the second , 0. any help would be appreciated Thanks Robin On Fri, 1 Oct 2004 02:06:04 -0400 (EDT), Stephane James Vaucher [EMAIL PROTECTED] wrote: Can you be a little more precise about how you process your documents? 1) What's your analyser? SimpleAnalyzer? 2) How do you parse the query? Out-of-the-box QueryParser? can we not enter space or do an OR search with two words one of which has a wildcard ? Simple answer, yes. Complicated answer, words are delimited by your tokeniser. That's included in your analyser (hence my question above). The asterix syntax comes from using a query parser that transforms the query into a PrefixQuery object. sv On Fri, 1 Oct 2004, Robinson Raju w Hi , Would there be a problem if one enters space while using wildcards ? say i search for 'abc' . i get 100 hits as results 'man' gives - 200 'abc man' gives 300 but 'ab* man' 'abc ma*' ab* ma*' ab* OR ma* .. all of these return 0 results. can we not enter space or do an OR search with two words one of which has a wildcard ? Regards, Robin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Robin 9886394650 The merit of an action lies in finishing it to the end - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - Too Many Clases on date range.
You can use: BooleanQuery.setMaxClauseCount(int maxClauseCount); to increase the limit. On Sep 30, 2004, at 8:24 PM, Chris Fraschetti wrote: I recently read in regards to my problem that date_field:[0820483200 TO 110448] is evluated into a series of boolean queries ... which has a cap of 1024 ... considering my documents will have dates spanning over many years, and i need the granualirity of 'by day' searching, are there any reccomendations on how to make this work? Currently with query: +content_field:sometext +date_field:[0820483200 TO 110448] I get the following exception: org.apache.lucene.search.BooleanQuery$TooManyClauses any suggestions on how I can still keep the granuality of by day, but without limiting my search results? Are there any date formats that I can change those numbers to that would allow me to complete the search (i.e. Feb, 15 2004 ) .. can lucene's range do a proper search on formatted dates? Is there a combination of RangeQuery and Query/MultiTermQuery that I can use? your help is greatly appreciated. -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - Too Many Clases on date range.
Dnia 01-10-2004, pi o godzinie 07:57 -0500, Scott Ganyo napisa(a): You can use: BooleanQuery.setMaxClauseCount(int maxClauseCount); I had a similar problem with date ranges. Someone on the list suggested me a solution to my problems but it was more clever than the above solution, which helps but makes the searches work slower and is memory hungry (many terms are loaded into memmory, and than searched). The solution suggested was to split dates into sub fields during indexing and use those fields while searching. This makes it more effective but harder to create a query (personally I prefer working on queries build using Lucene API, than ones parsed by QueryParser). For instance the time stamp 2004-10-01 15:34:26.001 may be split into following fields: some-date_year: 2004 some-date_month: 10 some-date_day: 01 some-date_time: 153426001 The above fields should be indexed so they can be searched. They give some nice possibilities, for instance fast and easy querying for all documents that have a date in a particular year, month or day of month. For conveniece one could also store weekdays. A query for a date range from 15th august to 10th october 2004 (in no particular query language - this just gives an idea): some-date_year = 2004 AND ( (some-date_month = 08 AND some-date_day = 15) OR (some-date_month=09) OR (some-date_month = 10 AND some-date_day = 10) ) As You can see it is easy to build such a query from the lucene API. The equalities are Term queries. The inequalities are Range queries. The AND and OR operators can be provided by usage of Boolean queries. Have fun implementing the solution - it has only one disadvantage. It makes results sorting not so easy. The solution for it is usage of multiple sort fields, or another stored field containing a full date (one almost surely will need to store a date for each hit, unless You want to write some baroque code to calculate date from split fields values). Have fun, -- Damian Gajda Caltha Sp. j. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
removing duplicate Documents from Hits
Hello, I've searched on previous posts on this topic but couldn't find an answer. I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field. In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...). My index needs to have each of these records as 'B' is a searchable field in the index. However, after the query is executed, I want my resulting Hits on be unique on 'A'. I'm only returning the Oracle object ID, so once I've seen it once I don't need it again. It looks like some sort of custom Filter is in order. My fix at the moment is to run the query, then store unique id's in a Map to build another query that will return singletons on field 'A'. I could skip this step if there was a way to remove documents from Hits (I didn't see a way). Has anyone written a filter that does this? Are there others using Lucene to mimic a relational DB? I've got a complex SQL search that joins (most outer) 40 some tables. Query performance is important, and the tables are relatively static. I find the ID's of the objects that match the users' criteria, then go to the DB to instantiate them. Any comments are appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: removing duplicate Documents from Hits
Timm, Andy (ETW) wrote: Hello, I've searched on previous posts on this topic but couldn't find an answer. I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field. In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...). My index needs to have each of these records as 'B' is a searchable field in the index. However, after the query is executed, I want my resulting Hits on be unique on 'A'. I'm only returning the Oracle object ID, so once I've seen it once I don't need it again. It looks like some sort of custom Filter is in order. I'd suggest a HitCollector that uses a FieldCache of the A values to check for duplicates, and collect only a the best document id for each value of A. This would use a bit of RAM, but be very fast. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
new release: 1.4.2
There's a new release of Lucene, 1.4.2, which mostly fixes bugs in 1.4.1. Details are at http://jakarta.apache.org/lucene/. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Memory leak in ParallelMultiSeacher?
Hello, I ran across this post (http://java2.5341.com/msg/77213.html) in the mailing list archives, and wondering if anyone has any updates on this? Thanks, Ed __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
multiple threads
As I understand it, if two writers try to acess the same index for writing, then one of the writers should block waiting for a lock until the lock timeout period expires, and then they will return a Lock wait timeout exception. I have a multithreaded indexing applications that writes into one of multiple indexes depending on a hash value, and I intend to merge all the hashes when the indexing finishes. Locking usually works but sometimes it doesn't and I get IO exceptions such as the following.. java.io.IOException: Cannot delete _19.fnm at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:389) at org.en.global.indexer.IndexGroup.run(IndexGroup.java:387) Any idea on why this could be happening? I am using NFS currently, but the problem appears on the local filesystem as well. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question regarding using Lucene or not
Hello I have a stand-alone java application. We have a new requirement where there will be around 1000 data files in XML format. Each of them have the same format. Nodes will have value and attributes. In the application, the user will search for a particular spec (the data file) by defining parameters. The parameters are both string and numeric. For example, the model should be Cargo and its HP value should be 55,000 or near it . If we specify tolerance value of 5000 then it should search for all the data files where model node is Cargo (definitive match) and HP value is between 50,000 to 60,000 with the one having 55,000 coming as the 100% match. Do you think Lucene can meet this requirement or do I need to look into any other product ? Please let me know. Thanks.