Re: search question
Erik, They both use the StandardAnalyzer... however looking at the toString() makes everything clearer. In the case a string has the following email address: [EMAIL PROTECTED], it gets split like so: first.last domain.com However in 1.4 it does not get split. So now we just check to see if an index was built using 1.2 or 1.4 and have some checks thrown in. Thanks for the guidance. Roy. On Wed, 22 Dec 2004 18:41:44 -0500, Erik Hatcher wrote What does toString() return for each of those queries? Are you using the same analyzer in both cases? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
search question
Hi guys, We have an index with some fields containing email addresses. Doing a search for an email address with this format: [EMAIL PROTECTED], does not bring up any results with lucene 1.4. The query: Field1:[EMAIL PROTECTED] However it returns results with 1.2. Any ideas? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lock file paths
Hey guys, Quick question... is there a way to get the file paths to the lock files? Or do I have to modify the src? Currently I can't find any methods that will return a lock's file path. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
Hi Fred, We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's just a guess, I would love to hear from others about this. Anyway, since it is a separate thread, a token error could kill it and there is no way for the calling thread to know about it. We had to create our own html parser since we only cared about grabbing the entire text from the html document and also we wanted to avoid the extra thread. We also do a lot of SKIPping for minimal EOF errors (html documents in email almost never follow standards). For your html needs, you might want to check out other JavaCC HTML parsers from the JavaCC web site. Roy. On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the JVM hangs instead of exiting. Presumably this is because there are threads hanging around doing something. But I'm not sure what! Any pointers? I just want to exit gracefully on an error such as a required meta tag is missing or similar. Thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
compiling 1.4 source
Hi guys, So we started upgrading to 1.4 and we need to add some of our own custom code. After compiling with ant, I noticed that the 1.4 ant script builds a jar called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar. I'm pretty sure I did not download the wrong source. Is this just a wrong name in the properties or does the source code actually contain lucene 1.5 rc1 code? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hits.doc(x) and range queries
Hi guys! I've posted previously that Hits.doc(x) was taking a long time. Turns out it has to do with a date range in our query. We usually do date ranges like this: Date:[(lucene date field) - (lucene date field)] Sometimes the begin date is 0 which is what we get from DateField.dateToString( ( new Date( 0 ) ). This is when getting our search results from the Hits object takes an absurd amount of time. Its usually each time the Hits object attempts to get more results from an IndexSearcher ( aka, every 100? ). It also takes up more memory... I was wondering why it affects the search so much even though we're only returning 350 or so results. Does the QueryParser do something similar to the DateFilter on range queries? Would it be better to use a DateFilter? We're using Lucene 1.2 (with plans to upgrade). Do newer versions of Lucene have this problem? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Custom filter
On Fri, 20 Aug 2004 20:01:36 -0400, Erik Hatcher wrote On Aug 20, 2004, at 6:48 PM, [EMAIL PROTECTED] wrote: We're currently in lucene 1.2... haven't moved to 1.3 yet. Skip 1.3 and go straight to 1.4.1 :) Upgrade - why not? Well we have some MASSIVE indexes so updating needs to be planned out. In the meantime we continue with 1.2. So, just for curiousity's sake... any clue on the filter? Or perhaps someone could clue me in on what kind of terms the query parser creates ( and what the searcher class does with them ) when it has something like (From:(blah OR blah2) OR To:(blah OR blah2)). Tried to look at the QueryParser.jj file but javacc makes my head hurt... Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Custom filter
Hi guys! I was hoping someone here could help me out with a custom filter. We have an index of emails and do some searches on the text of an email message and also searches based on the email addresses in a To, From or CC. Since we also do searches on a bunch of emails, we created a custom filter for searches on an array of fields for an array of values. [code included below] The problem we're having is that creating a query string like so: Message:viagra AND (From:(email1 OR email2) OR To:(email1 OR email2) OR CC:(email1 OR email2)) would return results, but our filter combined with a query string of Message:viagra sometimes wouldn't. One thing I noticed is that when the results do return with the filter, the email has the format of [EMAIL PROTECTED], but the one that doesn't has something like [EMAIL PROTECTED] Also it might have something to do with the storage of the From or To or CC. We don't parse out the email addresses before storing them. So sometimes the value of a From/To/CC field might be [EMAIL PROTECTED] or local [EMAIL PROTECTED] or even [EMAIL PROTECTED]. Could the carrots be throwing off my filter? I also wouldn't mind any suggestions to doing this filter better. Here is the bits method from our custom filter: - final public BitSet bits( IndexReader reader ) throws IOException { BitSet bits = new BitSet( reader.maxDoc() ); for ( int x = 0; x fields.length; x++ ) { for ( int y = 0; y values.length; y++ ) { TermDocs termDocs = reader.termDocs( new Term( fields[x], values[y] ) ); try { while ( termDocs.next() ) { bits.set( termDocs.doc() ); } } finally { termDocs.close(); } } } return bits; } - Thanks in advance, Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: addIndexes vs addDocument
Otis, Okay, got it... however we weren't creating new document objects... just grabbing a document through an IndexReader and calling addDocument on another index. Would that still work with unstored fields(well, its working for us since we don't have any unstored fields)? Thanks a lot! Roy. On Tue, 6 Jul 2004 19:46:30 -0700 (PDT), Otis Gospodnetic wrote Quick example. Index A has fields 'title' and 'contents'. Field 'contents' is stored in A as Field.UnStored. This means that you cannot retrieve the original content of the 'contents' field, since that value was not stored verbatim in the index. Therefore, you cannot create a new Document instance, pull out String value of the 'contents' field from A, use it to create another field, add it to the new Document instance, and add that Document to a new index B using addDocument method. addIndexes method does not need to pull out the original String field values from Documents, so it will work. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
addIndexes and optimize
Hey y'all again, Just wondering why the IndexWriter.addIndexes method calls optimize before and after it starts merging segments together. We would like to create an addIndexes method that doesn't optimize and call optimize on the IndexWriter later. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
moving 1.2 index to 1.4
Hey guys, We have a couple of giant indexes that were done in lucene 1.2. We would like to move to lucene 1.4 at some point. We have heard that we would probably need to re-index our indexes to take advantage of certain new features/optimizations of lucene 1.3/1.4. We were wondering if it was possible to open our old 1.2 index with an IndexReader, get each Document object, and add it to a new 1.4 index? Would it be the same as re-building an index from scratch? Thanks! Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stress/scalability testing Lucene
Ah, for some reason i thought none of the Lucene methods were thread safe, or is this only in the case of reading and writing at the same time? I thought I read this in the FAQ. Roy. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 20, 2002 5:04 PM To: Lucene Users List Subject: Re: Stress/scalability testing Lucene * Replies will be sent through Spamex to [EMAIL PROTECTED] * For additional info click - http://www.spamex.com/i/?v=886513 Justin Greene wrote: We created a thread pool to read and parse the email messages. 10 threads seems to be the magic number here for us. We then created a queue of messages to be indexed onto which we push the parsed messages and have a single thread adding messages to the index. IndexWriter.addDocument(Document) is thread safe, so you don't need a separate indexing thread. So long as your analyzer is thread safe, you can index each messages in the thread that parses it, for even greater parallelism. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] This email and any attachments are confidential and may be legally privileged. No confidentiality or privilege is waived or lost by any transmission in error. If you are not the intended recipient you are hereby notified that any use, printing, copying or disclosure is strictly prohibited. Please delete this email and any attachments, without printing, copying, forwarding or saving them and notify the sender immediately by reply e-mail. Zurich Capital Markets and its affiliates reserve the right to monitor all e-mail communications through its networks. Unless otherwise stated, any pricing information in this e-mail is indicative only, is subject to change and does not constitute an offer to enter into any transaction at such price and any terms in relation to any proposed transaction are indicative only and subject to express final confirmation.
the order of fields in Document.fields()
Quick question about Document.fields(). Lucene provides you with a method to retrieve the value of a field or grab all fields as an Enumeration. It does not, however, allow you to grab all values of one field for a document, it will only return the last value added for that field. For example, I am indexing email messages that might have multiple To/CC/BCC fields in the message header. Currently to grab all the values when I display an email that has been indexed, I must use the fields() method to grab an Enumeration of all fields in a document. I then separate them into different arrays based on the field names. However I am concerned about the order of the fields since I consider the first To or CC or BCC to be the main value for each field. Is the order of the fields returned in the order that they are added? Or is there no order? If there is no order, can someone suggest a solution? Thanks! Roy. This email and any attachments are confidential and may be legally privileged. No confidentiality or privilege is waived or lost by any transmission in error. If you are not the intended recipient you are hereby notified that any use, printing, copying or disclosure is strictly prohibited. Please delete this email and any attachments, without printing, copying, forwarding or saving them and notify the sender immediately by reply e-mail. Zurich Capital Markets and its affiliates reserve the right to monitor all e-mail communications through its networks. Unless otherwise stated, any pricing information in this e-mail is indicative only, is subject to change and does not constitute an offer to enter into any transaction at such price and any terms in relation to any proposed transaction are indicative only and subject to express final confirmation.
RE: the order of fields in Document.fields()
Shouldn't there be at least one method that returns an array of fields in the correct order? Roy. -Original Message- The order is preserved (or reversed actually), so it's not random. It's reverse of the order of the order in which the fields were added to the document. This would be easy to test... This email and any attachments are confidential and may be legally privileged. No confidentiality or privilege is waived or lost by any transmission in error. If you are not the intended recipient you are hereby notified that any use, printing, copying or disclosure is strictly prohibited. Please delete this email and any attachments, without printing, copying, forwarding or saving them and notify the sender immediately by reply e-mail. Zurich Capital Markets and its affiliates reserve the right to monitor all e-mail communications through its networks. Unless otherwise stated, any pricing information in this e-mail is indicative only, is subject to change and does not constitute an offer to enter into any transaction at such price and any terms in relation to any proposed transaction are indicative only and subject to express final confirmation.
Deleting a document found in a search
I am just getting started with Lucene and I think I have a problem understanding some basic concepts. I am using two-part identifiers to uniquely identify a document in the index. So whenever I want to index a document, I first want to find and delete the old form. To find it, I intend to use: BooleanQuery findOurs = new BooleanQuery(); findOurs.add(new TermQuery(new Term(Id, id)), true, false); findOurs.add(new TermQuery(new Term(Domain, domain)), true, false); System.out.println(Deleting document matching: \ + findOurs.toString() + ''); Searcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(findOurs); // Assert: hits.length() = 1 for (int i = 0 ; i hits.length() i 10; i++) { Document d = hits.doc(i); // Now what can I do to find document id? int id = ?? searcher.delete(id); } But I can't discover how to convert a search result into a document id. It is recorded in the private HitDoc class, but since it is not publicly accessible, there must be a reason why it would not work to add a public getter for it. Is there an alternative way that I can do this? My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Enumerating all Terms
Is there a way of getting a list of all Terms that have been indexed? I guess it would approximate a wildcard query of the form *:* if that were valid, and instead of returning matching documents, just returning the fields and values. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Deleting a document found in a search
No, I mean HitDoc.id, the document number field stored in the HitDoc class. This number is needed when calling IndexReader.delete(int docnum) but it is not publicly accessible. -- Adrian At 06:32 09/10/2002 -0700, Otis Gospodnetic wrote: You mean d.get(Id); ? --- [EMAIL PROTECTED] wrote: I am just getting started with Lucene and I think I have a problem understanding some basic concepts. I am using two-part identifiers to uniquely identify a document in the index. So whenever I want to index a document, I first want to find and delete the old form. To find it, I intend to use: BooleanQuery findOurs = new BooleanQuery(); findOurs.add(new TermQuery(new Term(Id, id)), true, false); findOurs.add(new TermQuery(new Term(Domain, domain)), true, false); System.out.println(Deleting document matching: \ + findOurs.toString() + ''); Searcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(findOurs); // Assert: hits.length() = 1 for (int i = 0 ; i hits.length() i 10; i++) { Document d = hits.doc(i); // Now what can I do to find document id? int id = ?? searcher.delete(id); } But I can't discover how to convert a search result into a document id. It is recorded in the private HitDoc class, but since it is not publicly accessible, there must be a reason why it would not work to add a public getter for it. Is there an alternative way that I can do this? My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]