Re: Unique Fields
The "problem" is that my unique field is a title, many terms per field. I want to make an index with titles and i don't want to have duplicates. John Erick Erickson wrote: You can easily find whether a term is in the index with TermEnum/TermDocs (I think TermEnum is all you really need). Except, you'll probably also have to keep an internal map of IDs added since the searcher was opened and check against that too. Best Erick On Tue, Mar 11, 2008 at 11:04 AM, Ion Badita <[EMAIL PROTECTED]> wrote: Hi, I want to create an index with one unique field. Before inserting a document i must be sure that "unique field" is unique. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Specialized XML handling in Lucene
Indeed it seems like a problematic way. I would also have a problem searching for documents with more then one value. if the query is something simple like : "value1 AND value2" I would expect to get all xml docs with both values, but if I use the doc=element method, I won't get any result because each doc contains only value1 or value2 or something else, even if their xml_doc_id is the same. back to the drawing table... On Tue, Mar 11, 2008 at 9:50 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote: > On 03/11/2008 at 11:48 AM, Steven A Rowe wrote: > > 5 billion docs is within the range that Lucene can handle. I > > think you should try doc = element and see how well it works. > > Sorry, Eran, I was dead wrong about this assertion. See this thread for > more information: > > < > http://www.nabble.com/MultiSearcher-to-overcome-the-Integer.MAX_VALUE-limit-td15876190.html > > > > Looks like doc = element is *not* the way to go. > > Steve > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Searching for null (empty) fields, how to use -field:[* TO *]
Thanks for your suggestion markmiller. When I try this query, I get both documents as hits. The one with the field having a value and also the one with the field not set... Any idea why? markrmiller wrote: > > You cannot have a purely negative query like you can in Solr. > > Try: *:* -MY_FIELD_NAME:[* TO *] > > -- View this message in context: http://www.nabble.com/Searching-for-null-%28empty%29-fields%2C-how-to-use--field%3A-*-TO-*--tp15976538p16000127.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document ID shuffling under 2.3.x (on merge?)
Daniel Noll wrote: I have filtered out lines in the log which indicated an exception adding the document; these occur when our Reader throws an IOException and there were so many that it bloated the file. OK, I think very likely this is the issue: when IndexWriter hits an exception while processing a document, the portion of the document already indexed is left in the index, and then its docID is marked for deletion. You can see these deletions in your infoStream: flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments This means you have deletions in your index, by docID, and so when you optimize the docIDs are then compacted. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using Lucene from scripting language without any java coding
Here is a POC about using Lucene, via Compass, from PHP or Python (other languages will come later), with only XML configuration, object notation, and native use of scripting language. http://blog.garambrogne.net/index.php?post/2008/03/11/Using-Compass-without-dirtying-its-hands-with-java It's look like Solr, but it's different. Solr framework: Lucene serialisation: XML transport: HTTP via servlet. Goniomotre framework : Compass + Spring serialisation: JSON transport: Socket via Mina Other difference, Solr is a mature project with admin pages, caching and production ready stuff. Goniometre is a prototype for Compass fan, wich like XML, and coding with any language except Java. Examples, test and code are available via svn. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Unique Fields
So, you're tokenizing the title field? If so, I don't understand how you expect this to work. Would the title "this is one order" and "is one order this" be considered identical? Would capitalization matter? Punctuation? Throwing all the terms of a title into a tokenized field and expecting some magic to keep duplicates is beyond the scope of Lucene, you'll have to roll some customized solution. For instance, index your title UN_TOKENIZED in a duplicate field (after applying whatever massaging you want re: punctuation, spaces, etc.). Use TermDocs/TermEnum on that field to detect duplicates. You won't search on this field Or create a hash of the title and index *that* in a separate field and check against the hash with termenum/terndocs. Or. But no, there's no magic that makes Lucene DWIM (Do What I Mean)... Best Erick On Wed, Mar 12, 2008 at 2:01 AM, Ion Badita <[EMAIL PROTECTED]> wrote: > The "problem" is that my unique field is a title, many terms per field. > I want to make an index with titles and i don't want to have duplicates. > > John > > > Erick Erickson wrote: > > You can easily find whether a term is in the index with > TermEnum/TermDocs > > (I think TermEnum is all you really need). > > > > Except, you'll probably also have to keep an internal map of IDs added > since > > the searcher was opened and check against that too. > > > > Best > > Erick > > > > On Tue, Mar 11, 2008 at 11:04 AM, Ion Badita < > [EMAIL PROTECTED]> > > wrote: > > > > > >> Hi, > >> > >> I want to create an index with one unique field. > >> Before inserting a document i must be sure that "unique field" is > unique. > >> > >> > >> > >> John > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > >
Re: Searching for null (empty) fields, how to use -field:[* TO *]
Thanks Erick, I ended up by following your second suggestion. It has been a bit tricky since I had to plug into a MapConverter but it works as expected. Thanks to all. --thogau You could also think about making a filter, probably when you open your searcher. You can use TermDocs/TermEnum to find all of the documents that *do* have entries for your field, assemble those into a filter, then invert that filter. Keep the filter around and use it whenever you need to. Perhaps CachingWrapperFilter would help here (although I've never used the latter). Another possibility is to index a field only for those documents that don't have any value for MY_FIELD_NAME. So when indexing a doc, you have something like if (has MY_FIELD_NAME) { doc.add("MY_FIELD_NAME", ); } else { doc.add("NO_MY_FIELD_NAME", "no"); } Now finding docs without your field really is just searching on NO_MY_FIELD_NAME:no Your index would be very slightly bigger in this instance FWIW Erick -- View this message in context: http://www.nabble.com/Searching-for-null-%28empty%29-fields%2C-how-to-use--field%3A-*-TO-*--tp15976538p16002412.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighter Hits
Hello everybody, I have s slight problem using lucenes highlighter. If i have the highlighter enabled, a query creates 0 hits, if i disable the highlighter i get the hits. It seems like, when i call searcher.search() and pass my Hits hits to the highlighter function, the program quits. All prints after the highlighter call also do not appear. I have no idea what the problem is. Thanks in advise Jens Burkhardt -- View this message in context: http://www.nabble.com/Highlighter-Hits-tp16002424p16002424.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document ID shuffling under 2.3.x (on merge?)
I certainly found that lazy loading changed my speed dramatically, but that was on a particularly field-heavy index. I wonder if TermEnum/TermDocs would be fast enough on an indexed (UN_TOKENIZED???) field for a unique id. Mostly, I'm hoping you'll try this and tell me if it works so I don't have to sometime Erick On Tue, Mar 11, 2008 at 9:26 PM, Daniel Noll <[EMAIL PROTECTED]> wrote: > On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote: > > But to me, it always seems...er...fraught to even *think* about relying > > on doc ids. I know you've been around the block with Lucene, but do you > > have a compelling reason to use the doc ID and not your own unique ID? > > From memory it was around a factor of 10 times slower to use a text field > for > this; I haven't tested it recently and the case of retrieving the Document > should be slightly faster now that we have FieldSelector, but it certainly > won't be faster as to get the document you need the ID in the first place. > > For single documents it wasn't a problem, the use cases are: > 1. Bulk database operations based on the matched documents. > 2. Creating a filter BitSet based on a database query. > > Effectively this is required because Lucene offered no way to update a > Document after it was indexed; if it had that feature we would never have > needed a database. ;-) > > Daniel > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Highlighter Hits
What does your stack trace look like? I've never seen Lucene "just quit" without throwing an exception, and printStackTrace() is your friend. Or are you catching exceptions without logging them? If so, shame on you . Best Erick P.S. I can't recommend strongly enough that you get a good IDE and debug in it. I spent far too much of my life debugging with printlns and never, ever want to go back there again... Eclipse is free if sometimes "interesting" to set up. IntelliJ is sweet. And a unit test or two will help significantly too. Sorry if you know all this, but your comment about prints lights me right up . On Wed, Mar 12, 2008 at 9:54 AM, JensBurkhardt <[EMAIL PROTECTED]> wrote: > > Hello everybody, > > I have s slight problem using lucenes highlighter. If i have the > highlighter > enabled, a query creates 0 hits, if i disable the highlighter i get the > hits. > It seems like, when i call searcher.search() and pass my Hits hits to the > highlighter function, the program quits. All prints after the highlighter > call also do not appear. > I have no idea what the problem is. > > Thanks in advise > > Jens Burkhardt > -- > View this message in context: > http://www.nabble.com/Highlighter-Hits-tp16002424p16002424.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
cannot delete cfs files on windows
Hello I can index many times and delete the index files (manually). But if I search once, then the cfs file is locked and cannot be deleted. Subsequent indexings create new cfs files. Even if I undeploy the tomcat web application which holds the search code, the cfs file cannot be deleted. O/S: Windows XP Code to index: IndexWriter writer = new IndexWriter( PATH, new StandardAnalyzer(), true); writer.addDocument(doc1); writer.addDocument(doc2); writer.addDocument(doc3); writer.optimize(); writer.close(); Code to search: Searcher searcher = null; IndexReader indexReader = null; try { indexReader = IndexReader.open(PATH); searcher = new IndexSearcher(indexReader); Hits hits = searcher.search(query); ... } finally { searcher.close(); indexReader.close(); } Am I doing something wrong? thanks, Ioannis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter Hits
I suspect you are using a different analyzer to highlight than you are using to search. A couple of things you can check: Immediately after your query simply print out hits.length, this should conclusively tell you that you query is in fact working, after that ensure that you are using the same analyzer for your highlighter that you are for your query parser. If you are not, its entirely possible that the text you are trying to highlight with is being transformed differently than how it was in the query, and as a result isn't matching against your fields anymore. Hope that helps, Matt JensBurkhardt wrote: Hello everybody, I have s slight problem using lucenes highlighter. If i have the highlighter enabled, a query creates 0 hits, if i disable the highlighter i get the hits. It seems like, when i call searcher.search() and pass my Hits hits to the highlighter function, the program quits. All prints after the highlighter call also do not appear. I have no idea what the problem is. Thanks in advise Jens Burkhardt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing Yes and No
Querying Lucene with includeNews:Yes Works fine and brings back expected results.. includeNews:No Does not work and brings back nothing.. There are definitely documents in my index that has the word "No" in the includeNews field. Tested in Luke with all the analyzers. Any ideas? Any thoughts? Any workarounds..? Any help much appreciated. Thanks Raq -- View this message in context: http://www.nabble.com/Indexing-Yes-and-No-tp16012319p16012319.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexReader deleteDocument
Hi, I am trying to delete a document without using the hits object. What is the unique field in the index that I can use to delete the document? I am trying to make a web interface where index can be modified, smaller subset of what Luke does but using JSPs and Servlet. to use deleteDocument(int docNum) I need docNum how can I get this? or does it have to come only vis Hits? Thanks, Varun
Re: IndexReader deleteDocument
Have you seen the work that Mark Harwood has done making a GWT version of Luke? I think its in the latest release. varun sood wrote: Hi, I am trying to delete a document without using the hits object. What is the unique field in the index that I can use to delete the document? I am trying to make a web interface where index can be modified, smaller subset of what Luke does but using JSPs and Servlet. to use deleteDocument(int docNum) I need docNum how can I get this? or does it have to come only vis Hits? Thanks, Varun - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Yes and No
Well, if your using a stopword list, "no" is likely to be on it and "yes" is not. Raq wrote: Querying Lucene with includeNews:Yes Works fine and brings back expected results.. includeNews:No Does not work and brings back nothing.. There are definitely documents in my index that has the word "No" in the includeNews field. Tested in Luke with all the analyzers. Any ideas? Any thoughts? Any workarounds..? Any help much appreciated. Thanks Raq - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexReader deleteDocument
No. I haven't but I will. even though I would like to make my own implementation. So any idea of how to get the "doc num"? Thanks for replying. Varun On Wed, Mar 12, 2008 at 5:15 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > Have you seen the work that Mark Harwood has done making a GWT version > of Luke? I think its in the latest release. > > varun sood wrote: > > Hi, > > I am trying to delete a document without using the hits object. > > What is the unique field in the index that I can use to delete the > document? > > > > I am trying to make a web interface where index can be modified, smaller > > subset of what Luke does but using JSPs and Servlet. > > > > to use deleteDocument(int docNum) > > I need docNum how can I get this? or does it have to come only vis Hits? > > > > Thanks, > > Varun > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Document ID shuffling under 2.3.x (on merge?)
On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote: > OK, I think very likely this is the issue: when IndexWriter hits an > exception while processing a document, the portion of the document > already indexed is left in the index, and then its docID is marked > for deletion. You can see these deletions in your infoStream: > >flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments > > This means you have deletions in your index, by docID, and so when > you optimize the docIDs are then compacted. Aha. Under 2.2, a failure would result in nothing being added to the text index so this would explain the problem. It would also explain why smaller data sets are less likely to cause the problem (it's less likely for there to be an error in it.) Workarounds? - flush() after any IOException from addDocument() (overhead?) - use ++ to determine the next document ID instead of index.getWriter().docCount() (out of sync after an error but fixes itself on optimize(). - Use a field for a separate ID (slower later when reading the index) - ??? Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing api wrt Analyzer
Hi all: Maybe this has been asked before: I am building an index consists of multiple languages, (stored as a field), and I have different analyzers depending on the language of the language to be indexed. But the IndexWriter takes only an Analyzer. I was hoping to have IndexWriter take an AnalyzerFactory, where the AnalyzerFactory produces Analyzer depending on some criteria of the document, e.g. language. Maybe I am going about the wrong way. Any suggestions on how to go about? Thanks -John
Re: indexing api wrt Analyzer
On Thu, Mar 13, 2008 at 10:40 AM, John Wang <[EMAIL PROTECTED]> wrote: > Hi all: > >Maybe this has been asked before: > >I am building an index consists of multiple languages, (stored as a > field), and I have different analyzers depending on the language of the > language to be indexed. But the IndexWriter takes only an Analyzer. > >I was hoping to have IndexWriter take an AnalyzerFactory, where the > AnalyzerFactory produces Analyzer depending on some criteria of the > document, e.g. language. > >Maybe I am going about the wrong way. > >Any suggestions on how to go about? > Perhaps this is what you are searching for: http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html With PerFieldAnalyzerWrapper, you can specify which analyzer to use with each field, as well as a default analyzer. cheers, asgeir
Re: indexing api wrt Analyzer
On Thursday 13 March 2008 15:21:19 Asgeir Frimannsson wrote: > >I was hoping to have IndexWriter take an AnalyzerFactory, where the > > AnalyzerFactory produces Analyzer depending on some criteria of the > > document, e.g. language. > With PerFieldAnalyzerWrapper, you can specify which analyzer to use with > each field, as well as a default analyzer. Certainly this would work as long as you store each language in a different Lucene field. This is probably a good idea anyway as it will be easier for the QueryParser where there won't necessarily be enough text to determine the language easily. Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document ID shuffling under 2.3.x (on merge?)
On Thursday 13 March 2008 00:42:59 Erick Erickson wrote: > I certainly found that lazy loading changed my speed dramatically, but > that was on a particularly field-heavy index. > > I wonder if TermEnum/TermDocs would be fast enough on an indexed > (UN_TOKENIZED???) field for a unique id. > > Mostly, I'm hoping you'll try this and tell me if it works so I don't have > to sometime I added a "uid" field to our existing fields. After the load there were some gaps in the values for this field; presumably those were documents where adding the doc failed and adding the fallback doc also failed. The index contains 20004 documents. Each test I ran over 10 iterations and times below are an average of the last 5 as it took around 5 rounds to warm up. Filter building, for a filter returning 1000 documents randomly selected: Time to build filter by UID (100% Derby) - 93ms Additional time to build filter by DocID - 12ms (13% penalty) 13% penalty is acceptable IMO. The problem comes next. Bulk operation building, for a query returning around 2800 documents: Time to build the bulkop by DocID (100% Hits) - 6ms Time to fetch the "uid" field from the document - 152ms (2600% penalty) Time to do the DB query (not counting commit though) - 10ms For interest's sake I also timed fetching the document with no FieldSelector, that takes around 410ms for the same documents. So there is still a big benefit in using the field selector, it just isn't anywhere near enough to get it close to the time it takes to retrieve the doc IDs. Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Good way of Indexing TextFiles
Hi All, I tried one Indexing Stratergy: 1.I am having unique numbers as the search column for ex : my search query should be 9840836588 AND dateSc:[13/03/2008 TO 16/03/2008] while Indexing the numbers i divide the number by 3 9840836588%3 = 26588 creating a folder in the foloowing format "/200080301-200080316/26588" I index and store the records in that folder.so while searching i get the modulo and search the records only in that folder. is it a good way of indexing? Sebastin wrote: > > Hi All, >I am going to create a Lucene Index Store of Size 300 GB per > month.I read Lucene Index Performance tips in wiki.can anyone suggest what > are all the steps need to be followed while dealing with big Indexes.My > Index Store gets updated every second.I used to search 15 days records > approximately 150 GB records,in a time.Does anyone give me a clue,what > have to set JVM for both Index and Search to avoid Out of memory error and > how can i create Index store for large Indexes? > -- View this message in context: http://www.nabble.com/Good-way-of-Indexing-TextFiles-tp15950791p16021739.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Good way of Indexing TextFiles
Hi All, I tried one Indexing Stratergy: 1.I am having unique numbers as the search column for ex : my search query should be 9840836588 AND dateSc:[13/03/2008 TO 16/03/2008] while Indexing the numbers i divide the number by 3 9840836588%3 = 26588 creating a folder in the foloowing format "/200080301-200080316/26588" I index and store the records in that folder.so while searching i get the modulo and search the records only in that folder. is it a good way of indexing? Sebastin wrote: > > Hi All, >I am going to create a Lucene Index Store of Size 300 GB per > month.I read Lucene Index Performance tips in wiki.can anyone suggest what > are all the steps need to be followed while dealing with big Indexes.My > Index Store gets updated every second.I used to search 15 days records > approximately 150 GB records,in a time.Does anyone give me a clue,what > have to set JVM for both Index and Search to avoid Out of memory error and > how can i create Index store for large Indexes? > -- View this message in context: http://www.nabble.com/Good-way-of-Indexing-TextFiles-tp15950791p16021743.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]