RE: when indexing, java.io.FileNotFoundException
Increase the minMergeDocs and use the compact file format when creating your index. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean) -Original Message- From: Chris Lu [mailto:[EMAIL PROTECTED] Sent: Thursday, February 03, 2005 12:46 PM To: Lucene Users List Subject: when indexing, java.io.FileNotFoundException Hi, I am getting this exception now and then when I am indexing content. It doesn't always happen. But when it happens, I have to delete the index and start over again. This is a serious problem for us. In this email, Doug was say it has something to do with win32's lack of atomic renaming. http://java2.5341.com/msg/1348.html But how can I prevent this? Chris Lu java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:204) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480) at org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
literal search in quotes on non-tokenized field
Here is a problem I am experiencing with Lucene searches on non-tokenized fields: A search in quotes on a field named Build with the query \orig\ does not work but the query origi yields 62 hits I have run indexing on the field with the following method doc.add(Field.Keyword(data.getColumnName(j), fieldValue.toString().toLowerCase())); so even though the original data has ORIGI in the Build field, lowercase is not the problem Here's a log of the parsed query before going to the searcher: Parsed query: (Build:origi) for the first search Parsed query: (Build:origi) for the second search Right now we're not using a query parser / analyzer system to build the query. We're building the query up. The query mentioned above is a TermQuery object Thanks
RE: literal search in quotes on non-tokenized field
Erik, -Original Message- Here's a log of the parsed query before going to the searcher: Parsed query: (Build:origi) for the first search Parsed query: (Build:origi) for the second search What do you mean by parsed, since below you say you're not using QueryParser/Analyzer. Sorry, that's residual log text. The lines of code are BooleanQuery totalQuery = new BooleanQuery(); .. logic to build totalQuery ... log.debug(Parsed query: + totalQuery.toString()); dbSearchHits = searcher.search(totalQuery); Right now we're not using a query parser / analyzer system to build the query. We're building the query up. The query mentioned above is a TermQuery object Let me hopefully clarify what you've said you've indexed (I'm not using quotes on purpose) origi, but you're doing a TermQuery on origi (with the quotes) and expecting it to match? It doesn't work that way. A TermQuery must match *exactly* what was indexed (either directly as a Keyword, or as tokens emitted from the analyzer). Since you're building the query up yourself from, I'm assuming, user input, you may need to pre-process what the user entered to get the right term to query on. Only the term origi would match. Yeah but it doesn't. The exact text in the database is ORIGI. Keyword doesn't work if you supply more than one word. In fact we're doing it wrong. Fields with a small number of terms should not be indexed as keyword, but tokenized. I'm going to change the indexing strategy to only use keyword when there's one and only one keyword in the data itself. Fields with two to three words will be tokenized with the NoTokenizingTokenizer that was posted earlier, and fields with four or more words will be tokenized with MyTokenizer. All we need to do for searching keyword fields is remove the double quotes to be consistent with searching in a tokenized field. Then use QueryParser to parse the tokenized fields with the appropriate parser for the field. This should solve the problem. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: modifying existing index
To update a document you need to insert the modified document, then delete the old one. Here is some code that I use to get you going in the right direction (it wont compile, but if you follow it closely you will see how I take an array of lucene documents with new properties and add them, then delete the old ones.): public void updateDocuments( Document[] documentsToUpdate ) { if ( documentsToUpdate.length 0 ) { String updateDate = Dates.formatDate( new Date(), MMddHHmm ); // wait on some other modification to finish HashSet failedToAdd = new HashSet(); waitToModify(); synchronized(directory) { IndexWriter indexWriter = null; try { indexWriter = getWriter(); indexWriter.mergeFactor = 2; //this seems to be needed to accomodate a lucene (ver 1.4.2) bug //otherwise the index does not accurately reflect the change //load data from new document into old document for ( int i = 0; i documentsToUpdate.length; i++ ) { try { Document newDoc = modifyDocument( documentsToUpdate[i], updateDate ); if ( newDoc != null ) { documentsToUpdate[i] = newDoc; indexWriter.addDocument( newDoc ); } else { failedToAdd.add( documentsToUpdate[i].get( messageid ) ); } } catch ( IOException addDocException ) { //if we fail to add, make a note and dont delete it logger.error( [+getContext().getID()+] error updating message: + documentsToUpdate[i].get(messageid) ,addDocException ); failedToAdd.add( documentsToUpdate[i].get( messageid ) ); } catch ( java.lang.IllegalStateException ise ) { //if we fail to add, make a note and dont delete it logger.error( [+getContext().getID()+] error updating message: + documentsToUpdate[i].get(messageid) ,ise ); failedToAdd.add( documentsToUpdate[i].get( messageid ) ); } } //if we fail to close the writer, we dont want to continue closeWriter(); searcherVersion = -1; //establish that the searcher needs to update IndexReader reader = IndexReader.open( indexPath ); int testid = -1; for ( int i = 0; i documentsToUpdate.length; i++ ) { Document newDoc = documentsToUpdate[i]; try { logger.debug( delete id: + newDoc.get( deleteid ) + messageid: + newDoc.get( messageid ) ); reader.delete( Integer.parseInt( newDoc.get( deleteid ) ) ); testid = Integer.parseInt( newDoc.get( deleteid ) ); } catch ( NumberFormatException nfe )
RE: Too many open files issue
If you are on linux the number of file handles for a session is much lower than that for the whole machine. ulimit -n will tell you. There are instructions on the web for changing this setting, it involves the /etc/security/limits.conf and setting the values for nofile. (bulkadm is my user) bulkadm softnofile 8192 bulkadm hardnofile 65536 Also, if you use the condensed file format you will have many fewer files. -Original Message- From: Neelam Bhatnagar [mailto:[EMAIL PROTECTED] Sent: Monday, November 22, 2004 10:02 AM To: Otis Gospodnetic Cc: [EMAIL PROTECTED] Subject: Too many open files issue Hi, I had requested help on an issue we have been facing with the Too many open files Exception garbling the search indexes and crashing the search on the web site. As a suggestion, you had asked us to look at the articles on O'Reilly Network which had specific context around this exact problem. One of the suggestions was to increase the limit on the number of file descriptors on the file system. We tried it by first lowering the limit to 200 from 256 in order to reproduce the exception. The exception did get reproduced but even after increasing the limit to 500, the exception kept coming until after several rounds of trying to rebuild the index, we finally got to get it working for the default file descriptor limit of 256. This makes us wonder if your first suggestion of optimizing indexes is a pre-requisite to trying this option. Another piece of relevant information is that we have the default merge factor of 10. Kindly give us pointers to what it that we are doing wrong or should we be trying something completely different. Thanks and regards Neelam Bhatnagar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Best Implementation of Next and Prev in Lucene
See the demo jsp pages. -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 9:26 PM To: [EMAIL PROTECTED] Subject: Best Implementation of Next and Prev in Lucene Hi All, What's the best implementation of displaying the Next and Prev search result in Lucene? Thanks, Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
API request: isOpen on indexwriter and searcher
Could a developer consider adding an isOpen method to the writer and searcher? I have looked at doing it myself, but not sure what I am doing. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: java.io.FileNotFoundException: ... (No such file or directory)
I have gotten this a few times. I am also using a NFS mount, but have seen it in cases where a mount wasn't involved. I cannot speak to why this is happening, but I have posted to this forum before a way of repairing your index by modifying the segments file. Search for wallen. The other thing I have done, is use code to copy the documents that can be read by a reader to a new index. I suppose I should submit those tools to open source! Anyway, this error will break the searcher, but the index can still be read with an indexreader. -Will Here is the source of a method that should get you started (logger is a log4j object): public void transferDocuments() throws IOException { IndexReader reader = IndexReader.open(brokenDir); logger.debug(reader.numDocs() + ); IndexWriter writer = new IndexWriter(newIndexDir, PopIndexer.popAnalyzer(),true); writer.minMergeDocs = 50; writer.mergeFactor = 200; writer.setUseCompoundFile(true); int docCount = reader.numDocs(); Date start = new Date(); //docCount = Math.min(docCount, 500); for(int x=0; x docCount; x++) { try { if(!reader.isDeleted(x)) { Document doc = reader.document(x); if(x % 1000 == 0) { logger.debug(doc.get(subject)); } //remove the new fields if they exist, and add new value //TODO test not having this in /* for ( Enumeration newFields = doc.fields(); newFields.hasMoreElements(); ) { Field newField = (Field) newFields.nextElement(); doc.removeFields( newField.name() ); doc.add( newField ); } */ doc.removeFields(counter); doc.add(Field.Keyword(counter, counter)); // reinsert old document writer.addDocument( doc ); } } catch(IOException ioe) { logger.error(doc: + x + failed, + ioe.getMessage()); } catch(IndexOutOfBoundsException ioobe) { logger.error(INDEX OUT OF BOUNDS! + ioobe.getMessage()); ioobe.printStackTrace(); } } reader.close(); //logger.debug(done, about the optimize); //writer.optimize(); writer.close(); long time = ((new Date()).getTime() - start.getTime())/1000; logger.info(done optimizing: + time + seconds or + (docCount / time) + rec/sec); } -Original Message- From: Justin Swanhart [mailto:[EMAIL PROTECTED] Sent: Thursday, November 18, 2004 5:00 PM To: Lucene Users List Subject: java.io.FileNotFoundException: ... (No such file or directory) I have two index processes. One is an index server, the other is a search server. The processes run on different machines. The index server is a single threaded process that reads from the database and adds unindexed rows to the index as needed. It sleeps for a couple minutes between each batch to allow newly added/updated rows to accumulate. The searcher process keeps an open cache of IndexSearcher objects and is multithreaded. It accepts connections on a tcp port, runs the query and stores the results in a database. After a set interval, the server checks to see if the index on disk is a newer version. If it is, it loads the index into a new IndexSearcher as a RAMDirectory. Every once in awhile, the index reader process gets a FileNotFoundException: 20041118 1378 1383 (index number, old version, new version) [newer version found] Loading index directory into RAM: 20041118 java.io.FileNotFoundException: /path/omitted/for/obvious/reasons/_4zj6.cfs (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:204) at org.en.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376) at org.en.lucene.store.FSInputStream.init(FSDirectory.java:405) at org.en.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:60) at org.en.lucene.store.RAMDirectory.init(RAMDirectory.java:89) at org.en.global.searchserver.UpdateSearchers.createIndexSearchers(Search.java:89) at org.en.global.searchserver.UpdateSearchers.run(Search.java:54) the code being called at that point is: //add the directory to the HashMap of IndexSearchers (dir# = IndexSearcher) indexSearchers.put(subDirs[i],new IndexSearcher(new RAMDirectory(indexDir + / + subDirs[i]))); The indexes are located on a NFS mountpoint. Could
RE: QueryParser: [stopword] AND something throws Exception
Holy cow! This does happen! -Original Message- From: Peter Pimley [mailto:[EMAIL PROTECTED] Sent: Friday, November 12, 2004 11:52 AM To: Lucene Users List Subject: QueryParser: [stopword] AND something throws Exception [this is using lucene-1.4-final] Hello. I have just encountered a way to get the QueryParser to throw an ArrayIndexOutOfBoundsException. It can be recreated with the demo org.apache.lucene.demo.SearchFiles program. The way to trigger it is to parse a query of the form: a AND b ...where 'a' is a stop word. For example, the AND vector. It only happens when the -first- term is a stop word. You could search for vector AND the or vector AND the AND class, and it works as you would expect (i.e. the stop words are ignored). Unfortunately I am up against a deadline right now so I can't fix this myself. I'm just going to filter out stop words before feeding them to the query parser. I'll try to have a look at it in roughly 2 weeks time if nobody else has solved it. Peter Pimley, Semantico Here is the stack trace. java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.Vector.elementAt(Vector.java:434) at org.apache.lucene.queryParser.QueryParser.addClause(QueryParser.java:181) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:529) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:561) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Acedemic Question About Indexing
I have a servlet that instanciates a multisearcher on 6 indexes: (du -h) 7.2G./0 7.2G./1 7.2G./2 7.2G./3 7.2G./4 7.2G./5 43G . I recreate the index from scratch each month based upon a 50gig zip file with all of the 40 million documents. I wanted to keep my indexing speed as low as possible, without hurting search performace too much, as each searcher allocates a certain amount of memory proportional to the number of terms it has. A single large index has a lot of overlap in terms, so it needs less memory than multiple indexes. Anyway, for indexing, I am able to index ~100 documents per second. The total indexing process takes 2.5 days. I have a powerful machine with 2 hyperthreaded processors (linux sees 4 processors) and 1GB ram. I also have pretty fast SCSI disks. I perform no updates or deletes on my indexes. The indexing process equally divides the work amongst the indexers. The bottleneck of the indexing process is not memory or CPU, rather disk IO of 6 writers. If I had faster disks, I could create more indexers. -Original Message- From: Sodel Vazquez-Reyes [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 11:37 AM To: Lucene Users List Cc: Will Allen Subject: Re: Acedemic Question About Indexing Will, could you give more details about your architecture? -each time update o create new indexes -data stored at each index etc. because it is quite interesting, and I would like to test it. Sodel Quoting Luke Shannon [EMAIL PROTECTED]: 40 Million! Wow. Ok this is the kind of answer I was looking for. The site I am working on indexes maybe 1000 at any given time. I think I am ok with a single index. Thanks. - Original Message - From: Will Allen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 7:23 PM Subject: RE: Acedemic Question About Indexing I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml
RE: Acedemic Question About Indexing
I have an application that I run monthly that indexes 40 million documents into 6 indexes, then uses a multisearcher. The advantage for me is that I can have multiple writers indexing 1/6 of that total data reducing the time it takes to index by about 5X. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:39 PM To: Lucene Users List Subject: Re: Acedemic Question About Indexing Don't worry, regardless of what I learn in this forum I am telling my company to get me a copy of that bad boy when it comes out (which as far as I am concerned can't be soon enough). I will pay for grama's myself. I think I have reviewed the code you are referring to and have something similar working in my own indexer (using the uid). All is well. My stupid question for the day is why would you ever want multiple indexes running if you can build one smart indexer that does everything as efficiently as possible? Does the answer to this question move me to multi threaded indexing territory? Thanks, Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 2:08 PM Subject: Re: Acedemic Question About Indexing Uh, I hate to market it, but it's in the book. But you don't have to wait for it, as there already is a Lucene demo that does what you described. I am not sure if the demo always recreates the index or whether it deletes and re-adds only the new and modified files, but if it's the former, you would only need to modify the demo a little bit to check the timestamps of File objects and compare them to those stored in the index (if they are being stored - if not, you should add a field to hold that data) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I am working on debugging an existing Lucene implementation. Before I started, I built a demo to understand Lucene. In my demo I indexed the entire content hierarhcy all at once, and than optimize this index and used it for queries. It was time consuming but very simply. The code I am currently trying to fix indexes the content hierarchy by folder creating a seperate index for each one. Thus it ends up with a bunch of indexes. I still don't understand how this works (I am assuming they get merged someone that I have tracked down yet) but I have noticed it doesn't always index the right folder. This results in the users reporting inconsistant behavior in searching after they make a change to a document. To keep things simiple I would like to remove all the logic that figures out which folder to index and just do them all (usually less than 1000 files) so I end up with one index. Would indexing time be the only area I would be losing out in, or is there something more to the approach of creating multiple indexes and merging them. What is a good approach I can take to indexing a content hierarchy composed primarily of pdf, xsl, doc and xml where any of these documents can be changed several times a day? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Highlighting in Lucene
There is a highlighting tool in the sandbox (3/4 of the way down): http://jakarta.apache.org/lucene/docs/lucene-sandbox/ -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 3:40 PM To: 'Lucene Users List' Subject: Highlighting in Lucene Hi All, I would like to know if Lucene support highlighting on the searched text? Thanks in advance. Thanks, Ramon Aseniero - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching for a phrase that contains quote character
I am having this same problem, but cannot find any help! I have a keyword field that sometimes includes double quotes, but I am unable to search for that field because the escape for a quote doesnt work! I have tried a number of things: myfield:lucene is \cool\ AND myfield:lucene is \\cool\\ http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=7351 From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Searching for a phrase that contains quote character Date: Wed, 24 Mar 2004 21:25:16 + I'd like to search for a phrase that contains the quote character. I've tried escaping the quote character, but am receiving a ParseException from the QueryParser: For example to search for the phrase: this is a test I'm trying the following QueryParser.parse(field:\This is a \\\test, field, new StandardAnalyzer()); This results in: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 31. Encountered: EOF after : at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) ... What is the proper way to accomplish this? --Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching for a phrase that contains quote character
I am using a NullAnalyzer for this field. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, October 28, 2004 2:00 PM To: Lucene Users List Subject: Re: Searching for a phrase that contains quote character On Oct 28, 2004, at 1:03 PM, Justin Swanhart wrote: Have you tried making a term query by hand and testing to see if it works? Term t = new Term(field, this is a \test\); PhraseQuery pq = new PhraseQuery(t); That's not accurate API, but add you used pq.add(t), it still would presume that text is all a single term. Chances are, though, that even getting the query to have the quotes is not going to work as you've probably lost the quotes during indexing. Check out the AnalysisParalysis page on the wiki and analyze your Analyzer and make sure you are indexing the text with the quotes (no built-in analyzer besides WhitespaceAnalyzer would do that for you). Erik ... On Thu, 28 Oct 2004 12:02:48 -0400, Will Allen [EMAIL PROTECTED] wrote: I am having this same problem, but cannot find any help! I have a keyword field that sometimes includes double quotes, but I am unable to search for that field because the escape for a quote doesnt work! I have tried a number of things: myfield:lucene is \cool\ AND myfield:lucene is \\cool\\ http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene- [EMAIL PROTECTED]msgNo=7351 From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Searching for a phrase that contains quote character Date: Wed, 24 Mar 2004 21:25:16 + I'd like to search for a phrase that contains the quote character. I've tried escaping the quote character, but am receiving a ParseException from the QueryParser: For example to search for the phrase: this is a test I'm trying the following QueryParser.parse(field:\This is a \\\test, field, new StandardAnalyzer()); This results in: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 31. Encountered: EOF after : at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) ... What is the proper way to accomplish this? --Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching for a phrase that contains quote character
The nullanalyzer overrides the isTokenChar method to simply return true in the tokenizer class (http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1703655). The situation is that it seems lucene does not expect you to escape characters that exist inside of a quoted string. So my search [ authorkeyword:MariaMy* ] works, but [ authorkeyword:MariaMy\* ] does not, even though the * character should be escaped (http://jakarta.apache.org/lucene/docs/queryparsersyntax.html#Terms) So, if this is true, then the rule might be, reserved characters must be escaped EXCEPT when they are within double quotes as a phrase. When double quotes are needed within a phrase, they should be escaped with a .. ? -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, October 28, 2004 3:05 PM To: Lucene Users List Subject: Re: Searching for a phrase that contains quote character On Oct 28, 2004, at 2:02 PM, Will Allen wrote: I am using a NullAnalyzer for this field. Which means that each field is added exactly as-is as a single term? Then trying the PhraseQuery directly is a good first step - if you can get that to work then you can move on to making QueryParser work with escaping. But don't complicate things with QueryParser at first. Start with the queries constructed directly first. Erik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, October 28, 2004 2:00 PM To: Lucene Users List Subject: Re: Searching for a phrase that contains quote character On Oct 28, 2004, at 1:03 PM, Justin Swanhart wrote: Have you tried making a term query by hand and testing to see if it works? Term t = new Term(field, this is a \test\); PhraseQuery pq = new PhraseQuery(t); That's not accurate API, but add you used pq.add(t), it still would presume that text is all a single term. Chances are, though, that even getting the query to have the quotes is not going to work as you've probably lost the quotes during indexing. Check out the AnalysisParalysis page on the wiki and analyze your Analyzer and make sure you are indexing the text with the quotes (no built-in analyzer besides WhitespaceAnalyzer would do that for you). Erik ... On Thu, 28 Oct 2004 12:02:48 -0400, Will Allen [EMAIL PROTECTED] wrote: I am having this same problem, but cannot find any help! I have a keyword field that sometimes includes double quotes, but I am unable to search for that field because the escape for a quote doesnt work! I have tried a number of things: myfield:lucene is \cool\ AND myfield:lucene is \\cool\\ http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene- [EMAIL PROTECTED]msgNo=7351 From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Searching for a phrase that contains quote character Date: Wed, 24 Mar 2004 21:25:16 + I'd like to search for a phrase that contains the quote character. I've tried escaping the quote character, but am receiving a ParseException from the QueryParser: For example to search for the phrase: this is a test I'm trying the following QueryParser.parse(field:\This is a \\\test, field, new StandardAnalyzer()); This results in: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 31. Encountered: EOF after : at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:111) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) ... What is the proper way to accomplish this? --Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi + Parallel
I am using 6 indexers / indexes to balance the speed of indexing against query performance for 40+ million documents. I came to this number through trial and error, and performance testing on the indexing side with a fast 4 processor machine. The trick is to max out the I/O throughput. -Will -Original Message- From: Justin Swanhart [mailto:[EMAIL PROTECTED] Sent: Thursday, October 14, 2004 2:43 PM To: Lucene Users List Subject: Re: Multi + Parallel The overhead of creating that many searcher objects is going to far outweigh any performance benefit you could possibly hope to gain by splitting your index up. On Thu, 14 Oct 2004 04:42:27 -0700 (PDT), Otis Gospodnetic [EMAIL PROTECTED] wrote: Search a single merged index. Otis --- Karthik N S [EMAIL PROTECTED] wrote: Hi Apologies.. Can somebody provide me Approximate answers [ Which is Better choice ] A search of 10,000 subindexes using multisearcher or a search on One Single Merged Index [ merged 10,000 Sub indexes ] a) SubIndexes 10,000 ( future) b) Field to be searche upon = 4 c)Field type present in Indexed format = 15 d) RAM = 1GB e) O/s Linux [ Clustered Enviournament] f) Processor make AMD [Probably High End] g) WebServer Tomcat 5.0.x 1)Which would be Faster ???; 2)If not What is may be the Probable Solution. Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 13, 2004 3:53 PM To: Lucene Users List Subject: Re: Multi + Parallel On Oct 13, 2004, at 3:14 AM, Karthik N S wrote: I was Curious to Know the Difference between ParallelMultiSearcher and MultiSearcher , 1) Is the working internal functionality of these are same or different . They are different internally. Externally they should return identical results and not appear different at all. Internally, ParallelMultiSearcher searches each index in a separate thread (searches wait until all threads finish before returning). In MultiSearcher, each index is searched serially. You will not likely see a benefit to using ParallelMultiSearcher unless your environment is specialized to accommodate multi-threading (multiple CPU's, indexes on separate drives that can operate independently, etc). 2) In terms of time domain do these differ when searching same no of fields / words . 3)What are the features used on each of API. There is no external difference to using either implementation. Benchmark searches using both and see what is best, but generally MultiSeacher will be better in most environments as it avoids the overhead of starting up and managing multiple threads. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: -- TomCat/Lucene, filesystem
I think you might be refering to the xml files you keep in C:\Program Files\Apache\Tomcat\conf\Catalina\localhost I have a file with the contents (myapp.xml): ?xml version='1.0' encoding='utf-8'? Context docBase=C:/work/aggregation/myapp/web path=/myapp reloadable=true /Context -Original Message- From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 12:36 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: -- TomCat/Lucene, filesystem i have a web application using lucene via tomcat, you may need to set the correct permissions in ur catalina.policy file i use a blanket policy of grant { permission java.io.FilePermission /,read; }; to manage allow access to lucene -Original Message- From: J.Ph DEGLETAGNE [mailto:[EMAIL PROTECTED] Sent: 31 August 2004 17:12 To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: -- TomCat/Lucene, filesystem Hello Somebody, ..I beg your pardon... Under Windows XP / TomCat, How to customize Webapp Lucene to access directory filesystem which are outside TomCat ? like this : D:\Program Files\Apache Software Foundation\Tomcat 5.0\.. to access E:\Data Thank's a lot JPhD - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: too many open files
I suspect it has to do with this change: --- jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java 2004/08/08 13:03:59 1.12 +++ jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java 2004/08/11 17:37:52 1.13 I wouldn't know where to start to reproduce the problem as it was happening just once a day or so on an index that was being both queried and added to real time to the tune of 100,000 docs a day / 50 queries a day. The corruption was always the same thing, the segments file listed an entry to a file that was not there. -Will -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 07, 2004 1:54 PM To: Lucene Users List Subject: Re: Spam:too many open files On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote: A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. Here are some changes from around that day: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java Could you check which of those might have caused the problem? I guess there's not much the developers can do without the problem being reproducible. regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spam:too many open files
I will deploy and test through the end of the week and report back Friday if the problem persists. Thank you! -Original Message- From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 07, 2004 8:40 PM To: Lucene Users List Subject: Re: Spam:too many open files Hi Wallen, Actually, the files Daniel listed were modified on 8/11 and then again on 8/15. In the time between 8/11 to 8/15, I belive there could have been any number of problems, including corrupt indexes and poor multithreaded performance. However, I think after 8/15, the files should be in good working order. If you are not sure if you saw problems with pre-8/15 or post-8/15 version of the code, is it possible for you to try the latest CVS and see if the problem exists now? If it does, it will of course require urgent attention. Thanks very much! Dmitry. Daniel Naber wrote: On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote: A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. Here are some changes from around that day: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java Could you check which of those might have caused the problem? I guess there's not much the developers can do without the problem being reproducible. regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser handling a NOT query on its own
The Javadoc spec calls for one or more clauses in a query, but I had trouble with a NOT query just on its own. For example QueryParser.parse(my_field:-exclude) throws a parsing exception Same with QueryParser.parse(my_field:-(exclude)) QueryParser.parse(my_field:(* AND -exclude) The query QueryParser.parse(my_field:(-(exclude))) gives a legitimate query that brings no results. What I would expect is the following: If I have an index with 100 total entries, and 20 records with the word exclude in them, then the above queries should give 80 hits. There is no test case for this scenario in TestQueryParser. Please confirm whether this is a bug or not, Thank you, Allen Atamer - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query performance on a 315 Million document index (1TB)
Hi, I am considering a project that would index 315+ million documents. I am comfortable that the indexing will work well in creating an index ~800GB in size, but am concerned about the query performance. (Is this a = bad assumption?) What are the bottlenecks of performance as an index scales? Memory? = Cost is not a concern, so what would be the shortcomings of a theoretical = machine with 16GB of ram, 4-16 cpus and 1-2 terabytes of space? Would it be = better to cluster machines to break apart the query? Thank you for your serious responses, Will Allen -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
termPosition does not iterate properly in Lucene 1.3 rc1
Lucene does not iterate through the termPositions on one of my indexed data sources. It used to iterate properly through this data source, but not anymore. I tried on a different indexed data source and it iterates properly. The Lucene index directory does not have any lock files either. My code is as follows TermPositions termPos = reader.termPositions(aTerm); while (termPos.next()) { // get doc String docID = reader.document(termPos.doc()).get(keyName); ... } Is there anything wrong with that? Thanks for your help, Allen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: implementing a TokenFilter for aliases
Erik, Below are the results of a debug run on the piece of text that I want aliased. The token spitline must be recognized as splitline i.e. when I do a search for splitline, this record will come up. 1: [173] , start:1, end:2 1: [missing] , start:1, end:6 2: [hardware] , start:9, end:7 3: [for] , start:18, end:2 4: [bypass] , start:22, end:5 5: [spitline] , start:29, end:37 I also added extra debug info after the token text, which are the startOffset, and the endOffset. Lucene has the first token 173 only stored, it is not indexed. The remaining terms are tokenized, indexed and stored. Does this make a difference? Allen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: implementing a TokenFilter for aliases
173 is the ID field from a database (which we use as a primary key). For Lucene's purpose, it only stores the field, and does not index it. The place where I put the print statements is before the actual filtering. The goal of the AliasFilter is to replace spitline. The debug line is in the Tokenizer, and the filters are run afterwards so I am not sure what is happening inside lucene. I can't put the util line into the analyzer after the AliasFilter is run because it will call recursively into tokenStream() and cause a stack overflow. I will try to work on seeing what is happening after aliasfilter is run Allen -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: December 5, 2003 12:23 PM To: Lucene Users List Subject: Re: implementing a TokenFilter for aliases On Friday, December 5, 2003, at 11:59 AM, Allen Atamer wrote: Below are the results of a debug run on the piece of text that I want aliased. The token spitline must be recognized as splitline i.e. when I do a search for splitline, this record will come up. 1: [173] , start:1, end:2 1: [missing] , start:1, end:6 2: [hardware] , start:9, end:7 3: [for] , start:18, end:2 4: [bypass] , start:22, end:5 5: [spitline] , start:29, end:37 I also added extra debug info after the token text, which are the startOffset, and the endOffset. Lucene has the first token 173 only stored, it is not indexed. The remaining terms are tokenized, indexed and stored. Does this make a difference? I don't understand what you mean by 173 - is that output from a different string being analyzed? Well, it's obvious from this output that you cannot find spitline when splitline is used in a search. Your analyzer isn't working as you expect, I'm guessing. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
implementing a TokenFilter for aliases
The FAQ describes implementing a TokenFilter for applying aliases. I have a trouble accomplishing this. This is the code that I have so far for the next Method within AliasFilter. After reading some posts, I also got the idea to call setPositionIncrement(). Neither way works, because when I search for the alias, no search results come back. Thank you for your help, Allen Atamer public Token next() throws java.io.IOException { Token token = tokenStream.next(); if (aliasMap == null || token == null) { return token; } TermData t = (TermData)aliasMap.get(token.termText()); if (t == null) { return token; } String tokenText = AliasManager.replaceIgnoreCase( token.termText(), t.getTerm(), t.getTeach()); int increment = tokenText.length() - token.termText().length(); if (increment 0) { token.setPositionIncrement(increment); } return new Token(tokenText, token.startOffset(), token.endOffset()); }