Re: Access Lucene from PHP or Perl
why not using something like XML/RPC ? Bernhard Greetings. Can anyone point me to a how-to tutorial on how to access Lucene from a web page generated by PHP pr Perl? I've been looking but couldn't find anything. Thanks a lot. And __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieve all documents - possible?
you could use something like: int maxDoc = reader.maxDoc(); for (int i = 0; i maxDoc; i++) { Document doc = reader.document(i); } Bernhard Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
However, three times the space sounds a bit too much, or I make a mistake in the book. :) there already was a discussion about disk usage during index optimize. Please have a look to the developers list at: http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 where i made some measurements about the disk usage within lucene. At that time i proposed a patch which was reducing disk total used disk size from 3 times to a little more than 2 times of the final index size. Together with Christoph we implemented some improvements to the optimization patch and finally commit the changes. Bernhard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
i think the easiest way ist to use Lucene's StandardAnalyzer. If you want to use the snowball stemmers, you have to add a language guesser to get the language for the particular document before creating the analyzer. regards Bernhard [EMAIL PROTECTED] schrieb: Greetings everyone I wonder is there a solution for analyzing both English and French documents using the same analyzer. Reason being is that we have predominantly English documents but there are some French, yet it all has to go into the same index and be searchable from the same location during any perticular search. Is there a way to analyze both types of documents with a same analyzer (and which one)? I've looked around and I see there's a SnowBall analyzer but you have to specify the language of analysis, and I do not know that ahead of time during indexing nor do I know it most of the time during searching (users would like to search in both document types). There's also the issue of letter accents in french words and searching for the same (how are they indexed at the first place even)? Has anyone dealt with this before and how did you solve the problem? thanks -pedja - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TermPositionVector
Siddarth, i tested your code and the return is true and not false as you wrote. I assume that there is somethinf else which is wrong. Bernhard Siddharth Vijayakrishnan schrieb: Hi, I am adding a field to a document in the index as follows doc.add(new Field(contents,reader,Field.TermVector.WITH_POSITIONS)) Later,I query the index and get the document id of this document. The following code, however, prints false. TermFreqVector tfv = reader.getTermFreqVector(docId,contents); System.out.println(Is a TermPositionVector + (tfv instanceof TermPositionVector)); Using Field.TermVector.WITH_POSITIONS_OFFSETS, while creating the field, also produces the same result. Can someone tell me why this is happening ? Thanks, Siddharth - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
Right now I am using StandardAnalyzer but the results are not what I'd hope for. Also since my understanding is that we should use the same analyzer for searching that was used for indexing, even if I can manage to guess the language during indexing and apply to the SnowBall analyzer I wouldn't be able to use SnowBall for searching because users want to search through both English and French and I suppose I would not get the same results if used with StandardAnalyzer? you could try to create a more complex query and expand it into both languages using different analyzers. Would this solve your problem ? Another problem with StandardAnalyzer is that it breaks up some words that should not be broken (in our case document identifiers such as ABC-1234 etc) but that's a secondary issue... This is a behaviour is implemented in StandardTokenizer used by StandardAnalyzer. Look at the documentation of StandardTokenizer: Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. Bernhard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
you could try to create a more complex query and expand it into both languages using different analyzers. Would this solve your problem ? Would that mean I would have to actually conduct two searches (one in English and one in French) then merge the results and display them to the user? It sounds to me like a long way around, so then actually writing an analyzer that has the language guesser might be a better solution on the long run? It's no problem to guess the language based on the document corpus. But how do you want to guess the language of a simple Term Query ? What if your users are searching for names like George Bush ? You can't guess the language of such a query and you have to expand it into both languages. I don't see an easier way for solving that problem. This is a behaviour is implemented in StandardTokenizer used by StandardAnalyzer. Look at the documentation of StandardTokenizer: Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. Hmm I feel this is beyond my abilities at the moment, writing my own tokenizer, without more in-depth knowledge of everything else. Perhaps I'll try taking the StandardTokenizer and expand it or change it based on other tokenziers available in Lucene such as WhiteSpaceTokenizer. What's about using the WhitespaceAnalyzer directly ? Maybe this fits more into your requirement and you could use it for both languages. Bernhard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem indexing large document collction on windows xp
Thilo, thanks for your effort. Could you please open a new entry in Bugzilla, mark it as [PATCH] and add the diff file with your changes. This ensures that the sources and the information will not get lost in the huge universe of mailing lists. As soon there is time, one of the comitters will review and decide if it should be committed. Bernhard Hello I encoutered a problem when i tried to index large document collections (about 20 mio documents). The indexing failed with the IOException: Cannot delete deletables I tried different times (with the same document collection) and allways received the error, but after a different number of documents. The exception is thrown after failing to delete the specfied file at line 212 in FSDirectory.java. I found the following cure: after the lines if (nu.exists()) if (!nu.delete()){ i replaced throw new IOException(Cannot delete + to); with while(nu.exists()){ nu.delete(); System.out.println(delete loop); try { Thread.sleep(5000); } catch (InterruptedException e) { throw new RuntimeException(e); } That is, now i retry deleting the file until it is successful. After the changes, i was able to index all documents. From the fact that i observed several times delete loop on the output console, it can be deduced that the body of the while loop was reached (and left) several times. I am running lucene on windows xp. Regards Thilo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CFS file?
Steve Rajavuori schrieb: Can someone tell me the purpose of the .CFS files? The Index File Formats page does not mention this type of file. uuuh, you're right, it is not documented at fileformats.html. Since Lucene 1.4, the individual index files are stored per default within one single compound file which has the file extension .cvs . You can switch that behaviour off by setting the public static member IndexWriter.useCompoundFile to false. Bernhard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing with Lucene 1.4.3
That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck is right, the index looks fine and will be searchable. Since lucene version 1.4, the index is stored per default using the compound file format. The index files you are missing are merged within one compound file which has the extension cfs. You can disable the compound file option using IndexWriters setUseCompoundFile(false). Bernhard -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
Just to clarify. I have a Field 'uid' those value is an unique integer. I use it as a key to the document stored externally. I don't mean Lucene's internal document number. I was wonder if there is a method to query the highest value of a field, perhaps something like: IndexReader.maxTerm('uid') what you could do is to write your own IndexWriter class by extending the original one found in org.apache.lucene.index.IndexWriter. Than you have direct access to lucene's segment counter which could provide you a unique id for each document in the index. Those id's would stay sticky even if you modify the index after the intial creation process. is that the hint you need to start ? regards Bernhard What would the purpose of an auto-generated UID be? But no, Lucene does not generate UID's for you. Documents are numbered internally by their insertion order. This number changes, however, when documents are deleted in the middle and the index is optimized. Erik On Nov 22, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing .ppt
Hi, i tested the implementation. It seems to work with basic Powerpoint slides. The problem i have is that it doesn't extract special characters like german umlaute. Does anybody already adressed the problem ? thanks Bernhard Magnus Johansson schrieb: There's some code using POI at http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html /magnus Luke Shannon wrote: Hey All; Anyone know a good API for parsing MS powerpoint files? Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: about Stemming
Miguel Angel schrieb: Hi, I have used the DEMOS of lucene and I want to know as it is possible to be added Stemming for my applications. have a look to the lucene-sandbox. Under contributions there are stemmers for many different languages. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Transaction in Lucene
The message No tvx file can appear, if you have term vectors enabled during index and the documents you are adding have empty fields. As an example, if you try to index html documents, where many of them don't have a valid html title, the message will raise up. Looking at the term vector relevant code, this is nothing you have to worry about, it is just a status message. Otis is right, it is planned for future releases to avoid System.out.println() statements within lucene. regards Bernhard Otis Gospodnetic schrieb: I'm not sure about the tvx error, but I think I recall somebody changing some code around it a month or two ago. I also believe System.out.println is on the TODO list for elimination. Otis --- commandor [EMAIL PROTECTED] wrote: Hello, I came across the following problem with No tvx file. How could I manage to get it? I like to have transaction processes in Lucene. After my reading dev-lucene and user-lucene lists and analysing what people suggested I made up my own. The problem in my case is that I had to make several changes and only than make commit. That's why I did the following: 1. Turn off Lucene lock (setting the corresponding system variable = false) 2. Start the loop (from the first document to the last one to change in the index) 2.1. Open IndexReader 2.2. Get a document by its id 2.3. Store it as local variable 2.4. IndexReader.delete(document id) 2.5. IndexReader.close() 2.6. Merge new Terms (changes) and old ones in the document I retrieved 2.6. Open IndexWriter 2.7. Add a new made document 3. end of loop 4. Waiting for other action ends in my programm I close IndexWriter. The Result: Everything works fine but I had No tvx file I really worried about it cause I read what for tvx file... Might anybody explain me what I did wrong? In spite of your answer I did like the following: the way of logging messages This message appeared with the help of System.out.println() Investigating the code of Lucene I found a lot of places of using System.out I guess it is not a very good solution espessially in so beautiful search/indexing API. I guess Lucene must have a normal log to write its messages. Thanks in advance... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does lucene makes any compression
The lucene version from CVS head does now have a option to store and compress whole text files (binary fields within a lucene document) thru GZip. The index itself is not GZip compressed. Due to the nature of how the index is created and stored, it is very effectiv regarding to diskspace without the need of additional compression. I have no idea if the new functionality is already adapted within the c# port. regards Bernhard abdulrahman galal schrieb: i got the c# of lucene thanks god @ http://sourceforge.net/projects/nlucene what about the new version that include the compression facility ? you did n't replay on my qustion does it compress original text files and its indexs like Great MG thanks alot _ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE INDEX STATISTICS
please take a look at http://jakarta.apache.org/lucene/docs/benchmarks.html bernhard Karthik N S schrieb: Hi Guys Apologies. Can some body provide approximate Statics about the following factor for Developement and Deployment of Lucene [ it may be usefull for Pro's Developers ] a) Creation Indexing 1) X [ Say 100 Million ] of number of documents Y [ Kilobytes ] with Z no of Fields Hardware requirement [ RAM / Os / Processor / HardDisk Space / Other Specific Details ] Software [ Jdk Version / Lucene Version / Appserver Version ] 2) X [Say 100 Million] number to create Merged Indexes Hardware requirement [ RAM / Os / Processor / HardDisk Space / Other Specific Details ] Software [ Jdk Version / Lucene Version / Appserver Version ] b)Searching on Indexes [ 2 number of Persons Searching per Sec ] 1) X [ Say 100 Million ] of number of documents Y [ Kilobytes ] with Z no of Fields Hardware requirement [ RAM / Os / Processor / HardDisk Space / Other Specific Details ] Software [ Jdk Version / Lucene Version / Appserver Version ] 2)X [Say 100 Million] number of Merged Indexes Hardware requirement [ RAM / Os / Processor / HardDisk Space / Other Specific Details ] Software [ Jdk Version / Lucene Version / Appserver Version ] Thx in Advance Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching against index in memory
Ravi schrieb: If I have a document set of 10,000 docs and my merge factor is 1000, for every 1000 documents, Lucene creates a new segment. By the time Lucene indexes 4500 documents, index will have 4000 documents on the disk and index for 500 documents is stored in memory. How can I search against this index at the same time from a different JVM? I can access the 4000 docs on the disk. But what about those in the memory on the indexing box? Is there a way to do this? currently, i'm not sure if there can be a solution to solve it. the easiest way would be to reduce the merge factor so that not to many documents will be in memory. but this will slow down your index process also. bernhard Thanks Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Seraching in Keyword Field
Hi, try that query: MyKeywordField:ABC regards bernhard Rosen Marinov wrote: Hi all, I have a Keyword field in my Lucene docs. And i am tring to execure some queries on this field. 1. MyKeywordField:([ABC TO ABC]) - this query is OK and returns expecting result 2. MyKeywordField:(ABC) - but this returning nothing I am using SimpleAnalyzer - is the problem in analyzer? If yes, which i have to use to make query 2 working? How can i make query 2 working? I know that Keyword fields are not analyzed, than may be the problem is not in analyzer. But for QueryParser i use again SimpleAnalyzer, may be here is my mistake? However, how to make a query 2 to work properly (as i expect)? I know that it will find only fields with exact ABC value, is this true expecting? Best Regars Rosen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Dmitry Serebrennikov wrote: Niraj Alok wrote: Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. By the way, I could be wrong, but I think the 35% figure you referenced in the your first e-mail actually does not include any stored fields. The deal with 35% was, I think, to illustrate that index data structures used for searching by Lucene are efficient. But Lucene does nothing special about stored content - no compression or anything like that. So you end up with the pure size of your data plus the 35% of the indexed data. There will be a patch available to the end of this week, which allows you to store binary values compressed within a lucene index. It means that you will be able to store and retrieve whole documents within lucene in a very efficient way ;-) regards bernhard Cheers. Dmitry. Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:14 PM Subject: Re: indexing size Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full web search engine package using Lucene
Anne Y. Zhang wrote: Thanks, David. But it seems that this is downloadable. Could you please provide me the link for download? Thank you very much! http://www.nutch.org/release/ Ya - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 2:43 PM Subject: Re: Full web search engine package using Lucene Anne Y. Zhang wrote: Hi, I am assistanting a professor for a IR course. We need to provide the student with a full-fuctioned search engine package, and the professor prefers it being powered by lucene. Since I am new to lucene, can anyone provide me some information that where can I get the package? We also want the package contains the crawling function. Thank you very much! http://www.nutch.org/ Ya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: complex searche (newbie)
hi, in general the query parser doesn't allow queries which start with a wildcard Those queries could end up with very long response times and block your system. This is not what you want. I'm not sure if i understand what you want to do. I expect that you have a field within a lucene document with name type. For this field you can have different values like contact,account etc. Now you want to search all documents where type is contact. So the query to do this would be type:contact, nothing else is required. can you try that and give some feedback ? best regards Bernhard Wermus Fernando wrote: I am using multifieldQueryParse to look up some models. I have several models: account, contacts, tasks, etc. The user chooses models and a query string to look up. Besides fields for searching, I add some conditions to the query string. If he puts john to look up and chooses contacts, I add to the query string the following Query string: john and type:contact But, If he wants to look up any contact, multifieldQueryParse throws an exception. In these case, the query string is the following: Query string: * and type:contact Am I choosing the wrong QueryParser or is there another easy way to look up several fields and the same time any content? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: weird lock behavior
hi, the IndexReader class provides some public static methodes to check if an index is locked. If this is the case, there is also a method to unlock an existing index. You could do something like: Directory dir = FSDirectory.getDirectory(indexDir, false); if (IndexReader.isLocked(dir)) { IndexReader.unlock(dir); } dir.close(); You also should catch the possible IOException in case of an error or the index can't be unlocked. fun with it Bernhard [EMAIL PROTECTED] wrote: Hi, I experienced following situation: Suddenly my query became too slow (c.10sec instead of c.1sec) and the number of returned hits changed from c. 2000 to c.1800. Tracing the case I've found locking file abc...-commit.lck. After deletion of this file everything turned back to normal behavior, i.e. I got my 2000 hits in 1sec. There were no concurent writing or reading processes running parallely. Probably the lock file was lost because of abnormal termination ( during development it's ok, but may happen in production as well) My question is how to handle such situation, find out and repair in case it happens (in real life there are many concurensy processes and I have no idea which lock file to kill). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
Avi, i would prefer the second approach. If you already store the date time when the doc was index, you could use the following trick to get the last document added to the index: IndexReader ir = IndexReader.open(/tmp/testindex); int maxDoc = ir.maxDoc(); while (--maxDoc 0) { if (!ir.isDeleted(maxDoc)) { Document doc = ir.document(maxDoc); System.out.println(doc.getField(indexDate)); break; } } What do you think about the implementation, no extra properties, nothing to worry about. Every information is within you index. regards Bernhard Avi Drissman wrote: I've used Lucene for a long time, but only in the most basic way. I have a custom analyzer and a slightly hacked query parser, but in general it's the basic add document/remove document/query documents cycle. In my system, I'm indexing a store of external documents, maintaining an index for full-text querying. However, I might be turned off when documents are added, and then when I'm restarted, I'm going to need to determine the timestamp of the last document added to the index so that I can pick up where I left off. There are three approaches to doing this, two using Lucene. I don't know how I would do the two Lucene approaches, or even if they're possible. 1. Just keep a file in parallel with the index, reading and writing the timestamp of the last indexed document in it. I know how to do this, but I don't like the idea of keeping a separate file. 2. Drop a timestamp onto each document as it's indexed. I've attached timestamp fields to documents in the past so that I could do range queries on them. However, I don't know how to do a query like the document with the latest timestamp or even if that's possible. 3. Create a dummy document (with some unique field identifier so you could quickly query for it) with a field last timestamp. This is a global value storage approach, as you could just store any field with any value on it. But I'd be updating this timestamp field a lot, which means that every time I updated the index I'd have to remove this special document and reindex it. Is there any way to update the value of a field in a document directly in the index without removing and adding it again to the index? The field I'd want to update would just be stored, not indexed or tokenized. Thanks for your help in guiding my exploration into the capabilities of Lucene. Avi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: integration of lucene with pdfbox
Santosh, please have a look to the lucene demo package. There are several samples (IndexFiles.java) showing how to add a document to a writer. regards Bernhard Santosh wrote: I dont know how to add lucene document to index, i know how to add given directory. any body please tell me how to add lucene document to index - Original Message - From: Ben Litchfield [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 8:13 PM Subject: Re: integration of lucene with pdfbox If you can use lucene on its own then you already know how to add a lucene Document to the index. So you need to be able to take a PDF and get a lucene Document. org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument() does that for you. Ben On Mon, 23 Aug 2004, Santosh wrote: I have downloaded pdfbox and lucene and kept jar files in the class path, I am able to work with both of them independently but how can I integrate both regards Santosh kumar ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: term frequency data of terms of all documents
Serkan, it's easier using the IndexReader class to get the information you need. If you just need the doc frequency of each term you could use the sample. IndexReader ir = null; try { if (!IndexReader.indexExists(tmp/index)) return; ir = IndexReader.open(/tmp/index); TermEnum termEnum = ir.terms(); while (termEnum.next()) { Term t = termEnum.term(); System.out.println(t.text() + -- + ir.docFreq(t)); } } catch (IOException e) { System.out.println(e.toString()); } finally { if (ir != null) { try { ir.close(); } catch (IOException e) { System.err.println(IOException, opened IndexReader can't be closed: + e.toString()); } } } hope this helps, Bernhard Serkan Oktar wrote: I want to build a list of terms of all documents and their frequency data. It seems the information I need is in tis and tii files. However I havent't found a way to handle them till now. How can I get the term frequency data? Thanks , Serkan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Yonik, there is another synchronized block in CSInputStream which could block your second cpu out. Do you think there is a chance to recreate the index (maybe a smaller subset) without compound file option enabled and run your test again, so that we can see if this helps ? regards Bernhard Otis Gospodnetic wrote: Ah, you may be right (no stack trace in email any more). Somebody recenly identified a few bottlenecks that, if I recall correctly, were related to synchronized blocks. I believe Doug committed some improvements, but I can't remember which version of Lucene that is in. It's definitely in 1.4.1. Otis --- Yonik Seeley [EMAIL PROTECTED] wrote: --- Otis Gospodnetic [EMAIL PROTECTED] wrote: The bottleneck seems to be disk IO. But it's not. Linux is caching the whole file, and there really isn't any disk activity at all. Most of the threads are blocked on InputStream.refill, not waiting for the disk, but waiting for their turn into the synchronized block to read from the disk (which is why I asked about cacheing above that level). CPU is a constant 50% on a dual CPU system (meaning 100% of 1 cpu). -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]