Re: Search PDF ???
--- Eric Chow [EMAIL PROTECTED] wrote: Hello, 1. Is it possibleto use Lucene to search PDF contents ? Yes, you need to use some external tools to extract the text from the PDF file and then pass it to lucene for indexing. If you do a search of this list you will get lot of mails related to that. 2. Can it search Chinese contents PDF files ??? I have used a tool called xpdf (in linux) and it works with both chinese traditional and chinese simplified. It gives language support packages for many of the languages. Please take a look at the URL below. http://www.foolabs.com/xpdf/download.html Now the tool only helps in extracting the text. Whether you can search chinese text or not depends on the analyzer you use in Lucene. Try CJKAnalyzer for CJK text search. Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Corrupted indexes
Andy, Please take a look at the following thread, this should help you. http://www.mail-archive.com/[EMAIL PROTECTED]/msg08976.html Thanks, George --- Andy Goodell [EMAIL PROTECTED] wrote: Recently, I've been getting a lot of corrupted lucene indexes. They appear to return search results normally, but there is really no good way to test whether information is missing. The main problem is that when i try to optimize, i get the following Exception: java.io.IOException: read past EOF at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:218) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:422) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) this is preventing me from optimizing the indexes, and also scares me that information might be missing. Does anybody know what's going on here, and what might be wrong? Thanks for your time, - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene docs
Try these, http://jakarta.apache.org/lucene/docs/gettingstarted.html http://www.darksleep.com/lucene/ Thanks, George --- Ian McDonnell [EMAIL PROTECTED] wrote: What is the best resource for beginners looking to understand Lucenes functionality, ie its use of fields, documents, the index reader and writer etc. is there any web resource that goes into details on the exact workings of it? Ian _ Sign up for FREE email from SpinnersCity Online Dance Magazine Vortal at http://www.spinnerscity.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
--- Natarajan.T [EMAIL PROTECTED] wrote: I am trying to extend the current behavior. You might have already seen a mail from Cocula Remi on this. Please provide more details of the problem for specific comments - basically the problem you are facing and/or what behavior you are trying to extend. This was not clear from your email. An example will make things more clear. Thanks Regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PorterStemfilter
Hi, This might be more of a questing related to the PorterStemmer algorithm rather than with lucene, but if anyone has the knowledge please share. I am using the PorterStemFilter that some with lucene and it turns out that searching for the word 'printer' does not return a document containing the text 'print'. To narrow down the problem, I have tested the PorterStemFilter in a standalone programs and it turns out that the stem of printer is 'printer' and not 'print'. That is 'printer' is not equal to 'print' + 'er', the whole of the word is stem. Can somebody explain the behavior. Thanks Regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Help for text based indexing
You could recieve the group name as an input from the user and construct a BooleanQuery internally which will qyery only the group field based on the user input. So the user need not append the group name with the search string. Thanks, George --- mahaveer jain [EMAIL PROTECTED] wrote: If i have rightly understood, you mean to say that the query for search has to be Group1 AND Hello (if hello is what I want to search ?) Cocula Remi [EMAIL PROTECTED] wrote: A keyword is not tokenized, that's why you wont be able to search over a part of it. You'd rather use a Text fied. About creating a special field : IndexWriter Ir = File f = Document doc = new Document(); if (f.toString.startsWith(C:\tomcat\webapps\Root\Group1) { doc.add(Field.Text(group, Group1)); } if (f.toString.startsWith(C:\tomcat\webapps\Root\Group2) { doc.add(Field.Text(group, Group2)); } doc.add(Field.Text(content, getContent(f))); Ir.addDocument(doc); Then you can search in group1 with query like that : group:Group1 AND rest_of_the_query. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 18:03 À : Lucene Users List Objet : RE: Help for text based indexing Well in my case the path is KeyWord. I had tried that earlier and it does not seems to work in a single index file. Can you explain a bit more about adding group1 and group2 ? Cocula Remi wrote: Well you could add a field to each of your Documents whose value would be either group1 or group2. Or you could use the path to your files ... -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:49 À : [EMAIL PROTECTED] Objet : RE: Help for text based indexing I am clear with looping recursively to index all the file under Root folder. But the problem is if I want to search only in group1 or group2.Is that possible to search only in one of the group folder ? Cocula Remi wrote: You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! ___ALL-NEW Yahoo! Messenger - all
Re: PorterStemfilter
--- Tea Yu [EMAIL PROTECTED] wrote: David, For me I don't want a search for in print gives results from in printer? I'll consider that over-stemmed elsecase. Here the in won't be considered as it is a stopword in most of the analyzers. I know it is in StandardAnalyzer. So searching for 'in print' will not return the document containing 'in printer' because stem('printer') is 'printer' and not 'print'. So 'printer' is what getting stored in the index. Enclosing in double quotes does not prevent stemming. I'm also not that satisfactory when effective is stemmed to effect by snowball recently I have tested this with PorterStemFilter and there is also effective is stemmed as effect. There are more serious problems. printable is stemmed as printabl. Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Existing Parsers
Hi Chris, I do not have a stats but I think the performance is reasonable. I use xpdf for PDF wvWare for DOC. The size of my index is ~2GB (this is not limited to only pdf doc). For avoiding memory problems, I have set an upperbound to the size of the documents that can be indexed. For example in my case I do not index documents if the size is more that 4MB. You could try something like that. Thanks Regards, George --- Chris Fraschetti [EMAIL PROTECTED] wrote: Some of the tools listed use cmd line execs to output a doc of some sort to text and then I grab the text and add it to a lucene doc, etc etc... Any stats on the scalability of that? In large scale applications, I'm assuming this will cause some serious issues... anyone have any input on this? -Chris Fraschetti On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer [EMAIL PROTECTED] wrote: Honey George wrote: Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports My dated experience from 2 years ago was that (the evil, native code) foolabs pdf parser was the best, but obviously things could have changed. http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html various languages. 2. word + http://sourceforge.net/projects/wvware 3. excel + http://www.jguru.com/faq/view.jsp?EID=1074230 -George --- [EMAIL PROTECTED] wrote: Anyone know of any reliable parsers out there for pdf word excel or powerpoint? For powerpoint it's not easy. I've been using this and it has worked fine util recently and seems to sometimes go into an infinite loop now on some recent PPTs. Native code and a package that seems to be dormant but to some extent it does the job. The file ppthtml does the work. http://chicago.sourceforge.net/xlhtml - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Case sensitiveness and wildcard searches
Hi, I noticed a behavior with wildcard searches and like to clarify. From the FAQ http://www.jguru.com/faq/view.jsp?EID=538312 in JGuru, Analyzer is not used for wildcard queries. In my case I have a document which contains the word IMPORTANT. I use PorterStemFiler + StandardAnalyzer for indexing searching. I am getting the document if I search for the word IM*. But if analyzer is not used then who does the conversion of the word to lowercase. My code will look like this. --- QueryParser qp=new QueryParser(title, new MyAnalyzer()); Query q = qp.parse(text); --- Though I pass the text in uppercase (IM*), when I print the Query object I can see it in lowercase, something like (title:im*) I am using lucene-1.3-final. Can someone explain this? Thanks regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Case sensitiveness and wildcard searches
Thanks for links René, The mail is not exactly talking about my case because the StandardAnalyzer which I use does lowercase the input. So it is the same scenario as the FAQ entry. -George --- René_Hackl [EMAIL PROTECTED] wrote: Hi George, I'm not sure about v1.3, but you may want to take a look at http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=9342 or http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1806371 cheers, René -- NEU: Bis zu 10 GB Speicher für e-mails Dateien! 1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Existing Parsers
Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports various languages. 2. word + http://sourceforge.net/projects/wvware 3. excel + http://www.jguru.com/faq/view.jsp?EID=1074230 -George --- [EMAIL PROTECTED] wrote: Anyone know of any reliable parsers out there for pdf word excel or powerpoint? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: too many open files
Patrick, For your second problem, are you seeing a behavior similar to the one discussed in the following thread? http://www.mail-archive.com/[EMAIL PROTECTED]/msg08952.html If yes, you can see the solution there. Thanks, George --- Patrick Kates [EMAIL PROTECTED] wrote: I am having two problems with my client's lucene indexes. One, we are getting a FileNotFound exception (too many open files). THis would seem to indicate that I need to increase the number of open files on our Suse 9.0 Pro box. I have our sys admin working on this problem for me. Two, because of this error and subsequent restarting of the box, we seem to have lost an index segment or two. My client's tape backups do not contain the segments we know about. I am concerned about the missing index segments as they seem to be preventing any further update of the index. Does anyone have any suggestions as to how to fix this besides a full re-index of the problem indexes? I was wondering if maybe a merge of the index might solve the problem? I could move our nightly merge of the index files to sooner, but I am afraid that the merge might make matters worse? Any ideas or helpful speculation would be greatly appreciated. Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
This is what I did. There are 2 classes in the lucene source which are not public and therefore cannot be accessed from outside the package. The classes are 1. org.apache.lucene.index.SegmentInfos - collection of segments 2. org.apache.lucene.index.SegmentInfo -represents a sigle segment I took these two files and moved to a separate folder. Then created a class with the following code fragment. public void displaySegments(String indexDir) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); StringBuffer str = new StringBuffer(); int size = segments.size(); str.append(Index Dir = + indexDir ); str.append(\nTotal Number of Segments + size); str.append(\n--); for(int i=0;isize;i++) { str.append(\n); str.append((i+1) + . ); str.append(((SegmentInfo)segments.get(i)).name); } str.append(\n--); System.out.println(str.toString()); } public void deleteSegment(String indexDir, String segmentName) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); int size = segments.size(); String name = null; boolean found = false; for(int i=0;isize;i++) { name = ((SegmentInfo)segments.get(i)).name; if (segmentName.equals(name)) { found = true; segments.remove(i); System.out.println(Deleted the segment with name + name + from the segments file); break; } } if (found) { segments.write(dir); } else { System.out.println(Invalid segment name: + segmentName); } } Use the displaySegments() method to display the segments and deleteSegment to delete the corrupt segment. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys In Our Situation we would be indexing Million Millions of Information documents with Huge Giga Bytes of Data Indexed and finally would be put into a MERGED INDEX, Categorized accordingly. There may be a possibility of Corruption, So Please do post the code reffrals Thx Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 5:51 PM To: Lucene Users List Subject: Re: Restoring a corrupt index Thanks Erik, that worked. I was able to remove the corrupt index and now it looks like the index is OK. I was able to view the number of documents in the index. Before that I was getting the error, java.io.IOException: read past EOF I am yet to find out how my index got corrupted. There is another thread going on about this topic, http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html If anybody is facing similar problem and is interested in the code I can post it here. Thanks, George --- Erik Hatcher [EMAIL PROTECTED] wrote: The details of the segments file (and all the others) is freely available here: http://jakarta.apache.org/lucene/docs/fileformats.html Also, there is Java code in Lucene, of course, that manipulates the segments file which could be leveraged (although probably package scoped and not easily usable in a standalone repair tool). Erik On Aug 18, 2004, at 6:50 AM, Honey George wrote: Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am seeing the same display. The problem is I am not able to identify where a record starts with just 1 record in the file. Need to try some alternate approach. Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
Hi, Note that Lucene only provides an API to build a search engine you can use it how ever you want it. You can pass data to indexing in 2 forms. 1. java.lang.String 2. java.io.Reader What Lucene recieves is any of the two objects above. Now in the case of non-text documents you need to extract the text information from the documents and either create as a text file and convert to a Reader object or creat a String object (for small files). For indexing database contents, you need to write your own APIs to get data from the database (using JDBC/EJB etc), convert the data to a String object and pass it to Lucene for indexing. Again Lucene is not responsible for getting the data from your application. It only indexed the data given it to you. Also for extracting contents from pdf doc files(generally known as straining) I know of 2 more tools wvWare - for word documents pdftotext(xpdf) - for pdf documents. Google around and you will get lot of links. Hope this helps. Thanks, George --- Santosh [EMAIL PROTECTED] wrote: I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
If I understand correctly, You have situation where you have a large main index and then you create small indexes and finally merge to the main index. It can happen that half way through merging, the system crashed and the index got corrupted. I do not think in this case you can use my solution. What I am trying to do is to remove a corrupt segment and associated files from the index folder, not trying to fix a corrupt segment. This way atleast I can add new documents to the index. Of cource I am sure I didn't loose anything because my index file size was actually 0 bytes. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi George Do u think ,the same would work for MERGED Indexes Please Can u suggest a solution. Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 2:08 PM To: Lucene Users List Subject: RE: Restoring a corrupt index This is what I did. There are 2 classes in the lucene source which are not public and therefore cannot be accessed from outside the package. The classes are 1. org.apache.lucene.index.SegmentInfos - collection of segments 2. org.apache.lucene.index.SegmentInfo -represents a sigle segment I took these two files and moved to a separate folder. Then created a class with the following code fragment. public void displaySegments(String indexDir) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); StringBuffer str = new StringBuffer(); int size = segments.size(); str.append(Index Dir = + indexDir ); str.append(\nTotal Number of Segments + size); str.append(\n--); for(int i=0;isize;i++) { str.append(\n); str.append((i+1) + . ); str.append(((SegmentInfo)segments.get(i)).name); } str.append(\n--); System.out.println(str.toString()); } public void deleteSegment(String indexDir, String segmentName) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); int size = segments.size(); String name = null; boolean found = false; for(int i=0;isize;i++) { name = ((SegmentInfo)segments.get(i)).name; if (segmentName.equals(name)) { found = true; segments.remove(i); System.out.println(Deleted the segment with name + name + from the segments file); break; } } if (found) { segments.write(dir); } else { System.out.println(Invalid segment name: + segmentName); } } Use the displaySegments() method to display the segments and deleteSegment to delete the corrupt segment. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys In Our Situation we would be indexing Million Millions of Information documents with Huge Giga Bytes of Data Indexed and finally would be put into a MERGED INDEX, Categorized accordingly. There may be a possibility of Corruption, So Please do post the code reffrals Thx Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 5:51 PM To: Lucene Users List Subject: Re: Restoring a corrupt index Thanks Erik, that worked. I was able to remove the corrupt index and now it looks like the index is OK. I was able to view the number of documents in the index. Before that I was getting the error, java.io.IOException: read past EOF I am yet to find out how my index got corrupted. There is another thread going on about this topic, http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html If anybody is facing similar problem and is interested in the code I can post it here. Thanks, George --- Erik Hatcher [EMAIL PROTECTED] wrote: The details of the segments file (and all the others) is freely available here: http://jakarta.apache.org/lucene/docs/fileformats.html Also, there is Java code in Lucene, of course, that manipulates the segments file which could be leveraged (although probably package scoped and not easily usable in a standalone repair tool). Erik On Aug 18, 2004, at 6:50 AM, Honey George wrote: Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am
RE: Restoring a corrupt index
Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am seeing the same display. The problem is I am not able to identify where a record starts with just 1 record in the file. Need to try some alternate approach. Thanks, George --- [EMAIL PROTECTED] wrote: http://www.ultraedit.com/ is the best! However, I cannot imagine how another hexeditor wouldnt work. -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 17, 2004 10:35 AM To: Lucene Users List Subject: RE: Restoring a corrupt index Wallen, Which hex editor have you used. I am also facing a similar problem. I tried to use KHexEdit and it doesn't seem to help. I am attaching with this email my segments file. I think only the segment with name _ung is a valid one, I wanted to delete the remaining..but couldn't. Can you help? -George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Restoring a corrupt index
Thanks Erik, that worked. I was able to remove the corrupt index and now it looks like the index is OK. I was able to view the number of documents in the index. Before that I was getting the error, java.io.IOException: read past EOF I am yet to find out how my index got corrupted. There is another thread going on about this topic, http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html If anybody is facing similar problem and is interested in the code I can post it here. Thanks, George --- Erik Hatcher [EMAIL PROTECTED] wrote: The details of the segments file (and all the others) is freely available here: http://jakarta.apache.org/lucene/docs/fileformats.html Also, there is Java code in Lucene, of course, that manipulates the segments file which could be leveraged (although probably package scoped and not easily usable in a standalone repair tool). Erik On Aug 18, 2004, at 6:50 AM, Honey George wrote: Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am seeing the same display. The problem is I am not able to identify where a record starts with just 1 record in the file. Need to try some alternate approach. Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
Wallen, Which hex editor have you used. I am also facing a similar problem. I tried to use KHexEdit and it doesn't seem to help. I am attaching with this email my segments file. I think only the segment with name _ung is a valid one, I wanted to delete the remaining..but couldn't. Can you help? -George --- [EMAIL PROTECTED] wrote: I fixed my own problem, but hope this might help someone else in the future: I went into my segments file (with a hex editor), deleted the record for _cu0v and changed the length 0x20 to be 0x1f, and it seems I have most of my index back! Maybe a developer could elaborate on this? ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
I think attachments are filtered. This is what I see when I open in the hex editor. : 00 04 e0 af 00 00 00 02 05 5f 36 75 6e 67 00 04 ..à¯._6ung.. :0010 1e fb 05 5f 36 75 6e 69 00 00 00 01 00 00 00 00 .û._6uni :0020 00 00 c1 b4 ..Á´ -George --- Honey George [EMAIL PROTECTED] wrote: Wallen, Which hex editor have you used. I am also facing a similar problem. I tried to use KHexEdit and it doesn't seem to help. I am attaching with this email my segments file. I think only the segment with name _ung is a valid one, I wanted to delete the remaining..but couldn't. Can you help? -George --- [EMAIL PROTECTED] wrote: I fixed my own problem, but hope this might help someone else in the future: I went into my segments file (with a hex editor), deleted the record for _cu0v and changed the length 0x20 to be 0x1f, and it seems I have most of my index back! Maybe a developer could elaborate on this? ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Not deleting temp files after updating/optimising
Hi, I am facing the same problem with temporary index files. I can see lot of temporary files(hidden) not being deleted(luncee-1.3-final + Linux RH7 + jdk 1.3.1). The size of the temporary files is almost same as that of the index. I have deleted all the hidden temporary files and now my directory contents are as given below. _2lok.fdt _2lok.fnm _6c2h.fdx _6hgv.fdt _6hgv.fnm _6hh1.fdx _7gqr.fdt _7gqr.fnm _918i.fdx deletable _2lok.fdx _6c2h.fdt _6c2h.fnm _6hgv.fdx _6hh1.fdt _6hh1.fnm _7gqr.fdx _918i.fdt _918i.fnm segments Again I see that the index size is bigger than the data size. The data size is 5.3GB but the size of the index is 7GB. I have almost 2,00,000 documents in the index. Any help in the above 2 problems is much appreciated. Thanks regards, George ___ALL-NEW Yahoo! Messenger - so many all-new ways to express yourself http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Not deleting temp files after updating/optimising
--- Honey George [EMAIL PROTECTED] wrote: Hi, I am facing the same problem with temporary index files. I can see lot of temporary files(hidden) not being deleted(luncee-1.3-final + Linux RH7 + jdk 1.3.1). Sorry for the spam. I use lucene-1.2. The problem was actually found in lucene-1.2. ___ALL-NEW Yahoo! Messenger - so many all-new ways to express yourself http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]