Hebrew support
Hello, Do you know something about hebrew support in Lucene Thanks in advance Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 RD, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you.
Re: online and offline Directory
Ernesto De Santis writes: Hi Aviran Thanks for response. I forgot important information for you understand my issue. My process do some like this: The index have contents from differents sources, identified for a special field 'source'. Then the index have documents with source: S1 or source: S2 ... etc. When I reindex the source S1, first delete all documents with source: S1, in otherwise I have the index with repeated content. Then add the new index result. In the middle of process the IndexSearcher use an incomplete index. Is posible do it like a data base transaction? It's not like a data base transcation but any index reader/searcher that was opened before the changes won't see them until it's closed and reopened. AFAIK that also applies to deletions though I never checked that. So you have two options: a) use a second index for indexing, move the indexes after the indexing is done and make sure indexreader/searcher are closed and reopened after the move. b) use one index and make sure that you do not open any index reader/searcher during the update. Searches may only use already opened reader/searcher. I guess it depends on index size, update frequency and so on, which szenario is easier to handle. Given that the index isn't too large and update frequency is rather low, I'd use a second index. But you'll need to copy that index and should consider the time and disc IO needed for that. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting Info
On Sep 27, 2004, at 6:32 PM, [EMAIL PROTECTED] wrote: I'm interested in doing sorting in Lucene. Is there a FAQ or an article that will show me how to do this? I already have my indexing and searching working. From IndexSearcher, use search(Query,Sort) method (or other variants that take a Sort): http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ Searcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene .search.Sort) Following the Javadocs for Sort should be (hopefully) self-explanatory. Sorting is a pretty new feature, and is only described in Javadocs and this e-mail list as far as I know. I wrote a pretty extensive section on sorting in Lucene in Action - http://www.manning.com/hatcher2 - which is primarily in Otis' hands right now (*nudge nudge*) and has been technical and copy edited. In other words, LIA should be available in print 6-8 weeks (e-book in PDF format available much sooner) after Otis is done. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using lucene in Tomcat
Hi all, I have implemented lucene search for my documents and db successfully. Now my problem is, the index i created is indexing to my local disk, i want the index to be created with reference to my server. Right now I index C:/tomcat/webapps/jetspeed/document, but I want to index wrt, /jetspped/document. Let me know if some one - Do you Yahoo!? Y! Messenger - Communicate in real time. Download now.
Re: Using lucene in Tomcat
mahaveer jain wrote: Hi all, I have implemented lucene search for my documents and db successfully. Now my problem is, the index i created is indexing to my local disk, i want the index to be created with reference to my server. Right now I index C:/tomcat/webapps/jetspeed/document, but I want to index wrt, /jetspped/document. maybe you can create a small method to convert /jetspped/document to a File object and then call toStringMethod and pass to indexer. Let me know if some one - Do you Yahoo!? Y! Messenger - Communicate in real time. Download now. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Accent filter
Hi, I am certainly not the first, and probably not the last, that have had problems with accented characters in my index. But unfortunately I couldnt find anything in neither lucene nor the lucene-sandbox to solve the problem. Så I wrote an accent filter and thought that I might as well share it with you guys :) -- Bo Gundersen DBA/Software Developer M.Sc.CS. www.atira.dk package dk.atira.search; import java.io.IOException; import java.util.Collection; import java.util.HashMap; import java.util.HashSet; import java.util.Map; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; /** * This filter converts accent characters to their non-accented versions. * Also it strips unwanted characters from the tokens, mening anything * but A-Z,a-z,0-9,ÆÅØæøå and - * The valid characters can be changed by adding them to the string validCharsStr. * * Created by Bo Gundersen at Sep 28, 2004 12:39:04 PM * * @author Bo Gundersen ([EMAIL PROTECTED]) */ public class AccentFilter extends TokenFilter { private static final Collection validChars = new HashSet(); private static final String validCharsStr = abcdefghijklmnopqrstuvwxyz\u00E6\u00F8\u00E5 + ABCDEFGHIJKLMNOPQRSTUVWXYZ\u00C6\u00D8\u00C5 + 0123456789 + -; static { for(int i=0; ivalidCharsStr.length(); i++) validChars.add(new Character(validCharsStr.charAt(i))); } private static final Map accents = new HashMap(); static { accents.put(new Character('\u00C0'), A); accents.put(new Character('\u00C1'), A); accents.put(new Character('\u00C2'), A); accents.put(new Character('\u00C3'), A); accents.put(new Character('\u00E0'), a); accents.put(new Character('\u00E1'), a); accents.put(new Character('\u00E2'), a); accents.put(new Character('\u00E3'), a); accents.put(new Character('\u00E4'), a); accents.put(new Character('\u00C8'), E); accents.put(new Character('\u00C9'), E); accents.put(new Character('\u00CA'), E); accents.put(new Character('\u00CB'), E); accents.put(new Character('\u00E8'), e); accents.put(new Character('\u00E9'), e); accents.put(new Character('\u00EA'), e); accents.put(new Character('\u00EB'), e); accents.put(new Character('\u00CC'), I); accents.put(new Character('\u00CD'), I); accents.put(new Character('\u00CE'), I); accents.put(new Character('\u00CF'), I); accents.put(new Character('\u00EC'), i); accents.put(new Character('\u00ED'), i); accents.put(new Character('\u00EE'), i); accents.put(new Character('\u00EF'), i); accents.put(new Character('\u00D1'), N); accents.put(new Character('\u00F1'), n); accents.put(new Character('\u00D2'), O); accents.put(new Character('\u00D3'), O); accents.put(new Character('\u00D4'), O); accents.put(new Character('\u00D5'), O); accents.put(new Character('\u00D6'), O); accents.put(new Character('\u00F2'), o); accents.put(new Character('\u00F3'), o); accents.put(new Character('\u00F4'), o); accents.put(new Character('\u00F5'), o); accents.put(new Character('\u00F6'), o); accents.put(new Character('\u00D9'), U); accents.put(new Character('\u00DA'), U); accents.put(new Character('\u00DB'), U); accents.put(new Character('\u00DC'), U); accents.put(new Character('\u00F9'), u); accents.put(new Character('\u00FA'), u); accents.put(new Character('\u00FB'), u); accents.put(new Character('\u00FC'), u); accents.put(new Character('\u00DD'), Y); accents.put(new Character('\u00FD'), y); accents.put(new Character('\u00FF'), y);
Re: Accent filter
Loads of very well thought out ISO-8859 + French/Irish Filters available here too: (I think they are all GPL'd) http://www.nongnu.org/sdx/ Best Regards, JOhn Bo Gundersen wrote: Hi, I am certainly not the first, and probably not the last, that have had problems with accented characters in my index. But unfortunately I couldnt find anything in neither lucene nor the lucene-sandbox to solve the problem. Så I wrote an accent filter and thought that I might as well share it with you guys :) package dk.atira.search; import java.io.IOException; import java.util.Collection; import java.util.HashMap; import java.util.HashSet; import java.util.Map; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; /** * This filter converts accent characters to their non-accented versions. * Also it strips unwanted characters from the tokens, mening anything * but A-Z,a-z,0-9,ÆÅØæøå and - * The valid characters can be changed by adding them to the string validCharsStr. * * Created by Bo Gundersen at Sep 28, 2004 12:39:04 PM * * @author Bo Gundersen ([EMAIL PROTECTED]) */ public class AccentFilter extends TokenFilter { private static final Collection validChars = new HashSet(); private static final String validCharsStr = abcdefghijklmnopqrstuvwxyz\u00E6\u00F8\u00E5 + ABCDEFGHIJKLMNOPQRSTUVWXYZ\u00C6\u00D8\u00C5 + 0123456789 + -; static { for(int i=0; ivalidCharsStr.length(); i++) validChars.add(new Character(validCharsStr.charAt(i))); } private static final Map accents = new HashMap(); static { accents.put(new Character('\u00C0'), A); accents.put(new Character('\u00C1'), A); accents.put(new Character('\u00C2'), A); accents.put(new Character('\u00C3'), A); accents.put(new Character('\u00E0'), a); accents.put(new Character('\u00E1'), a); accents.put(new Character('\u00E2'), a); accents.put(new Character('\u00E3'), a); accents.put(new Character('\u00E4'), a); accents.put(new Character('\u00C8'), E); accents.put(new Character('\u00C9'), E); accents.put(new Character('\u00CA'), E); accents.put(new Character('\u00CB'), E); accents.put(new Character('\u00E8'), e); accents.put(new Character('\u00E9'), e); accents.put(new Character('\u00EA'), e); accents.put(new Character('\u00EB'), e); accents.put(new Character('\u00CC'), I); accents.put(new Character('\u00CD'), I); accents.put(new Character('\u00CE'), I); accents.put(new Character('\u00CF'), I); accents.put(new Character('\u00EC'), i); accents.put(new Character('\u00ED'), i); accents.put(new Character('\u00EE'), i); accents.put(new Character('\u00EF'), i); accents.put(new Character('\u00D1'), N); accents.put(new Character('\u00F1'), n); accents.put(new Character('\u00D2'), O); accents.put(new Character('\u00D3'), O); accents.put(new Character('\u00D4'), O); accents.put(new Character('\u00D5'), O); accents.put(new Character('\u00D6'), O); accents.put(new Character('\u00F2'), o); accents.put(new Character('\u00F3'), o); accents.put(new Character('\u00F4'), o); accents.put(new Character('\u00F5'), o); accents.put(new Character('\u00F6'), o); accents.put(new Character('\u00D9'), U); accents.put(new Character('\u00DA'), U); accents.put(new Character('\u00DB'), U); accents.put(new Character('\u00DC'), U); accents.put(new Character('\u00F9'), u); accents.put(new Character('\u00FA'), u); accents.put(new Character('\u00FB'), u); accents.put(new Character('\u00FC'), u); accents.put(new Character('\u00DD'), Y); accents.put(new Character('\u00FD'), y); accents.put(new Character('\u00FF'), y); accents.put(new Character('\u00C6'), AE); accents.put(new Character('\u00E6'), ae); accents.put(new Character('\u00D8'), OE); accents.put(new Character('\u00F8'), oe); accents.put(new Character('\u00C5'), AA); accents.put(new Character('\u00E5'), aa); }
RE: Hebrew support
As far as I know there is no Analyzer for Hebrew. Aviran -Original Message- From: Alex Kiselevski [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 28, 2004 3:12 AM To: [EMAIL PROTECTED] Subject: Hebrew support Hello, Do you know something about hebrew support in Lucene Thanks in advance Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 RD, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PorterStemAnalyzer versus SnowballAnalyser?
I use PorterStemAnalyzer provided by Otis (see http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2,), in http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/snowball/there are another implementations of porter algoritm. From the first glance it looks some how very different. What the difference/advantage/disadvantage in using Otis code or one from lucene-sandbox Snowball Thanks in advance J.
Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?
Christian Rodriguez wrote: Now the problem I have is that I dont have a way to force a flush of the IndexWriter without closing it and I need to do that before commiting a transaction or I would get random errors. Shouldnt that function be public, in case the user wants to force a flush at some point that is not when the IndexWriter is closed? If not I am forced to create a new IndexWriter and close it EVERY TIME I commit a transaction (which in my application is very often). Opening and closing IndexWriters should be a lightweight operation. Have you tried this and found it to be too slow? A flush() would have to do just about the same work. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to pull document scoring values
Hi, I'm trying to learn the Scoring mechanism of Lucene. I want to fetch each parameter value individually as they are collectively dumped out by Explanation. I've managed to pull out TF and IDF values using DefaultSimilarity and FilterIndexReader, but not sure from where to get the fieldNorm and queryNorm from. Also is there any reference about how normalisation has been implemented? Any idea? Thanks, Zia -- Zia Syed [EMAIL PROTECTED] Smartweb Research Center, Robert Gordon University - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sorting on a long string
I am new to lucene, and trying to perform a sorted query on a list of people's names. Lucene seem unable to properly sort on the name field of my indexed documents. If I sort by the other (shorter) fields, it seems to work fine. The name sort seems to be close, almost like the last few iterations through the sort loop are not being done. The records are obviously not in the normally random order, but not fully sorted either. I have tried different ways of sorting, including a SortField array/object with the field cast as a string. The index I am sorting has about 1.2 million documents. Are their known limitations in the sorting functionality that I am running into? I can provide more details if needed. Thanks for any help, -Pete
re-indexing
I am having touble reindexing. Basically what I want to do is: 1. Delete the old index 2. Write the new index. The enviroment: The index is search by a web app running from the Orion App Server. This code runs fin and reindexes fine prior to any searches. After the first search against the index is completed the index ends up beiong read-only ( or not writeable), I cannot reindex and subsequently cannot search because the index is incomplete. 1. Why doesn't IndexReader.delete(i) really delete the file. it seems to just make anothe 1K file with a .del extension the IndexWriter still cannot content with? 2. How can I make this work? Thanks, Jason The code below produces the following output when run AFTER an initial search against the index have be completed: IndexerDrug-disableLuceneLocks: true Directory: [EMAIL PROTECTED]:\lucene_index_drug Deleted [0]: true ... (out put form for loop confirming deleted items) Deleted [367]: true Hit uncaught exception java.io.IOException java.io.IOException: Cannot delete _ba.cfs at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:105) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:193) at IndexerDrug.index(IndexerDrug.java:103) at IndexerDrug.main(IndexerDrug.java:246) Exception in thread main =-=-=-=-=-=-=-=-=-=-=-=-=- My indexing code (some items have been deleted to protect the innocent) =-=-=-=-=-=-=-=-=-=-=-=-=- import java.io.*; import java.sql.*; import javax.naming.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.document.*; import org.apache.lucene.index.*; import org.apache.lucene.store.*; public class IndexerDrug { private String sql = my query code ; public static String[] stopWords = org.apache.lucene.analysis.standard.StandardAnalyzer.STOP_WORDS; public File indexDir = new File(C:\\lucene_index_drug\\); public Directory fsDir; public void index() throws IOException { try { // Delete old index fsDir = FSDirectory.getDirectory(indexDir, false); if (indexDir.list().length 0) { IndexReader reader = IndexReader.open(fsDir); System.out.println(Directory:+reader.directory().toString()); reader.unlock(fsDir); for (int i = 0; i reader.maxDoc()-1; i++) { reader.delete(i); System.out.println(Deleted [+i+]: +reader.isDeleted(i)); } reader.close(); } } catch (Exception ex) { System.out.println(Error while deleting index: +ex.getMessage()); } // Write new index Analyzer analyzer = new StandardAnalyzer(stopWords); IndexWriter writer = new IndexWriter(indexDir, analyzer, true);// fails here * writer.mergeFactor = 1000; indexDirectory(writer); writer.setUseCompoundFile(true); writer.optimize(); writer.close(); } private void indexDirectory(IndexWriter writer) throws IOException { Connection c = null; ResultSet rs = null; Statement stmt = null; long startTime = System.currentTimeMillis(); System.out.println(Start Time: + new java.sql.Timestamp(System.currentTimeMillis()).toString()); try { Class.forName(); c = DriverManager.getConnection( , , ); stmt = c.createStatement(); rs = stmt.executeQuery(this.sql); System.out.println(Query Completed: + new java.sql.Timestamp(System.currentTimeMillis()).toString()); int total = 0; String resourceID = ; String resourceName = ; String summary = ; String shortSummary = ; String hciPick = ; String url = ; String format = ; String orgType = ; String holdingType = ; String indexText = ; String c_indexText = ; boolean ready = false; Document doc = null; String oldResourceID = null; String newResourceID = null; while (rs.next()) { newResourceID = rs.getString(resourceID)!= null ? rs.getString(resourceID) : ; resourceID = newResourceID; resourceName = rs.getString(resourceName) != null ? rs.getString(resourceName) : ; summary = rs.getString(summary) != null ? rs.getString(summary) : ; if (summary.length() 300) { shortSummary = summary.substring(0, 300) + ...; } else { shortSummary = summary; } hciPick = rs.getString(hciPick) != null ?rs.getString(hciPick) : ; url = rs.getString(url) != null ? rs.getString(url) : ; format = rs.getString(format) != null ? rs.getString(format): ; orgType = rs.getString(orgType) != null ?rs.getString(orgType) : ; holdingType = rs.getString(holdingType) != null ?rs.getString(holdingType) : ; indexText = rs.getString(indexText) != null ?rs.getString(indexText) : ; if
Re: re-indexing
Jason wrote: I am having touble reindexing. Basically what I want to do is: 1. Delete the old index 2. Write the new index. The enviroment: The index is search by a web app running from the Orion App Server. This code runs fin and reindexes fine prior to any searches. After the first search against the index is completed the index ends up beiong read-only ( or not writeable), I cannot reindex and subsequently cannot search because the index is incomplete. We have several apps running like this only on Tomcat and JBoss with no problems... 1. Why doesn't IndexReader.delete(i) really delete the file. it seems to just make anothe 1K file with a .del extension the IndexWriter still cannot content with? Never tried the IndexReader.delete() method, we generally build the new index in a temporary directory and when the index is done we delete the current online directory (using java.io.File methods) and then rename the temp directory to online. 2. How can I make this work? This may be just be silly, but do you remember to close your org.apache.lucene.search.IndexSearcher when you are done with your search? -- Bo Gundersen DBA/Software Developer M.Sc.CS. www.atira.dk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]