RE: demo IndexHTML parser breaks unicode?
In org.apache.lucene.demo.HTMLDocument you need to change the input stream to use a different encoding. Replace the fis with this: fis = new InputStreamReader(new FileInputStream(f), "UTF-16"); -Original Message- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Friday, September 24, 2004 9:25 PM To: Lucene Users List Subject: Re: demo IndexHTML parser breaks unicode? Sorry, that didn't cure it. Again, anyone want to point me to the quickest replacement HTML parser (that's unicode clean)? Thanks, Fred At 03:17 PM 9/24/2004, you wrote: >On Friday 24 September 2004 19:58, Fred Toth wrote: > > > I've got unicode in my source HTML. In particular, within meta tags, > > and it's getting broken by the indexer. Note that I'm not trying to > > query on any of this, just store and retrieve document titles with > > unicode characters. > >Please try again with the code from CVS, Christoph Goller committed a fix >for this problem (at least I think it was this problem) 1-3 weeks ago. > >Regards > Daniel > >-- >http://www.danielnaber.de > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TopTerms on query results
Can anyone help me with code to get the topterms of a given field for a query resultset? Here is code modified from Luke to get the topterms for a field: public TermInfo[] mostCommonTerms( String fieldName, int numberOfTerms ) { //make sure min will get a positive number if ( numberOfTerms < 1 ) { numberOfTerms = Integer.MAX_VALUE; } numberOfTerms = Math.min( numberOfTerms, 50 ); //String[] commonTerms = new String[numberOfTerms]; try { IndexReader reader = IndexReader.open( indexPath ); TermInfoQueue tiq = new TermInfoQueue( numberOfTerms ); TermEnum terms = reader.terms(); int minFreq = 0; while ( terms.next() ) { if ( fieldName.equalsIgnoreCase( terms.term().field() ) ) { if ( terms.docFreq() > minFreq ) { tiq.put( new TermInfo( terms.term(), terms.docFreq() ) ); if ( tiq.size() >= numberOfTerms ) // if tiq overfull { tiq.pop(); // remove lowest in tiq minFreq = ( (TermInfo) tiq.top() ).docFreq; // reset // minFreq } } } } TermInfo[] res = new TermInfo[ tiq.size() ]; for ( int i = 0; i < res.length; i++ ) { res[ res.length - i - 1 ] = (TermInfo) tiq.pop(); } reader.close(); return res; } catch ( IOException ioe ) { logger.error( "IOException: " + ioe.getMessage() ); } return null; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
getting most common terms for a smaller set of documents
Dear Lucene Users: What is the best way to get the most common terms for a subset of the total documents in your index? I know how to get the most common terms for a field for the entire index, but what is the most efficient way to do this for a subset of documents? Here is the code I am using to get the top "numberOfTerms" common terms for the field "fieldName": public TermInfo[] mostCommonTerms(String fieldName, int numberOfTerms) { //make sure min will get a positive number if (numberOfTerms < 1) { numberOfTerms = Integer.MAX_VALUE; } numberOfTerms = Math.min(numberOfTerms, 50); //String[] commonTerms = new String[numberOfTerms]; try { IndexReader reader = IndexReader.open(indexPath); TermInfoQueue tiq = new TermInfoQueue(numberOfTerms); TermEnum terms = reader.terms(); int minFreq = 0; while (terms.next()) { if(fieldName.equalsIgnoreCase(terms.term().field())) { if (terms.docFreq() > minFreq) { tiq.put(new TermInfo(terms.term(), terms.docFreq())); if (tiq.size() >= numberOfTerms) // if tiq overfull { tiq.pop(); // remove lowest in tiq minFreq = ((TermInfo) tiq.top()).docFreq; // reset // minFreq } } } } TermInfo[] res = new TermInfo[tiq.size()]; for (int i = 0; i < res.length; i++) { res[res.length - i - 1] = (TermInfo) tiq.pop(); } reader.close(); return res; } catch (IOException ioe) { logger.error("IOException: " + ioe.getMessage()); } return null; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spam:too many open files
A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. I was unable to trace the problem to anything specific, but was using the newer code to take advantage of the sort fixes. -Original Message- From: Patrick Kates [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:30 PM To: [EMAIL PROTECTED] Subject: Spam:too many open files I am having two problems with my client's lucene indexes. One, we are getting a FileNotFound exception (too many open files). THis would seem to indicate that I need to increase the number of open files on our Suse 9.0 Pro box. I have our sys admin working on this problem for me. Two, because of this error and subsequent restarting of the box, we seem to have lost an index segment or two. My client's tape backups do not contain the segments we know about. I am concerned about the missing index segments as they seem to be preventing any further update of the index. Does anyone have any suggestions as to how to fix this besides a full re-index of the problem indexes? I was wondering if maybe a merge of the index might solve the problem? I could move our nightly merge of the index files to sooner, but I am afraid that the merge might make matters worse? Any ideas or helpful speculation would be greatly appreciated. Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spam:too many open files
I sent out an email to this list a few weeks ago about how to fix a corrupt index. I basically edited the segments file with a hex editor removing the entry for the missing file and decremented the total count of files from the file count that is near the beginning of the segments file. -Original Message- From: Patrick Kates [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:30 PM To: [EMAIL PROTECTED] Subject: Spam:too many open files I am having two problems with my client's lucene indexes. One, we are getting a FileNotFound exception (too many open files). THis would seem to indicate that I need to increase the number of open files on our Suse 9.0 Pro box. I have our sys admin working on this problem for me. Two, because of this error and subsequent restarting of the box, we seem to have lost an index segment or two. My client's tape backups do not contain the segments we know about. I am concerned about the missing index segments as they seem to be preventing any further update of the index. Does anyone have any suggestions as to how to fix this besides a full re-index of the problem indexes? I was wondering if maybe a merge of the index might solve the problem? I could move our nightly merge of the index files to sooner, but I am afraid that the merge might make matters worse? Any ideas or helpful speculation would be greatly appreciated. Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
Change 02 to be 01 and delete the bytes that represent the one record that is bad. It was easier to see what a record was in my file because I had about 30 _files. -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 17, 2004 10:39 AM To: Lucene Users List Subject: RE: Restoring a corrupt index I think attachments are filtered. This is what I see when I open in the hex editor. : 00 04 e0 af 00 00 00 02 05 5f 36 75 6e 67 00 04 ..à¯._6ung.. :0010 1e fb 05 5f 36 75 6e 69 00 00 00 01 00 00 00 00 .û._6uni :0020 00 00 c1 b4 ..Á´ -George --- Honey George <[EMAIL PROTECTED]> wrote: > Wallen, > Which hex editor have you used. I am also facing a > similar problem. I tried to use KHexEdit and it > doesn't seem to help. I am attaching with this email > my segments file. I think only the segment with name > _ung is a valid one, I wanted to delete the > remaining..but couldn't. Can you help? > > -George > > > > --- [EMAIL PROTECTED] wrote: > > I fixed my own problem, but hope this might help > > someone else in the future: > > > > I went into my segments file (with a hex editor), > > deleted the record for > > _cu0v and changed the length 0x20 to be 0x1f, and > it > > seems I have most of my > > index back! > > > > Maybe a developer could elaborate on this? > > > > > > > > ___ALL-NEW > Yahoo! Messenger - all new features - even more fun! > http://uk.messenger.yahoo.com > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
http://www.ultraedit.com/ is the best! However, I cannot imagine how another hexeditor wouldnt work. -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 17, 2004 10:35 AM To: Lucene Users List Subject: RE: Restoring a corrupt index Wallen, Which hex editor have you used. I am also facing a similar problem. I tried to use KHexEdit and it doesn't seem to help. I am attaching with this email my segments file. I think only the segment with name _ung is a valid one, I wanted to delete the remaining..but couldn't. Can you help? -George --- [EMAIL PROTECTED] wrote: > I fixed my own problem, but hope this might help > someone else in the future: > > I went into my segments file (with a hex editor), > deleted the record for > _cu0v and changed the length 0x20 to be 0x1f, and it > seems I have most of my > index back! > > Maybe a developer could elaborate on this? > ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
I fixed my own problem, but hope this might help someone else in the future: I went into my segments file (with a hex editor), deleted the record for _cu0v and changed the length 0x20 to be 0x1f, and it seems I have most of my index back! Maybe a developer could elaborate on this? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, August 16, 2004 2:16 PM To: [EMAIL PROTECTED] Subject: Restoring a corrupt index Dear fellow Luceners, I had a disk failure while indexing and am now unable to get ANY of the documents stored in my index. I am interested in restoring as many documents as possible from what is a mostly complete index. Is there something I can alter by hand to at least get most of the data back? I am getting an EOF error on the file/segment _cu0v which was presumably the file that was being written when the index crashed. Is there a reference to that file in segments that I could edit out?? I have included what I hope is useful information below. Thank you, Will This is the call-stack from an optimize call IndexWriter writer = new IndexWriter(path, new StandardAnalyzer(), false); --> writer.optimize(); logger.debug(writer.docCount() + ""); writer.close(); Call Stack--- java.io.IOException: read past EOF at org.apache.lucene.store.InputStream.refill(InputStream.java:154) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:66 ) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:104) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at TryStuff.tryFixingLuceneIndex(TryStuff.java:60) at TryStuff.main(TryStuff.java:49) -Directory listing- -rw-rw-r--1 wallen devs 383461 Jul 27 16:48 _1wtg.cfs -rw-rw-r--1 wallen devs 754131765 Jul 27 21:12 _262q.cfs -rw-rw-r--1 wallen devs 754345785 Jul 29 11:43 _4c49.cfs -rw-rw-r--1 wallen devs 719608798 Jul 31 04:38 _6i6l.cfs -rw-rw-r--1 wallen devs 773242798 Aug 2 03:05 _8o79.cfs -rw-rw-r--1 wallen devs 791843591 Aug 3 12:13 _au8j.cfs -rw-rw-r--1 wallen devs 77665301 Aug 3 14:35 _b21n.cfs -rw-rw-r--1 wallen devs 79123000 Aug 3 17:49 _b9uk.cfs -rw-rw-r--1 wallen devs 71718714 Aug 3 22:05 _bhnf.cfs -rw-rw-r--1 wallen devs 81537292 Aug 4 02:50 _bpga.cfs -rw-rw-r--1 wallen devs 80611946 Aug 4 07:44 _bx95.cfs -rw-rw-r--1 wallen devs 77923836 Aug 4 13:23 _c523.cfs -rw-rw-r--1 wallen devs0 Aug 4 14:20 _caip.fnm -rw-rw-r--1 wallen devs 79987096 Aug 4 15:29 _ccxt.cfs -rw-rw-r--1 wallen devs 84966054 Aug 4 16:25 _ckqo.cfs -rw-rw-r--1 wallen devs 90829602 Aug 4 19:14 _csjj.cfs -rw-rw-r--1 wallen devs 7486317 Aug 4 19:23 _ctbm.cfs -rw-rw-r--1 wallen devs 1148765 Aug 4 19:24 _ctef.cfs -rw-rw-r--1 wallen devs 958149 Aug 4 19:27 _cth8.cfs -rw-rw-r--1 wallen devs 909911 Aug 4 19:28 _ctk1.cfs -rw-rw-r--1 wallen devs 918952 Aug 4 19:28 _ctmu.cfs -rw-rw-r--1 wallen devs 957856 Aug 4 19:31 _ctpn.cfs -rw-rw-r--1 wallen devs 651717 Aug 4 19:32 _ctsg.cfs -rw-rw-r--1 wallen devs 790354 Aug 4 19:32 _ctv9.cfs -rw-rw-r--1 wallen devs 890058 Aug 4 19:35 _cty2.cfs -rw-rw-r--1 wallen devs0 Aug 4 19:35 _cu0v.cfs -rw-rw-r--1 wallen devs 891397 Aug 5 13:36 _cu3o.cfs -rw-rw-r--1 wallen devs 1085511 Aug 5 13:40 _cu6h.cfs -rw-rw-r--1 wallen devs 754877 Aug 5 13:40 _cu9b.cfs -rw-rw-r--1 wallen devs 1610682 Aug 5 13:40 _cuc5.cfs -rw-rw-r--1 wallen devs 1039577 Aug 5 13:41 _cuez.cfs -rw-rw-r--1 wallen devs 831174 Aug 5 13:41 _cuht.cfs -rw-rw-r--1 wallen devs 930858 Aug 5 13:56 _cuko.cfs -rw-rw-r--1 wallen devs 911844 Aug 5 13:56 _cuni.cfs -rw-rw-r--1 wallen devs 340 Aug 5 13:56 segments -rw-rw-r--1 wallen devs4 Aug 5 13:56 deletable drwxrwxrwx2 wallen devs 929792 Aug 5 13:56 . drwxrwxr-x5 wallen devs 40 Aug 10 14:13 .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ---
Restoring a corrupt index
Dear fellow Luceners, I had a disk failure while indexing and am now unable to get ANY of the documents stored in my index. I am interested in restoring as many documents as possible from what is a mostly complete index. Is there something I can alter by hand to at least get most of the data back? I am getting an EOF error on the file/segment _cu0v which was presumably the file that was being written when the index crashed. Is there a reference to that file in segments that I could edit out?? I have included what I hope is useful information below. Thank you, Will This is the call-stack from an optimize call IndexWriter writer = new IndexWriter(path, new StandardAnalyzer(), false); --> writer.optimize(); logger.debug(writer.docCount() + ""); writer.close(); Call Stack--- java.io.IOException: read past EOF at org.apache.lucene.store.InputStream.refill(InputStream.java:154) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:66 ) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:104) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at TryStuff.tryFixingLuceneIndex(TryStuff.java:60) at TryStuff.main(TryStuff.java:49) -Directory listing- -rw-rw-r--1 wallen devs 383461 Jul 27 16:48 _1wtg.cfs -rw-rw-r--1 wallen devs 754131765 Jul 27 21:12 _262q.cfs -rw-rw-r--1 wallen devs 754345785 Jul 29 11:43 _4c49.cfs -rw-rw-r--1 wallen devs 719608798 Jul 31 04:38 _6i6l.cfs -rw-rw-r--1 wallen devs 773242798 Aug 2 03:05 _8o79.cfs -rw-rw-r--1 wallen devs 791843591 Aug 3 12:13 _au8j.cfs -rw-rw-r--1 wallen devs 77665301 Aug 3 14:35 _b21n.cfs -rw-rw-r--1 wallen devs 79123000 Aug 3 17:49 _b9uk.cfs -rw-rw-r--1 wallen devs 71718714 Aug 3 22:05 _bhnf.cfs -rw-rw-r--1 wallen devs 81537292 Aug 4 02:50 _bpga.cfs -rw-rw-r--1 wallen devs 80611946 Aug 4 07:44 _bx95.cfs -rw-rw-r--1 wallen devs 77923836 Aug 4 13:23 _c523.cfs -rw-rw-r--1 wallen devs0 Aug 4 14:20 _caip.fnm -rw-rw-r--1 wallen devs 79987096 Aug 4 15:29 _ccxt.cfs -rw-rw-r--1 wallen devs 84966054 Aug 4 16:25 _ckqo.cfs -rw-rw-r--1 wallen devs 90829602 Aug 4 19:14 _csjj.cfs -rw-rw-r--1 wallen devs 7486317 Aug 4 19:23 _ctbm.cfs -rw-rw-r--1 wallen devs 1148765 Aug 4 19:24 _ctef.cfs -rw-rw-r--1 wallen devs 958149 Aug 4 19:27 _cth8.cfs -rw-rw-r--1 wallen devs 909911 Aug 4 19:28 _ctk1.cfs -rw-rw-r--1 wallen devs 918952 Aug 4 19:28 _ctmu.cfs -rw-rw-r--1 wallen devs 957856 Aug 4 19:31 _ctpn.cfs -rw-rw-r--1 wallen devs 651717 Aug 4 19:32 _ctsg.cfs -rw-rw-r--1 wallen devs 790354 Aug 4 19:32 _ctv9.cfs -rw-rw-r--1 wallen devs 890058 Aug 4 19:35 _cty2.cfs -rw-rw-r--1 wallen devs0 Aug 4 19:35 _cu0v.cfs -rw-rw-r--1 wallen devs 891397 Aug 5 13:36 _cu3o.cfs -rw-rw-r--1 wallen devs 1085511 Aug 5 13:40 _cu6h.cfs -rw-rw-r--1 wallen devs 754877 Aug 5 13:40 _cu9b.cfs -rw-rw-r--1 wallen devs 1610682 Aug 5 13:40 _cuc5.cfs -rw-rw-r--1 wallen devs 1039577 Aug 5 13:41 _cuez.cfs -rw-rw-r--1 wallen devs 831174 Aug 5 13:41 _cuht.cfs -rw-rw-r--1 wallen devs 930858 Aug 5 13:56 _cuko.cfs -rw-rw-r--1 wallen devs 911844 Aug 5 13:56 _cuni.cfs -rw-rw-r--1 wallen devs 340 Aug 5 13:56 segments -rw-rw-r--1 wallen devs4 Aug 5 13:56 deletable drwxrwxrwx2 wallen devs 929792 Aug 5 13:56 . drwxrwxr-x5 wallen devs 40 Aug 10 14:13 .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Finding All?
A ranged query that covers the full range does the same thing. Of course it is also inefficient with term generation: myField[a TO z] -Original Message- From: Patrick Burleson [mailto:[EMAIL PROTECTED] Sent: Friday, August 13, 2004 3:58 PM To: Lucene Users List Subject: Re: Finding All? That is a very interesting idea. I might give that a shot. Thanks, Patrick On Fri, 13 Aug 2004 15:36:11 -0400, Tate Avery <[EMAIL PROTECTED]> wrote: > > I had to do this once and I put a field called "all" with a value of "true" for every document. > > _doc.addField(Field.Keyword("all", "true")); > > Then, if there was an empty query, I would substitute it for the query "all:true". And, of course, every doc would match this. > > There might be a MUCH more elegant solution, but this certainly worked for me and was quite easy to incorporate. And, it appears to order the documents by the order in which they were indexed. > > T > > p.s. You can probably do something using IndexReader directly... but the nice thing about this approach is that you are still just using a simple query. > > > > -Original Message- > From: Patrick Burleson [mailto:[EMAIL PROTECTED] > Sent: Friday, August 13, 2004 3:25 PM > To: Lucene Users List > Subject: Finding All? > > Is there a way for lucene to find all documents? Say if I have a > search input and someone puts nothing in I want to go ahead and > return everything. Passing "*" to QueryParser was not pretty. > > Thanks, > Patrick > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Question on the minimum value for DateField
The date is stored as a Long that is the number of seconds since jan 1970. Anything before that would be negative. -Original Message- From: Terence Lai [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 04, 2004 6:25 PM To: Lucene Users List Subject: Question on the minimum value for DateField Hi All, I realize that the DateField cannot except the value which is before the Year 1970, specifically in the org.apache.lucene.document.DateField.timeToString() method. Is there are any techincal reason for this limitation? Thanks, Terence -- Get your free email account from http://www.trekspace.com Your Internet Virtual Desktop! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TermFreqVector Beginner Question
Are you certain that you are storing the field "contents" in your documents, not just tokenizing... If you use the overloaded method that takes a Reader you lose the content. -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 5:35 PM To: [EMAIL PROTECTED] Subject: Re: TermFreqVector Beginner Question Can you post the whole section of related code? Sounds like you are doing things right. In the Lucene source code, there is a file called TestTermVectors.java, take a look at that and see how your stuff compares. I ran the test against the HEAD and it worked. >>> [EMAIL PROTECTED] 07/28/04 04:51PM >>> Howdy, I am new to Lucene and thus far I am very impressed. Thanks to all who have worked on this project! I am working on a project where I want to do the following: 1.) Index a bunch of document. 2.) Pluck out one of the doucments by Lucene document number 3.) Get a term frequency for that document After some digging and playing I came across this method... IndexReader.getTermFreqVector(int docNumber, String field) This is exactly what I want. So I ran the IndexFiles demo program with some test documents and started poking at the index with an IndexReader. But when I called IndexReader.getTermFreqVector(someDocNumber,"contents") I get NULL back. After a little more digging I find that for a TermVector to exist the Field has to have the TermVector flag set. So I changes some lines in the demo FileDocument.Document method to: FileInputStream is = new FileInputStream(f); Reader reader = new BufferedReader(new InputStreamReader(is)); doc.add(Field.Text("contents", reader.toString(),true)); with the "true" parameter causing the new Field to turn on the storeTermVector flag, right? So then I reindex and get the same results - getTermFreqVector returns NULL. So I inspect the field list of the Document from the index: Document d = ir.document(td.doc()); System.out.println(" Path: "+d.get("path")); for (Enumeration e = d.fields() ; e.hasMoreElements() ;) { System.out.println(((Field)e.nextElement()).toString()); } and I discover that there is now NO "contents" Field. If I change the paramter in Field.Text to false, I get a "contents" Field but no TermVector. To date I haven't been able to figure out how to get a TermFreqVector at all. What am I missing? I have looked at the documents - all the tutorials I have found just cover the basics. I have read the news group postings related to "TermVectors" and "TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is great". So how do they know? Where can I learn about this? Are there any more complete user tutorials/references that cover TermVector features? Oh, I am using the 1.4 Lucene release in case it matters. Thanks in advance, Matt Galloway Tulsa, Oklahoma (BTW, I also tired Field.UnStored with the same results.) - This mail sent through IMP: http://horde.org/imp/ - End forwarded message - - This mail sent through IMP: http://horde.org/imp/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene vs. MySQL Full-Text
I also question whether it could handle extreme volume with such good query speed. Has anyone done numbers with 1+ million documents? -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 20, 2004 5:44 PM To: Lucene Users List Subject: Re: Lucene vs. MySQL Full-Text On Tuesday 20 July 2004 21:29, Tim Brennan wrote: > ÂDoes anyone out there have > anything more concrete they can add? Stemming is still on the MySQL TODO list: http://dev.mysql.com/doc/mysql/en/Fulltext_TODO.html Also, for most people it's easier to extend Lucene than MySQL (as MySQL is written in C(++?)) and there are more powerful queries in Lucene, e.g. fuzzy phrase search. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Very slow IndexReader.open() performance
It could also be that your disk space is filling up and the OS runs out of swap room. -Original Message- From: Mark Florence [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 20, 2004 1:52 PM To: Lucene Users List Subject: Very slow IndexReader.open() performance Hi -- We have a large index (~4m documents, ~14gb) that we haven't been able to optimize for some time, because the JVM throws OutOfMemory, after climbing to the maximum we can throw at it, 2gb. In fact, the OutOfMemory condition occurred most recently during a segment merge operation. maxMergeDocs was set to the default, and we seem to have gotten around this problem by setting it to some lower value, currently 100,000. The index is highly interactive so I took the hint from earlier posts to set it to this value. Good news! No more OutOfMemory conditions. Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it is killing performance. I followed the design pattern in another earlier post from Doug. I take a batch of deletes, open an IndexReader, perform the deletes, then close it. Then I take a batch of adds, open an IndexWriter, perform the adds, then close it. Then I get a new IndexSearcher for searching. But because the index is so interactive, this sequence repeats itself all the time. My question is, is there a better way? Performance was fine when I could optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher to avoid the overhead of the open? Any help would be most gratefully received. Mark Florence, CTO, AIRS [EMAIL PROTECTED] 800-897-7714x1703 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching against Database
If you know ahead of time which documents are viewable by a certain user group you could add a field, such as group, and then when you index the document you put the names of the user groups that are allowed to view that document. Then your query tool can append, for example "AND group:developers" to the user's query. Then you will not have to merge results. -Will -Original Message- From: Sergiu Gordea [mailto:[EMAIL PROTECTED] Sent: Thursday, July 15, 2004 2:58 AM To: Lucene Users List Subject: Re: Searching against Database Hi, I have a simillar problem. I'm working on a web application in which the users have different permissions. Not all information stored in the index is public for all users. The documents in Index are identified by the same ID that the rows have in database tables. I can get the IDs of the documents that can be accesible by the user, but if this are 1000, what will happen in Lucene? Is this a valid solution? Can anyone provide a better idea? Thanks, Sergiu lingaraju wrote: >Hello > >Even i am searching the same code as all my web display information is >stored in database. >Early response will be very much helpful > >Thanks and regards >Raju > >- Original Message - >From: "Hetan Shah" <[EMAIL PROTECTED]> >To: "Lucene Users List" <[EMAIL PROTECTED]> >Sent: Thursday, July 15, 2004 5:56 AM >Subject: Searching against Database > > > > >>Hello All, >> >>I have got all the answers from this fantastic mailing list. I have >>another question ;) >> >>What is the best way (Best Practices) to integrate Lucene with live >>database, Oracle to be more specific. Any pointers are really very much >>appreciated. >> >>thanks guys. >>-H >> >> >>- >>To unsubscribe, e-mail: [EMAIL PROTECTED] >>For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
corrupt indexes?
Has anyone had any experience with their index getting corrupted? Are there any tools to repair it should it get corrupted? I have not had any problems, but was curious at how resiliant this data store seems to be. -Will - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Field.java -> STORED, NOT_STORED, etc...
I have 2 suggestions: 1) use Eclipse, or an IDE that references the javadoc with mouseovers 2) if you are going to create constants, consider using a bitflag. Then your constants can have a 2's value, ie STORED = 1 INDEXED = 2 TOKENIZED = 4 Then you can have the constructor look like: new Field("name", "value", STORED + TOKENIZED) The constructor would break that down bitwise! -Original Message- From: Kevin A. Burton [mailto:[EMAIL PROTECTED] Sent: Sunday, July 11, 2004 5:05 AM To: Lucene Users List Subject: Field.java -> STORED, NOT_STORED, etc... I've been working with the Field class doing index conversions between an old index format to my new external content store proposal (thus the email about the 14M convert). Anyway... I find the whole Field.Keyword, Field.Text thing confusing. The main problem is that the constructor to Field just takes booleans and if you forget the ordering of the booleans its very confusing. new Field( "name", "value", true, false, true ); So looking at that you have NO idea what its doing without fetching javadoc. So I added a few constants to my class: new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED = false; public static final boolean INDEXED = true; public static final boolean NOT_INDEXED = false; public static final boolean TOKENIZED = true; public static final boolean NOT_TOKENIZED = false; Of course you still have to remember the order but this becomes a lot easier to maintain. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem with match on a non tokenized field.
I do not know how to work around that. It is indeed an interesting situation that would require more understanding as to how the analyzer (in this case NullAnalyzer) interacts with the special characters such as the * and ~. You could try using the whitespace analyzer instead of the nullanalyzer! -Will -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Friday, July 09, 2004 4:45 PM To: 'Lucene Users List' Subject: RE: Problem with match on a non tokenized field. Thanks a lot for your help. I've done what you suggested and it works great except in this particular case: I am trying to search for something like "abc-ef*" - i.e. I want to find all fields that start with: "abc-ef". I use PerFieldAnalyzerWrapper together with NullAnalyzer to make sure this field doesn't get tokenized on the "-", but at the same time I need the analyzer to realize that '*' is the wildcard search, not part of the field value itself. Would you know how to work around this ? Thank you, Polina -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 8, 2004 1:10 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. The PerFieldAnalyzerWrapper is constructed with your default analyzer, suppose this is the analyzer you use to tokenize. You then call the addAnalyzer method for each non-tokenized/keyword fields. In the case below, url is a keyword, all other fields are tokenized: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); query = QueryParser.parse(searchQuery,"contents",analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Thursday, July 08, 2004 10:19 AM To: 'Lucene Users List' Subject: RE: Problem with match on a non tokenized field. Thanks a lot for your help. I have one more question: How would you handle a query consisting of two fields combined with a Boolean operator, where one field is only indexed and stored (a Keyword) and another is tokenized, indexed and store ? Is it possible to have parts of the same query analyzed with different analyzers ? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 7, 2004 4:38 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); try { query = QueryParser.parse(searchQuery, "contents", analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 4:20 PM To: [EMAIL PROTECTED] Subject: Problem with match on a non tokenized field. I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried using breaks my query into Code:ABC5 -Code:LB. I need to be able to match this field by doing something like Code:ABC5-L*, therefore always using quotes is not an option. How would I go about writing my own analyzer that will not tokenize the query ? Thanks, Polina - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem with match on a non tokenized field.
The PerFieldAnalyzerWrapper is constructed with your default analyzer, suppose this is the analyzer you use to tokenize. You then call the addAnalyzer method for each non-tokenized/keyword fields. In the case below, url is a keyword, all other fields are tokenized: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); query = QueryParser.parse(searchQuery,"contents",analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Thursday, July 08, 2004 10:19 AM To: 'Lucene Users List' Subject: RE: Problem with match on a non tokenized field. Thanks a lot for your help. I have one more question: How would you handle a query consisting of two fields combined with a Boolean operator, where one field is only indexed and stored (a Keyword) and another is tokenized, indexed and store ? Is it possible to have parts of the same query analyzed with different analyzers ? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 7, 2004 4:38 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); try { query = QueryParser.parse(searchQuery, "contents", analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 4:20 PM To: [EMAIL PROTECTED] Subject: Problem with match on a non tokenized field. I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried using breaks my query into Code:ABC5 -Code:LB. I need to be able to match this field by doing something like Code:ABC5-L*, therefore always using quotes is not an option. How would I go about writing my own analyzer that will not tokenize the query ? Thanks, Polina - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem with match on a non tokenized field.
Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); try { query = QueryParser.parse(searchQuery, "contents", analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 4:20 PM To: [EMAIL PROTECTED] Subject: Problem with match on a non tokenized field. I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried using breaks my query into Code:ABC5 -Code:LB. I need to be able to match this field by doing something like Code:ABC5-L*, therefore always using quotes is not an option. How would I go about writing my own analyzer that will not tokenize the query ? Thanks, Polina - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser and Keyword Fields
Can anyone give me advice on the best way to not have your keyword fields analyzed by QueryParser? Even though it seems like it would be a common problem, I have read the FAQ, and found this relevant thread with no real answers. http://issues.apache.org/eyebrowse/[EMAIL PROTECTED] he.org&msgId=1235589 "QueryParser has some nasty habits of analyzing everything." Can't it be smart and not analyze fields that are keywords (aka not tokenized by the analyzer?) Thank you, Will - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Demo 3 on windows
use forward slashes / instead of \ for your path: c:/apache/group/index OR if c: is your main drive /apache/group/index -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Monday, June 21, 2004 5:55 PM To: [EMAIL PROTECTED] Subject: Demo 3 on windows Hello, I have been trying to build the index on my windows machine with the following syntax and getting this message back from Lucene. *java org.apache.lucene.demo.IndexHTML -create -index {index-dir} .. *in my case it looks like java org.apache.lucene.demo.IndexHTML -create -index c:\apache group\index .. and the message that I am getting is Usage: IndexHTML [-create] [-index ] Any idea why would I keep getting this message? TIA. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: search "" and ""
This depends on the analyzer you use. http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi ng&toc=faq#q13 -Original Message- From: Lynn Li [mailto:[EMAIL PROTECTED] Sent: Friday, June 18, 2004 5:03 PM To: '[EMAIL PROTECTED]' Subject: search "" and "" When search "" or "", QueryParser parses them into "text". How can I make it not to remove anchor brackets and slashes? Thank you in advance, Lynn - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: help needed in starting lucene
It sounds to me like you need a newer version of Java. -Original Message- From: milind honrao [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 02, 2004 5:36 PM To: [EMAIL PROTECTED] Subject: help needed in starting lucene Hi, I am just a beginner. I installed lucene according to the intsructions provided. I did all the changed to the environment variables when i try to run the test program for building indexes using the following command: java org.apache.lucene.demo.IndexFiles test/Doc I am getting the following exception Exception in thread "main" class java.lang.ExceptionInInitializerError: java.lang.RuntimeException: java.security.NoSuchAlgorithmException: MD5: Class not found. Yahoo! India Matrimony: Find your partner online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem Indexing Large Document Field
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite r.html#DEFAULT_MAX_FIELD_LENGTH maxFieldLength public int maxFieldLengthThe maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError. By default, no more than 10,000 terms will be indexed for a field. -Original Message- From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 4:04 PM To: [EMAIL PROTECTED] Subject: Problem Indexing Large Document Field I am trying to index a field in a Lucene document with about 90,000 characters. The problem is that it only indexes part of the document. It seems to only index about 65,00 characters. So, if I search on terms that are at the beginning of the text, the search works, but it fails for terms that are at the end of the document. Is there a limitation on how many characters can be stored in a document field? Any help would be appreciated, thanks Gilberto Rodriguez Software Engineer 370 CenterPointe Circle, Suite 1178 Altamonte Springs, FL 32701-3451 407.339.1177 (Ext.112) phone 407.339.6704 fax [EMAIL PROTECTED] email www.conviveon.com web This e-mail contains legally privileged and confidential information intended only for the individual or entity named within the message. If the reader of this message is not the intended recipient, or the agent responsible to deliver it to the intended recipient, the recipient is hereby notified that any review, dissemination, distribution or copying of this communication is prohibited. If this communication was received in error, please notify me by reply e-mail and delete the original message. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memory usage
This sounds like a memory leakage situation. If you are using tomcat I would suggest you make sure you are on a recent version, as it is known to have some memory leaks in version 4. It doesn't make sense that repeated queries would use more memory that the most demanding query unless objects are not getting freed from memory. -Will -Original Message- From: James Dunn [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 3:02 PM To: [EMAIL PROTECTED] Subject: Memory usage Hello, I was wondering if anyone has had problems with memory usage and MultiSearcher. My index is composed of two sub-indexes that I search with a MultiSearcher. The total size of the index is about 3.7GB with the larger sub-index being 3.6GB and the smaller being 117MB. I am using Lucene 1.3 Final with the compound file format. Also I search across about 50 fields but I don't use wildcard or range queries. Doing repeated searches in this way seems to eventually chew up about 500MB of memory which seems excessive to me. Does anyone have any ideas where I could look to reduce the memory my queries consume? Thanks, Jim __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Performance profile of optimization...
My understanding is that hard drive IO is the main bottleneck, as the operation is mainly a file copy. So to directly answer your question, I believe the overall file size of your indexes will linearly effect the performance profile of your optimizations. -Original Message- From: Michael Giles [mailto:[EMAIL PROTECTED] Sent: Monday, May 24, 2004 3:13 PM To: Lucene Users List Subject: Performance profile of optimization... What is the performance profile of optimizing an index? By that I mean, what are the primary variables that negatively impact its speed (i.e. index size (bytes, docs), number of adds/deletes since last optimization, etc). For example, if I add a single document to a small (i.e. < 10K docs) index and still have that index open (but would otherwise close it until the next update, a few minutes later), what type of a performance hit would optimizing the index be? Does that cost change as the index gets bigger or is it tied to the number of changes that need to be rolled in? -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Rebuild after corruption
Make sure you close your indexwriter. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite r.html#close() -Original Message- From: Steve Rajavuori [mailto:[EMAIL PROTECTED] Sent: Friday, May 21, 2004 7:49 PM To: '[EMAIL PROTECTED]' Subject: Rebuild after corruption I have a problem periodically where the process updating my Lucene files terminates abnormally. When I try to open the Lucene files afterward I get an exception indicating that files are missing. Does anyone know how I can recover at this point, without having to rebuild the whole index from scratch? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching Microsoft Word , Excel and PPT files for Japanese
I am not sure. See what google give you. I would guess you need to get a table of entities and compare it to the unicode character. So if you parse the word file you might see something like "&u12312;" (without quotes) this corresponds to a single unicode character and you can use the java api to get that character. -Will -Original Message- From: Ankur Goel [mailto:[EMAIL PROTECTED] Sent: Thursday, May 20, 2004 1:18 PM To: 'Lucene Users List' Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese Hi Can you tell me how to convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16). Sorry I am new to this. Looks like first I will have to parse the text out of these files. I tried Jakarta PI also But for japans it was also not very good. Regards, Ankur -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thursday, May 20, 2004 10:43 PM To: [EMAIL PROTECTED] Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese I believe MS apps store non-ascii characters as entities internally instead of using unicode. You can see evidence of this if you save your file as an HTML file and look at the source. You will have to adjust your parser to convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16). -Will -Original Message- From: Ankur Goel [mailto:[EMAIL PROTECTED] Sent: Thursday, May 20, 2004 1:10 PM To: 'Lucene Users List' Subject: Searching Microsoft Word , Excel and PPT files for Japanese Hi, I am using CJK Tokenzier for searching the Japanese documents. I am able to search japanese documents which are text files. But I am not able to search from Microsoft word, excel files with content in Japanese. Can you tell me how can search on Japanese content for Microsoft word, excel and ppt files. Thanks, Ankur -Original Message- From: Ankur Goel [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:36 AM To: 'Lucene Users List' Subject: RE: Boolean Phrase Query question Thanks Eric for the solution. I have to filename field as I have to give the end user facility to search on File Name also. That's why I am using TEXT for file Name also. "By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query?" I need an OR type of query. I mean the word can be in the filename or in the contents of the filename. But i am not able to do this. Can you tell me how to do it? Regards, Ankur -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:27 AM To: Lucene Users List Subject: Re: Boolean Phrase Query question On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: > > Hi, > I have to provide a functionality which provides search on both file > name and contents of the file. > > For indexing I use the following code: > > > org.apache.lucene.document.Document doc = new org.apache. > lucene.document.Document(); > doc.add(Field.Keyword("fileId","" + document.getFileId())); > doc.add(Field.Text("fileName",fileName); > doc.add(Field.Text("contents", new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) > For searching a text say "temp" I use the following code to look both > in file Name and contents of the file: > > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = > QueryParser.parse("temp","fileName",analyzer); > Query mainQuery = QueryParser.parse("temp","contents",analyzer); > > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, > true, false); > > Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching Microsoft Word , Excel and PPT files for Japanese
I believe MS apps store non-ascii characters as entities internally instead of using unicode. You can see evidence of this if you save your file as an HTML file and look at the source. You will have to adjust your parser to convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16). -Will -Original Message- From: Ankur Goel [mailto:[EMAIL PROTECTED] Sent: Thursday, May 20, 2004 1:10 PM To: 'Lucene Users List' Subject: Searching Microsoft Word , Excel and PPT files for Japanese Hi, I am using CJK Tokenzier for searching the Japanese documents. I am able to search japanese documents which are text files. But I am not able to search from Microsoft word, excel files with content in Japanese. Can you tell me how can search on Japanese content for Microsoft word, excel and ppt files. Thanks, Ankur -Original Message- From: Ankur Goel [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:36 AM To: 'Lucene Users List' Subject: RE: Boolean Phrase Query question Thanks Eric for the solution. I have to filename field as I have to give the end user facility to search on File Name also. That's why I am using TEXT for file Name also. "By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query?" I need an OR type of query. I mean the word can be in the filename or in the contents of the filename. But i am not able to do this. Can you tell me how to do it? Regards, Ankur -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:27 AM To: Lucene Users List Subject: Re: Boolean Phrase Query question On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: > > Hi, > I have to provide a functionality which provides search on both file > name and contents of the file. > > For indexing I use the following code: > > > org.apache.lucene.document.Document doc = new org.apache. > lucene.document.Document(); > doc.add(Field.Keyword("fileId","" + document.getFileId())); > doc.add(Field.Text("fileName",fileName); > doc.add(Field.Text("contents", new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) > For searching a text say "temp" I use the following code to look both > in file Name and contents of the file: > > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = > QueryParser.parse("temp","fileName",analyzer); > Query mainQuery = QueryParser.parse("temp","contents",analyzer); > > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, > true, false); > > Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: AW: Problem indexing Spanish Characters
Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 2:09 PM To: 'Lucene Users List' Subject: RE: AW: Problem indexing Spanish Characters The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la "Fundación Hospital de Na. Señora del Pilar" -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >Hi Hannah, Otis >I cannot help but I have excatly the same problems with special german >charcters. I used snowball analyser but this does not help because the >problem (tokenizing) appears before the analyser comes into action. >I just posted the question "Problem tokenizing UTF-8 with geman umlauts" >some minutes ago which describes my problem and Hannahs seem to be similar. >Do you have also UTF-8 encoded pages? > >Pet
Can documents be appended to?
Is it possible to append to an existing document? Judging by my own tests and this thread, NO. http://issues.apache.org/eyebrowse/[EMAIL PROTECTED] he.org&msgNo=3971 Wouldn't it be possible to look up an individual document (based upon a uid of sorts), then load the Fields off of the old one, delete it, then add the new document. Is there any hope of doing this efficiently? This would run into problems when merging indexes, you would get duplicates if they existed on more than 1 of your original indexes. Thank you, Will - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]