Re: UNIX command-line indexing script?
Have a look at the Ant index task in the Lucene sandbox. You're on your own, currently, to build this and understand it, but I use it frequently. In fact, the sample index from our book is generated with this: index index=${build.dir}/index documenthandler=lia.common.TestDataDocumentHandler fileset dir=${data.dir}/ config basedir=${data.dir}/ /index You can plug in your own DocumentHandler implementation to index different document types however you like. The default one indexes .txt and .html files, but a custom implementation can do its own thing. Again, to write a DocumentHandler that knows about various document types is not hard you will have to write your own at the moment. Despite the (minor) amount of work you'll have to do to start using index - the infrastructure adds a lot of value: an incremental file system indexer (only new docs get indexed on successive runs). Plugging this into cron would be trivial. Erik On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: Anyone written a simple UNIX command-line indexing script which will read a bunch off different kinds of docs and index them? I'd like to make a cron job out of this so as to be able to come back and read it later during a search. PERL or JAVA script would be fine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reader Text input as field for HTML data text leading to null retrieval
Re-directing this message to lucene-user list. That is the correct behaviour. Use http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.lang.String) if you want to be able to retrieve the original value of the indexed text. Otis --- jitender ahuja [EMAIL PROTECTED] wrote: I am working to make an index using Lucene over HTML files. I intend to use the Reader as the type of the text field so as to not store the Html files verbatim in the index. But the data retrieval yields null as the text retrieved. However, if I do not use the Reader class as the Text field type, then I get whole file back .Also, the index directory size is nearly four times more now. brbr The indexer code that deals with the Reader data type is: br p public class IndexData{ p protected static final String INDEX_FOLDER = C:\\Temp\\DB_GT11; prepublic static void main(String[] args) { try{ IndexData objDBdex = new IndexData(); boolean createDex = !objDBdex.indexExists(); /pre pIndexWriter writ = new IndexWriter(INDEX_FOLDER, new StandardAnalyzer(), createDex); pre for(int i=0; iargs.length; i++){ System.out.println(Indexing File +args[i]); InputStream is = new FileInputStream(args[i]); Document doc = new Document(); doc.add(Field.UnIndexed(path, args[i]));/pre p BufferedReader rdr = new BufferedReader((Reader)new InputStreamReader(is)); pre StringBuffer fileBuffer = new StringBuffer(); String line; while ((line = rdr.readLine()) != null ) { fileBuffer.append(line); } System.out.println(File contents from buffer: ); System.out.println(fileBuffer.toString()); StringReader ab = new StringReader(fileBuffer.toString()); doc.add(Field.Text(body, (Reader)ab)); writ.addDocument(doc); is.close(); } writ.close(); } catch(IOException ex) { ex.printStackTrace(); } } public boolean indexExists(){ return false; } } /pre - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
To add to this. The upcoming Lucene in Action book has ready to use code that will handle and index files in most popular file formats. Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: Have a look at the Ant index task in the Lucene sandbox. You're on your own, currently, to build this and understand it, but I use it frequently. In fact, the sample index from our book is generated with this: index index=${build.dir}/index documenthandler=lia.common.TestDataDocumentHandler fileset dir=${data.dir}/ config basedir=${data.dir}/ /index You can plug in your own DocumentHandler implementation to index different document types however you like. The default one indexes .txt and .html files, but a custom implementation can do its own thing. Again, to write a DocumentHandler that knows about various document types is not hard you will have to write your own at the moment. Despite the (minor) amount of work you'll have to do to start using index - the infrastructure adds a lot of value: an incremental file system indexer (only new docs get indexed on successive runs). Plugging this into cron would be trivial. Erik On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: Anyone written a simple UNIX command-line indexing script which will read a bunch off different kinds of docs and index them? I'd like to make a cron job out of this so as to be able to come back and read it later during a search. PERL or JAVA script would be fine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
So, how upcoming is this book going to be? [EMAIL PROTECTED] 3/15/2004 3:39:39 AM To add to this. The upcoming Lucene in Action book has ready to use code that will handle and index files in most popular file formats. Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: Have a look at the Ant index task in the Lucene sandbox. You're on your own, currently, to build this and understand it, but I use it frequently. In fact, the sample index from our book is generated with this: index index=${build.dir}/index documenthandler=lia.common.TestDataDocumentHandler fileset dir=${data.dir}/ config basedir=${data.dir}/ /index You can plug in your own DocumentHandler implementation to index different document types however you like. The default one indexes .txt and .html files, but a custom implementation can do its own thing. Again, to write a DocumentHandler that knows about various document types is not hard you will have to write your own at the moment. Despite the (minor) amount of work you'll have to do to start using index - the infrastructure adds a lot of value: an incremental file system indexer (only new docs get indexed on successive runs). Plugging this into cron would be trivial. Erik On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: Anyone written a simple UNIX command-line indexing script which will read a bunch off different kinds of docs and index them? I'd like to make a cron job out of this so as to be able to come back and read it later during a search. PERL or JAVA script would be fine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
java.io.IOException: Lock obtain timed out
I am using Lucene 1.3 final and am having an error that I can't seem to shake. Basically, I am updating a Document in the index incrementally by calling an IndexReader to remove the document. This works. Then, I close the IndexReader with the following code: reader.unlock(reader.directory()); reader.close(); I put the first of the two lines in to try to force the lock to disable. According to the logging, this code is being called and the IndexReader is being closed. However, then I open a writer to add the document, I get the following. java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at ... I open the writer by calling: return new IndexWriter(INDEX_DIR, analyzer, false); where analyzer=new StandardAnalyzer(); I get the reader by calling: IndexReader reader=IndexReader.open(INDEX_DIR); Thanks for any help, Gabe __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.IOException: Lock obtain timed out
There is no need for that .unlock call, just .close() Otis --- Gabe [EMAIL PROTECTED] wrote: I am using Lucene 1.3 final and am having an error that I can't seem to shake. Basically, I am updating a Document in the index incrementally by calling an IndexReader to remove the document. This works. Then, I close the IndexReader with the following code: reader.unlock(reader.directory()); reader.close(); I put the first of the two lines in to try to force the lock to disable. According to the logging, this code is being called and the IndexReader is being closed. However, then I open a writer to add the document, I get the following. java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at ... I open the writer by calling: return new IndexWriter(INDEX_DIR, analyzer, false); where analyzer=new StandardAnalyzer(); I get the reader by calling: IndexReader reader=IndexReader.open(INDEX_DIR); Thanks for any help, Gabe __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UNIX command-line indexing script?
Erik and I are putting finishing touches on it, so by Summer (this one ;)). Otis --- Charlie Smith [EMAIL PROTECTED] wrote: So, how upcoming is this book going to be? [EMAIL PROTECTED] 3/15/2004 3:39:39 AM To add to this. The upcoming Lucene in Action book has ready to use code that will handle and index files in most popular file formats. Otis --- Erik Hatcher [EMAIL PROTECTED] wrote: Have a look at the Ant index task in the Lucene sandbox. You're on your own, currently, to build this and understand it, but I use it frequently. In fact, the sample index from our book is generated with this: index index=${build.dir}/index documenthandler=lia.common.TestDataDocumentHandler fileset dir=${data.dir}/ config basedir=${data.dir}/ /index You can plug in your own DocumentHandler implementation to index different document types however you like. The default one indexes .txt and .html files, but a custom implementation can do its own thing. Again, to write a DocumentHandler that knows about various document types is not hard you will have to write your own at the moment. Despite the (minor) amount of work you'll have to do to start using index - the infrastructure adds a lot of value: an incremental file system indexer (only new docs get indexed on successive runs). Plugging this into cron would be trivial. Erik On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote: Anyone written a simple UNIX command-line indexing script which will read a bunch off different kinds of docs and index them? I'd like to make a cron job out of this so as to be able to come back and read it later during a search. PERL or JAVA script would be fine. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.IOException: Lock obtain timed out
Otis, I only put the unlock call in because I had the error in the first place. Removing it, the IOException still occurs, when trying to instantiate the IndexWriter. Thanks, Gabe --- Otis Gospodnetic [EMAIL PROTECTED] wrote: There is no need for that .unlock call, just .close() Otis --- Gabe [EMAIL PROTECTED] wrote: I am using Lucene 1.3 final and am having an error that I can't seem to shake. Basically, I am updating a Document in the index incrementally by calling an IndexReader to remove the document. This works. Then, I close the IndexReader with the following code: reader.unlock(reader.directory()); reader.close(); I put the first of the two lines in to try to force the lock to disable. According to the logging, this code is being called and the IndexReader is being closed. However, then I open a writer to add the document, I get the following. java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at ... I open the writer by calling: return new IndexWriter(INDEX_DIR, analyzer, false); where analyzer=new StandardAnalyzer(); I get the reader by calling: IndexReader reader=IndexReader.open(INDEX_DIR); Thanks for any help, Gabe __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: java.io.IOException: Lock obtain timed out
Did you close your writer if an Exception occured? I had a similiar problem, but it was fixed when i close the writer in the finally block. Below is my original code (which generate Mjava.io.Exception: Lock obtain timed out when an Exception is thrown) public static void index(File indexDir, List cList, boolean ow) throws Exception{ IndexWriter writer = null; try{ writer = new IndexWriter(indexDir, new MyAnalyzer(), overwrite); // index documents } catch(Exception e){ writer = new IndexWriter(indexDir, new MyAnalyzer(), true); try{ // index documents } catch(Exception ee){ throw ee; } } writer.close(); // never reaches this statement if the catch block is called. } // revised code to force a close on the IndexWriter public static void index(File indexDir, List cList, boolean ow) throws Exception{ IndexWriter writer = null; try{ writer = new IndexWriter(indexDir, new MyAnalyzer(), overwrite); // index documents writer.close(); } catch(Exception e){ writer = new IndexWriter(indexDir, new MyAnalyzer(), true); try{ // index documents } catch(Exception ee){ throw ee; } finally{ writer.close(); } } } -Original Message- From: Gabe [mailto:[EMAIL PROTECTED] Sent: Monday, March 15, 2004 1:53 PM To: Lucene Users List Subject: Re: java.io.IOException: Lock obtain timed out Otis, I only put the unlock call in because I had the error in the first place. Removing it, the IOException still occurs, when trying to instantiate the IndexWriter. Thanks, Gabe --- Otis Gospodnetic [EMAIL PROTECTED] wrote: There is no need for that .unlock call, just .close() Otis --- Gabe [EMAIL PROTECTED] wrote: I am using Lucene 1.3 final and am having an error that I can't seem to shake. Basically, I am updating a Document in the index incrementally by calling an IndexReader to remove the document. This works. Then, I close the IndexReader with the following code: reader.unlock(reader.directory()); reader.close(); I put the first of the two lines in to try to force the lock to disable. According to the logging, this code is being called and the IndexReader is being closed. However, then I open a writer to add the document, I get the following. java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at ... I open the writer by calling: return new IndexWriter(INDEX_DIR, analyzer, false); where analyzer=new StandardAnalyzer(); I get the reader by calling: IndexReader reader=IndexReader.open(INDEX_DIR); Thanks for any help, Gabe __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: java.io.IOException: Lock obtain timed out
I notice in your catch clause you always set the writer to be true... (i.e. new IndexWriter(INDEX_DIR, analyzer, true). If I am not mistaken reading the docs, this overwrites the entire index, no? That is why I was setting that variable to false when doing an incremental update. When I reindex all documents, I have had no problem. Gabe --- Nguyen, Tri (NIH/NLM/LHC) [EMAIL PROTECTED] wrote: Did you close your writer if an Exception occured? I had a similiar problem, but it was fixed when i close the writer in the finally block. Below is my original code (which generate Mjava.io.Exception: Lock obtain timed out when an Exception is thrown) public static void index(File indexDir, List cList, boolean ow) throws Exception{ IndexWriter writer = null; try{ writer = new IndexWriter(indexDir, new MyAnalyzer(), overwrite); // index documents } catch(Exception e){ writer = new IndexWriter(indexDir, new MyAnalyzer(), true); try{ // index documents } catch(Exception ee){ throw ee; } } writer.close(); // never reaches this statement if the catch block is called. } // revised code to force a close on the IndexWriter public static void index(File indexDir, List cList, boolean ow) throws Exception{ IndexWriter writer = null; try{ writer = new IndexWriter(indexDir, new MyAnalyzer(), overwrite); // index documents writer.close(); } catch(Exception e){ writer = new IndexWriter(indexDir, new MyAnalyzer(), true); try{ // index documents } catch(Exception ee){ throw ee; } finally{ writer.close(); } } } -Original Message- From: Gabe [mailto:[EMAIL PROTECTED] Sent: Monday, March 15, 2004 1:53 PM To: Lucene Users List Subject: Re: java.io.IOException: Lock obtain timed out Otis, I only put the unlock call in because I had the error in the first place. Removing it, the IOException still occurs, when trying to instantiate the IndexWriter. Thanks, Gabe --- Otis Gospodnetic [EMAIL PROTECTED] wrote: There is no need for that .unlock call, just .close() Otis --- Gabe [EMAIL PROTECTED] wrote: I am using Lucene 1.3 final and am having an error that I can't seem to shake. Basically, I am updating a Document in the index incrementally by calling an IndexReader to remove the document. This works. Then, I close the IndexReader with the following code: reader.unlock(reader.directory()); reader.close(); I put the first of the two lines in to try to force the lock to disable. According to the logging, this code is being called and the IndexReader is being closed. However, then I open a writer to add the document, I get the following. java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at ... I open the writer by calling: return new IndexWriter(INDEX_DIR, analyzer, false); where analyzer=new StandardAnalyzer(); I get the reader by calling: IndexReader reader=IndexReader.open(INDEX_DIR); Thanks for any help, Gabe __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: java.io.IOException: Lock obtain timed out
I figured it out. an errant open IndexWriter. --- Nguyen, Tri (NIH/NLM/LHC) [EMAIL PROTECTED] wrote: Did you close your writer if an Exception occured? I had a similiar problem, but it was fixed when i close the writer in the finally block. Below is my original code (which generate Mjava.io.Exception: Lock obtain timed out when an Exception is thrown) public static void index(File indexDir, List cList, boolean ow) throws Exception{ IndexWriter writer = null; try{ writer = new IndexWriter(indexDir, new MyAnalyzer(), overwrite); // index documents } catch(Exception e){ writer = new IndexWriter(indexDir, new MyAnalyzer(), true); try{ // index documents } catch(Exception ee){ throw ee; } } writer.close(); // never reaches this statement if the catch block is called. } // revised code to force a close on the IndexWriter public static void index(File indexDir, List cList, boolean ow) throws Exception{ IndexWriter writer = null; try{ writer = new IndexWriter(indexDir, new MyAnalyzer(), overwrite); // index documents writer.close(); } catch(Exception e){ writer = new IndexWriter(indexDir, new MyAnalyzer(), true); try{ // index documents } catch(Exception ee){ throw ee; } finally{ writer.close(); } } } -Original Message- From: Gabe [mailto:[EMAIL PROTECTED] Sent: Monday, March 15, 2004 1:53 PM To: Lucene Users List Subject: Re: java.io.IOException: Lock obtain timed out Otis, I only put the unlock call in because I had the error in the first place. Removing it, the IOException still occurs, when trying to instantiate the IndexWriter. Thanks, Gabe --- Otis Gospodnetic [EMAIL PROTECTED] wrote: There is no need for that .unlock call, just .close() Otis --- Gabe [EMAIL PROTECTED] wrote: I am using Lucene 1.3 final and am having an error that I can't seem to shake. Basically, I am updating a Document in the index incrementally by calling an IndexReader to remove the document. This works. Then, I close the IndexReader with the following code: reader.unlock(reader.directory()); reader.close(); I put the first of the two lines in to try to force the lock to disable. According to the logging, this code is being called and the IndexReader is being closed. However, then I open a writer to add the document, I get the following. java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at ... I open the writer by calling: return new IndexWriter(INDEX_DIR, analyzer, false); where analyzer=new StandardAnalyzer(); I get the reader by calling: IndexReader reader=IndexReader.open(INDEX_DIR); Thanks for any help, Gabe __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can lucene index both Big5 and GB2312 encoding character?
Can I find out if I have both Big5 and GB2312 encoded HTML files in two separate directories, and when I build the index, does Lucene able to distinguish the character set? or Lucene only work with single encoding. Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]