Re: Storing whole documents in the index

2007-03-19 Thread Karel Tejnora
To store document (specially large ones) out of the index is better than in index. Every merge of segments or optimize will copy those data. Stored in index is possible, but it requires 1-4x more space, depends on read/write speed of the fs, merge and optimize takes longer time. Karel On Sun, 200

Re: StandardAnalyzer Problem with Apostrophes

2006-11-14 Thread Karel Tejnora
The problem is in StandardTokenizer so Analyzer with method: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new LowerCaseTokenizer(reader); result = new StopFilter(result, stopSet); return result; } if you need everything standard analyzer does Fr

Re: StandardAnalyzer Problem with Apostrophes

2006-11-14 Thread Karel Tejnora
Apostrophe is recognized as a part of word - Standard analyzer is mostly English oriented. The way is to swap apostrophes - "normal" with unusual. StandardAnalyzer.java line 40-44 APOSTROPHE: token = jj_consume_token(APOSTROPHE); -

Re: Searching Problem

2006-10-26 Thread Karel Tejnora
Nope. IndexReader obtains a snapshot of index - not closing and opening indexreader leads to not deleting files (windows exception, linux will not free them). Is it possible to get all the matching document in the result without restarting the Searcher program?

Re: Advantage of putting lucene index in RDBMS

2006-10-06 Thread Karel Tejnora
One think, generally use RDBM for the STORED fields is good idea because every segment merging / optimize copies those data once or twice (cfs). I'm thinking about to put STORED fields in extra file and put pointers in cfs. Delete will just mark document as delete. And new operation omptimize_

Re: Sudden FileNotFoundException

2006-10-04 Thread Karel Tejnora
Once I got same problem and following Jira not alone. I deleted index and rebuild it from source again and problem was gone. Im unable to reproduce it. Are you able to reproduce the problem? Karel java.io.FileNotFoundException: /lucene-indexes/mediafragments/_8km.fnm (No ---

Re: best way indexing user queries

2006-09-07 Thread Karel Tejnora
Discussed before, it's more relation db task than lucene. Simple approach is to get a list of terms from your queries and store relation document - query - terms. I have around 1.6e10 query-terms in postgreSQL and with proper index select takes around 0.6 ms (clustered vacuumed analyzed), 300

Re: Highlighting "really" found terms

2006-09-05 Thread Karel Tejnora
Not for now, but I'd like to contribute span support soon. Karel An alternative highlighter implementation was recently contributed here: http://issues.apache.org/jira/browse/LUCENE-644?page=all I've not had the time to study this alternative in detail (I hope to soon) so I can't say if it wi

Re: Change index structure

2006-08-23 Thread Karel Tejnora
Yes it is possible. Only UNSTORED fields became UNSTORED again and You cannot change TERM in them. If you have SQL db I have neat code to doing this. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [

Re: Split an existing index into smaller segments without a re-index?

2006-08-17 Thread Karel Tejnora
Depends. 0) optimize big index 1) on big index delete all documents except those for a part of index 2) use AddIndexes on IndexWriter on destination dir (empty) 3) delete segments.del in big index directory (the segments.del is a just serialized BitVector) 4) repeat for another set do not mak

Re: updating document

2006-08-15 Thread Karel Tejnora
Im sending a snippet of code how to reconstruct UNSTORED fields. It has two parts: DB+terms Class.forName("org.postgresql.Driver").newInstance(); con = DriverManager.getConnection("jdbc:postgresql:lucene", "lucene", "lucene"); PreparedStatement psCompany=con.prepareStatemen

Re: updating document

2006-08-15 Thread Karel Tejnora
Well, you can have! :-) Even I have not tested, just an idea. You can get document id after add - numDocs() and insert if DB fails, you can delete document from RAMDir. Or in my case of batches - im adding documents in DB with savepoint, than create clear index (create=true) and at the end if

Re: updating document

2006-08-11 Thread Karel Tejnora
Jason is right. I think, even Im not expert on lucene too, your newly added document cann't recreate terms for field with analyzer, because field text in empty. There is very hairy solution - hack a IndexReader, FieldInfosWriter and use addIndexes. Lucene is "only" a fulltext search library, n

Re: updating document

2006-08-10 Thread Karel Tejnora
Hi, I'm facing similar problem. I found a possible way, how to copy a part of index (w/o copy whole index,delete,optimize), but don't know how to change/add/remove field (or add term vector in my case) to existing index. To copy a part of index override methods in IndexReader /** Returns

Re: IndexWriter.addIndexes & optimizatio

2006-06-27 Thread Karel Tejnora
depends of the document type, look at method setOmitNorms in Field class. heritrix.lucene wrote: Hi, Aprrox 50 Million i have processed upto now. I kept maxMergeFactor and maxBufferedDoc's value 1000. This value i got after several round of test runs. Indexing rate for each document in 50 M, is

Re: IndexSearcher in Servlet

2006-06-27 Thread Karel Tejnora
Singleton pattern is better. Than you can extend it to proxy pattern. existing IndexReader really isn't that expensive and does get around - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Searching multiple indexes

2006-06-23 Thread Karel Tejnora
Hi, there are two ways. The first is to use MultiFieldQueryParser http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/MultiFieldQueryParser.html or do an extra step in indexing to build a new field as join of those (e.g. StringBuffer append f1 append f2 ...) Benefits of the

Re: spring & lucene

2006-06-07 Thread Karel Tejnora
Not explict closing can lead especially when is allowed a lot of memory to JVM but small amount is used that old files will stay on the disk on linux. Solution is in using ReentrantReadWriteLock where the re-open method opens new indexreader at ThreadLocal accuire write lock saves old reference

Re: Enforcing Primary key uniqueness in lucene index

2006-05-30 Thread Karel Tejnora
You can use jdbm.sf.net for holding your_id to lucene_id relation in a transaction hashtable on the disk. Also Yonik will say solr at incubator.apache.org/solr has this constraint check implemented. - To unsubscribe, e-mail:

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Karel Tejnora
Or you can use ssh -X for X11 forwarding. I don't know how it's working in windows (some x client app) but great on linux(es) with huge bandwidth. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EM