Re: Scalability of Lucene indexes
We are doing the same exacting thing. We didn't test with so many documents. The most we tested till now 3 million documents with 3GB file size. I would be interested in seeing how you maintained replicated indices that r in sync. The way we did was, run the indexer on each server independently. I the data changes, one server will know the change. That server updates lucene index and notifies other servers (using multicast). Glad to know someone else is doing the similar thing and more happy to know that the solution works even for 100 millions documents. I was little worried if the index size goes higher and higher but it looks like we should not have to worry anymore :) Thanks Praveen - Original Message - From: Bryan McCormick [EMAIL PROTECTED] To: Chris D [EMAIL PROTECTED] Cc: lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 3:45 PM Subject: Re: Scalability of Lucene indexes Hi chris, I'm responsible for the webshots.com search index and we've had very good results with lucene. It currently indexes over 100 Million documents and performs 4 Million searches / day. We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index. We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index. Bryan McCormick On Fri, 2005-02-18 at 08:01, Chris D wrote: Hi all, I have a question about scaling lucene across a cluster, and good ways of breaking up the work. We have a very large index and searches sometimes take more time than they're allowed. What we have been doing is during indexing we index into 256 seperate indexes (depending on the md5sum) then distribute the indexes to the search machines. So if a machine has 128 indexes it would have to do 128 searches. I gave parallelMultiSearcher a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster. Thanks, Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanties
Good work Eric (even though UI could be made pretty). We use lucene so I have some knowledge of it. I could see the features you are using with lucene (like paging, highlighting, different kinds of pharases). Over all, good stuff. Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene User lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 2:46 PM Subject: Lucene in the Humanties It's about time I actually did something real with Lucene :) I have been working with the Applied Research in Patacriticism group at the University of Virginia for a few months and finally ready to present what I've been doing. The primary focus of my group is working with the Rossetti Archive - poems, artwork, interpretations, collections, and so on of Dante Gabriel Rossetti. I was initially brought on to build a collection and exhibit system, though I got detoured a bit as I got involved in applying Lucene to the archive to replace their existing search system. The existing system used an old version of Tamino with XPath queries. Tamino is not at fault here, at least not entirely, because our data is in a very complicated set of XML files with a lot of non-normalized and legacy metadata - getting at things via XPath is challenging and practically impossible in many cases. My work is now presentable at http://www.rossettiarchive.org/rose (rose is for ROsetti SEarch) This system is implicitly designed for academics who are delving into Rossetti's work, so it may not be all that interesting for most of you. Have fun and send me any interesting things you discover, especially any issues you may encounter. Here are some numbers to give you a sense of what is going on underneath... There are currently 4,983 XML files, totally about 110MB. Without getting into a lot of details of the confusing domain, there are basically 3 types of XML files (works, pictures, and transcripts). It is important that there be case-sensitive and case-insensitive searches. To accomplish that, a custom analyzer is used in two different modes, one applying a LowerCaseFilter, and one not with the same documents written to two different indexes. There is one particular type of XML file that gets indexed as two different types of documents (a specialized summary/header type). In this first set of indexes, it is basically a one-to-one mapping of XML file to Lucene Document (with one type being indexed twice in different ways) - all said there are 5539 documents in each of the two main indexes. The transcript type gets sliced into another set of original case and lowercased indexes with each document in that index representing a document division (a div element in the XML). There are 12326 documents in each of these div-level indexes. All said, the 4 indexes built total about 3GB in size - I'm storing several fields in order to hit-highlight. Only one of these indexes is being hit at a time - it depends on what parameters you use when querying for which index is used. Lucene brought the search times into a usable, and impressive to the scholars, state. The previous search solution often timed the browser out! Search results now are in the milliseconds range. The amount of data is tiny compared to most usages of Lucene, but things are getting interesting in other ways. There has been little tuning in terms of ranking quality so far, but this is the next area of work. There is one document type that is more important than the others, and it is being boosted during indexing. There is now a growing interest in tinkering with all the new knobs and dials that are now possible. Putting in similar and more-like-this features are desired and will be relatively straightforward to implement. I'm currently using catch-all-aggregate-field technique for a default field for QueryParser searching. Using a multi-field expansion is an area that is desirable instead though. So, I've got my homework to do and catch up on all the goodness that has been mentioned in this list recently regarding all of these techniques. An area where I'd like to solicit more help from the community relates to something akin to personalization. The scholars would like to be able to tune results based on the role (such as art historian) that is searching the site. This would involve some type of training or continual learning process so that someone searching feeds back preferences implicitly for their queries by visiting the actual documents that are of interest. Now that the scholars have seen what is possible (I showed them the cool SearchMorph comparison page searching Wikipedia for rossetti), they want more and more! So - here's where I'm soliciting feedback - who's doing these types of things in the realm of Humanties? Where should we go from here in terms of researching and applying the types of features dreamed about here? How would you recommend
Best way to find if a document exists, using Reader ...
Hi luceners, Using Reader, whats the best (fastest) way to find if a documents exists with a given term. The term is unique ID, meaning, with that term, atmost one document can exist. I have seen 2 appropriate methods of Reader. docFreq(Term) and termDocs(Term). docFreq should return either 0 or 1 in my case and termDocs should return TermDocs of size 0 or 1. But I was not sure which method is faster. All I want to find is if a document exist. The actual reason I want to do is, I want to delete a document with the given GUID. It looks like delete(Term) has some overhead. So I thought I can look up the document and delete it only if it exists. Since I will be dealing with millions of documents, most of which are new documents. But I don't know if a document already exists in lucene index. So I was calling Reader.delete(Term) on each documen before adding it. This means I am calling delete method millions but there are possibly 99.9% of new document in that million docs. Does it makes sense to call docFreq or termDocs (which ever is faster) before calling delete? Any help is appreciated. Thanks, Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
sorting on a field that can have null values (resend)
I sent this mail yesterday but had no luck in receiving responses. Trying it again . Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. Order by in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null pointer exception. Is this an expected behaviour? Is lucene always expecting some value on the sortable fields? I thought of putting empty strings instead of null values but I think empty strings are put first in the list while sorting which is the reverse of what anyone would want. Following is the exception I saw in the error log: java.lang.NullPointerException at org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36) at org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95) at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120) at org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47) at org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58) at org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130) at org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38) at org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125) at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64) at org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51) at org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41) If its a bug in lucene, Will it be fixed in next release? Any suggestions would be appreciated. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
sorting on a field that can have null values
Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. Order by in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null pointer exception. Is this an expected behaviour? Is lucene always expecting some value on the sortable fields? I thought of putting empty strings instead of null values but I think empty strings are put first in the list while sorting which is the reverse of what anyone would want. Following is the exception I saw in the error log: java.lang.NullPointerException at org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36) at org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95) at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120) at org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47) at org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58) at org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130) at org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38) at org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125) at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64) at org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51) at org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41) If its a bug in lucene, Will it be fixed in next release? Any suggestions would be appreciated. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
sorting on a field that can have null values
Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. Order by in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null pointer exception. Is this an expected behaviour? Is lucene always expecting some value on the sortable fields? I thought of putting empty strings instead of null values but I think empty strings are put first in the list while sorting which is the reverse of what anyone would want. Following is the exception I saw in the error log: java.lang.NullPointerException at org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36) at org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95) at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120) at org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47) at org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58) at org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130) at org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38) at org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125) at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64) at org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51) at org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41) If its a bug in lucene, Will it be fixed in next release? Any suggestions would be appreciated. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: Lucene appreciation
The product looks great. Are you separately indexing by reading info from all the sites or just issuing federated search to all job sites? I am impressed by the speed. Its surely fater than dice and all other job search sites. I understand its in beta version but adding an advanced search option would help the users a lot. Just a suggestion Praveen - Original Message - From: Rony Kahan [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:26 AM Subject: Lucene appreciation Hello fellow Lucene users, I'd like to introduce myself and say thanks. We've recently launched http://www.indeed.com, a search engine for jobs based on Lucene. I'm consistently impressed with the quality, professionalism and support of the Lucene project and the Lucene community. This mailing list has been a great help. I'd also like to give mention to some of the consultants who had a big hand in making our project a reality ... Thank you Otis, Aviran, Sergiu Dawid. As for our project, we're in beta and would love to get your feedback. The index size is currently ~1.8m jobs. My personal email address is rony a_t indeed.com. If you are interested in Lucene work you can set up an rss feed or email alert from here: http://www.indeed.com/search?q=lucenesort=date Is it possible to be added to the Wiki Powered By page? Thanks Everyone, Rony Indeed.com - one search. all Jobs. http://www.indeed.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
Hmm. So far all our fields are just strings. But I would guess you should be able to use Integer.MAX_VALUE or something on the upper bound. Or there might be a better way of doing it. Praveen - Original Message - From: Akmal Sarhan [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 10:23 AM Subject: Re: Opinions: Using Lucene as a thin database that sounds very interesting but how do you handle queries like select * from MY_TABLE where MY_NUMERIC_FIELD 80 as far as I know you have only the range query so you will have to say my_numeric_filed:[80 TO ??] but this would not work in the a/m example or am I missing something? regards Akmal Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07: Even we use lucene for similar purpose except that we index and store quite a few fields. Infact I also update partial documents as people suggested. I store all the indexed fields so I don't have to build the whole document again while updating partial document. The reason we do this is due to the speed. I found the lucene search on a millions objects is 4 to 5 times faster than our oracle queries (ofcourse this might be due to our pitiful database design :) ). It works great so far. the only caveat that we had till now was incremental updates. But now I am implementing real-time updates so that the data in lucene index is almost always in sync with data in database. So now, our search does not goto the database at all. Praveen - Original Message - From: Kevin L. Cobb [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:40 AM Subject: Opinions: Using Lucene as a thin database I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] !EXCUBATOR:41bf0221115901292611315! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
Even we use lucene for similar purpose except that we index and store quite a few fields. Infact I also update partial documents as people suggested. I store all the indexed fields so I don't have to build the whole document again while updating partial document. The reason we do this is due to the speed. I found the lucene search on a millions objects is 4 to 5 times faster than our oracle queries (ofcourse this might be due to our pitiful database design :) ). It works great so far. the only caveat that we had till now was incremental updates. But now I am implementing real-time updates so that the data in lucene index is almost always in sync with data in database. So now, our search does not goto the database at all. Praveen - Original Message - From: Kevin L. Cobb [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:40 AM Subject: Opinions: Using Lucene as a thin database I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
If its not added to the release code already, is there any reason for it being not added. Seems like many people agree that this is an important functionality of sorting. Its just that I can't get permission to use customized libraries in our company. Either we have to use the library as is or implement our own stuff. We don't want to go into the pains of maintaining the 3rd party library code whenever we migrate from one version to other. I would assume everyone would have the same problem. Is there any possibility this patch contributed by Aviran can be added to the actual release branch. Thanks Praveen - Original Message - From: Aviran [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 13, 2004 11:30 AM Subject: RE: sorting tokenized field The patch is very simple. What is does is it checks if the field you want to sort on is tokenized. If it is it loads the values from the documents to the sorting table. The only con in this approach is that loading the values this way is much slower than if the values where Keywords, but other than that it should work just fine. Aviran http://www.aviransplace.com -Original Message- From: Praveen Peddi [mailto:[EMAIL PROTECTED] Sent: Monday, December 13, 2004 10:48 AM To: lucenelist Subject: Fw: sorting tokenized field Hi all, I forwarding the same email I sent before. Just wanted to try my luck again :). Thanks in advance. Praveen - Original Message - From: Praveen Peddi [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, December 10, 2004 3:33 PM Subject: Re: sorting tokenized field Since I am not aware of the lucene code much, I couldn't make much out of your patch. But is this patch already tested and proved to be efficient? If so, why can't it be merge into the lucene code and made it part of the release. I think the bug is valid. Its very likely that people want to sort on tokenized fields. If I apply this patch to lucene code and use it for myself, I will have hard time managing it in future (while upgrading lucene library). If the pathc is applied to lucene release code, it would be very easy for the lucene users. If possible, can someone explain what the path does? I am trying to understand what exactly changed but could not figrue out. Praveen - Original Message - From: Aviran [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:30 PM Subject: RE: sorting tokenized field I have suggested a solution for this problem ( http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the patch suggested there and recompile lucene. Aviran http://www.aviransplace.com -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 13:53 PM To: Lucene Users List Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
Hi Erik, Thanks a lot for your kind response. I appreciate the details. What I meant by custom library is, applying aviran's patch to the lucene and maintaining it, not adding an extra field. Adding an extra field was my last option if I can't use the patch. I did look at the extensible search and infact I wrote my own comparators (IgnoreCaseStringComparator and another custom comparator) and they work just fine. But I am not sure if this extensible search features helps me in sorting on tokenized field w/o adding the extra field. For now, I will just go for the extra field option and later if a more optimized solution is built into lucene I can use that. Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, December 13, 2004 3:01 PM Subject: Re: sorting tokenized field On Dec 13, 2004, at 2:22 PM, Praveen Peddi wrote: If its not added to the release code already, is there any reason for it being not added. As noted, there is a performance issue with sorting by tokenized fields. It would seem far more advisable for you to simply add another field used for sorting which is untokenized. Why has it not been added? There have been several committers quite active in the codebase (myself excluded). If you wish for changes to be committed, perseverance and patience are key. Keep lobbying, but do so kindly. When there are viable alternatives (such as adding an untokenized field for sorting) then certainly there is less incentive to commit changes. Lucene's codebase is pretty clean and tight - it is wise for us to be very selective about changes to it. Seems like many people agree that this is an important functionality of sorting. Many do, but not all. I'm -0 on this change, meaning I'm not veto'ing it, but I'm not actually for it given the performance issue. Its just that I can't get permission to use customized libraries in our company. No custom library is needed for you to add an untokenized field for sorting purposes. Also, sorting is extensible. Check out the Lucene in Action code, specifically the lia.extsearch.sorting.DistanceSortingTest class. Maybe you could add your own custom sorting code that could do what you want without patching Lucene. Is there any possibility this patch contributed by Aviran can be added to the actual release branch. Keep lobbying - other committers may feel differently than I do about it and add it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sorting tokenized field
I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: sorting tokenized field
I was only thinking in terms of other search engines. I worked with other search engines and I didn't see this requirements before. I think you are right that its wasteful to duplicate all tokenized fields. Not sure if there is a smart of dealing with it. Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, December 10, 2004 1:53 PM Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
Since I am not aware of the lucene code much, I couldn't make much out of your patch. But is this patch already tested and proved to be efficient? If so, why can't it be merge into the lucene code and made it part of the release. I think the bug is valid. Its very likely that people want to sort on tokenized fields. If I apply this patch to lucene code and use it for myself, I will have hard time managing it in future (while upgrading lucene library). If the pathc is applied to lucene release code, it would be very easy for the lucene users. If possible, can someone explain what the path does? I am trying to understand what exactly changed but could not figrue out. Praveen - Original Message - From: Aviran [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:30 PM Subject: RE: sorting tokenized field I have suggested a solution for this problem ( http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the patch suggested there and recompile lucene. Aviran http://www.aviransplace.com -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 13:53 PM To: Lucene Users List Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: partial updating of lucene
But when I am searching, it only searches in the index. Stored fields are only used to display the results, not to search. Why would it lose the terms in the index when I retrieve the document? First solution is not possible (I can't create a new document) since I only have modified fields. When I get a document, doesn't the fields have indexed terms along with it? Is there no way to get a full document (along with indexed terms) and clone it and add it to the index? Well is there anyway I ca update a document with just one field (because I only have data for that one field)? Praveen - Original Message - From: Justin Swanhart [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, December 08, 2004 5:59 PM Subject: Re: partial updating of lucene You unstored fields were not stored in the index, only their terms were stored. When you get the document from the index and modify it, those terms are lost when you add the document again. You can either simply create a new document and populate all the fields and add that document to the index, or you can add the unstored fields to the document retrieved in step 1. On Wed, 8 Dec 2004 17:53:26 -0500, Praveen Peddi [EMAIL PROTECTED] wrote: Hi all, I have a question about updating the lucene document. I know that there is no API to do that now. So this is what I am doing in order to update the document with the field title. 1) Get the document from lucene index 2) Remove a field called title and add the same field with a modified value 3) Remove the docment (based on one of our field) using Reader and then close the Reader. 4) Add the document that is obtained in 1 and modified in 2. I am not sure if this is the right way of doing it but I am having problems searching for that document after updating it. The problem is only with the un stored fields. For example, I search as description:boy where description is a unstored, indexed, tokenized field in the document. I find 1 document. Now I update the document the document's title as descripbed above and repeat the same search description:boy and now I don't find any results. I have not touched the field description at all. I just updated the field title. Is this an expected behaviour? If not, is it a bug. If I change the field description as stored, indexed and tokenized, the search works fine before and after updating. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: partial updating of lucene
If I store all the fields I am indexing, is it safe to get the document, update a fields and add it again to the search index? I do not want to lose anything and I want to make sure that document is same before and after updating (execpt for the updated fields). Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, December 09, 2004 10:00 AM Subject: Re: partial updating of lucene On Dec 9, 2004, at 9:48 AM, Praveen Peddi wrote: But when I am searching, it only searches in the index. Stored fields are only used to display the results, not to search. Why would it lose the terms in the index when I retrieve the document? First solution is not possible (I can't create a new document) since I only have modified fields. When I get a document, doesn't the fields have indexed terms along with it? Is there no way to get a full document (along with indexed terms) and clone it and add it to the index? Well is there anyway I ca update a document with just one field (because I only have data for that one field)? A Document only carries along its *stored* fields. Fields that are indexed, but not stored, are not retrievable from Document. Have a look at the tool Luke (Google for luke lucene :) and see how it does its Reconstruct and Edit facility. It is possible, though potentially lossy, to reconstruct a document and add it again. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Vs Ixiasoft
Does anyone know about Ixiasoft server. Its a xml repository/search engine. If anyone knows about it, does he/she also know how it is compared to Lucene? Which is fast? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
partial updating of lucene
Hi all, I have a question about updating the lucene document. I know that there is no API to do that now. So this is what I am doing in order to update the document with the field title. 1) Get the document from lucene index 2) Remove a field called title and add the same field with a modified value 3) Remove the docment (based on one of our field) using Reader and then close the Reader. 4) Add the document that is obtained in 1 and modified in 2. I am not sure if this is the right way of doing it but I am having problems searching for that document after updating it. The problem is only with the un stored fields. For example, I search as description:boy where description is a unstored, indexed, tokenized field in the document. I find 1 document. Now I update the document the document's title as descripbed above and repeat the same search description:boy and now I don't find any results. I have not touched the field description at all. I just updated the field title. Is this an expected behaviour? If not, is it a bug. If I change the field description as stored, indexed and tokenized, the search works fine before and after updating. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: False Locking Conflict?
If you have more than one lucene application running on the same machine, they all share the same temp file? Atleast I had this problem when I run my application in 2 diff instances of weblogic on the same machine. Praveen - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, November 19, 2004 2:13 PM Subject: Re: False Locking Conflict? It is possible, but it's not likely, as other users are not reporting this. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: Hey All; Is it possible for there to be a situation where the locking file is in place after the reader has been closed? I have extra logging in place and have followed the code execution. The reader finishes deleting old content and closes (I know this for sure). This is the only reader instance I have for the class (it is a static member). The reader is not re-opened. I try to open the writer and I get my old friend: java.io.IOException: Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d43210f7fe8-write.lock This code is synchronized so I am sure there is no other processes trying to do the same thing. It looks to me like the reader is closing and the lock file is not being removed. Is this possible? Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using Shared directory as lucene index in cluster
Hi all, This topic has been discussed in the mailing list before but I could not find an answer to the problem I am having. I am trying to decide whether shared directory based index is better way or local index per server is a better way in a clustered application. First I am evaluating the shared direcotry option. I am trying to use shared directory (NFS) and test the performance difference compared to local directory. I didn't see much difference on the search side but indexing I think is little slower in shared directory. I think we can live with this. But, I could not make indexing run in cluster mode. We cache IndexSearcher on each server in the cluster (for faster search). We make sure that the cached index search is always up to date. When I run our full indexer, it cleans the index directory and re indexes all the objects from DB to lucene index directory. But it looks like indexsearcher holds the file handles so I cannot delete the directory. This means I cannot run the full indexer in cluster mode since each server holds file handle on some of the index files and those files cannot be deleted. One solution is to make sure searchers on all servers are closed before running full indexer. But there is no direct way to notify this in cluster. So my question is, Is there any other solution that doesn't need to close searcher to clean the index files? Note: This is in fact not related to shared directory but true for local directory also. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: sorting and score ordering
Use SortField.FIELD_SCORE as the first element in the SortField[] when you pass it to sort method. Praveen - Original Message - From: Chris Fraschetti [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, October 13, 2004 3:19 PM Subject: Re: sorting and score ordering Will do. My other question was: the 'score' for a page as far as I know, is only accessible post-search... and is not contained in a field. How can I specift the score as a sort field when there is no field 'score' ? -Chris On Wed, 13 Oct 2004 21:06:14 +0200, Daniel Naber [EMAIL PROTECTED] wrote: On Wednesday 13 October 2004 20:44, Chris Fraschetti wrote: I haven't seen an example on how to apply two sorts to a search.. can you help me out with that? Check out the documentation for Sort(SortField[] fields) and SortField. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Making lucene work in weblogic cluster
While I was going through the mailing list in solving the lucene cluster problem, I came accross this thread. Does any one know if David Townsend had submitted the patch he was talking about? http://www.mail-archive.com/[EMAIL PROTECTED]/msg06252.html I am interested in looking at the NFS solution (mounting the shared drive on each server in cluster). I don't know if anyone has used this solution in cluster but this seems to be a better approach than RemoteSearchable interface and DB based index (SQLDirectory). I am currently looking at 2 options: Index on Shared drive: Use single index dir on a shared drive (NFS, etc.), which is mounted on each app server. All the servers in the cluster write to this shared drive when objects are modified. Problems: 1) Known problems like file locking etc. (The above thread talks about moving locking mechanism to DB but I have no idea how). 2) Performance. Index Per Server: Create copies of the index dir for each machine. Requires regular updates, etc. Each server maintains its own index and searches on its own index. Problems: 1) Modifying the index is complex. When Objects are modified on a server1 that does not run the search system, server1 needs to notify all servers in the cluster about these modifications so that each server can update its own index. This may involve some kind of remote communication mechanism which will perform bad since our index modifies a lot. So I am still reviewing both options and trying to figure out which one is the best and how to solve the above problems. If you guys have any ideas, Pls shoot them. I would appreciate any help regarding making lucene clusterable (both indexing and searching). Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: displaying 'pages' of search results...
PROTECTED] Sent: Wednesday, September 22, 2004 2:53 AM Subject: displaying 'pages' of search results... Hi Can u share the searcher.search(query, hitCollector); [light weight paging api ] Code on the form ,may be somebody like me need's it. ; ) Karthik -Original Message- From: Praveen Peddi [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 22, 2004 1:24 AM To: Lucene Users List Subject: Re: displaying 'pages' of search results... The way we do it is: Get all the document ids, cache them and then get the first 50, second 50 documents etc. We wrote a light weight paging api on top of lucene. We call searcher.search(query, hitCollector); Our HitCollectorImpl implements collect method and just collects the document id only. Praveen - Original Message - From: Chris Fraschetti [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, September 21, 2004 3:33 PM Subject: displaying 'pages' of search results... I was wondering was the best way was to go about returning say 1,000,000 results, divided up into say 50 element sections and then accessing them via the first 50, second 50, etc etc. Is there a way to keep the query around so that lucene doesn't need to search again, or would the search be cached and no delay arise? Just looking for some ideas and possibly some implementational issues... -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: displaying 'pages' of search results...
The way we do it is: Get all the document ids, cache them and then get the first 50, second 50 documents etc. We wrote a light weight paging api on top of lucene. We call searcher.search(query, hitCollector); Our HitCollectorImpl implements collect method and just collects the document id only. Praveen - Original Message - From: Chris Fraschetti [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, September 21, 2004 3:33 PM Subject: displaying 'pages' of search results... I was wondering was the best way was to go about returning say 1,000,000 results, divided up into say 50 element sections and then accessing them via the first 50, second 50, etc etc. Is there a way to keep the query around so that lucene doesn't need to search again, or would the search be cached and no delay arise? Just looking for some ideas and possibly some implementational issues... -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with SortField[] in search method (newbie)
Does it mean you indexed all not null fields?. I think you should change your code so that you always index the fields you want to sort. In any case, it looks like some of your documents have shortName not null and not indexed. If you do not have any non-indexed shotnames in the index, I don't think u would have got that error. But I may be wrong. Praveen - Original Message - From: Wermus Fernando [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 15, 2004 1:53 PM Subject: RE: problem with SortField[] in search method (newbie) Aviran, I can search in not indexed fields without any exception, but I can't order by the same fields. Besides, I can't know in advance if they are indexed in my app, because I index those fields that have some value, if it doesn't I don't add it to the document. What if I don't have any document indexed? -Mensaje original- De: Aviran [mailto:[EMAIL PROTECTED] Enviado el: Miércoles, 15 de Septiembre de 2004 02:35 p.m. Para: 'Lucene Users List' Asunto: RE: problem with SortField[] in search method (newbie) You can only sort on indexed field. (even more than that, it'll work properly only on Untokenized fields, ie keyword). Aviran -Original Message- From: Wermus Fernando [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 15, 2004 13:13 PM To: [EMAIL PROTECTED] Subject: problem with SortField[] in search method (newbie) Luceners, My search looks up the whole entities. My entities are accounts, contacts, tasks, etc. My searching looks up a group of entity's fields. This works fine despite, I don't have indexed any entity in a document. But If I sort by some fields from different entities, I get the following error. field shortName does not appear to be indexed The account's field I have indexed are shortName,number,location,fax,phone,symbol and I order by shortName without any order shortName,number,location,fax,phone,symbol it works fine. I don't understand the behavior because If I don't order the searching and I don't have any document indexed, It works fine, but If I add an order I get a runtimeException and I can't catch the exception to solve the problem. The only solution it's to index the whole fields' entitities once in a document, but for me it's a patch. Any idea, it could help me out. Thanks in advance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving from a single server to a cluster
We went thru the same scenario as yours. We recently made our application clsuterable and I wrote our own version of jdbc directory (similar to the SQLDirectory posted by someone) with our own caching. It was great for searching for indexing had become a real bottleneck. So we have decided to move back to file system for non-clustered apps. I am still trying to figure the best way (whether using a RemoteSearcher or manage multiple index). I already tried multiple index and we didn't really like the solution of maintaining multiple copies. It requires more space, more maintaineance, all index needs to be in sync etc. I will be glad if I can get the best answer for this. Did anyone try RemoteSearchable and how is it compared to multiple index solution? Nader: I would appreciate if you can send me the docs. Praveen - Original Message - From: David Townsend [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 10:42 AM Subject: RE: Moving from a single server to a cluster Would it be cheeky to ask you to post the docs to the group? It would be interesting to read how you've tackled this. -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: 08 September 2004 13:57 To: Lucene Users List Subject: Re: Moving from a single server to a cluster Hey Ben, We've been using a distributed environment with three servers and three separate indecies for the past 2 years since the first stable Lucene release and it has been great, recently and for the past two months I've been working on a redesign for our Lucene App and I've shared my findings and plans with Otis, Doug and Erik, they pointed out a few faults in my logic which you will probably come across soon enough that mainly have to do with keeping you updates atomic (not too hard) and your deletes atomic (a little more tricky), give me a few days and I'll send you both the early document and the newer version that deals squarely with Lucene in a distributed environment with high volume index. Regards. Nader Henein Ben Sinclair wrote: My application currently uses Lucene with an index living on the filesystem, and it works fine. I'm moving to a clustered environment soon and need to figure out how to keep my indexes together. Since the index is on the filesystem, each machine in the cluster will end up with a different index. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene for Indian Languages
Infact CJK analyzer also works well with indian languages. Since CJKAnalyzer considers the multi byte characters as special case, it works with most asian multi byte characters. I introduced CJKAnalyzer for japanese text search and we also tested with hindi and telugu languages. All our search test cases passed. Give CJKAnalyzer a try. You will find it a better analyzer than the standard (for any asian language). Praveen - Original Message - From: Satish Kagathare [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 9:20 AM Subject: Re: Lucene for Indian Languages Hi,Srinivasa, Use StandardAnaylzer for indexing and parsing query for Indian Lang. docs. It will work. Right now we r searching on Hindi,Marathi but without specific stemmers and filters. We r plannig to develop Marathi Morphological Analyzer. Thanks, Satish. On Sun, 22 Aug 2004, srinivasa raghavan wrote: Hi all, Is Lucene API implemented for Indian contexts? I know that Lucene stemmers and filters for German and Russian Languages. I would like to know, whether there are stemmers and filters available/being developed for Indian Languages. Thanks, Rahavan. ___ Do you Yahoo!? Express yourself with Y! Messenger! Free. Download now. http://messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene and ejb applications
Infact we do the same exact thing. Session bean method called search() delegates to a POJO SearchService. We lazy load the IndexSearch cache it in memory and invalidate that object when someone else modifies the index. This trick works wonderfually for us. The search has become faster after caching the searcher. Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, August 20, 2004 12:02 PM Subject: Re: lucene and ejb applications On Aug 20, 2004, at 7:54 AM, Rupinder Singh Mazara wrote: hi erik thanks for the warning and the code. Let me re-phrase the question, i have a index generated by lucene, i need to have the search capabilty to have a high availabilty. What solutions would be the most optimal I'm guessing from your descriptions that you want a search server that multiple applications can access. Correct? Is that what you mean by high availability? Take a look at Nutch for examples of doing this kind of thing. And also... Currentlly i have two senarions in mind a) setup a RMI based app. that on start-up initializes a IndexSearcher object and waits for invocation of a method like Vector executeQuery(Query ) Lucene has built-in RMI capability, so you don't need to recreate this yourself. Look at RemoteSearchable (and the test cases that use it). b) create a web based app(jsp/servlet or struts) that initialises the IndexSearcher object, and stores in the servletContext on intialization, and all request invoke the Hits search(Query q) This is ok, but you have the same issues with servlet context (application scope or even session scope) with distributed applications. IndexSearcher, at the very least, should be transient and lazy initialized, perhaps nested under a controller object of your making. with senario a) i can have more control over updates, insert, and deletes where as with senario b) has higher availabilty I disagree with your analysis of those scenarios. Neither has more or less control or availability than the other. I want to create and store the IndexSearcher object, during initailization to save on mutlitple open and reads. once updates are ready signal can be sent to block further searches while the updates are integrated into the existing index. It is a good thing to keep an IndexSearcher instance around for big indexes to save on that I/O, I completely agree. A simple IndexSearcher-encapsulating Java object which lazy initializes and keeps IndexSearcher as a transient would be quite sufficient, I think. Store that object wherever you like - application scope seems to be appropriate for your web application scenario. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
merge factor and minMergeDocs
Is anything changed lucene 1.4 regarding mergefactor? I recently ported to lucene 1.4 final and my indexing time doesnot change with change in the merge factor. Increasing minMergeDocs is improving my indexing as expected but changing mergefactor is making no difference. If this is the case, I can always go with the default merge factor of 10 so I won't run into too many files open problem. But just vary minMergeDocs to tune the indexing perf. Currently I tested with 25K objects and the indexing time is almost the same with mergefactor 10 and mergefactor of 100 (kept minMergeDocs =100 in both cases). I am confident that my indexing time used to vary with change in the merge factor before (with lucene 1.3 RC3 I think). Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
No change in the indexing time after increase the merge factor
I performed lucene indexing with 25,000 documents. We feel that indexing is slow, so I am trying to tune it. My configuration is as follow: Machine: Windows XP, 1GB RAM, 3GHz # of documents: 25,000 App Server: Weblogic 7.0 lucene version: lucene 1.4 final I ran the indexer with merge factor of 10 and 50. Both times, the total indexing time (lucene time only) is almost the same (27.92 mins for mergefactor=10 and 28.11 mins for mergefactor=50). From the lucene mails and lucene related articles I read, I thought increasing merge factor will imporve the performance of indexing. Am I wrong? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Re: Problems indexing Japanese with CJKAnalyzer
If its a web application, you have to cal request.setEncoding(UTF-8) before reading any parameters. Also make sure html page encoding is specified as UTF-8 in the metatag. most web app servers decode the request paramaters in the system's default encoding algorithm. If u call above method, I think it will solve ur problem. Praveen - Original Message - From: Bruno Tirel [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Thursday, July 15, 2004 6:15 AM Subject: RE: Problems indexing Japanese with CJKAnalyzer Hi All, I am also trying to localize everything for French application, using UTF-8 encoding. I have already applied what Jon described. I fully confirm his recommandation for HTML Parser and HTML Document changes with UNICODE and UTF-8 encoding specification. In my case, I have still one case not functional : using meta-data from HTML document, as in demo3 example. Trying to convert to UTF-8, or ISO-8859-1, it is still not correctly encoded when I check with Luke. A word Propriété is seen either as Propri?t? with a square, or as Propriã©tã©. My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result when I use local FileEncoding parameter. All the other fields are correctly encoded into UTF-8, tokenized and successfully searched through JSP page. Is anybody already facing this issue? Any help available? Best regards, Bruno -Message d'origine- De : Jon Schuster [mailto:[EMAIL PROTECTED] Envoyé : mercredi 14 juillet 2004 22:51 À : 'Lucene Users List' Objet : RE: Problems indexing Japanese with CJKAnalyzer Hi all, Thanks for the help on indexing Japanese documents. I eventually got things working, and here's an update so that other folks might have an easier time in similar situations. The problem I had was indeed with the encoding, but it was more than just the encoding on the initial creation of the HTMLParser (from the Lucene demo package). In HTMLDocument, doing this: InputStreamReader reader = new InputStreamReader( new FileInputStream(f), SJIS); HTMLParser parser = new HTMLParser( reader ); creates the parser and feeds it Unicode from the original Shift-JIS encoding document, but then when the document contents is fetched using this line: Field fld = Field.Text(contents, parser.getReader() ); HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter using the default encoding, which in my case was Windows 1252 (essentially Latin-1). That was bad. In the HTMLParser.jj grammar file, adding an explicit encoding of UTF8 on both the Reader and Writer got things mostly working. The one missing piece was in the options section of the HTMLParser.jj file. The original grammar file generates an input character stream class that treats the input as a stream of 1-byte characters. To have JavaCC generate a stream class that handles double-byte characters, you need the option UNICODE_INPUT=true. So, there were essentially three changes in two files: HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit UTF8 encoding on Reader and Writer creation in getReader(). As far as I can tell, this changes works fine for all of the languages I need to handle, which are English, French, German, and Japanese. HTMLDocument - add explicit encoding of SJIS when creating the Reader used to create the HTMLParser. (For western languages, I use encoding of ISO8859_1.) And of course, use the right language tokenizer. --Jon earlier responses snipped; see the list archive - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting and tokenization
The solution you suggested is exactly as I expected and I already thought about implementing it. But the problem is the memory in efficiency. Somce times titles are huge. And with i18n, title can be in japanese, chinese or any language which takes mroe memory than english. Ok. how about taking the first token of the title and using it just for the sake of sorting. Does anyone see any problem with it? This solution saves atleast some memory, compared to the other solution. Praveen - Original Message - From: John Moylan [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, July 01, 2004 10:24 AM Subject: Re: Sorting and tokenization Hi, You just need to have another title field that is not tokenized - for sorting purposes. Best, John On Thu, 2004-07-01 at 15:15, Praveen Peddi wrote: Hello all, Now that lucene 1.4 rc3 has sorting functionality built in, I am adding sorting functionality to our searching. Before posting any question to this mailing list, I have been going thru most of the email responses in this mailing list related to sorting. I have found that I cannot tokenize the fields that I want to sort on. Lets take the example I have. I use lucene 1.3 final for searching. Sorting is in fact a very important feature in our application. But we found that lucene does not support out of box, we had to implement sorting by score and doc id programatically which is kind of useless for us. So I thought lucene's new sorting feature will best suit now. But unfortunately, the field called title is tokenized currently. And this is done purposefully because users would want to search partial matches (or rather search on multiple words of the title). So if we make it un tokenized we may lose an improtant functionality. My question is, is there any way I can achieve sorting the objects by title and keeping title as tokenized? Thanks in advance. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration -- John Moylan -- ePublishing Radio Telefis Eireann, Montrose House, Donnybrook, Dublin 4, Eire t:+353 1 2083564 e:[EMAIL PROTECTED] ** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RTÉ may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ** - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
languages lucene can support
I have read many emails in lucene mailing list regarding analyzers. Following is the list of languages lucene supports out of box. So they will be supported with no change in our code but just a configuration change. English German Russian Following is the list of languages that are available as external downloads on lucene's site: Chinese Japanese Korean (all of the above come as single download) Brazilian CZech French Dutch I also read that lucene's StandardAnalyzer supports most of the european languages. Does it mean it supports spanish also? or is there a separate analyzer for that? I didn't see any spanish analyzer in the sand box or lucene release. Another question regarding FrenchAnalyzer. I downloaded FrenchAnalyzer and some methods do not throw IOException where it is supposed to throw. For example, the constructor. I am using 1.4 final (I know its relased only today :)). Whats the fix for it? Praveen Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration
Do we really need CJKAnalyzer to search japanese characters
Hello all, You will have to excuse me if the question looks dumb ;) I didn't use CJKAnalyzer and I could still search japanese characters. Actually I used it first but then I thought of testing with just the standard analyzer. It worked with standard analyzer also. I was able to search the metadata of our objects that has chinese and japanese characters. I think lucene is internally storing unicode characters. So should it matter if its standard analyzer or CJK analyzer? When do we have to use CJKAnalyzer really? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Do we really need CJKAnalyzer to search japanese characters
Hello all, You will have to excuse me if the question looks dumb ;) I didn't use CJKAnalyzer and I could still search japanese characters. Actually I used it first but then I thought of testing with just the standard analyzer. It worked with standard analyzer also. I was able to search the metadata of our objects that has chinese and japanese characters. I think lucene is internally storing unicode characters. So should it matter if its standard analyzer or CJK analyzer? When do we have to use CJKAnalyzer really? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration