Re: Re-Indexing a moving target???
details? Yousef Ourabi wrote: Saad, Here is what I got. I will post again, and be more specific. -Y --- Nader Henein [EMAIL PROTECTED] wrote: We'll need a little more detail to help you, what are the sizes of your updates and how often are they updated. 1) No just re-open the index writer every time to re-index, according to you it's moderately changing index, just keep a flag on the rows and batch indexing every so often. 2) It all comes down to your needs, more detail would help us help you. Nader Henein Yousef Ourabi wrote: Hey, We are using lucene to index a moderatly changing database, and I have a couple of questions on a performance strategy. 1) Should we just have one index writer open unil the system comes down...or create a new index writer each time we re-index our data-set. 2) Does anyone have anythoughts...multi-threading and segments instead of one index? Thanks for your time and help. Best, Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Developer Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QUERYPARSIN BOOSTING
From the text on the Lucene Jakarta Site : http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, ^, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term jakarta to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: jakarta apache^4 jakarta lucene By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2) Regards. Nader Henein Karthik N S wrote: Hi Guys Apologies... This Question may be asked million times on this form ,need some clarifications. 1) FieldType = keyword name = vendor 2)FieldType = text name = contents Question: 1) How to Construct a Query which would allow hits avaliable for the VENDOR to appear first ?. 2) If boosting is to be applied How TO ?. 3) Is the Query Constructed Below correct?. +Contents:shoes +((vendor:nike)^10) Please Advise. Thx in advance. WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: time of indexer
Download Luke, it makes life easy when you inspect the index, so you an actually look at what you've indexed, as opposed to what you may think you indexed. Nader Daniel Cortes wrote: Hi to everybody, and merry christmas for all(and specially people who that me today are working instead of stay with the family). I don't understand because my search in the index give this bad results: I index 112 php files how a txt. with this machine Pentium 4 2,4GHz 512 RAM running during the index Windows XP and Eclipse Tiempo de bsqueda total: 80882 ms the fields that I use are doc.add(Field.Keyword(filename, file.getCanonicalPath())); doc.add(Field.UnStored(body, bodyText)); doc.add(Field.Text(titulo, title)); What I'm doing bad? thks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index question
It comes down to your searching needs, do you need to have your documents searcheable by these fields or do you need a general search of the whole document, your decisions will impact the size of the index and the speed of indexing and searching so give it due thought, start from your GUI requirement and design the index that responds to your user needs best. Nader Daniel Cortes wrote: I want to know In the case that you use Lucene for index files how a general searcher, what fields (or keys) do you use to index. For example, in my case are html,pdf,doc,ppt and txt and I'm thinked to use Field Autor, Field title, field url, field content, field modification date. Something more? some recommendation? thks and Merry Xmas for all. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index question
ok, so you can index the whole document in one shot, but you should store certain fields like what you display in the search results in the index to avoid a round trip to the DB. so for example you would store title synopsis link doc_id date and then just index what you want to be searchable, the reason why you would have title stored in one field and indexed again in another so if you stem that field it will become useless for display purposes. So the logical representation of your index would look something like this: document id stored/ indexed title stored/ un-indexed synopsis stored/ un-indexed date stored / indexed full document stemmed indexed / un stored /document Enjoy Nader Henein Daniel Cortes wrote: thks nader I need a general search of documents, it's for this that I ask yours recomendations, because fields are only for info in the search. Tipically search on Google for example search:casa La casa roja ..haba una vez una casa roja que tenia htttp:\\go.to\casaModification date:25-12-04 for do this what fields and options (keybord,text,unindex,unstored) do you should use? thks Nader Henein wrote: It comes down to your searching needs, do you need to have your documents searcheable by these fields or do you need a general search of the whole document, your decisions will impact the size of the index and the speed of indexing and searching so give it due thought, start from your GUI requirement and design the index that responds to your user needs best. Nader Daniel Cortes wrote: I want to know In the case that you use Lucene for index files how a general searcher, what fields (or keys) do you use to index. For example, in my case are html,pdf,doc,ppt and txt and I'm thinked to use Field Autor, Field title, field url, field content, field modification date. Something more? some recommendation? thks and Merry Xmas for all. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergerIndex + Searchables
As obvious as it may seem, you could always store the index ID in which you are indexing the document in the document itself and have that fetched with the search results, or is there something stopping you from doing that. Nader Henein Karthik N S wrote: Hi Guys Apologies... I have several MERGERINDEXES [ MGR1,MGR2,MGR3]. for searching across these MERGERINDEXES I use the following Code IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK]; for(int all=0;allCNTINDXDBOOK;all++){ indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]); System.out.println(all + ADDED TO SEARCHABLES + INDEXEDBOOKS[all]); } MultiSearcher searcher = new MultiSearcher(indexToSearch); Question : When on Search Process , How to Display that this relevan Document Id Originated from Which MRG??? [ Some thing like this : - Search word 'ISBN12345' is avalible from MRGx ] WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception
This is a OS file system error not a Lucene issue (not for this board) , Google it for Gentoo specifically you a get a whole bunch of results one of which is this thread on the Gentoo Forums, http://forums.gentoo.org/viewtopic.php?t=9620 Good Luck Nader Henein Karthik N S wrote: Hi Guys Some body tell me what this Exception am Getting Pleae Sys Specifications O/s Linux Gentoo Appserver Apache Tomcat/4.1.24 Jdk build 1.4.2_03-b02 Lucene 1.4.1 ,2, 3 Note: - This Exception is displayed on Every 2nd Query after Tomcat is started java.io.IOException: Stale NFS file handle at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:307) at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou ndFileReader.java:220) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69) at org.apache.lucene.search.Similarity.idf(Similarity.java:255) at org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery. java:47) at org.apache.lucene.search.Query.weight(Query.java:86) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java: 251) WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
How big do you expect it to get and how often do you expect to update it, we've been using Lucene for about 1 M records (19 fields each) with incremental updates every 10 minutes, the performance during updates wasn't wonderful, so it took some seriously intense code to sort that out, as you mentioned, it comes down to why you need the Thin DB for, Lucene is a wonderful search engine, but if I were looking at a fast and dirty relational DB, MySQL wins hands down, put them both together and you've really got something. My 2 cents Nader Henein Kevin L. Cobb wrote: I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMMA
Dude, and I say this with love, it's open source, you've got the code, take the initiative, DIY, be creative and share your findings with the rest of us. Personally I would be interested to see how you do this, keep your changes documented and share. Nader Henein Karthik N S wrote: Hi Erik Apologies.. I got Confused with the last mail. Iterate over Hits. returns large hit values and Iteration on Hits for scores consumes time , so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the Hits. Note:- The search is being done on Field Type 'Text' ,consists of 'Contents' from various Html documents Please Advise me Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, December 13, 2004 5:05 PM To: Lucene Users List Subject: Re: HITCOLLECTOR+SCORE+DELIMA On Dec 13, 2004, at 1:16 AM, Karthik N S wrote: So u say I have to Build a Filter to Collect all the Scores between the 2 Ranges [ 0.2f to 1.0f] My message is being misinterpreted. I said filter as a verb, not a noun. :) In other words, I was not intending to mean write a Filter - a Filter would not be able to filter on score. so the API for the same would be Hits hit = search(Query query, Filter filtertoGetScore) But while writing the Filter Score again depends on Hits Score = hits.score(x); Again, you cannot write a Filter (capital 'F') to deal with score. Please re-read what I said below... Hits are in descending score order, so you may just want to use Hits and filter based on the score provided by hits.score(i). Iterate over Hits... when you encounter scores below your desired range, stop iterating. Why is this simple procedure not good enough for what you are trying to achieve? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SEARCH CRITERIA
they probably create a list of similar results by doing some sort of data mining on the search criteria that people use in succession, so for example someone, or they have a list of searches that are too general (a search for the word kid is at best stupid) but you can't call your users stupid so you try to guess what they're searching for based on other searches conducted (kid rock, kid games, star wars kid, karate kid ) that contain the initial search string kid. You can use fuzzy search in Lucene, but that won't do that really, the short answer is DIY depending on your needs. My two galiuns Nader Henein Karthik N S wrote: Hi Guys Apologies. On yahoo and Altavista ,if searched upon a word like 'kid' returns the search with similar as below. Also try: kid rock, kid games, star wars kid, karate kid More... How to obtain the similar search criteria using Lucene. Thx in advance Warm regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: disadvantages
You may singe your fingers if you touch the keyboard during indexing Nader Miguel Angel wrote: What are disadvantages the Lucene?? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimized??
The down and dirty answer is it's like defragmenting your harddrive, you're basically compacting and sorting out index references. What you need to know is that it makes searching so much faster after you've updating the index. Nader Henein Miguel Angel wrote: What`s mean Optimized index in Lucene¿? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Backup strategies
We've recently implemented something similar with the backup process creating a file (much like the lock files during indexing) that the IndexWriter recognizes (tweak) and doesn't attempt to start and indexing or a delete while it's there, wasn't that much work actually. Nader Doug Cutting wrote: Christoph Kiehl wrote: I'm curious about your strategy to backup indexes based on FSDirectory. If I do a file based copy I suspect I will get corrupted data because of concurrent write access. My current favorite is to create an empty index and use IndexWriter.addIndexes() to copy the current index state. But I'm not sure about the performance of this solution. How do you make your backups? A safe way to backup is to have your indexing process, when it knows the index is stable (e.g., just after calling IndexWriter.close()), make a checkpoint copy of the index by running a shell command like cp -lpr index index.YYYMMDDHHmmSS. This is very fast and requires little disk space, since it creates only a new directory of hard links. Then you can separately back this up and subsequently remove it. This is also a useful way to replicate indexes. On the master indexing server periodically perform cp -lpr as above. Then search slaves can use rsync to pull down the latest version of the index. If a very small mergefactor is used (e.g., 2) then the index will have only a few segments, so that searches are fast. On the slave, periodically find the latest index.YYYMMDDHHmmSS, use cp -lpr index/ index.YYYMMDDHHmmSS and 'rsync --delete master:index.YYYMMDDHHmmSS index.YYYMMDDHHmmSS' to efficiently get a local copy, and finally ln -fsn index.YYYMMDDHHmmSS index to publish the new version of the index. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
what kind of incremental updates are you doing, because we update our index every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB memory resident index, the IndexWriter runs one instance at a time, so what kind of increments are we talking about it takes a bit of doing to overwhelm Lucene. What's your update schedule, how big is the index, and after how many updates does the system crash? Nader Henein Luke Shannon wrote: It conistantly breaks when I run more than 10 concurrent incremental updates. I can post the code on Bugzilla (hopefully when I get to the site it will be obvious how I can post things). Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 3:20 PM Subject: Re: _4c.fnm missing Field names are stored in the field info file, with suffix .fnm. - see http://jakarta.apache.org/lucene/docs/fileformats.html The .fnm should be inside the .cfs file (cfs files are compound files that contain all index files described at the above URL). Maybe you can provide the code that causes this error in Bugzilla for somebody to look at. Does it consistently break? Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I received the error below when I was attempting to over whelm my system with incremental update requests. What is this file it is looking for? I checked the index. It contains: _4c.del _4d.cfs deletable segments Where does _4c.fnm come from? Here is the error: Unable to create the create the writer and/or index new content /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory). Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
That's it, you need to batch your updates, it comes down to do you need to give your users search accuracy to the second, take your database and put an is_dirty row on the master table of the object you're indexing and run a scheduled task every x minutes and have your process read the objects that are set to dirty and then re set the flag once they've been indexed correctly. my two cents Nader Otis Gospodnetic wrote: 'Concurrent' and 'updates' in the same sentence sounds like a possible source of the problem. You have to use a single IndexWriter and it should not overlap with an IndexReader that is doing deletes. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: It conistantly breaks when I run more than 10 concurrent incremental updates. I can post the code on Bugzilla (hopefully when I get to the site it will be obvious how I can post things). Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 3:20 PM Subject: Re: _4c.fnm missing Field names are stored in the field info file, with suffix .fnm. - see http://jakarta.apache.org/lucene/docs/fileformats.html The .fnm should be inside the .cfs file (cfs files are compound files that contain all index files described at the above URL). Maybe you can provide the code that causes this error in Bugzilla for somebody to look at. Does it consistently break? Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I received the error below when I was attempting to over whelm my system with incremental update requests. What is this file it is looking for? I checked the index. It contains: _4c.del _4d.cfs deletable segments Where does _4c.fnm come from? Here is the error: Unable to create the create the writer and/or index new content /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory). Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help with filtering
Well if the document ID is number (even if it isn't really) you could use a range query, or just rebuild your index using that specific filed as a sorted field but if it numeric be aware that if you use integer it limits how high your numbers can get. nader Edwin Tang wrote: Hello, I have been using DateFilter to limit my search results to a certain date range. I am now asked to replace this filter with one where my search results have document IDs greater than a given document ID. This document ID is assigned during indexing and is a Keyword field. I've browsed around the FAQs and archives and see that I can either use QueryFilter or BooleanQuery. I've tried both approaches to limit the document ID range, but am getting the BooleanQuery.TooManyClauses exception in both cases. I've also tried bumping max number of clauses via setMaxClauseCount(), but that number has gotten pretty big. Is there another approach to this? Or am I setting this up incorrectly? Snippet of one of my approaches follows: queryFilter = new QueryFilter(new RangeQuery(new Term(id, sLastSearchedId), null, false)); docs = searcher.search(parser.parse(sSearchPhrase), queryFilter, utility.iMaxResults, new Sort(sortFields)); Thanks in advance, Ed __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to efficiently get # of search results, per attribute
It depends on how many results they're looking through, here are two scenarios I see: 1] If you don't have that many records you can fetch all the results and then do a post parsing step the determine totals 2] If you have a lot of entries in each category and you're worried about fetching thousands of records every time, you can just have seperate indecies per category and search them in in parallel (not Lucene Parallel Search) and you can get up to 100 hits for each one (efficiency) but you'll also have the total from the search to display. Either way you can boost up speed using RamDirectory if you need more speed from the search, but whichever approach you choose I would recommend that you sit down and do some number crunching to figure out which way to go. Hope this helps Nader Henein Chris Lamprecht wrote: I'd like to implement a search across several types of entities, let's say, classes, professors, and departments. I want the user to be able to enter a simple, single query and not have to specify what they're looking for. Then I want the search results to be something like this: Search results for: philosophy boyer Found: 121 classes - 5 professors - 2 departments search results here... I know I could iterate through every hit returned and count them up myself, but that seems inefficient if there are lots of results. Is there some other way to get this kind of information from the search result set? My other ideas are: doing a separate search each result type, or storing different types in different indexes. Any suggestions? Thanks for your help! -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UPDATION+MERGERINDEX
Well if you do all the steps in one run, I guess optimizing once at the end would be faster overall, but all you have to do is test it out and time it, performance wise, I don't think that step 3 (OPTIMIZE) in scenario (a) will really improve the performance of the new index merge. my 2 cents Nader Henein Karthik N S wrote: Hi Guys Apologies. a) 1) SEARCH FOR SUBINDEX IN A OPTIMISED MERGED INDEX 2) DELETE THE FOUND SUBINDEX FROM THE OPTIMISED MERGERINDEX 3) OPTIMISE THE MERGERINDEX 4) ADD A NEW VERSION OF THE SUBINDEX TO THE MERGER INDEX 5) OPTIMISE THE MERGERINDEX b) 1) SEARCH FOR SUBINDEX IN A OPTIMISED MERGED INDEX 2) DELETE THE FOUND SUBINDEX FROM THE OPTIMISED MERGERINDEX 3) ADD A NEW VERSION OF THE SUBINDEX TO THE MERGER INDEX 4) OPTIMISE THE MERGERINDEX a OR b WHICH IS BETTER CHOICE THX IN ADVANCE WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Atomicity in Lucene operations
As soon as I've cleaned up the code, I'll publish it, it needs a little more documentation as well. Nader Roy Shan wrote: Maybe you can contribute it to sandbox? On Mon, 18 Oct 2004 08:31:30 -0700 (PDT), Yonik Seeley [EMAIL PROTECTED] wrote: Hi Nader, I would greatly appreciate it if you could CC me on the docs or the code. Thanks! Yonik --- Nader Henein [EMAIL PROTECTED] wrote: It's pretty integrated into our system at this point, I'm working on Packaging it and cleaning up my documentation and then I'll make it available, I can give you the documents and if you still want the code I'll slap together a ruff copy for you and ship it across. Nader Henein Roy Shan wrote: Hello, Nader: I am very interested in how you implement the atomicity. Could you send me a copy of your code? Thanks in advance. Roy __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Atomicity in Lucene operations
It's pretty integrated into our system at this point, I'm working on Packaging it and cleaning up my documentation and then I'll make it available, I can give you the documents and if you still want the code I'll slap together a ruff copy for you and ship it across. Nader Henein Roy Shan wrote: Hello, Nader: I am very interested in how you implement the atomicity. Could you send me a copy of your code? Thanks in advance. Roy On Sat, 16 Oct 2004 01:20:09 +0400, Nader Henein [EMAIL PROTECTED] wrote: We use Lucene over 4 replicated indecies and we have to maintain atomicity on deletion and updates with multiple fallback points. I'll send you the right up, it's too big to CC the entire board. nader henein Christian Rodriguez wrote: Hello guys, I need additions and deletions of documents to the index to be ATOMIC (they either happen to completion or not at all). On top of this, I need updates (which I currently implement with a deletion of the document followed by an addition) to be ATOMIC and DURABLE (once I return from the update function its because the operation happened to completion and stays in the index). Notice that I dont really need all the ACID properties for all the operations. I have tried to solve the problem by using the Lucene + BDB package written by Andi Vajda and using transactions, but the BDB database gets corrupted if I insert random System.exit() to simulate a crash of the application before aborting or commiting transactions. So I have two questions: 1. Has anyone been able to use the Lucene + BDB WITH transactions and simulate random crashes at different points in the process of addding items and found it to be robust (specially, have you been able to always recover after a crash, with uncommited txns rolled back and commited ones present in the DB)? 2. Can anyone suggest other solutions (beside using BDB) that may work? For example: are any of these operations already atomic in Lucene (using an FSDirectory)? Thanks for any help you can give me! Xtian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: simultanous search and indexing
you can do both at the same time, it's thread safe, you will face different issues depending on the frequency or your indexing and the load on the search, but that shouldn't come into play till your index gets nice and heavy. So basically code on. Nader Henein Miro Max wrote: hi, i'm using servlet to search my index and i wish to be able to create an index at the same time. do i have to use threads - i'm beginner thx ___ Gesendet von Yahoo! Mail - Jetzt mit 100MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Atomicity in Lucene operations
We use Lucene over 4 replicated indecies and we have to maintain atomicity on deletion and updates with multiple fallback points. I'll send you the right up, it's too big to CC the entire board. nader henein Christian Rodriguez wrote: Hello guys, I need additions and deletions of documents to the index to be ATOMIC (they either happen to completion or not at all). On top of this, I need updates (which I currently implement with a deletion of the document followed by an addition) to be ATOMIC and DURABLE (once I return from the update function its because the operation happened to completion and stays in the index). Notice that I dont really need all the ACID properties for all the operations. I have tried to solve the problem by using the Lucene + BDB package written by Andi Vajda and using transactions, but the BDB database gets corrupted if I insert random System.exit() to simulate a crash of the application before aborting or commiting transactions. So I have two questions: 1. Has anyone been able to use the Lucene + BDB WITH transactions and simulate random crashes at different points in the process of addding items and found it to be robust (specially, have you been able to always recover after a crash, with uncommited txns rolled back and commited ones present in the DB)? 2. Can anyone suggest other solutions (beside using BDB) that may work? For example: are any of these operations already atomic in Lucene (using an FSDirectory)? Thanks for any help you can give me! Xtian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Encrypted indexes
Well, are you storing any data for retrieval from the index, because you could encrypt the actual data and then encrypt the search string public key style. Nader Henein Weir, Michael wrote: We need to have index files that can't be reverse engineered, etc. An obvious approach would be to write a 'FSEncryptedDirectory' class, but sounds like a performance killer. Does anyone have experience in making an index secure? Thanks for any help, Michael Weir This message may contain privileged and/or confidential information. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it; do not open any attachments, delete it immediately from your system and notify the sender promptly by e-mail that you have done so. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting and score ordering
As far as my testing showed, the sort will take priority, because it's basically an opt-in sort as opposed to the defaulted score sort. So you're basically displaying a sorted set over all your results as opposed to sorting the most relevant results. Hope this helps Nader Henein Chris Fraschetti wrote: If I use a Sort instance on my searcher, what will have priority? Score or Sort? Assuming I have a pages with .9, .9, and .5 scores, ... if the .5 has a higher 'sort' value, will it return higher than one of the .9 lucene score values if they are lower? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
There is a way of writing an Arabic stemmer, it's just not a weekend project, I've seen the translate/stem option as well, and even tried it with Lucene, we've implemented Lucene on our database and we have about a million records in our DB with 19 indexed fields (some of which are clobs) in each record, the free text fields in each record are in many cases Arabic, we do not provide stemming on those just because I couldn't find a valid stemming or translation option, which held up to proper testing, some were ok, but after collecting data from user searches (averaging out at 5 searches per second) the Arabic stemming options would not be able to manage user expectations, which is what it comes down to, sometimes theory does not translate well to practice. Nader Henein Dawid Weiss wrote: nothing to do with each other furthermore, Arabic uses phonetic indicators on each letter called diacritics that change the way you pronounce the word which in turn changes the words meaning so two word spelled exactly the same way with different diacritics will mean two separate things, Just to point out the fact: most slavic languages also use diacritic marks (above, like 'acute', or 'dot' marks, or below, like the Polish 'ogonek' mark). Some people argue that they can be stripped off the text upon indexing and that the queries usually disambiguate the context of the word. It is just a digression. Now back to the arabic stemmer -- there has to be a way of doing it. I know Vivisimo has clustering options for arabic. They must be using a stemmer (and an English translation dictionary), although it might be a commercial one. Take a look: http://vivisimo.com/search?v:file=cnnarabic D. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
I'd be happy to help anyone test this out, my Arabic is pretty good. Nader Andrzej Bialecki wrote: Dawid Weiss wrote: nothing to do with each other furthermore, Arabic uses phonetic indicators on each letter called diacritics that change the way you pronounce the word which in turn changes the words meaning so two word spelled exactly the same way with different diacritics will mean two separate things, Just to point out the fact: most slavic languages also use diacritic marks (above, like 'acute', or 'dot' marks, or below, like the Polish 'ogonek' mark). Some people argue that they can be stripped off the text upon indexing and that the queries usually disambiguate the context of the word. Hmm. This brings up a question: the algorithmic stemmer package from Egothor works quite well for Polish (http://www.getopt.org/stempel), wouldn't it work well for Arabic, too? I lack the necessary expertise to evaluate results (knowing only two or three arabic words ;-) ), but I can certainly help someone to get started with testing... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving from a single server to a cluster
Hey Ben, We've been using a distributed environment with three servers and three separate indecies for the past 2 years since the first stable Lucene release and it has been great, recently and for the past two months I've been working on a redesign for our Lucene App and I've shared my findings and plans with Otis, Doug and Erik, they pointed out a few faults in my logic which you will probably come across soon enough that mainly have to do with keeping you updates atomic (not too hard) and your deletes atomic (a little more tricky), give me a few days and I'll send you both the early document and the newer version that deals squarely with Lucene in a distributed environment with high volume index. Regards. Nader Henein Ben Sinclair wrote: My application currently uses Lucene with an index living on the filesystem, and it works fine. I'm moving to a clustered environment soon and need to figure out how to keep my indexes together. Since the index is on the filesystem, each machine in the cluster will end up with a different index. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving from a single server to a cluster
be a pleasure, just didn't want to mislead someone down the wrong way. Give me a few days and I'll have the new version up. Nader - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Devnagari Search?
Have faith in the UNICODE standard it's well thought out and if you have any internationalization queries there was an excellent article on Java World entitled end-to-end internationalization here's the link: http://www.javaworld.com/javaworld/jw-05-2004/jw-0524-i18n_p.html have a read it helps clear out some myths. Nader Henein - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: read only file system
I hate to speak after Otis, but the way we deal with this is by clearing locks on server restart, in case a server crash occurs mid indexing and we also optimize on server restart, it doesn't happen often (God bless Resin) but when it has we faced no problems from Lucene. Just fir the record we have a validate function that the LuceneInit calls it looks something like this: try { Directory directory = FSDirectory.getDirectory(indexPath,false); if ( directory.list().length == 0 ) clear() ; Lock writeLock = directory.makeLock(writeFileName); if (!writeLock.obtain()) { IndexReader.unlock(directory) ; } else { writeLock.release() ; } } catch (IOException e) { logger.error(Index Validate,e) ; } Nader -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, April 30, 2004 4:09 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: read only file system If you have a very recent Lucene, then you can disable locks with command line parameters. I believe a page describing various command line parameters is on Lucene's Wiki. Otis --- Supun Edirisinghe [EMAIL PROTECTED] wrote: I think I'm alittle confused on how and index is put into use on a readonly file system I'm using Lucene in my web application. Our indexes are built off our database nightly and copied into our web app servers. I think our web app dies from time to time and sometimes a lock is left behind from Lucene in /tmp/. I have read that there is a disableLuceneLocks System Property(is that the full name or is it something like org.apache.jakarta...disableLuceneLocks?). But, I'm still not sure how I can set that. Do I give it as commandline arg to the java VM? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disappearing segments
Could you share you're indexing code, and just to make sure id there anything running on your machine that could delete these files, like an a cron job that'll back up the index. You could go by process of elimination and shut down your server and see if the files disappear, coz if the problem is contained within the server you know that you can safely go on the DEBUG rampage. Nader -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Friday, April 30, 2004 9:15 AM To: Lucene Users List Subject: Re: Disappearing segments An update: Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see if it happens with the compound index format. Before I had a chance to try it out, this happened: java.io.FileNotFoundException: C:\index\segments (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:200) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.j ava:321) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71) at org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154) at org.apache.lucene.store.Lock$With.run(Lock.java:116) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:149) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:131) so even the segments file somehow got deleted. Hoping someone can shed some light on this... Kelvin On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said: Errr, sorry for the cross-post to lucene-dev as well, but I realized this mail really belongs on lucene-user... I've been experiencing intermittent disappearing segments which result in the following stacktrace: Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:200) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.ja va:321) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:104) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:95) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112) at org.apache.lucene.store.Lock$With.run(Lock.java:116) at org.apache.lucene.index.IndexReader.open(IndexReader.java:103) at org.apache.lucene.index.IndexReader.open(IndexReader.java:91) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:75) The segment that disappears (_1ae.fnm) varies. I can't seem to reproduce this error consistently, so don't have a clue what might cause it, but it usually happens after the application has been running for some time. Has anyone experienced something similar, or can anyone point me in the right direction? When this occurs, I need to rebuild the entire index for it to be usable. Very troubling indeed... Kelvin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi-Threading
Why do you have concurency problems? are you trying to have each user initiate the indexing himself? because that will create issues, how about you put all the new files you want to index in a directory and then have a schedule procedure on the webserver run the lucene indexer on that directory, our application hasn't had any concurrency problems at all, because we index based on a pull system, rather than the user puching documents to the indexer. I hope I understood your problem correctly, so that the answer is useful Nader On Tue, 19 Aug 2003 12:55:09 +0200, Damien Lust wrote: Hello, I developed an Client-Server application on the web, with a search module using Lucene. In the same application, the users can index new text. So, multiple sessions can acces to the Index and concurrences problems can be possible. I used Threads in Java. Is it the best solutions? I call : IndexFiles indexFiles = new IndexFiles(); indexFiles.run(); Here you are an extract of my code. Thanks. public class IndexFiles extends Thread{ public IndexFiles(){ } public void run(){ SynchronizedIndexWriter.insertDocument(currentIndexDocument(),tmp/ IndexPath,new MainAnalyser()); } } public class SynchronizedIndexWriter { static synchronized void insertDocument(IndexDocument document,String indexLocValue,Analyzer analyzerValue){ File f=new File(indexLocValue); if (f.exists()) addDocumentToIndex(document,indexLocValue,analyzerValue,false); else addDocumentToIndex(document,indexLocValue,analyzerValue,true); } static synchronized void addDocumentToIndex(IndexDocument document,String indexLocValue,Analyzer analyzerValue,boolean createNewIndex){ try{ IndexWriter indexWriter = new IndexWriter(indexLocValue,analyzerValue,createNewIndex); indexWriter.addDocument(document.getDocument()); indexWriter.optimize(); indexWriter.close(); } catch(IOException io){ // If IndexWrite don't know write on index because it's locked, recall of the function = It's not very safe addDocumentToIndex(document,indexLocValue,analyzerValue,createNewIndex); } catch(Exception e){ } } } The information contained above is proprietary to BAYT.COM and confidential. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]