Re: Index Locking Issues Resolved...I hope

2004-11-17 Thread jeichels

I was thinking that perhaps I can pre-stem words before sticking them in a 
search field in the database perhaps using Lucene stemming code, then try to 
use the Natural Language Search found in MySql 4.1.1.   I am confident the 
MySql product can't keep up with Lucene yet, but at least they hvae improved it 
some.  Not even sure if my hosting company will upgrade to 4.1.1 though.  Still 
looking for a lot of solutions to make Lucene sit in synch more nicely with 
MySql as the main database...aka an easy to use way of handling 



- Original Message -
From: Chris Lamprecht [EMAIL PROTECTED]
Date: Wednesday, November 17, 2004 1:38 am
Subject: Re: Index Locking Issues Resolved...I hope

 MySQL does offer a basic fulltext search (with MyISAM tables), but it
 doesn't really approach the functionality of Lucene, such as pluggable
 tokenizers, stemming, etc.  I think MS SQL server has fulltext search
 as well, but I have no idea if it's any good.
 
 See 
 http://www.google.com/search?hl=enlr=safe=offc2coff=1q=mysql+fulltext
  I have not seen clear yet because it is all new.   I wish a 
 database Text field could have this sort of mechanism built into 
 it.   MySql does not do this (what I am using), but I am going to 
 check into other databases now.  OJB will work with most all of 
 them so that would help if there is a database type of solution 
 that will allow that sleep at night thing to happen!!!
 
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Considering intermediary solution before Lucene question

2004-11-17 Thread jeichels

Is there a way to use Lucene stemming and stop word removal without using the 
rest of the tool?   I am downloading the code now, but I imagine the answer 
might be deeply burried.  I would like to be able to send in a phrase and get 
back a collection of keywords if possible.

I am thinking of using an intermediary solution before moving fully to Lucene.  
I don't have time to spend a month making a carefully tested, administratable 
Lucene solution for my site yet, but I intend to do so over time.  Funny thing 
is the Lucene code likely would only take up a couple hundred of lines, but 
integration and administration would take me much more time.

In the meantime, I am thinking I could use perhaps Lucene steming and parsing 
of words, then stick each search word along with the associated primary key in 
an indexed MySql table.   Each record I would need to do this to is small with 
maybe only average 15 userful words.   I would be able to have an in-database 
solution though ranking, etc would not exist.   This is better then the exact 
word searching i have currently which is really bad.

By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have 
stemming and I am sure it is very slow compaired to Lucene.   Cpanel is still 
stuck on MySql 4.0.* so many people would not have access to even this basic 
ability in production systems for some time yet.

JohnE



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread jeichels
This is so cool Otis.  I was just to write this off of something in the FAQ, 
but this is better then what I was doing.

This rocks!!!  Thank you.

JohnE

P.S.:  I am assuming you use org.apache.lucene.analysis.Token?   There are 
three Token's under Lucene.



- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
Date: Wednesday, November 17, 2004 7:17 pm
Subject: Re: Considering intermediary solution before Lucene question

 Yes, you can use just the Analysis part.  For instance, I use this for
 http://www.simpy.com and I believe we also have this in the Lucene 
 bookas part of the source code package:
 
/**
 * Gets Tokens extracted from the given text, using the specified
 Analyzer.
 *
 * @param analyzer the codeAnalyzer/code to use
 * @param text the text to analyze
 * @param field the field to pass to the Analyzer for tokenization
 * @return an array of codeToken/codes
 * @exception IOException if an error occurs
 */
public static Token[] getTokens(Analyzer analyzer, String text,
 String field)
throws IOException
{
TokenStream stream = analyzer.tokenStream(field, new
 StringReader(text));
ArrayList tokenList = new ArrayList();
while (true) {
Token token = stream.next();
if (token == null)
break;
tokenList.add(token);
}
return (Token[]) tokenList.toArray(new Token[0]);
}
 
 Otis
 
 --- [EMAIL PROTECTED] wrote:
 
  
  Is there a way to use Lucene stemming and stop word removal without
  using the rest of the tool?   I am downloading the code now, but I
  imagine the answer might be deeply burried.  I would like to be able
  to send in a phrase and get back a collection of keywords if
  possible.
  
  I am thinking of using an intermediary solution before moving fully
  to Lucene.  I don't have time to spend a month making a carefully
  tested, administratable Lucene solution for my site yet, but I 
 intend to do so over time.  Funny thing is the Lucene code likely 
 would only
  take up a couple hundred of lines, but integration and 
 administration would take me much more time.
  
  In the meantime, I am thinking I could use perhaps Lucene 
 steming and
  parsing of words, then stick each search word along with the
  associated primary key in an indexed MySql table.   Each record I
  would need to do this to is small with maybe only average 15 userful
  words.   I would be able to have an in-database solution though
  ranking, etc would not exist.   This is better then the exact word
  searching i have currently which is really bad.
  
  By the way, MySql 4.1.1 has some Lucene type handling, but it too
  does not have stemming and I am sure it is very slow compaired to
  Lucene.   Cpanel is still stuck on MySql 4.0.* so many people would
  not have access to even this basic ability in production systems for
  some time yet.
  
  JohnE
  
  
  
  -
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread jeichels
I thank you both.  I have it already partly implemented here.   It seems easy.

At least this should carry through my product until I can really get to use 
Lucene.  I am not sure how far I can take MySql with stemmed, indexed key 
words, but should give me maybe 6 monthes at least of something useful as 
opposed to impossible searching.  I need time and this might just be the trick.

Always I fight for simplicity, but it is hard when you have 2 databases that 
have to keep in synch.  If accuracy is important (people paying money) then 
handling all of the edge cases (such as the question that was just asked about 
if the machine goes down) are so important.  I understand this is beyond the 
scope of Lucene.

Thank you for the help.  This really is an interesting project.

JohnE



- Original Message -
From: Chris Lamprecht [EMAIL PROTECTED]
Date: Wednesday, November 17, 2004 7:08 pm
Subject: Re: Considering intermediary solution before Lucene question

 John,
 
 It actually should be pretty easy to use just the parts of Lucene you
 want (the analyzers, etc) without using the rest.  See the example of
 the PorterStemmer from this article:
 
 http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2
 
 You could feed a Reader to the tokenStream() method of
 PorterStemAnalyzer, and get back a TokenStream, from which you pull
 the tokens using the next() method.
 
 
 
 On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED]
 [EMAIL PROTECTED] wrote:
  
  Is there a way to use Lucene stemming and stop word removal 
 without using the rest of the tool?   I am downloading the code 
 now, but I imagine the answer might be deeply burried.  I would 
 like to be able to send in a phrase and get back a collection of 
 keywords if possible.
  
  I am thinking of using an intermediary solution before moving 
 fully to Lucene.  I don't have time to spend a month making a 
 carefully tested, administratable Lucene solution for my site yet, 
 but I intend to do so over time.  Funny thing is the Lucene code 
 likely would only take up a couple hundred of lines, but 
 integration and administration would take me much more time.
  
  In the meantime, I am thinking I could use perhaps Lucene 
 steming and parsing of words, then stick each search word along 
 with the associated primary key in an indexed MySql table.   Each 
 record I would need to do this to is small with maybe only average 
 15 userful words.   I would be able to have an in-database 
 solution though ranking, etc would not exist.   This is better 
 then the exact word searching i have currently which is really bad.
  
  By the way, MySql 4.1.1 has some Lucene type handling, but it 
 too does not have stemming and I am sure it is very slow compaired 
 to Lucene.   Cpanel is still stuck on MySql 4.0.* so many people 
 would not have access to even this basic ability in production 
 systems for some time yet.
  
  JohnE
  
  -
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene : avoiding locking (incremental indexing)

2004-11-16 Thread jeichels

I am interested in pursuing experienced peoples' understanding as I have half 
the queue approach developed already.

I am not following why you don't like the queue approach Sergiu.  From what I 
gathered from this board, if you do lots of updates, the opening of the 
WriterIndex is very intensive and should be used in a batch orientation rather 
then on a one-at-a-time incremental approach.  In some cases on this board they 
talk about it being so overwhelming that people are putting forced delays so 
the Java engine can catch up.  Using a queueing approach, you may get a hit 
every 30 seconds or minute or...whatever you choose as your timeframe, but it 
should be enough of a delay to allow the java engine to not be overwhelmed.  I 
would like this not to happen with Lucene and would like to be able to update 
every time an update occurs, but this does not seem the right approach right 
now.  As I said before, this seems like a wish item for Lucene.  I don't really 
know if the wish is feasible.

So far the biggest problem I was facing with this approach, however, was having 
feedback from the archiving process to the main database that the archiving 
change actually has happened and correctly even if the server goes down.

JohnE





 Personally I don't like the Queue aproach... because I already 
 implemented multithreading in out application
 to improve its performance. In our application indexing is not a 
 high 
 priority, but it's happening quite often.
 Search is a priority.
 
 Lucene allows to have more searches at on time. When you have a 
 big 
 index and a many users then ...
 the Queue aproach can slow down your application to much. I think 
 it 
 will be a bottleneck.
 
 I know that the lock problem is annoying, but I also think that 
 the 
 right way is to identify the source of locking.
 Our application is a webbased application based on turbine, and 
 when we 
 want to restart tomcat, we just kill
 the process (otherwise we need to restart 2 times because of some 
 log4j 
 initialization problem), so ...
 the index is locked after the tomcat restart. In my case it makes 
 sense 
 to check if index is locked one time at
 startup. I'm also logging all errors that I get in the systems, 
 this is 
 helping me to find their sourcce easier.
 
 All the best,
 
 Sergiu
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Locking Issues Resolved...I hope

2004-11-16 Thread jeichels
Very cool Luke.  I am not quite there yet.  I am half way through implementing 
the queue approach, but I have hit walls that are making me sit back and figure 
out my strategy.   I have a struts/tomcat/ojb/mysql project that can 
potentially have a million records and growing over time and updates will occur 
perhaps 100,000/day.  This is not today, but what I am building for.

My concerns not just Lucene itself, but its surrounding effects as follows.  I 
am finding out that edge case scenerios are making things difficult due to 
having two databases instead of one.

- How to know the index on this huge database is always in synch.
- What happens if the server crashes or is brought down.  solution might 
be db last modified date
- Backups of the database and the index handled in an efficient, safe 
manner on a live system.
-  How to reindex while the system is in place solution might be doing new 
index to a different location as a seperate tool.
-  How to handle the fact that the IndexWriter is not very good in 
incremental data cases in a high volume update/query system. soluction might 
be to query for records from the database that have changed every 45 seconds or 
so and applying the changes.
-  How the IndexWriter solution above might cause bad lag on queries 
frequently. no solution
-  how to get Tomcat to start up a thread to run this updater at startup 
and not have a problem with memory management.
-  How to make this all work in my startup business to allow me to feel I 
can sleep at night.


In general, things just got much more complicated then I was hoping for though 
I don't know how I can do without using Lucene or something like Lucene.  This 
has been done so many times before that I would have suspected it would be 
easy, but I have not seen clear yet because it is all new.   I wish a database 
Text field could have this sort of mechanism built into it.   MySql does not do 
this (what I am using), but I am going to check into other databases now.  OJB 
will work with most all of them so that would help if there is a database type 
of solution that will allow that sleep at night thing to happen!!!

If you have input to these things, I had found some answers in the mailing 
list, but not really a concept of how to manage the whole thing.  Is there an 
incremental big open source project out there that uses Lucene and a database?  
I don't think so.

If you have any code or ideas I would appreciate both!!!  Also having a FAQ 
that handles lots of these common problems, though a bit off topic they are, 
might really help people choose to use Lucene.

Thanks,

JohnE




- Original Message -
From: Luke Shannon [EMAIL PROTECTED]
Date: Tuesday, November 16, 2004 10:51 pm
Subject: Index Locking Issues Resolved...I hope

 Hello;
 
 I think I have solved my locking issues. I just made it through 
 the set of
 test cases that previously resulted in Index Locking Errors. I 
 just removed
 the method from my code that checks for a Index lock and 
 forcefully removes
 it after 1 minute. Hopefully they never need to be put back in.
 
 Here is what I changed:
 
 I moved all my Indexer logic into a class called Index.java that 
 implementedRunnable. Index's start() called a method named go() 
 which was static and
 synchronized. go() kicks off all the logic to update the index 
 (the reader,
 writer and other members involved with incremental updates also 
 static). I
 put logging in place that logs when a thread has executed the 
 method and
 what the thread's name is.
 
 Every time a client class changes the content it can create a thread
 reference and pass it the runnable Index. The convention I have 
 requestedfor naming the thread is a toString() of the current 
 date. Then they start
 the thread.
 
 How it worked:
 
 A few users just tested the system, half added documents to the 
 system while
 another half deleted documents at the same time. No locking issues 
 were seen
 and the index was current with the changes made a short time after 
 the last
 operation (in my previous code this test resulted in a issue with 
 indexlocking).
 
 I was able to go through the log file and find the start of the 
 synchronizedgo() method and the successful completion of the 
 indexing operations for
 every request made.
 
 The only performance issue I noticed was if someone added a very 
 large PDF
 it took a while before the thread handling the request could 
 finish. If this
 is the first operation of many it means the operations following 
 this large
 file take that much longer. Luckily for me search results don't 
 need to be
 instant.
 
 Things are looking much better. For now...
 
 Thanks to all that helped me up till now.
 
 Luke
 
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, November 16, 2004 4:01 PM
 Subject: Re: _4c.fnm missing
 
 
  'Concurrent' and 'updates' in the same 

Re: Lucene : avoiding locking

2004-11-15 Thread jeichels

I am new to Lucene, but have a large project in production on the web using 
other apache software including Tomcat, Struts, OJB, and others.

The database I need to support will hopefully grow to millions of records.  
Right now it only has thousands but it is growing.   These documents get 
updated by users regularly, but not frequently.   When you have 100k users 
though, infrequently means you still have to deal with lock types of issues.

When they update their record, their search criteria will have to be updated 
and they will expect to see results somewhat immediately.

In moving from exact matching which is very poor for searches to Lucene, this 
locking is the only thing that has me nervous.   I would really like a well 
thought out scheme for incremental changes as I won't generally need batch 
unless I have to delete/recreate the database for some reason.

Thinking about most online forums, I think incremental is the way they would 
like to be able to go for searching.

I have lots to learn about this project, but I really like what I see besides 
that locking issue.   If I get into this more and understand details maybe I 
will have something to offer later.   Lots to learn first though.

Thank you for your hard work,

JohnE





I am curious, though, how many people on this list are using Lucene in
the incremental update case. Most examples I've seen all assume batch
indexing.

Regards,

Luke Francl




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene : avoiding locking (incremental indexing)

2004-11-15 Thread jeichels
It really seems like I am not the only person having this issue.

So far I am seeing 2 solutions and honestly I don't love either totally.  I am 
thinking that without changes to Lucene itself, the best general way to 
implement this might be to have a queue of changes and have Lucene work off 
this queue in a single thread using a time-settable batch method.   This is 
similar to what you are using below, but I don't like that you forcibly unlock 
Lucene if it shows itself locked.   Using the Queue approach, only that one 
thread could be accessing Lucene for writes/deletes anyway so there should be 
no unknown locking.

I can imagine this being a very good addition to Lucene - creating a high level 
interface to Lucene that manages incremental updates in such a manner.  If 
anybody has such a general piece of code, please post it!!!   I would use it 
tonight rather then create my own.

I am not sure if there is anything that can be done to Lucene itself to help 
with this need people seem to be having.  I realize the likely reasons why 
Lucene might need to only have one Index writer and the additional load that 
might be caused by locking off pieces of the database rather then the whole 
database.  I think I need to look in the developer archives.

JohnE



- Original Message -
From: Luke Shannon [EMAIL PROTECTED]
Date: Monday, November 15, 2004 5:14 pm
Subject: Re: Lucene : avoiding locking (incremental indexing)

 Hi Luke;
 
 I have a similar system (except people don't need to see results
 immediatly). The approach I took is a little different.
 
 I made my Indexer a thread with the indexing operations occuring 
 the in run
 method. When the IndexWriter is to be created or the IndexReader 
 needs to
 execute a delete I called the following method:
 
 private void manageIndexLock() {
  try {
   //check if the index is locked and deal with it if it is
   if (index.exists()  IndexReader.isLocked(indexFileLocation)) {
System.out.println(INDEXING INFO: There is more than one 
 process trying
 to write to the index folder. Will wait for index to become 
 available.);//perform this loop until the lock if released or 
 3 mins
// has expired
int indexChecks = 0;
while (IndexReader.isLocked(indexFileLocation)
   indexChecks  6) {
 //increment the number of times we check the index
 // files
 indexChecks++;
 try {
  //sleep for 30 seconds
  Thread.sleep(3L);
 } catch (InterruptedException e2) {
  System.out.println(INDEX ERROR: There was a problem waiting 
 for the
 lock to release. 
  + e2.getMessage());
 }
}//closes the while loop for checking on the index
// directory
//if we are still locked we need to do something about it
if (IndexReader.isLocked(indexFileLocation)) {
 System.out.println(INDEXING INFO: Index Locked After 3 
 minute of
 waiting. Forcefully releasing lock.);
 IndexReader.unlock(FSDirectory.getDirectory(index, false));
 System.out.println(INDEXING INFO: Index lock released);
}//close the if that actually releases the lock
   }//close the if ensure the file exists
  }//closes the try for all the above operations
  catch (IOException e1) {
   System.out.println(INDEX ERROR: There was a problem waiting 
 for the lock
 to release. 
   + e1.getMessage());
  }
 }//close the manageIndexLock method
 
 Do you think this is a bad approach?
 
 Luke
 
 - Original Message - 
 From: Luke Francl [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, November 15, 2004 5:01 PM
 Subject: Re: Lucene : avoiding locking (incremental indexing)
 
 
  This is how I implemented incremental indexing. If anyone sees 
 anything wrong, please let me know.
 
  Our motivation is similar to John Eichel's. We have a digital asset
  management system and when users update, delete or create a new 
 asset, they need to see their results immediately.
 
  The most important thing to know about incremental indexing that
  multiple threads cannot share the same IndexWriter, and only one
  IndexWriter can be open on an index at a time.
 
  Therefore, what I did was control access to the IndexWriter 
 through a
  singleton wrapper class that synchronizes access to the 
 IndexWriter and
  IndexReader (for deletes). After finishing writing to the index, you
  must close the IndexWriter to flush the changes to the index.
 
  If you do this you will be fine.
 
  However, opening and closing the index takes time so we had to 
 look for
  some ways to speed up the indexing.
 
  The most obvious thing is that you should do as much work as 
 possible outside of the synchronized block. For example, in my 
 application, the
  creation of Lucene Document objects is not synchronized. Only 
 the part
  of the code that is between your IndexWriter.open() and
  IndexWriter.close() needs to be synchronized.
 
  The other easy thing I did to improve performance was batch 
 changes in a
  transaction together