RE: Which searched words are found in a document

2004-05-26 Thread Nader S. Henein
Take a look at the highlighter code, you could implement this on the front
end while processing the page.

Nader

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 25, 2004 10:51 AM
To: [EMAIL PROTECTED]
Subject: Which searched words are found in a document


Hi,

I have the following question:
Is there an easy way to see which words from a query were found in a
resulting document?

So if I search for 'cat OR dog' and get a result document with only 'cat' in
it. I would like to ask the searcher object or something to tell me that for
the result document 'cat' was the only word found.

I did see it is somehow possible with the explain method, but this does not
give a clean answer. I can also get the contents of the document and do an
indexof for each search term but there could be quite a lot in our case.

Any suggestions?

Thanks,

Edvard Scheffers



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SELECTIVE Indexing

2004-05-26 Thread Nader S. Henein
So you basically only want to index parts of your document within table
Foo Bar /table tags, 

I'm not sure if there's an easier way, but here's what I do:
1)  Parse XML files using JDOM (or any XML parser that floats your boat)
into a Map or an ArrayList 
2)  Create a Lucene document and loop through the aforementioned structure
(Map or ArrayList) adding field, value pairs to it like so
contentDoc.add(new Field(fieldName,fieldValue,true,true,true) ) ;

So all you would need to do is just put an if statement around the later
statement to the effect of 

If (  fieldName.equalsIgnoreCase(table) == 0   ) {
contentDoc.add(new Field(fieldName,fieldValue,true,true,true) ) ;
}


This may be overkill, someone feel free to correct me if I'm wrong

Nader

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 1:01 PM
To: Lucene Users List
Subject: RE: SELECTIVE Indexing


Hey Lucene Users

My original intension for indexing was to
index certain portions of  HTML [ not the whole Document ],
if Jtidy is not supporting this then what are my optionals

Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 1:43 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing


I doubt if it can be used as a plug in.
Would be good to know if it can be used as a plug in.

Regards,
Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 12:30
To: Lucene Users List
Subject: RE: SELECTIVE Indexing


Hi

Can I Use TIDY [as plug in ] with Lucene ...


with regards
Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Monday, May 17, 2004 3:27 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing



Try using Tidy.
Creates a Document of the html and allows you to apply xpath. Hope this
helps.

Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 11:59
To: Lucene Users List
Subject: SELECTIVE Indexing



Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-
table .
   

 /table


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: change directory

2004-05-03 Thread Nader S. Henein
When my server restarts, I have a little procedure that validates and sorts
out the index in case the server crashes mid-indexing/optimizing, what it
does is it checks for locks and frees them if need be then it optimizes the
whole thing (as a precaution) here's the code I use, try it out in your
lucene init:


try {
Directory directory =
FSDirectory.getDirectory(indexPath,false);
if ( directory.list().length == 0 ) clear() ; //
Create a new index
Lock writeLock = directory.makeLock(writeFileName); 
if (!writeLock.obtain()) {
IndexReader.unlock(directory) ;
} else {
writeLock.release() ;
}
} catch (IOException e) {
logger.error(Index Validate,e) ;
}

Try it out, hope it helps.

Nader Henein


-Original Message-
From: Rosen Marinov [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 03, 2004 5:52 PM
To: Lucene Users List
Subject: change directory


Hi all,

I have a good working index about 3 GB in one directory
for example in c:/index1

now i want to change the computer and directory for example
to d:/index2(is this possible ???)

and when i copy it to the new pc and directory  on 
IndeaxReader(indexpath) i get 

  java.io.IOException: Lock obtain timed out
at org.apache.lucene.store.Lock.obtain(Lock.java:97)
at org.apache.lucene.store.Lock$With.run(Lock.java:147)
at org.apache.lucene.index.IndexReader.open

before coping i closed all java aplications, index was with closed writers,
readers, serachers, terms and etc ... i have finally clauses to close all
this and shut down function, all my methods which works with index are
synchronized.

10x fopr help in advance



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disappearing segments

2004-05-02 Thread Nader S. Henein
You're catching an exception and acting on it, but you're not reporting it,
for now, comment out the deletion and copyfrombackup and try reporting
errors, if the batch is failing on a regular basis you want to know about
it, comment the deletion and copy across code out, also watch out that if
you backup the index during an indexing you could end up with a limp index
missing a few files, hence the missing segments, I would check for write and
commit locks pre-backup so as to avoid that. This is probably caused by two
unrelated errors first batchindex() fails then the backup restores a version
that may not have all the indexes there (depending when it was backed up)
thereby giving you the feeling that segments are disappearing randomly.

Hope this helps.

Nader Henein

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 03, 2004 6:52 AM
To: Lucene Users List
Subject: RE: Disappearing segments


Thanks for responding Nader.

h...you've hit the nail on the spot. I do have a cron job which backs up
the 
index. Its run in a batch index scheduled job. 

The logic is basically

backupindex()
try
{
batchindex()
}
catch(Exception e)
{
deleteindex();
copyfrombackuptoindex()
deletebackup();
}

I assume that the original index before backing up was complete and
'working'. 
I'm also deleting the index that failed, instead of just overwriting. Where
did 
I go wrong? 

I'm not checking that the index isn't write-locked before backing up, but I 
don't think that's the problem (though it very well can be a separate
problem).

Kelvin

On Fri, 30 Apr 2004 23:20:42 +0400, Nader Henein said:
 Could you share you're indexing code, and just to make sure id there 
 anything running on your machine that could delete these files, like 
 an a cron job that'll back up the index.
 
 You could go by process of elimination and shut down your server and 
 see if the files disappear, coz if the problem is contained within the 
 server you know that you can safely go on the DEBUG rampage.
 
 Nader
 
 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED]
 Sent: Friday, April 30, 2004 9:15 AM
 To: Lucene Users List
 Subject: Re: Disappearing segments
 
 An update:
 
 Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see 
 if it happens with the compound index format. Before I had a chance to 
 try it out, this happened:
 
 java.io.FileNotFoundException: C:\index\segments (The system cannot 
 find the file specified) at java.io.RandomAccessFile.open(Native 
 Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:200)
 at
 org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.j
 ava:321)
 at
 org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329)
 at
 org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71)
 at
 org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154)
 at org.apache.lucene.store.Lock$With.run(Lock.java:116)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:149)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:131)
 
 so even the segments file somehow got deleted. Hoping someone can shed 
 some light on this...
 
 Kelvin
 
 On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said:
 Errr, sorry for the cross-post to lucene-dev as well, but I realized 
 this mail really belongs on lucene-user...
 
 I've been experiencing intermittent disappearing segments which 
 result in the following stacktrace:
 
 Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The 
 system cannot find the file specified) at 
 java.io.RandomAccessFile.open(Native Method) at
 java.io.RandomAccessFile.init(RandomAccessFile.java:200)
 at 
 org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.j
 a
 va:321) at
 org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329)
 at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
 at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78)
 at
 org.apache.lucene.index.SegmentReader.init(SegmentReader.java:104)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:95)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112)
 at org.apache.lucene.store.Lock$With.run(Lock.java:116)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:103)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:91)
 at
 org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:75)
 
 The segment that disappears (_1ae.fnm) varies.
 
 I can't seem to reproduce this error consistently, so don't have a 
 clue what might cause it, but it usually happens after the 
 application has been running for some time. Has anyone experienced 
 something similar, or can anyone point
 me
 in the right direction?
 
 When this occurs, I need to rebuild the entire index for it to be 
 usable. Very troubling indeed...
 
 Kelvin
 

RE: Documents the same search is done many times.

2004-04-29 Thread Nader S. Henein

The short answer is, it's up to you :-)  Lucene doesn't know which document
is your primary key (you're thinking like a DB programmer) id you ad the new
document with ID=one without deleting the old one from the index then when
you search you'll get two documents pig and mongoose but if you delete
all documents with ID=one then index you're new document then you'll only
get mongoose, From a DBA perspective Lucene is like a table with a unique
ID on each document (that being the Lucene assigned DOC ID (which changes
every time you optimize, but nevertheless remains unique) and then all other
columns weather indexed, tokenized, stored or not, can bare repetition, so
if you want to implement a unique key like ID on your Lucene index, you 'll
have to do a little delete based on that ID field every time you insert a
new document into the index, quite simple and I've been doing it or a few
years now without fail.

Hope this helps

Nader Henein



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Count for a keyword occurance in a file

2004-04-29 Thread Nader S. Henein
Tricky, scoring has to do with the frequency of the occurrence of the word
as opposed to the amount of words in the file in general (Somebody correct
me if I'm wrong) , so short of an educated approximation, you could hack the
indexer to dynamically store the frequency of a word (oh so unadvisable).
Personally I recommend the educated approximation, because you could index
the document with the number of words in it ( you would have to make sure
you're not using Stop Word Analyzer or Port Stemmer) and then based on the
score reverse engineer the result you want.

Nader Henein

-Original Message-
From: hemal bhatt [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 28, 2004 5:50 PM
To: Lucene Users List
Subject: Count for a keyword occurance in a file


Hi,

How can I get a count of the score given by Hits.Score().
i.e I want to know how many times a keyword occurs in a file. Any help on
this would be appreciated.
  
regards
Hemal Bhatt



regards
Hemal bhatt



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Count for a keyword occurance in a file

2004-04-29 Thread Nader S. Henein
So even an educated calculation won't do it because you'd need to know how
many documents the word occurs in (you could do a search, but that would be
overkill and impractical).

Cool

-Original Message-
From: Ype Kingma [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 29, 2004 10:57 AM
To: Lucene Users List
Subject: Re: Count for a keyword occurance in a file


On Thursday 29 April 2004 08:14, Nader S. Henein wrote:
 Tricky, scoring has to do with the frequency of the occurrence of the 
 word as opposed to the amount of words in the file in general 
 (Somebody correct me if I'm wrong) , so short of an educated 
 approximation, you could hack

Lucene uses two frequencies for a term: the nr. of docs in which it occurs
in an index (basis for IDF), and the nr of times a term occurs in a
document.

 the indexer to dynamically store the frequency of a word (oh so 
 unadvisable). Personally I recommend the educated approximation, 
 because you could index the document with the number of words in it ( 
 you would have to make sure you're not using Stop Word Analyzer or 
 Port Stemmer) and then based on the score reverse engineer the result 
 you want.

 Nader Henein

 -Original Message-
 From: hemal bhatt [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, April 28, 2004 5:50 PM
 To: Lucene Users List
 Subject: Count for a keyword occurance in a file


 Hi,

 How can I get a count of the score given by Hits.Score().
 i.e I want to know how many times a keyword occurs in a file. Any help 
 on this would be appreciated.

The easiest way is to use IndexReader. I don't know what you mean by file
(index or document), but you can have both frequencies I mentioned above
from an IndexReader, evt. using skipTo() to go to the document. The methods
are docFreq(Term) and termDocs(Term).

Regards,
Ype



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: sorting by date (XML)

2004-04-27 Thread Nader S. Henein

Here's my two cents on this:
Both ways you will need to combine the date in one field, but if you use a
millisecond representation you will not be able to use the FLOAT sort type
and you'll have use STRING sort (Slower) because the millisecond
representation is longer than FLOAT allows, so you have three options:

1) Use MMDD and sort by FLOAT type
2) Use the millisecond representation and sort by STRING type
3) If the date you're entering here is the date of indexing then you can
just sort by DOC type (which is the DOC ID) and save yourself the pain

Hope this helps.

Nader Henein

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 3:52 PM
To: Lucene Users List
Subject: sorting by date (XML)


my XML files contain something like

date
  year2004/yearmonth04/monthday27/day...
/date

and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like a
millisecond field and then sort by this, correct?

Has anyone done something like this yet?

Thanks

Michi

-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
You may be able to jimmy the bi filter to produce the most recent 100, but
really keeping your fetch count at 100 and ordering by DOC should be
sufficient.

-Original Message-
From: Alan Smith [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 4:03 PM
To: [EMAIL PROTECTED]
Subject: searching only part of an index


Hi

I wondered if anyone knows whether it is possible to search ONLY the 100 (or

whatever) most recently added documents to a lucene index? I know that once 
I have all my results ordered by ID number in Hits I could then just display

the required amount, but I wondered if there is a way to avoid searching all

documents in the index in the first place?

Many thanks

Alan

_
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
Are the DOC ids sequential? Or just unique and ascending, I'm thinking like
a good little Oracle boy, so does anyone know?

-Original Message-
From: Ioan Miftode [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 4:55 PM
To: Lucene Users List
Subject: Re: searching only part of an index




If you know the id of the last document in the index.
(I don't know what's the best way to get it)
you could probably use a range query.
something like find all docs with the id in [lastId-100 TO lastID]. maybe
you should make sure that the first limit is non negative, though.

just a thought

ioan

At 08:02 AM 4/27/2004, you wrote:
Hi

I wondered if anyone knows whether it is possible to search ONLY the 
100
(or whatever) most recently added documents to a lucene index? I know that 
once I have all my results ordered by ID number in Hits I could then just 
display the required amount, but I wondered if there is a way to avoid 
searching all documents in the index in the first place?

Many thanks

Alan

_
Express yourself with cool new emoticons 
http://www.msn.co.uk/specials/myemo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
So if Alan wants to limit it to the first 100 he can't really use a range
search unless he can guarantee that the index is optimized after deletes,
but then if his deletion rounds are anything like mine ( every 2 mins) then
optimizing it at each delete will make searching the index really slow.
Right?

Nader

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 5:15 PM
To: Lucene Users List
Subject: Re: searching only part of an index


On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote:
 Are the DOC ids sequential? Or just unique and ascending, I'm thinking
 like
 a good little Oracle boy, so does anyone know?

They are unique and ascending.

Gaps in id's exist when documents are removed, and then the id's are 
squeezed back to completely sequential with no holes during an 
optimize.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Segments file get deleted?!

2004-04-25 Thread Nader S. Henein
Can you give us a bit of background, we've been using Lucene since the first
stable release 2 years ago, and I 've never had segments disappear on me,
first of all can you provide some background on your setup and secondly when
you say a certain period of time, how much time are we talking about here
and does that interval coincide with your indexing schedule, because you may
have the create flag on the Indexer set to true so it simply recreates the
index at every update and deleted whatever was there, of course if there are
no files to index at any point it will just give you a blank index. 


Nader Henein

-Original Message-
From: Surya Kiran [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 26, 2004 7:48 AM
To: [EMAIL PROTECTED]
Subject: Segments file get deleted?!


Hi all, we have implemented our portal search using Lucene. It  works fine.
But after a certain period of time Lucene segments file get deleted.
Eventually all searches fails. Anyone can guess where the error could be.

Thanks a lot.

Regards
Surya.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: converting text/doc to XML

2003-07-08 Thread Nader S. Henein
We read from the database and parse the data into a valid XML then I
hand over the XML file to lucene which in turn digests it and indexes
the information

N.

-Original Message-
From: Jagdip Singh [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 08, 2003 10:39 AM
To: 'Lucene Users List'; [EMAIL PROTECTED]
Subject: RE: converting text/doc to XML


Hi Nader,
As you talked about using Lucene for your http://www.bayt.com web site.
Do you convert CV's or any other documents to XML format before
submitting to Lucene for indexing? 

Regards, 
Jagdip

-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 08, 2003 1:55 AM
To: 'Lucene Users List'
Subject: RE: converting text/doc to XML

XML is an organized, standardized format so let's say your document has
the following characteristics

File name : foobar.doc
Firt line title : Foo Bar
File content :
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 

Then you have to read the file ( simple file read, java can do this in
about ten different ways, pick one ) But each of the files
characteristincs in a variable

And then parse it in a valid XML:
doc doc_id=1
file_namefoobar.doc/file_name
titleFoo Bar/title
content
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 
/content
/doc


There are probably packages that will do this for you but it's so simple
you could pull it off in under a hundred lines, it's also good exercise
to familiarize yourself with XML (if you haven't played around with it
before)



-Original Message-
From: Jagdip Singh [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 08, 2003 9:41 AM
To: 'Lucene Users List'
Subject: converting text/doc to XML


Hi,
How can I convert text/doc to XML?
Please help.
 
Regards, 
Jagdip


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: converting text/doc to XML

2003-07-07 Thread Nader S. Henein
XML is an organized, standardized format so let's say your document has
the following characteristics

File name : foobar.doc
Firt line title : Foo Bar
File content :
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 

Then you have to read the file ( simple file read, java can do this in
about ten different ways, pick one )
But each of the files characteristincs in a variable

And then parse it in a valid XML:
doc doc_id=1
file_namefoobar.doc/file_name
titleFoo Bar/title
content
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 
Blah blah blah blah 
/content
/doc


There are probably packages that will do this for you but it's so simple
you could pull it off in under a hundred lines, it's also good exercise
to familiarize yourself with XML (if you haven't played around with it
before)



-Original Message-
From: Jagdip Singh [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 08, 2003 9:41 AM
To: 'Lucene Users List'
Subject: converting text/doc to XML


Hi,
How can I convert text/doc to XML?
Please help.
 
Regards, 
Jagdip


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We use Lucene http://www.bayt.com , we're basically an on-line 
 Recruitment site and up until now we've got around 500 000 CVs and 
 documents indexed with results that stump Oracle Intermedia.

 Nader Henein
 Senior Web Dev

 Bayt.com

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 04, 2003 6:09 PM
 To: [EMAIL PROTECTED]
 Subject: commercial websites powered by Lucene?



 Hello All,

 I've been trying to find examples of large commercial websites that 
 use Lucene to power their search.  Having such examples would make 
 Lucene an easy sell to management

 Does anyone know of any good examples?  The bigger the better, and the

 more the better.

 TIA,
 -John



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
I have to store the information I am indexing in the database because
the nature of our application requires it, on update of certain columns
in a table I create an XML file which is then copied to directories on
each of my web servers, then separate lucene apps, running on separates
machines digest the information into separate indices, you also have to
provide procedures that will run periodically to ensure that all you
indices are in sync with each other and in sync with the DB ( I run this
once every three days when the CPU usage on the machines is low) 

To update the index I have a servlet running off a scheduler in Resin
(you could use any webserver, Orion's cool too), the up-side to
distributing your search engines like this is that you have three active
back ups in case one got corrupted (hasn't happened in two years), and
the load on each machine is pretty low even during updates/optimizations
every 20 minutes.

If the server crashes, it's not  a problem unless it happens
mid-indexing, then you have to somehow remove the write locks created in
the index directory ( I just delete them, optimize, and re-start the
update that crashed) 

Lucene destroyed Oracle on speed tests and we use to have to use our
single DB monster machine for all the searching and indexing which made
the load on it pretty high, but now I have 0.5 loads on all my CPUs and
no need to buy new hardware

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 1:12 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


So you have a holding table in a database (or directory on disk?) where
you store the incoming documents correct? Does each webserver run it's
own indexing thread which grabs any new documents every 20 minutes, or
is there a central process that manages that? I'm trying to understand
how you know when you can safely clean out the holding table.

Did you look at having just a single process that was responsible for
updating the index, and then pushing copies out to all the webservers?
I'm wondering if that might be worth investigating (since it would take
a lot of load off the webservers that are running the searches), or if
it will be too troublesome in practice.

Also, I'm interested to see how you handle the situation when a server
gets shutdown/restarted - does it just take a copy of the index from one
of the other servers (since it's own index is likely out of date)? I
take it it's not safe to copy an index while it is being updated, so you
have to block on that somehow?

PS: It's great to hear Lucene blows Oracle out of the water! I've got
some skeptical management that need some convincing, hearing stories
like this helps a lot :-)

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 I handle updates or inserts the same way first I delete the document 
 from the index and then I insert it (better safe than sorry), I batch 
 my updates/inserts every twenty minutes, I would do it in smaller 
 intervals but since I have to sync the XML files created from the DB 
 to three machines (I maintain three separate Lucene indices on my 
 three separate
 web-servers) it takes a little longer. You have to batch your changes
 because Updating the index takes time as opposed to deleted which I
 batch every two minutes. You won't have a problem updating the index
and
 searching at the same time because lucene updates the index on a
 separate set of files and then when It's done it overwrites the old
 version. I've had to provide for Backups, and things like server
crashes
 mid-indexing, but I was using Oracle Intermedia before and Lucene
BLOWS
 IT AWAY.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
Because I've setup Lucene as a webapp with a centralized Init file and
setup properties file, I do my sanity check in the Init, because if the
serer crashes mid-indexing, I have to delete the lock files optimize and
re-index the files that were indexing when the crash occurred, there was
long discussion about this back in August, search for Crash / Recovery
Scenario in the lucene-dev archived discussions. Should answer all your
questions

Nader Henein

-Original Message-
From: Gareth Griffiths [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 24, 2003 1:11 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Nader,
You say you have to cope with server crash mid-indexing. I think I'm
seeing lots of garbage files created by server crash mid merge/optimise
while lucene is creating a new index. Did you write code specifically to
handle this or is there something more automated. (I was thinking of
writing a sanity check for before start-up that looked in 'segments' and
'deletable and got rid of any files in the catalog directory that are
not referenced.)

Did you do something similar or have I missed something...

TIA

Gareth


- Original Message -
From: Nader S. Henein [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 9:30 AM
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document 
 from the index and then I insert it (better safe than sorry), I batch 
 my updates/inserts every twenty minutes, I would do it in smaller 
 intervals but since I have to sync the XML files created from the DB 
 to three machines (I maintain three separate Lucene indices on my 
 three separate
 web-servers) it takes a little longer. You have to batch your changes
 because Updating the index takes time as opposed to deleted which I
 batch every two minutes. You won't have a problem updating the index
and
 searching at the same time because lucene updates the index on a
 separate set of files and then when It's done it overwrites the old
 version. I've had to provide for Backups, and things like server
crashes
 mid-indexing, but I was using Oracle Intermedia before and Lucene
BLOWS
 IT AWAY.

 -Original Message-
 From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
 Sent: Tuesday, June 24, 2003 12:06 PM
 To: [EMAIL PROTECTED]
 Subject: Re: commercial websites powered by Lucene?


 Hi Nader,

 I was wondering if you'd mind me asking you a couple of questions 
 about your implementation?

 The main thing I'm interested in is how you handle updates to Lucene's

 index. I'd imagine you have a fairly high turnover of CVs and jobs, so

 index updates must place a reasonable load on the CPU/disk. Do you 
 keep CVs and jobs in the same index or two different ones? And what is

 the process you use to update the index(es) - do you batch-process 
 updates or do you handle them in real-time as changes are made?

 Any insight you can offer would be much appreciated as I'm about to 
 implement something similar and am a little unsure of the best 
 approach to take. We need to be able to handle indexing about 60,000 
 documents/day, while allowing (many) searches to continue operating 
 alongside.

 Thanks!
 Chris

 Nader S. Henein [EMAIL PROTECTED] wrote in message 
 news:[EMAIL PROTECTED]
  We use Lucene http://www.bayt.com , we're basically an on-line 
  Recruitment site and up until now we've got around 500 000 CVs and 
  documents indexed with results that stump Oracle Intermedia.
 
  Nader Henein
  Senior Web Dev
 
  Bayt.com
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, June 04, 2003 6:09 PM
  To: [EMAIL PROTECTED]
  Subject: commercial websites powered by Lucene?
 
 
 
  Hello All,
 
  I've been trying to find examples of large commercial websites that 
  use Lucene to power their search.  Having such examples would make 
  Lucene an easy sell to management
 
  Does anyone know of any good examples?  The bigger the better, and 
  the

  more the better.
 
  TIA,
  -John
 
 
 
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
The search is a little sluggish because our initial architecture was
based on TCL, not java, so until we complete the full java overhaul,
every time I perform a search the AOL Webserver (tcl) has to call the
servlet in Resin (where lucene is)  and then perform the search, then
this is the killer , I have to parse all the results from a Java
Collection into a TCL List, the most intense search with thousands of
results takes less than a second, it's all the things I have to do
around it that take time.

Nader

-Original Message-
From: John Takacs [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 24, 2003 1:52 PM
To: Lucene Users List
Subject: RE: commercial websites powered by Lucene?


Hi Nader,

This thread is by far one of the best, and most practical.  It will only
be topped when someone provides benchmarks for a DMOZ.org type directory
of 3 million plus urls.  I would love to, but the whole JavaCC thing is
a show stopper.

Questions:

I noticed that search is a little slow.  What has been your experience?
Perhaps it was a bandwidth issue, but I'm living in a country with the
greatest internet connectivity and penetration in the world (South
Korea), so I don't think that is an issue on my end.

You have 500,000 resumes.  Based on the steps you took to get to
500,000, do you think your current setup will scale to millions, like
say, 3 million or so?

What is your hardware like?  CPU/RAM?

Warm regards, and thanks for sharing.  If I can ever get passed the
Lucene/JavaCC installation failure, I'll share my benchmarks on the
above directory scenario.

John



-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 5:30 PM
To: 'Lucene Users List'
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We use Lucene http://www.bayt.com , we're basically an on-line 
 Recruitment site and up until now we've got around 500 000 CVs and 
 documents indexed with results that stump Oracle Intermedia.

 Nader Henein
 Senior Web Dev

 Bayt.com

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 04, 2003 6:09 PM
 To: [EMAIL PROTECTED]
 Subject: commercial websites powered by Lucene?



 Hello All,

 I've been trying to find examples of large commercial websites that 
 use Lucene to power their search.  Having such examples would make 
 Lucene an easy sell to management

 Does anyone know of any good examples?  The bigger the better, and the

 more the better.

 TIA,
 -John



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED

RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
About 100 documents every twenty minutes, but it fluctuates depending on
how much traffic is on the site

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 3:28 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hmm, good point with the cost of copying indicies in a distributed
environment, although that is unlikely to affect us in the foreseeable
future. But, noted!

Do you have any rough statistics on how many documents you index/day, or
how many every 20 minutes?

This discussion is fantastic by the way, lots of great experience and
comments coming out here. Thanks, it's really appreciated.

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We thought of that in the beginning and then we became more 
 comfortable with multiple indices for simple backup purposes, and now 
 our indices are in excess of 100megs, and transferring that kind of 
 data between three machines sitting in the same data center is 
 passable, but once you start thinking of distributed webservers in 
 different hosting facilities, copying  100Megs every 20 minutes, or 
 even every hour becomes financially expensive.

 Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz 
 with two gegs of memory, and I've never seen the CPU usage go over 0.8

 at peek time with the indexer running. Try it out first, take your 
 time to gather your own numbers so you can really get  a feel of what 
 set up fits you best.

 Nader




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
We were using Oracle Internedia before we switched to Lucene, and Lucene
has been much faster and it has allowed us to distribute our search
functionality over multiple servers, Intermedia which is supposedly one
of the best in the business couldn't hold a candle to Lucene, and our
Oracle installation and setup is impeccable, we spent years perfecting
it before we decided to separate from Intermedia and use Oracle as DBMS
not a search engine, also when you use lucene and not a proprietary
product like Intermedia we can switch databases at will if Licensing
fees become to high to ignore.

Nader

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Ulrich Mayring
Sent: Tuesday, June 24, 2003 3:40 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Chris Miller wrote:
 Thanks for your commments Ulrich. I just posted a message asking if 
 anyone had attempted this approach! Sounds like you have, and it works

 :-)  Thanks for information, this sounds pretty close to what my 
 preferred approach would be.

This is a good approach if the number of total documents doesn't grow 
too much. There's obviously a limit to full index runs at some point.

 You say you get 2000 docs/minute. I've done some benchmarking and 
 managed to get our data indexing at ~1000/minute on an Athlon 1800+ 
 (and most of that speed was acheived by bumping the 
 IndexWriter.mergeFactor up to 100 or so). Our data is coming from a 
 database table, each record contains about 40 fields, and I'm indexing

 8 of those fields (an ID, 4 number fields, 3 text fields including one

 that has ~2k text). Does this sound reasonable to you, or do you have 
 any tips that might improve that performance?

You need to find out where you lose most of the time:

a) in data access (like your database could be too slow, in my case I am

scanning the local filesystem)
b) in parsing (probably not an issue when reading from a DB, but in my 
case it is, I have HTML files)
c) in indexing

I haven't gone to the trouble to find that out for my app, because it is

fast enough the way it is.

However, what I wonder: if you have your data in a database anyway, why 
not use the database's indexing features? It seems like Lucene is an 
additional layer on top of your data, which you don't really need.

cheers,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-05 Thread Nader S. Henein
We use Lucene http://www.bayt.com , we're basically an on-line
Recruitment site and up until now we've got around 500 000 CVs and
documents indexed with results that stump Oracle Intermedia.

Nader Henein
Senior Web Dev

Bayt.com

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 04, 2003 6:09 PM
To: [EMAIL PROTECTED]
Subject: commercial websites powered by Lucene?



Hello All,

I've been trying to find examples of large commercial websites that
use Lucene to power their search.  Having such examples would
make Lucene an easy sell to management

Does anyone know of any good examples?  The bigger the better, and
the more the better.

TIA,
-John



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Size limit for indexing ?

2002-10-09 Thread Nader S. Henein

The size of the document is limited only by the OS constraints and 500 kb is
really small, I have documents in the hundreds of megs it's fine .. check
you indexing and searching you might find the problem there also are you
using wildcard searches because they don't work from both sides


Nader Henein

-Original Message-
From: Christophe GOGUYER DESSAGNES [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, October 09, 2002 12:08 PM
To: [EMAIL PROTECTED]
Subject: Size limit for indexing ?


Hi,

I use lucene 1.2 and I index a text document wich size is near 500 ko.
(I use Field.UnStored method)
It seems that only the beginning of this document is indexing !
If I search a term that is at the end of this document, I don't find it (but
If find term at the beginning).
So, I split my document in 2 parts and index them, and now it works fine.

Is there a limit size for indexing a document ?

Thx.
-
Christophe


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Lucene and RDBMS.

2002-09-25 Thread Nader S. Henein

We had to do the same thing, we moved from an Oracle Intermedia search to
Lucene (much better) the data is stored in the database.
What we did is produce XML files on an interval (15 minutes) and those files
would be picked by the indexer witch would delete any previous occurrence of
the same entry and re-index the new one and then optimize the index. You
could do the whole process in one shot retrieve a stream from the DB and
then pass it directly to Lucene, but the stream should be in field,value
pairs ( so XML makes sense ).

The answer to your question is no you don't have to use files to create the
index. The index itself is file based though.

Nader Henein

-Original Message-
From: Rehan Syed [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 25, 2002 10:51 AM
To: [EMAIL PROTECTED]
Subject: Lucene and RDBMS.



Hi,

I am in process of implementing a Knowlegde base for internal use by my
company.
The contents of this Knowledge base will be stored in one or more database
table(s).  I am evaluating Lucene for performing text searches on this
Knowledge base. I understand that Lucene has two components, indexing and
searching, but both these components work on files, not on text data stored
in an RDBMS.

In order for me to use Lucene, would I need to develop a process that will
extract text data out of the database, create text files and then do the
indexing and searching?  Are there any other approaches to this problem?
Comments/suggestions would be greatly appreciated.



-
Do you Yahoo!?
New DSL Internet Access from SBC  Yahoo!


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Lucene and RDBMS.

2002-09-25 Thread Nader S. Henein

The initial motivation behind switching from intermedia to Lucene was a
first step in achieving DB abstraction because if you rely on intermedia for
your indexing and searching purposes you're pretty much stuck with Oracle,
an excellent DB but if you're business is growing the licensing fees become
massive. Another thing is that I don't maintain one index on the Database
server, I maintain an index on each webserver witch allowed me to reduce the
average load on the DB machine by 78%, it's a little bit of a
synchronization might mare but we've had it in place for the past three
months without incident plus you have redundant indexes in-case one becomes
corrupted. Furthermore the traffic between the DB machine and the webserver
witch was inflated by having to pass search results back and forth has been
dwarfed.

Now the true joy behind using Lucene is the performance boost you'll get, we
had intermedia customized and tuned to our needs yet Lucene was able to give
a 200% increase in performance , a huge asset to our site witch is mainly
search driven.

PS: the reason why we create XML files and then hand them to Lucene is
because, the files are then used for display purposes and caching purposes,
because once they are transmitted to the webserver machines they save me the
hassle of retrieving them from the database since they are the most recent
version of the documents.

Nader Henein


-Original Message-
From: Mariusz Dziewierz [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 25, 2002 4:23 PM
To: Lucene Users List
Subject: Re: Lucene and RDBMS.


Nader S. Henein wrote:
 We had to do the same thing, we moved from an Oracle Intermedia search to
 Lucene (much better) the data is stored in the database.

Could you give some reasons which lead you to conclusion that Lucene is
much better than Oracle Intermedia in terms of searching data stored in
database? I'm currently reviewing technologies related to text mining
and I am very curious about your motives because I haven't opportunity
to evaluate both technologies yet.

--
Mariusz Dziewierz


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Lucene is not closing connections to index files

2002-08-12 Thread Nader S. Henein

I know it's not the most efficient way but I do close the searched after
every search using:
searcher.close() ;
This saves me the hassle of worrying about memory problems, and the search
on my system is quite intensive about half a million searches a day, I
haven't faced any problems with the opening an closing of searchers anyway
if you keep it open you're in a way going against the garbage collector,
slowly creating your own structured memory leek.

Memory is cheap .. any of my clients would rather pay for a couple gegs of
memory rather than have a team come two months after launch to troubleshoot
a memory leek, somewhat reminiscent of the days of C dangling pointers.
Gratned that one shouldn't go around using memory libraly but some trade
offs do pay off.


Nader Henein

-Original Message-
From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
Sent: Monday, August 12, 2002 11:18 AM
To: Lucene Users List
Subject: RE: Lucene is not closing connections to index files




 -Original Message-
 From: Jason Coleman [mailto:[EMAIL PROTECTED]]
 Sent: Monday, August 12, 2002 12:25 AM
 To: [EMAIL PROTECTED]
 Subject: Lucene is not closing connections to index files


 Lucene is not letting go (closing) index files that are being
 searched.

 I have not traced exactly where the problem is occurring, so
 I thought I
 would get some ideas first from the board.  It appears that
 when a user does
 a search against the Lucene index files, the connections to
 these files are
 not released.  It continues to maintain a connection until
 the JVM runs out
 of file space.

yes, you are right. you have to close the searcher to release opened files.

 This is how I am querying the index:


 Searcher searcher = new IndexSearcher(index_path);
 Query query = QueryParser.parse(queryString, body, new
 StandardAnalyzer());
 hits = searcher.search(query);


 index_path is just the location of the Lucene index files.  I
 am sure that a
 Reader class somewhere is not being closed properly.  Has
 anyone experienced
 this problem when querying the index?

it's not bug but feature. lucene don't close files after searching only if
you call the close() method. the cause: it's very slow to reopen the files.

you should check the discussion about searcher cache (see mailing list
archive)

peter

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Hit Navigation in Lucene?

2002-08-03 Thread Nader S. Henein

Here's the highlightinhg javascript ready with copyright and all

-Original Message-
From: Peter Carlson [mailto:[EMAIL PROTECTED]]
Sent: Friday, August 02, 2002 5:11 AM
To: Lucene Users List
Subject: Re: Hit Navigation in Lucene?


This clicking to the next highlighted term is all done in javascript, not by
the backend system.

So if you get permission, you can use their code and look this in with the
Lucene Highlight. I'll bet that the highlighting is being done via
javascript too so you don't need the lucene highlighting code.

Although, the Lucene highlighting code works with wildcards.

--Peter


On 8/1/02 12:36 PM, Bruce Best (CRO) [EMAIL PROTECTED] wrote:

 I am looking at Lucene as the search engine for our office's legal
research
 site. We have been looking at some of the commercial offerings, but Lucene
 seems to offer most of what we need, and we may end up using it and
spending
 money on paying someone to customize it to our needs.

 For our purposes, one feature that is probably indispensible is hit
 highlighing and hit navigation. I see the former has already been added to
 the contributions section.

 With respect to hit navigation, the kind of thing I am looking at is along
 the lines of that used by the Fulcrum search engine; if anyone is not
 familiar with Fulcrum, a good example site is the Government of Canada
 Employment Insurance Jurisprudence Library at
 http://www.ei-ae.gc.ca/easyk/search.asp. Do a search for any term (try
 fired), then click on any of the resulting documents. The resulting page
 has the search terms highlighted, much as they would be in Lucene with the
 hit highlighting added, with a narrow frame at the top of the window with
 hit navigation buttons to allow users to jump to the next search term in
the
 document.

 Would it be difficult to implement something similar with Lucene? I am not
 familiar with the technologies involved (I am not a coder), so do not know
 if this is trivial or impossible or somewhere in between.

 Any thoughts would be appreciated,

 Bruce

 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: Size Capabilities of Lucene Index

2002-07-31 Thread Nader S. Henein

since it's file system based index I don't see any limitations other than OS
max file size, and Imagine if you're data is 3 Terabytes you have monster
machines with monster memory (you'll need it) also you'll need to max up the
file handle set up on the OS and probably use a high MERGE_FACTOR.

PS: I'm hypothesizing here, so please anyone feel free to jump in

Nader Henein

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 31, 2002 6:32 PM
To: Lucene Users List
Subject: Size Capabilites of Lucene Index


Can anyone tell me the amount of data that Lucene is able to index?  Can it
handle up to 3 Terrabytes, how large are the indexes it creates, (1/2 the
size of the data)?

Thanks,

Scott




The information contained in this message may be privileged and confidential
and protected from disclosure.  If the reader of this message is not the
intended recipient, or an employee or agent responsible for delivering this
message to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please notify
us immediately by replying to the message and deleting it from your
computer.  Thank you.  Ernst  Young LLP

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Using Filters in Lucene

2002-07-31 Thread Nader S. Henein

My index changes ( updates every 15 minutes and delete every 2 minutes ) so
using the filter is not going to work for me because the order of the
Documents might change from the time the initial search is done to the time
the filter is done, I'm currently using a crude method ( ... doc_id:(23 AND
78 .. ) ) and so to filter it works surprisingly well because I thought the
query parser would cave but it's doing great even with sets as large as
filtering within 2000 documents

-Original Message-
From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 31, 2002 10:24 PM
To: 'Lucene Users List'
Subject: RE: Using Filters in Lucene


Cool.  But instead of adding a new class, why not change Hits to inherit
from Filter and add the bits() method to it?  Then one could pipe the
output of one Query into another search without modifying the Queries...

Scott

 -Original Message-
 From: Doug Cutting [mailto:[EMAIL PROTECTED]]
 Sent: Monday, July 29, 2002 12:03 PM
 To: Lucene Users List
 Subject: Re: Using Filters in Lucene


 Peter Carlson wrote:
  Would you suggest that search in selection type
 functionality use filters or
  redo the search with an AND clause?

 I'm not sure I fully understand the question.

 If you a condition that is likely to re-occur commonly in subsequent
 queries, then using a Filter which caches its bit vector is
 much faster
 than using an AND clause.  However, you probably cannot
 afford to keep a
 large number of such filters around, as the cached bit vectors use a
 fair amount of memory--one bit per document in the index.

 Perhaps the ultimate filter is something like the attached class,
 QueryFilter.  This caches the results of an arbitrary query in a bit
 vector.  The filter can then be reused with multiple queries, and (so
 long as the index isn't altered) that part of the query
 computation will
 be cached.  For example, RangeQuery could be used with this,
 instead of
 using DateFilter, which does not cache (yet).

 Caution: I have not yet tested this code.  If someone does try it,
 please send a message to the list telling how it goes.  If this is
 useful, I can document it better and add it to Lucene.

 Doug




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: is this possible in a query?

2002-07-31 Thread Nader S. Henein

This is a long shot but if you want you search to yield exact results alone
on that specific field, you might wannna think about replacing the spaces
between words with underscores (make sure the analyzer doesn't split them
up) and then apply that same rule to the query string in the sense that
Cathflo OrthoMed will become Cathflo_OrthoMed and OrthoMed will stay
the same so when you search for OrthoMed you'll only get exact results,
this does not save you from re-indexing (unfortunately) but it does save you
from writing a whole new analyzer.

Nader Henein

-Original Message-
From: Robert A. Decker [mailto:[EMAIL PROTECTED]]
Sent: Thursday, August 01, 2002 6:35 AM
To: Lucene Users List
Subject: Re: is this possible in a query?


I think this may be what I end up doing... Unfortunately this means
reindexing the documents...

thanks,
rob

http://www.robdecker.com/
http://www.planetside.com/

On Wed, 31 Jul 2002 [EMAIL PROTECTED] wrote:

 if you make the product name a type Field.Keyword, it will still be
 indexed and searchable, but will not be tokenized.
 --dmg


 - Original Message -
 From: Robert A. Decker [EMAIL PROTECTED]
 Date: Wednesday, July 31, 2002 5:07 pm
 Subject: is this possible in a query?

  I have a Text Field named product. Two of the products are:
  Cathflo OrthoMed
  OrthoMed
 
  When I search for Cathflo OrthoMed, I correctly only get items
  that have
  the product Cathflo OrthoMed. However, when I search for
  OrthoMed, not
  only do I get all OrthoMed products, but I also get all Cathflo
  OrthoMed products.
 
  Is there a way, when searching on a Field.Text type, to limit the
  aboveOrthoMed search to only OrthoMed, and to exclude Cathflo
  OrthoMed? The solution has to be generic enough to work with any
  combination of product names.
 
  thanks,
  rob
 
  http://www.robdecker.com/
  http://www.planetside.com/
 
 
  --
  To unsubscribe, e-mail:   mailto:lucene-user-
  [EMAIL PROTECTED]For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 
 


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Autonomy vs Lucene, etc..

2002-07-24 Thread Nader S. Henein

Could you explain a little bit what autonomy does for you and what
requirements you have that need to be met.

Nader Henein

-Original Message-
From: Anoop Kumar V [mailto:[EMAIL PROTECTED]]
Sent: Thursday, July 25, 2002 9:05 AM
To: 'Lucene Users List'
Subject: Autonomy vs Lucene, etc..


Hi,

i have a very basic question. We have been using Autonomy until now. but now
we are looking
for any alternative tools to substitute autonomy. This we decided as we hv
now shifted to a more
internal database/site search rather tahn the external search offered by
autonomy. What i want to know
is that.. can Lucene (or any other search engine) substitute autonomy and
what are the impacts.
Can you also guide me to any other search engine (ok..if it is not open
source) that is suitable
in terms of ease of installation and integration.

thanx in advance..
-anoop


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: I need help

2002-07-24 Thread Nader S. Henein

If you're talking about the ranking (scoring) scheme of the search results,
I imagine that you could use a vectorial model (a lot of changes), but why
when an algebraic ranking method is more accurate?

Nader Henein

-Original Message-
From: ilma barbosa [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 24, 2002 10:41 PM
To: [EMAIL PROTECTED]
Subject: I need help


I would like to know lucene  makes query using the
vectorial model

___
Yahoo! Encontros
O lugar certo para encontrar a sua alma gêmea.
http://br.encontros.yahoo.com/

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Replication of indexes

2002-07-15 Thread Nader S. Henein

I maintain the index on multiple machines UPDATING/DELETING/OPTIMIZING on
all three machines,
it's hard to make sure that everything is synchronized but it provides a
fall back in case anything happens
to the index. What you're doing is mainly a copy witch is probably like my
backup, witch is quite simple now in the
sense that I check the index directory for *.lock files if none are present
(the index isn't being edited/optimized)
I create a write.lock file witch tells the indexer not to run and I read the
file list using a shell script and copy
the files to a different directory. It's a hack but it works fine. I'm
currently working on a backup and rollback
API for Lucene witch should work for copying the index across.

Nader Henein

-Original Message-
From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 15, 2002 3:39 AM
To: Lucene Users List
Subject: Replication of indexes


Hello Everybody,

I have a requirement where i need to replicate the index files generated by
lucene to another server at a remote location.
What i have observed is that lucene keeps on changing file names for the
index files  . does it follow any speific pattern in doing this ?
Is anybody doing something like this ?

From what i understand it will be best if i optimize the index before i
replicate it and also make a local copy so that the index is not
updated while it is being replicated. What other issues can be there if i
try something like this .

TIA

Regards
Harpreet





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: simultaneous searching and indexing

2002-07-10 Thread Nader S. Henein

I'm not sure as to the status of the FAQ but I've had this discussion
before, and I've used tested Lucene heavily during the last few months, I've
searched it during man y of my repeated full indexing sessions (witch are
extremely exhaustive) and it has not failed me once, and about a month ago I
had a related discussion about concurrent indexing and backup that might
shed some light  on your issue,
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01709.html

Nader Henein

-Original Message-
From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 10, 2002 8:09 PM
To: Lucene Users List
Subject: simultaneous searching and indexing


Hi

I was going through the FAQ and found a mention of thread safety of lucene.
From what i understand lucene is not full thread safe . the FAQ is dated
2001 . has there been any improvements on this since then .

Is it safe to perform search and index operations on the same index
simultaneously .

Please pardon me if this question has been asked before.

Thanks and Regards,

Harpreet




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Crash / Recovery Scenario

2002-07-09 Thread Nader S. Henein

I'm not worried about my hardware I've been blessed with an 8 CPU Sum
machine and 2 2 CPU sun Machines with gegs of memory, and I do run Lucene
with 15 threads, I've set my merge factor at 1000 so a lot of work is done
in memory (speed), my current concerns are Recovery related as I'm a few
days from deployment, on a windows based machines I'm not too falmiliar with
the threading setup, the beauty of unix
is you can do anything, I'm worried about Lucene hanging mid indexing, how
do I monitor that?

-Original Message-
From: none none [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 08, 2002 11:05 PM
To: [EMAIL PROTECTED]
Subject: RE: Crash / Recovery Scenario


 If you tell me the computer doesn't crash , the only thing is you want to
stop safely the process, well , in this case the Manager will not stop until
the task is not complete, because i am running the Manager as an NT service
,i have a little problem here, you cannot stop a thread while it is doing
I/O operation like recursively scan of a directory, you have to wait a
little bit.
 I see that you are looking for software stability, but software is strictly
related with hardware, you need a good hardware too, think about a RAID
structure, 0 or 5 depends, think about a clustered system.
This depends what you want from your search engine.
Also i think is good focus on have a good cache status, e.g.: if i have a
bad error and i can't recover the index i rebuild it by calling a method
that scan all my cache, it is no great but better than nothing . Also i
never had that kind of problem.
Also adopt a multi threaded will improve by 40% the actual speed, you need
to merge all the segments at the end. (i tested just with 2 thread on
Win2K).
If you are looking for a search engine like google, there is a lot of work
to do, A LOT
My opinion is to split index and cache on 'n' machine , but the only thing i
don't know how to do it's run a search on multiple index on multiple
machine, with sockets will not work, sockets become really slow with heavy
traffic, i was thinking on a Java compatible DLL able to merge multiple
machine as a logical unit.

ciao.


--

On Mon, 8 Jul 2002 21:07:32
 Nader S. Henein wrote:
brilliant .. I was thinking along the same lines, a new issue that I'm
facing is just lucene dying on me, in the middle of indexing .. no server
crash .. nothing .. what do you do if it just stops mid-indexing ?

-Original Message-
From: none none [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 08, 2002 8:42 PM
To: [EMAIL PROTECTED]
Subject: Re: Crash / Recovery Scenario


 hi, i do perform the same things as you do, but i do that everytime i got
a
NullPointerException when i try to run a search . If this happen i try to
reopen the index searcher, if i got an exception here i sleep for 500 ms
then i try again, after 5 times i generate a servlet exception. Concerning
the delete of write.lock and commit.lock, i use a manager,what it does is
execute different kind of operation in blocks, like 100 or 1000.
Each operation can be:
1.Delete documents
2.Add documents
3.Search document/s

A combination of this 3 operation allow me to update the index with
searches still running. But there is a problem versioning, between
current
cache of documents and current version of INDEXED documents, during
update
you can search for something that is found in the index but that has been
updated in the cache, so i have a bounch of documents duplicate during
that,
and at the end i notify using a RMI callback all the clients connected to
that Manager to re open the index, then i clean up all this duplicate. At
this stage i have still an error in case the Manager die because i have all
in memory, but i did a little work around to handle that. My next step is
make this transaction persistent, so i can recovery the previous
status.

Every time i run an operation as listed above i do a check if write.lock
or commit.lock exists, in that case i call the unlock() method, i delete
them (if the method unlock doesn't), then i optimize the index.

Until now everything seems to work fine.
ciao.

--

On Mon, 8 Jul 2002 09:40:10
 Nader S. Henein wrote:

I'm currently using Lucene to sift through about a million documents, I've
written a servlet to do the indexing and the searching, the servlets are
ran
through resin, The Crash scenario I'm thinking of is a web server crash (
for a million possible reasons ) while the index is being updated or
optimized, what I've noticed is the creation of write.lock and commit.lock
files witch stop further indexing because the application thinks that the
previously scheduled indexer is still running (witch could very well be
true
depending on the size of the update). This is the recovery I have in mind
but I think it might be somewhat of a hack, On restart of the web server
I've written an Init function that checks for write.lock or commit.lock ,
and if either exist it deletes both of them and optimizes the index. Am I
forgetting anything

RE: Crash / Recovery Scenario

2002-07-09 Thread Nader S. Henein

Karl, what if I copy the index in memory or in another directory prior to
indexing thereby, assuring a working index in the case of a crash. I want to
stay away from DB interaction as I am trying to move out of an Oracle
Intermedia search solution (if you saw the Oracle price list you would too).
I have a backup process witch
1) Checks if the index is being updated
2) Does a small trial search (to ensure that the index s not corrupt)
3) Tar the index and move the file to another disk

I'm thinking of writing a full backup/restore add-on to Lucene so all of
this can be jared together as part of the package.

Nader

-Original Message-
From: Karl Øie [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, July 09, 2002 1:49 PM
To: Lucene Users List
Subject: Re: Crash / Recovery Scenario


 only deletes the old one while it's working on the new one, so is there a
 way of checking for the .lock files in case
 of a crash a rolling back to the old index image?

 Nader Henein

i have some thoughts about crash/recovery/rollback that i haven't found any
good solutions for.

If a crash happends during writing happens there is no good way to know if
the
index is intact, removing lock files doesn't help this fact, as we really
don't know. So providing rollback functionality is a good but expensive way
of compensating for lack of recovery.

To provide rollback i have used a RAMDirectory and serialized it to a SQL
table. By doing this i can catch any exceptions and ask the database to
rollback if required. This works great for small indexes but if the index
grows you will have problems with performance as the whole RAMDir has to be
serialized/deserialized into the BLOB all the time.

A better solution would be to hack the FSDirectory to store each file it
would
store in a file-directory as a serialized  byte array in a blob of a sql
table. This would increase performance because the whole Directory don't
have
to change each time, and it doesn't have to read the while directory into
memory. I also suspect lucene to sort its records into these different files
for increased performance (like: i KNOW that record will be in segment xxx
if it is there at all).

I have looked at the source for the RAMDirectory and the FSDirectory and
they
could both be altered to store their internal buffers into a BLOB, but i
haven't managed to do this successfully. The problem i have been pounding is
the lucene.InputStream's seek() function. This really requires the
underlying
impl to be either a file, or a array in memory. For a BLOB this would mean
that the blob has to be fetched, then read/seek-ed/written/ then stored back
again. (is this correct?!?, and if so is there a way to know WHEN it is
required to fetch/store the array).

I would really appreciate any tips on this as i would think
crash/recovery/rollback functionality to benefit lucene greatly.

I have indexes that uses 5 days to build, and it's really bad to receive
exceptions during a long index run, and no recovery/rollback functionality.

Mvh Karl Øie

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Crash / Recovery Scenario

2002-07-08 Thread Nader S. Henein

I understand that these files are there for a reason but in case of a web
server crash
lucene will not be able to update/delete/optimize the index in the existence
of these files,
the existence of these two files after a web server restart means that the
crash occurred when
the web server was editing the index and since there is no way to Rollback
(is there?, that would be a cool feature) I have to cut my losses and
continue.

Sorry for thinking out loud but speaking of rollback, I asked a question a
while back about
backing up the index while it's being written to.
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01711.html
and Peter told me that it's no problem especially on a Unix machine because
the Lucene writer creates a new index and
only deletes the old one while it's working on the new one, so is there a
way of checking for the .lock files in case
of a crash a rolling back to the old index image?

Nader Henein

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 08, 2002 9:43 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Crash / Recovery Scenario


Nader,

I don't have a solution for you, but just removing these two files is
probabl not a good idea.  There is a reason for their existence.
Actually, check jGuru Lucene FAQ for more information about them.

Otis
P.S.
s/witch/which/gi :)
witch = the ugly woman flying around on a broom stick :)

--- Nader S. Henein [EMAIL PROTECTED] wrote:

 I'm currently using Lucene to sift through about a million documents,
 I've
 written a servlet to do the indexing and the searching, the servlets
 are ran
 through resin, The Crash scenario I'm thinking of is a web server
 crash (
 for a million possible reasons ) while the index is being updated or
 optimized, what I've noticed is the creation of write.lock and
 commit.lock
 files witch stop further indexing because the application thinks that
 the
 previously scheduled indexer is still running (witch could very well
 be true
 depending on the size of the update). This is the recovery I have in
 mind
 but I think it might be somewhat of a hack, On restart of the web
 server
 I've written an Init function that checks for write.lock or
 commit.lock ,
 and if either exist it deletes both of them and optimizes the index.
 Am I
 forgetting anything ? is this wrong ? is there a Lucene specific way
 of
 doing this like running the optimizer with a specific setup.

 Nader S. Henein
 Bayt.com , Dubai Internet City
 Tel. +9714 3911900
 Fax. +9714 3911915
 GSM. +9715 05659557
 www.bayt.com


 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



__
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Crash / Recovery Scenario

2002-07-07 Thread Nader S. Henein


I'm currently using Lucene to sift through about a million documents, I've
written a servlet to do the indexing and the searching, the servlets are ran
through resin, The Crash scenario I'm thinking of is a web server crash (
for a million possible reasons ) while the index is being updated or
optimized, what I've noticed is the creation of write.lock and commit.lock
files witch stop further indexing because the application thinks that the
previously scheduled indexer is still running (witch could very well be true
depending on the size of the update). This is the recovery I have in mind
but I think it might be somewhat of a hack, On restart of the web server
I've written an Init function that checks for write.lock or commit.lock ,
and if either exist it deletes both of them and optimizes the index. Am I
forgetting anything ? is this wrong ? is there a Lucene specific way of
doing this like running the optimizer with a specific setup.

Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Stress Testing Lucene

2002-06-29 Thread Nader S. Henein

That's the weird thing I wasn't writing to the index at the
time I was searching (hardcore searching) 20 clients each one
issuing 20 simultaneous search request .. it was going fine until
it started throwing errors at me and when I looked at the logs, I found
a set of Too many files open error. Previously this only happened if
their was a crash on the server while indexing leaving an un-optimized index
with 800+ files.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 27, 2002 7:36 PM
To: Lucene Users List
Subject: Re: Stress Testing Lucene


It's very hard to leave an index in a bad state.  Updating the
segments file atomically updates the index.  So the only way to
corrupt things is to only partly update the segments file.  But that too
is hard, since it's first written to a temporary file, which is then
renamed segments.  The only vulnerability I know if is that in Java on
Win32 you can't atomically rename a file to something that already
exists, so Lucene has to first remove the old version.  So if you were
to crash between the time that the old version of segments is removed
and the new version is moved into place, then the index would be
corrupt, because it would have no segments file.

Doug

Scott Ganyo wrote:
 Which came first--the out of file handles error or the corruption?  I
 haven't looked, but I would guess that if you ran into the file handles
 exception while writing, that might leave Lucene in a bad state.  Lucene
 isn't transactional and doesn't really have the ACID properties of a
 database...


-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 26, 2002 11:45 PM
To: Lucene Users List
Subject: RE: Stress Testing Lucene


I rebooted my machine and still the same issue .. if I know
what caused that to happen, I would be able to solve it with
some source tweaking, and it's not the files handles on the machine I
got over that problem months ago. Let's consider worst case
scenario and
that
corruption did occur what could be the reasons, I'm goig to need some
insider
help to get through this one.

N.

-Original Message-
From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 26, 2002 7:15 PM
To: 'Lucene Users List'
Subject: RE: Stress Testing Lucene


1) Are you sure that the index is corrupted?  Maybe the file
handles just
haven't been released yet.  Did you try to reboot and try again?

2) To avoid the too-many files problem: a) increase the
system file handle
limits, b) make sure that you reuse IndexReaders as much as
you can rather
across requests and client rather than opening and closing them.


-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 26, 2002 10:11 AM
To: [EMAIL PROTECTED]
Subject: Stress Testing Lucene
Importance: High



Hey people,

I'm running a Lucene (v1.2) servlet on resin and I must say
compared to
Oracle Intermedia
it's working beautifully. BUT today, I started stress testing and I
downloaded a program called
Web Roller, witch simulates clients, requests ,
multi-threading .. the works
and I was testing
I was doing something like 50 simultaneous requests and I was
repeating that
10 times in a row.

but then something happened and the index got corrupted,
every time I try
opening the index
with the reader to search or open with the writer to optimize
I get that
damned too-many files
open error. I can imagine that every application on the market has a
breaking point and these breaking
points have side effects, so is the corruption of the index a
side effect
and if so is there a way that
I configure my web server to crash before the corruption
occurs, I'd rather
re-start the web server and
throw some people off wack rather that have to re-build the
index or revert
to an older version.

Do you know of any way to safeguard against this ?

General Info:
The index is about 45 MB with 60 000 XML files each
containing 18-25 fields.


Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Internationalization - Arabic Language Support

2002-06-29 Thread Nader S. Henein

I'm indexing arabic in my index and to make it searchable I had to switch
character sets
(not fun) the problem lies in the week standards surrounding Arabic
Character sets
between ISO 8895-6 , win-1256 and UTF-8 you can have three different
representations of the
same exact thing UTF-8 store arabic in numeric form ( the code that
represent each letter)
the lucene analyzer isn't to friendly with numbers and especially if you use
a stemmer.
When it comes to the other two encodings they are different but both come
back to the same results
lucene views them as if they were European character sets and tries to apply
the same rules to them
so take care when you're indexing arabic, I only figured it out when I
started experimenting with different
unix charset settings while encoding because I have an oracle DB that spits
out the XML files on a Solaris
os and then lucene picks them up for encoding and since my core application
isn't in java I have to contend
with two web servers Main application ( AOL server ) and then search
application (Lucene on Resin).

When trying to figure out encoding issues, you need to convert everything to
it's most simple form and
compare and contrast as it passes through your application.

Nader

-Original Message-
From: W. Eliot Kimber [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 28, 2002 6:59 PM
To: Lucene Users List
Subject: Re: Internationalization - Arabic Language Support


Peter Carlson wrote:

 The biggest part that is usually changed per language is the analyzer.
This
 is the part of Lucene which transforms and breaks up a string into
distinct
 terms.

I have only the smallest understanding of Arabic as a language, but I
have done some work to implement back-of-the-book indexing of Arabic
(and other languages) for XSL/XSLT. Based on that experience, I think
that the main challenges in implementing an Arabic analyzer would be:

1. Understanding the stemming rules for Arabic. Our research into Arabic
collation revealed that the rules for how Arabic words are formed is not
nearly as simple as in English and other Western languages. At this
point we haven't stepped up to trying to implement (or find an
implementation for) Arabic stemming for collation (words are collated
first by their roots, which are not necessarily at the start of the
words, so simple lexical collation won't work for Arabic and I'm
assuming that full-text indexing by word roots would have the same
problem). So I don't know more than that the problem is hard, even for
native speakers of Arabic.

2. Handling different letter forms in queries--Semitic languages often
have different forms for the same abstract character for different
positions in a word: initial forms, final forms, and base forms. These
different forms have different Unicode code points (although initial and
final forms are identified as such in the Unicode database). Often a
word will be stored with the base forms but the presented word will be
transformed to use the appropriate initial or final form. This means,
for example, that cutting and pasting a word from, say, a PDF document
into a query might require rationalization of variant forms to base
forms before performing the search (assuming that the analyzer also
reduces all letters to their base forms for indexing).

3. Right-to-left entry of queries and presentation of results. Mixing
right-to-left data with left-to-right data can get pretty tricky at the
user interface level (it's not an issue at the data storate level, where
all characters are stored in order of occurrence regardless of
presentation direction). Good support for bidirectional input and
presentation is hit and miss at best. For example, we could not figure
out how to get Internet Explorer to correctly present mixed English and
Arabic where there were lots of special characters (as opposed to simple
flowed prose, which seems to work OK).  I would expect Arabic localized
Web browsers to handle input OK, but it might be hard to find GUI
toolkits that do it well.

IBMs ICU4J package, a collection of national language support utilities
and libraries, might offer some solutions to this problem but I have not
yet investigated its support for Arabic and similar languages (we used
it for its Thai word breaker, which would be needed to implement a Thai
analyzer for Lucene).

Cheers,

Eliot
--
W. Eliot Kimber, [EMAIL PROTECTED]
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Stress Testing Lucene

2002-06-26 Thread Nader S. Henein

I rebooted my machine and still the same issue .. if I know
what caused that to happen, I would be able to solve it with
some source tweaking, and it's not the files handles on the machine I
got over that problem months ago. Let's consider worst case scenario and
that
corruption did occur what could be the reasons, I'm goig to need some
insider
help to get through this one.

N.

-Original Message-
From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 26, 2002 7:15 PM
To: 'Lucene Users List'
Subject: RE: Stress Testing Lucene


1) Are you sure that the index is corrupted?  Maybe the file handles just
haven't been released yet.  Did you try to reboot and try again?

2) To avoid the too-many files problem: a) increase the system file handle
limits, b) make sure that you reuse IndexReaders as much as you can rather
across requests and client rather than opening and closing them.

 -Original Message-
 From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, June 26, 2002 10:11 AM
 To: [EMAIL PROTECTED]
 Subject: Stress Testing Lucene
 Importance: High



 Hey people,

 I'm running a Lucene (v1.2) servlet on resin and I must say
 compared to
 Oracle Intermedia
 it's working beautifully. BUT today, I started stress testing and I
 downloaded a program called
 Web Roller, witch simulates clients, requests ,
 multi-threading .. the works
 and I was testing
 I was doing something like 50 simultaneous requests and I was
 repeating that
 10 times in a row.

 but then something happened and the index got corrupted,
 every time I try
 opening the index
 with the reader to search or open with the writer to optimize
 I get that
 damned too-many files
 open error. I can imagine that every application on the market has a
 breaking point and these breaking
 points have side effects, so is the corruption of the index a
 side effect
 and if so is there a way that
 I configure my web server to crash before the corruption
 occurs, I'd rather
 re-start the web server and
 throw some people off wack rather that have to re-build the
 index or revert
 to an older version.

 Do you know of any way to safeguard against this ?

 General Info:
 The index is about 45 MB with 60 000 XML files each
 containing 18-25 fields.


 Nader S. Henein
 Bayt.com , Dubai Internet City
 Tel. +9714 3911900
 Fax. +9714 3911915
 GSM. +9715 05659557
 www.bayt.com


 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Stress Testing Lucene

2002-06-26 Thread Nader S. Henein

sorry .. but still the same problem, I've saved the index in a seperate
direcory
and I've re-indexed overnight so, testing (witch is currently underway) on
the
system can resume. Like I said in my previous email , worst case scenarios
and the index
is corrupted any ideas as to why, I'll gladly go into the source but some
guidance as
to a starting point would be nice


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 26, 2002 7:33 PM
To: Lucene Users List
Subject: RE: Stress Testing Lucene



--- Scott Ganyo [EMAIL PROTECTED] wrote:
 1) Are you sure that the index is corrupted?  Maybe the file handles
 just
 haven't been released yet.  Did you try to reboot and try again?

You can also do something like this:

# lsof | wc -l
   8727

# lsof | grep -c java
   5382

# lsof | grep java | head
mozilla-b 8428   otis  memREG3,5  1242726   1287892
/usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so
mozilla-b 8453   otis  memREG3,5  1242726   1287892
/usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so
mozilla-b 8454   otis  memREG3,5  1242726   1287892
/usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so
mozilla-b 8455   otis  memREG3,5  1242726   1287892
/usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so
mozilla-b 8457   otis  memREG3,5  1242726   1287892
/usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so
mozilla-b 8471   otis  memREG3,5  1242726   1287892
/usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so

 2) To avoid the too-many files problem: a) increase the system file
 handle
 limits, b) make sure that you reuse IndexReaders as much as you can
 rather
 across requests and client rather than opening and closing them.

  -Original Message-
  From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
  Sent: Wednesday, June 26, 2002 10:11 AM
  To: [EMAIL PROTECTED]
  Subject: Stress Testing Lucene
  Importance: High
 
 
 
  Hey people,
 
  I'm running a Lucene (v1.2) servlet on resin and I must say
  compared to
  Oracle Intermedia
  it's working beautifully. BUT today, I started stress testing and I
  downloaded a program called
  Web Roller, witch simulates clients, requests ,
  multi-threading .. the works
  and I was testing
  I was doing something like 50 simultaneous requests and I was
  repeating that
  10 times in a row.
 
  but then something happened and the index got corrupted,
  every time I try
  opening the index
  with the reader to search or open with the writer to optimize
  I get that
  damned too-many files
  open error. I can imagine that every application on the market has
 a
  breaking point and these breaking
  points have side effects, so is the corruption of the index a
  side effect
  and if so is there a way that
  I configure my web server to crash before the corruption
  occurs, I'd rather
  re-start the web server and
  throw some people off wack rather that have to re-build the
  index or revert
  to an older version.
 
  Do you know of any way to safeguard against this ?
 
  General Info:
  The index is about 45 MB with 60 000 XML files each
  containing 18-25 fields.
 
 
  Nader S. Henein
  Bayt.com , Dubai Internet City
  Tel. +9714 3911900
  Fax. +9714 3911915
  GSM. +9715 05659557
  www.bayt.com
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 



__
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




IndexReader Pool

2002-06-26 Thread Nader S. Henein


I was going through the lucene-user posts on the web and I came accross
a posting by Scott Oshima 
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00693.html

witch is talking about creating a IndexReader pool to spead up the search
I've looked into that but I can't fiure out what to use for a DataSource 
like in creating a pool for DB connections, is there an equivalant in the 
lucene architecture or should one just take the initiative.

Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Updating Documents in the index

2002-06-24 Thread Nader S. Henein

As their is no update in lucene, this is exactly what you need to do,
and I would advise you to batch your update and optimize after you update,
because the number of files baloon if you don't

-Original Message-
From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, June 25, 2002 8:26 AM
To: Lucene Users List
Subject: Updating Documents in the index


Hi,

My application needs to provide a feature for updating documents in the
index.
I am thinking of doing this by deleting the original document and indexing
the updated one again , I think this is possible using the delete methods in
the IndexReader class .

Is there some other better way to achieve this with lucene .

Thanks and Regards
Harpreet


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Boolean Query + Memory Monster

2002-06-15 Thread Nader S. Henein

I'm all ears .. I'm running the search from a servlet on 
a resin web server, any suggestions as to increasing the heap
size in this case ?



-Original Message-
From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 13, 2002 9:47 PM
To: 'Lucene Users List'
Subject: RE: Boolean Query + Memory Monster


Use the java -Xmx option to increase your heap size.

Scott

 -Original Message-
 From: Nader S. Henein [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, June 13, 2002 12:20 PM
 To: [EMAIL PROTECTED]
 Subject: Boolean Query + Memory Monster
 
 
 
 I have 1 Geg of memory on the machine with the application 
 when I use a normal query it goes well, but when I use a range 
 query it sucks the memory out of the machine and throws a servlet 
 out of memory error, 
 I have 80 000 records in the index and it's 43 MB large
 
 anything people ?
 
 
 Nader S. Henein
 Bayt.com , Dubai Internet City
 Tel. +9714 3911900
 Fax. +9714 3911915
 GSM. +9715 05659557
 www.bayt.com
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
 mailto:[EMAIL PROTECTED]
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




DateField issues

2002-06-13 Thread Nader S. Henein


I managed to index according to the date field no problem, but then
when I search using a date filter, the search is slightly slower and
the results do not seem to be constraint by any date.

The following code segment shows how I'm searching :
(I basically want all the records indexed with dates after start )


// Current time in mills
long currInMills = DateField.MAX_DATE_STRING();
// startTime = currInMills - ( a number of days * ( length of a day in
mills) )
long start = currInMills - ( freshness * dayInMillis ) ;

Query query = QueryParser.parse(queryString, title , new
SuperStandardAnalyzer());
filter = new DateFilter.After(datemodified, start)  ;
Searcher searcher = new IndexSearcher(indexPath);
Hits hits = searcher.search(query,filter);


I know I'm indexing the dates correctly because I encode them then I decode
them and print them
and they seem to be accurate, just to be sure time in mills is measured
since 01 01 1970 right ?

if anyone has any idea why this isn't working please feel free to contribute
, oh and if you're wondering
yes I also tried the date filter with start and end .. nada



Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Creating indexes

2002-06-12 Thread Nader S. Henein

depending on the build of the document, but I guess not,
I had to write my own XML parser, you get better results when
you customize something like that to your needs.

-Original Message-
From: Chris Sibert [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 12, 2002 10:27 AM
To: Lucene Users List
Subject: Creating indexes


I have a big ( 40 MB or so) file to index. The file contains a whole bunch
of documents, which are each pretty small, about a few typewritten pages
long. There's a title, date, and author for each document, in addition to
the documents' actual text.

I'm not quite sure how you index this in Lucene. For each document in the
original file, I assume that I create a separate Lucene Document object in
the index with author, date, title, and text fields. If so, my question is
that when I'm reading in the original file for indexing, does Lucene know
where each document begins and ends in the original file ? Or do I have to
write a parser or filter or something for the InputStream that's reading the
file ?

Chris Sibert



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Wilcard Search Issues

2002-05-28 Thread Nader S. Henein


I'm using the new Lucene 1.5 release and I remember a message
in the lucene-user mailing list that talked about a wildcard issue that
if you search something like this:

reslocCCsa/resloc

using the following query string : resloc:CCsa*
it will yield no results, and them there was a reply saying that the issue
has
been resolved in the nightly builds, this was about two weeks before rc1.5
(witch I'm using)
and according the rc1.5 mailer that went out wildcard issues where hammered
out. but I still
have this problem if I search using resloc:CCsa I get 5 results but when I
add the star sign to
the right-hand side of the query string like so resloc:CCsa* I get no
results.

Anyone care to shed some light on this issue ?

Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Filtering in Lucene

2002-05-13 Thread Nader S. Henein

For those of you who have worked with the BitSet concept to
use lucene in searching within a subset, just to make sure that
I got this right, if I have 100 000 documents to search, my Bit Vector
will be of 100 000 length, just to save that vector for repeated use
I'll have to use a clob! Am I thinking right or have I misunderstood the
concept.

thanks

Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Italian web sites

2002-04-24 Thread Nader S. Henein

sniff the IP and then using the database at the
internet topology website http://netgeo.caida.org/perl/netgeo.cgi
you can find the country of origin, (use that to populate your
own DB) so retrieval decreases as you accumulate IPs), but that will
give you the website in Italy (not Italian websites). Unfortunately unless
Italian
uses a different encoding for the page, picking it up from the page
(JavaScript)
won't help much.




-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, April 24, 2002 1:03 PM
To: [EMAIL PROTECTED]
Subject: Italian web sites


Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The
problem is that I'd like spidering only Italian web sites.
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

Thanks


Laura



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: too many open files in system

2002-04-09 Thread Nader S. Henein

it's not a matter of releasing the handles, it needs to keep them open,
this tricked me as well I thought it kept the file handles of the
source XML files open, but if you look at the code it actually reads the
contents of
the files from an HTTP request, the file handles are consumed by the files
that lucene creates
to store the index results, that's why you get the same error when you try
to search as well
it tries to open all the files but runs out of handles in the process, you
have to increase your
unix file handles and reboot the system (how to depends on your OS), this
solves one problem.

I just hit another one, but I'm convinced it's worth it, I've gotten
excellent results after indexing
20 000 files, very fast and very responsive and if it's going to take some
tweaking to get it over this
problem so be it, that's the joy of open source

cheers .. I hope that was useful

-Original Message-
From: root [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 09, 2002 4:02 PM
To: [EMAIL PROTECTED]
Subject: too many open files in system


Hi List!

Doesn't Lucene releases the filehandles??

because I get too many open files in system after running lucene a while!

I use the 1.2 rc 4 version!


regards

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: too many open files in system

2002-04-09 Thread Nader S. Henein

that depends on how many files you're indexing .. I still have to figure out
too what logic does the LuceneCocoonIndexer adhere when it is creating the
index files


-Original Message-
From: root [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 09, 2002 4:50 PM
To: Lucene Users List
Subject: Re: too many open files in system


On Tuesday, 9. April 2002 14:08, you wrote:
 root wrote:
  Doesn't Lucene releases the filehandles??
 
  because I get too many open files in system after running lucene a
  while!

 Are you closing the readers and writers after you've finished using them?

 cheers,

 Chris


Yes I close the readers and writers!


@Nader S. Henein

If I increase the filehandles, to what count should I increase them?


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: too many open files in system

2002-04-09 Thread Nader S. Henein

that might be the case I'm indexing 200 000 files each one has about 30 XML
fields each one
has a set of attributes .. could that be it ?

-Original Message-
From: Karl Øie [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 09, 2002 7:03 PM
To: Lucene Users List
Subject: Re: too many open files in system


I have worked a little with the cocoon indexer and it indexes each
xml-attribute in a Field. I have done some indexing on both plaintext and
xml
sources and i think the Too many open files problem is directly related to
number of fields stored in a document in a index.

the reason for this is that i have never encountered Too many open files
when indexing clean text into one large field, but when creating many-many
fields as required by indexing xml i got a Too many open files  until i
had
to use a ram-dir to index document batches into..

mvh karl øie

On Tuesday 09 April 2002 16:42, you wrote:
 This sounds like a question for Cocoon people, as what you are asking
 about seems to be related to Cocoon's usage of Lucene, not the core
 Lucene API.

 Otis

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Index problem

2002-04-08 Thread Nader S. Henein

I'm currently working on indexing 200 000 documents with
index updates every half hour on three separate webservers.

So you can see my ordeal I have to update the index ( delete and add)
on three separate machines, how many files are you indexing, the first
issue I faced was the Too Many files open error, and are you indexing your
files from the webapp or did you write the indexer to run from the command
line ? Sorry about all the questions but there are very few people in the
dev mailers talking about the lucene cocoon issues that it's a joy when a
new voice props up

Nader



-Original Message-
From: Flavio Arruda [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 08, 2002 7:03 PM
To: [EMAIL PROTECTED]
Subject: Index problem


Hi everybody,

All documents of my application (indexed by Lucene) came from a Web Form
which
the application´s Administrator can change/remove/add (fields) regularly.

Researching Lucene´s FAQs I got that the only way to alter a indexed
document (adding index, deleting index, modify fields) is deleting the given
document and after this adding the modified version. Unfortunately this
looks
very slower on my application, because I have thousands of documents of each
Form.

My questions are:
  - Are there any eficient way to do what I need using Lucene?
  - If not, where is the best place to modify Lucene code? Is someone
working on this?

 Thanks by advance and best wishes,

Flavio

Flavio Regis de Arruda
[EMAIL PROTECTED]

PROMON*INTELLIGENS
Av. Pres. Juscelino Kubitschek, 1830/6º andar - T3
CEP: 04543-900, São Paulo, SP
Tel.: 55.11.3847 1173, Fax: 55.11.3847 4546
www.promoninteligens.com.br




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]