Re: Re-Indexing a moving target???

2005-02-01 Thread Nader Henein
details?
Yousef Ourabi wrote:
Saad,
Here is what I got. I will post again, and be more
specific.
-Y
--- Nader Henein [EMAIL PROTECTED] wrote:
 

We'll need a little more detail to help you, what
are the sizes of your 
updates and how often are they updated.

1) No just re-open the index writer every time to
re-index, according to 
you it's moderately changing index, just keep a flag
on the rows and 
batch indexing every so often.
2) It all comes down to your needs, more detail
would help us help you.

Nader Henein
Yousef Ourabi wrote:
   

Hey,
We are using lucene to index a moderatly changing
database, and I have a couple of questions on a
performance strategy.
1) Should we just have one index writer open unil
 

the
   

system comes down...or create a new index writer
 

each
   

time we re-index our data-set.
2) Does anyone have anythoughts...multi-threading
 

and
   

segments instead of one index?
Thanks for your time and help.
Best,
Yousef
 

-
   

To unsubscribe, e-mail:
 

[EMAIL PROTECTED]
   

For additional commands, e-mail:
 

[EMAIL PROTECTED]
   



 

--
Nader S. Henein
Senior Applications Developer
Bayt.com

   

-
 

To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: QUERYPARSIN BOOSTING

2005-01-11 Thread Nader Henein
From the text on the Lucene Jakarta Site : 
http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

Lucene provides the relevance level of matching documents based on the 
terms found. To boost a term use the caret, ^, symbol with a boost 
factor (a number) at the end of the term you are searching. The higher 
the boost factor, the more relevant the term will be.

   Boosting allows you to control the relevance of a document by
   boosting its term. For example, if you are searching for


jakarta apache


   and you want the term jakarta to be more relevant boost it using
   the ^ symbol along with the boost factor next to the term. You would
   type:


jakarta^4 apache


   This will make documents with the term jakarta appear more relevant.
   You can also boost Phrase Terms as in the example:


jakarta apache^4 jakarta lucene


   By default, the boost factor is 1. Although the boost factor must be
   positive, it can be less than 1 (e.g. 0.2)
Regards.
Nader Henein
Karthik N S wrote:
Hi Guys

Apologies...
This Question may be asked million times on this form ,need some
clarifications.
1) FieldType =  keyword  name =  vendor
2)FieldType =  text  name = contents
Question:
1) How to Construct a Query which would allow hits  avaliable for the VENDOR
to  appear  first ?.
2) If boosting is to be applied How TO   ?.
3) Is the Query Constructed Below correct?.
+Contents:shoes +((vendor:nike)^10)

Please Advise.
Thx in advance.
WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: time of indexer

2004-12-28 Thread Nader Henein
Download Luke, it makes life easy when you inspect the index, so you an 
actually look at what you've indexed, as opposed to what you may think 
you indexed.

Nader
Daniel Cortes wrote:
Hi to everybody, and merry christmas for all(and specially people who 
that me today are working  instead of stay with the family).

I don't understand because my search in the index give this bad results:
I index 112 php files how a txt.
with this machine
Pentium 4 2,4GHz 512 RAM running during the index Windows XP and Eclipse
Tiempo de bsqueda total: 80882 ms
the fields that I use are
doc.add(Field.Keyword(filename, file.getCanonicalPath()));
doc.add(Field.UnStored(body, bodyText));
doc.add(Field.Text(titulo, title));
What I'm doing bad?
thks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index question

2004-12-27 Thread Nader Henein
It comes down to your searching needs, do you need to have your 
documents searcheable by these fields or do you need a general search of 
the whole document, your decisions will impact the size of the index and 
the speed of indexing and searching so give it due thought, start from 
your GUI requirement and design the index that responds to your user 
needs best.

Nader
Daniel Cortes wrote:
I want to know In the case that you use Lucene for index files how a 
general searcher, what fields (or keys) do you use to index.
For example, in my case are html,pdf,doc,ppt and txt and I'm thinked 
to use Field Autor, Field title, field url, field content, field 
modification date.
Something more? some recommendation?
thks
and Merry Xmas for all.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index question

2004-12-27 Thread Nader Henein
ok, so you can index the whole document in one shot, but you should 
store certain fields like what you display in the search results in the 
index to avoid a round trip to the DB.

so for example you would store title synopsis link doc_id date 
and then just index what you want to be searchable, the reason why you 
would have title stored in one field and indexed again in another so if 
you stem that field it will become useless for display purposes.  So the 
logical representation of your index would look something like this:

document
   id stored/ indexed
   title stored/ un-indexed
   synopsis stored/ un-indexed
   date stored / indexed
   full document stemmed  indexed / un stored
/document
Enjoy
Nader Henein
Daniel Cortes wrote:
thks nader
I need a general search of documents, it's for this that I ask yours 
recomendations, because fields are only for info in the search. 
Tipically search on Google for example

search:casa
La casa roja
..haba una vez una casa roja que tenia 
htttp:\\go.to\casaModification date:25-12-04
for do this  what fields and options (keybord,text,unindex,unstored) 
do you should use?

thks
Nader Henein wrote:
It comes down to your searching needs, do you need to have your 
documents searcheable by these fields or do you need a general search 
of the whole document, your decisions will impact the size of the 
index and the speed of indexing and searching so give it due thought, 
start from your GUI requirement and design the index that responds to 
your user needs best.

Nader
Daniel Cortes wrote:
I want to know In the case that you use Lucene for index files how a 
general searcher, what fields (or keys) do you use to index.
For example, in my case are html,pdf,doc,ppt and txt and I'm thinked 
to use Field Autor, Field title, field url, field content, field 
modification date.
Something more? some recommendation?
thks
and Merry Xmas for all.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MergerIndex + Searchables

2004-12-21 Thread Nader Henein
As obvious as it may seem, you could always store the index ID in which 
you are indexing the document in the document itself and have that 
fetched with the search results, or is there something stopping you from 
doing that.

Nader Henein
Karthik N S wrote:
Hi Guys
Apologies...
I have several MERGERINDEXES [  MGR1,MGR2,MGR3].
for searching across these MERGERINDEXES I use the following Code
IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK];
for(int all=0;allCNTINDXDBOOK;all++){
   indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]);
System.out.println(all +  ADDED TO SEARCHABLES  + INDEXEDBOOKS[all]);
}
MultiSearcher searcher = new MultiSearcher(indexToSearch);
Question :
When on Search Process , How to Display that this relevan  Document Id
Originated from Which MRG???
[ Some thing like this : -  Search word  'ISBN12345' is avalible from
MRGx ]

 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-15 Thread Nader Henein
This is a OS file system error not a Lucene issue (not for this board) , 
Google it for Gentoo specifically you a get a whole bunch of results one 
of which is this thread on the Gentoo Forums, 
http://forums.gentoo.org/viewtopic.php?t=9620

Good Luck
Nader Henein
Karthik N S wrote:
Hi Guys
Some body tell me what this Exception am Getting Pleae
Sys Specifications
O/s Linux Gentoo
Appserver Apache Tomcat/4.1.24
Jdk build 1.4.2_03-b02
Lucene 1.4.1 ,2, 3
Note: - This Exception is displayed on Every 2nd Query after Tomcat is
started
java.io.IOException: Stale NFS file handle
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:307)
   at
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420)
   at
org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
   at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou
ndFileReader.java:220)
   at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
   at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
   at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
   at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142)
   at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
   at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
   at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137)
   at
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253)
   at
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69)
   at org.apache.lucene.search.Similarity.idf(Similarity.java:255)
   at
org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery.
java:47)
   at org.apache.lucene.search.Query.weight(Query.java:86)
   at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
   at
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:
251)


 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Nader Henein
How big do you expect it to get and how often do you expect to update 
it, we've been using Lucene for about 1 M records (19 fields each) with 
incremental updates every 10 minutes, the performance during updates 
wasn't wonderful, so it took some seriously intense code to sort that 
out, as you mentioned, it comes down to why you need the Thin DB for, 
Lucene is a wonderful search engine, but if I were looking at a fast and 
dirty relational DB, MySQL wins hands down, put them both together and 
you've really got something.

My 2 cents
Nader Henein
Kevin L. Cobb wrote:
I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose. 




 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HITCOLLECTOR+SCORE+DELIMMA

2004-12-13 Thread Nader Henein
Dude, and I say this with love, it's open source, you've got the code, 
take the initiative, DIY, be creative and share your findings with the 
rest of us.

Personally I would be interested to see how you do this, keep your 
changes documented and share.

Nader Henein
Karthik N S wrote:
Hi Erik
Apologies..
I got Confused with the last mail.
 

Iterate over Hits.  returns large hit values and Iteration on Hits for
 

scores consumes time ,
so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the Hits.
Note:- The search is being done on Field Type 'Text' ,consists of 'Contents'
from various Html documents
Please Advise me
Karthik

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, December 13, 2004 5:05 PM
To: Lucene Users List
Subject: Re: HITCOLLECTOR+SCORE+DELIMA

On Dec 13, 2004, at 1:16 AM, Karthik N S wrote:
 

So u say I have to Build a Filter to Collect all the Scores between
the 2
Ranges [ 0.2f to 1.0f]
   

My message is being misinterpreted.  I said filter as a verb, not a
noun.  :)  In other words, I was not intending to mean write a Filter -
a Filter would not be able to filter on score.
 

so the API for the same would be
Hits hit = search(Query query, Filter filtertoGetScore)
But while writing the Filter  Score again depends on Hits  
Score =
hits.score(x);
   

Again, you cannot write a Filter (capital 'F') to deal with score.
Please re-read what I said below...
 

Hits are in descending score
order, so you may just want to use Hits and filter based on the score
provided by hits.score(i).
   

Iterate over Hits... when you encounter scores below your desired
range, stop iterating.  Why is this simple procedure not good enough
for what you are trying to achieve?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: SEARCH CRITERIA

2004-11-30 Thread Nader Henein
they probably create a list of similar results by doing some sort of 
data mining on the search criteria that people use in succession, so for 
example someone, or they have a list of searches that are too general (a 
search for the word kid is at best stupid) but you can't call your users 
stupid so you try to guess what they're searching for based on other 
searches conducted  (kid rock, kid games, star wars kid, karate kid ) 
that contain the initial search string kid. You can use fuzzy search 
in Lucene, but that won't do that really, the short answer is DIY 
depending on your needs.

My two galiuns
Nader Henein
Karthik N S wrote:
Hi Guys
Apologies.
On yahoo and Altavista ,if searched upon a word like 'kid'  returns the
search with
similar as below.
  Also try: kid rock, kid games, star wars kid, karate kid   More...

 How to obtain the similar search criteria using Lucene.
Thx in advance
Warm regards
Karthik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: disadvantages

2004-11-21 Thread Nader Henein
You may singe your fingers if you touch the keyboard during indexing
Nader
Miguel Angel wrote:
What are disadvantages the Lucene?? 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Optimized??

2004-11-20 Thread Nader Henein
The down and dirty answer is it's like defragmenting your harddrive, 
you're basically compacting and sorting out index references. What you 
need to know is that it makes searching so much faster after you've 
updating the index.

Nader Henein
Miguel Angel wrote:
What`s mean Optimized index in Lucene¿?
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Backup strategies

2004-11-16 Thread Nader Henein
We've recently implemented something similar with the backup process 
creating a file (much like the lock files during indexing) that the 
IndexWriter recognizes (tweak) and doesn't attempt to start and indexing 
or a delete while it's there, wasn't that much work actually.

Nader
Doug Cutting wrote:
Christoph Kiehl wrote:
I'm curious about your strategy to backup indexes based on 
FSDirectory. If I do a file based copy I suspect I will get corrupted 
data because of concurrent write access.
My current favorite is to create an empty index and use 
IndexWriter.addIndexes() to copy the current index state. But I'm not 
sure about the performance of this solution.

How do you make your backups?

A safe way to backup is to have your indexing process, when it knows 
the index is stable (e.g., just after calling IndexWriter.close()), 
make a checkpoint copy of the index by running a shell command like 
cp -lpr index index.YYYMMDDHHmmSS.  This is very fast and requires 
little disk space, since it creates only a new directory of hard 
links.  Then you can separately back this up and subsequently remove it.

This is also a useful way to replicate indexes.  On the master 
indexing server periodically perform cp -lpr as above.  Then search 
slaves can use rsync to pull down the latest version of the index.  If 
a very small mergefactor is used (e.g., 2) then the index will have 
only a few segments, so that searches are fast.  On the slave, 
periodically find the latest index.YYYMMDDHHmmSS, use cp -lpr index/ 
index.YYYMMDDHHmmSS and 'rsync --delete master:index.YYYMMDDHHmmSS 
index.YYYMMDDHHmmSS' to efficiently get a local copy, and finally ln 
-fsn index.YYYMMDDHHmmSS index to publish the new version of the index.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: _4c.fnm missing

2004-11-16 Thread Nader Henein
what kind of incremental updates are you doing, because we update our index 
every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB memory 
resident index, the IndexWriter runs one instance at a time, so what kind of 
increments are we talking about it takes a bit of doing to overwhelm Lucene.
What's your update schedule, how big is the index, and after how many updates 
does the system crash?
Nader Henein

Luke Shannon wrote:
It conistantly breaks when I run more than 10 concurrent incremental
updates.
I can post the code on Bugzilla (hopefully when I get to the site it will be
obvious how I can post things).
Luke
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 16, 2004 3:20 PM
Subject: Re: _4c.fnm missing

 

Field names are stored in the field info file, with suffix .fnm. - see
http://jakarta.apache.org/lucene/docs/fileformats.html
The .fnm should be inside the .cfs file (cfs files are compound files
that contain all index files described at the above URL).  Maybe you
can provide the code that causes this error in Bugzilla for somebody to
look at.  Does it consistently break?
Otis
--- Luke Shannon [EMAIL PROTECTED] wrote:
   

I received the error below when I was attempting to over whelm my
system with incremental update requests.
What is this file it is looking for? I checked the index. It
contains:
_4c.del
_4d.cfs
deletable
segments
Where does _4c.fnm come from?
Here is the error:
Unable to create the create the writer and/or index new content
/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory).
Thanks,
Luke
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: _4c.fnm missing

2004-11-16 Thread Nader Henein
That's it, you need to batch your updates, it comes down to do you need to give 
your users search accuracy to the second, take your  database and put an 
is_dirty row on the master table of the object you're indexing and run a 
scheduled task every x minutes and have your process read the objects that are 
set to dirty and then re set the flag once they've been indexed correctly.
my two cents
Nader

Otis Gospodnetic wrote:
'Concurrent' and 'updates' in the same sentence sounds like a possible
source of the problem.  You have to use a single IndexWriter and it
should not overlap with an IndexReader that is doing deletes.
Otis
--- Luke Shannon [EMAIL PROTECTED] wrote:
 

It conistantly breaks when I run more than 10 concurrent incremental
updates.
I can post the code on Bugzilla (hopefully when I get to the site it
will be
obvious how I can post things).
Luke
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 16, 2004 3:20 PM
Subject: Re: _4c.fnm missing

   

Field names are stored in the field info file, with suffix .fnm. -
 

see
   

http://jakarta.apache.org/lucene/docs/fileformats.html
The .fnm should be inside the .cfs file (cfs files are compound
 

files
   

that contain all index files described at the above URL).  Maybe
 

you
   

can provide the code that causes this error in Bugzilla for
 

somebody to
   

look at.  Does it consistently break?
Otis
--- Luke Shannon [EMAIL PROTECTED] wrote:
 

I received the error below when I was attempting to over whelm my
system with incremental update requests.
What is this file it is looking for? I checked the index. It
contains:
_4c.del
_4d.cfs
deletable
segments
Where does _4c.fnm come from?
Here is the error:
Unable to create the create the writer and/or index new content
/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or
   

directory).
   

Thanks,
Luke
   

 

-
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
 

[EMAIL PROTECTED]
   

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Need help with filtering

2004-11-16 Thread Nader Henein
Well if the document ID is number (even if it isn't really) you could 
use a range query, or just rebuild your index using that specific filed 
as a sorted field but if it numeric be aware that if you use integer it 
limits how high your numbers can get.

nader
Edwin Tang wrote:
Hello,
I have been using DateFilter to limit my search results to a certain date
range. I am now asked to replace this filter with one where my search results
have document IDs greater than a given document ID. This document ID is
assigned during indexing and is a Keyword field.
I've browsed around the FAQs and archives and see that I can either use
QueryFilter or BooleanQuery. I've tried both approaches to limit the document
ID range, but am getting the BooleanQuery.TooManyClauses exception in both
cases. I've also tried bumping max number of clauses via setMaxClauseCount(),
but that number has gotten pretty big.
Is there another approach to this? Or am I setting this up incorrectly? Snippet
of one of my approaches follows:
queryFilter = new QueryFilter(new RangeQuery(new Term(id, sLastSearchedId),
null, false));
docs = searcher.search(parser.parse(sSearchPhrase), queryFilter,
utility.iMaxResults, new Sort(sortFields));
Thanks in advance,
Ed
		
__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Nader Henein
It depends on how many results they're looking through, here are two 
scenarios I see:

1] If you don't have that many records you can fetch all the results and 
then do a post parsing step the determine totals

2] If you have a lot of entries in each category and you're worried 
about fetching thousands of records every time, you can just have 
seperate indecies per category and search them in in parallel (not 
Lucene Parallel Search) and you can get up to 100 hits for each one 
(efficiency) but you'll also have the total from the search to display.

Either way you can boost up speed using RamDirectory if you need more 
speed from the search, but whichever approach you choose I would 
recommend that you sit down and do some number crunching to figure out 
which way to go.

Hope this helps
Nader Henein

Chris Lamprecht wrote:
I'd like to implement a search across several types of entities,
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:
Search results for: philosophy boyer
Found: 121 classes - 5 professors - 2 departments
search results here...
I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!
-Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: UPDATION+MERGERINDEX

2004-11-07 Thread Nader Henein
Well if you do all the steps in one run, I guess optimizing once at the 
end would be faster overall, but all you have to do is test it out and 
time it, performance wise, I don't think that step 3 (OPTIMIZE) in 
scenario (a) will really improve the performance of the new index merge.

my 2 cents
Nader Henein
Karthik N S wrote:
Hi Guys
Apologies.
a) 

1) SEARCH FOR SUBINDEX IN A  OPTIMISED MERGED INDEX
2) DELETE THE FOUND SUBINDEX FROM THE OPTIMISED MERGERINDEX
3) OPTIMISE THE MERGERINDEX
4) ADD A NEW VERSION OF THE SUBINDEX TO THE MERGER INDEX
5) OPTIMISE THE MERGERINDEX

b)
1) SEARCH FOR SUBINDEX IN A  OPTIMISED MERGED INDEX
2) DELETE THE FOUND SUBINDEX FROM THE OPTIMISED MERGERINDEX
3) ADD A NEW VERSION OF THE SUBINDEX TO THE MERGER INDEX
4) OPTIMISE THE MERGERINDEX
a  OR  b  WHICH IS BETTER CHOICE 


THX IN ADVANCE
   
 WITH WARM REGARDS 
 HAVE A NICE DAY 
 [ N.S.KARTHIK] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Atomicity in Lucene operations

2004-10-19 Thread Nader Henein
As soon as I've cleaned up the code, I'll publish it, it needs a little 
more documentation as well.

Nader
Roy Shan wrote:
Maybe you can contribute it to sandbox?
On Mon, 18 Oct 2004 08:31:30 -0700 (PDT), Yonik Seeley
[EMAIL PROTECTED] wrote:
 

Hi Nader,
I would greatly appreciate it if you could CC me on
the docs or the code.
Thanks!
Yonik
--- Nader Henein [EMAIL PROTECTED] wrote:
   

It's pretty integrated into our system at this
point, I'm working on
Packaging it and cleaning up my documentation and
then I'll make it
available, I can give you the documents and if you
still want the code
I'll slap together a ruff copy for you and ship it
across.
Nader Henein
Roy Shan wrote:
 

Hello, Nader:
I am very interested in how you implement the
   

atomicity. Could you
 

send me a copy of your code?
Thanks in advance.
Roy
   

   
__
Do you Yahoo!?
Yahoo! Mail - Helps protect you from nasty viruses.
http://promotions.yahoo.com/new_mail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Atomicity in Lucene operations

2004-10-17 Thread Nader Henein
It's pretty integrated into our system at this point, I'm working on
Packaging it and cleaning up my documentation and then I'll make it
available, I can give you the documents and if you still want the code
I'll slap together a ruff copy for you and ship it across.
Nader Henein
Roy Shan wrote:
Hello, Nader:
I am very interested in how you implement the atomicity. Could you
send me a copy of your code?
Thanks in advance.
Roy

On Sat, 16 Oct 2004 01:20:09 +0400, Nader Henein [EMAIL PROTECTED] wrote:
 

We use Lucene over 4 replicated indecies and we have to maintain
atomicity on deletion and updates with multiple fallback points. I'll
send you the right up, it's too big to CC the entire board.
nader henein

Christian Rodriguez wrote:
   

Hello guys,
I need additions and deletions of documents to the index to be ATOMIC
(they either happen to completion or not at all).
On top of this, I need updates (which I currently implement with a
deletion of the document followed by an addition) to be ATOMIC and
DURABLE (once I return from the update function its because the
operation happened to completion and stays in the index).
Notice that I dont really need all the ACID properties for all the operations.
I have tried to solve the problem by using the Lucene + BDB package
written by Andi Vajda and using transactions, but the BDB database
gets corrupted if I insert random System.exit() to simulate a crash of
the application before aborting or commiting transactions.
So I have two questions:
1. Has anyone been able to use the Lucene + BDB WITH transactions and
simulate random crashes at different points in the process of addding
items and found it to be robust (specially, have you been able to
always recover after a crash, with uncommited txns rolled back and
commited ones present in the DB)?
2. Can anyone suggest other solutions (beside using BDB) that may
work? For example: are any of these operations already atomic in
Lucene (using an FSDirectory)?
Thanks for any help you can give me!
Xtian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: simultanous search and indexing

2004-10-17 Thread Nader Henein
you can do both at the same time, it's thread safe, you will face 
different issues depending on the frequency or your indexing and the 
load on the search, but that shouldn't come into play till your index 
gets nice and heavy. So basically code on.

Nader Henein
Miro Max wrote:
hi,
i'm using servlet to search my index and i wish to be
able to create an index at the same time.
do i have to use threads - i'm beginner
thx



___
Gesendet von Yahoo! Mail - Jetzt mit 100MB Speicher kostenlos - Hier anmelden: 
http://mail.yahoo.de
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Atomicity in Lucene operations

2004-10-15 Thread Nader Henein
We use Lucene over 4 replicated indecies and we have to maintain 
atomicity on deletion and updates with multiple fallback points. I'll 
send you the right up, it's too big to CC the entire board.

nader henein
Christian Rodriguez wrote:
Hello guys,
I need additions and deletions of documents to the index to be ATOMIC
(they either happen to completion or not at all).
On top of this, I need updates (which I currently implement with a
deletion of the document followed by an addition) to be ATOMIC and
DURABLE (once I return from the update function its because the
operation happened to completion and stays in the index).
Notice that I dont really need all the ACID properties for all the operations.
I have tried to solve the problem by using the Lucene + BDB package
written by Andi Vajda and using transactions, but the BDB database
gets corrupted if I insert random System.exit() to simulate a crash of
the application before aborting or commiting transactions.
So I have two questions:
1. Has anyone been able to use the Lucene + BDB WITH transactions and
simulate random crashes at different points in the process of addding
items and found it to be robust (specially, have you been able to
always recover after a crash, with uncommited txns rolled back and
commited ones present in the DB)?
2. Can anyone suggest other solutions (beside using BDB) that may
work? For example: are any of these operations already atomic in
Lucene (using an FSDirectory)?
Thanks for any help you can give me!
Xtian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Encrypted indexes

2004-10-13 Thread Nader Henein
Well, are you storing any data for retrieval from the index, because 
you could encrypt the actual data and then encrypt the search string 
public key style.

Nader Henein
Weir, Michael wrote:
We need to have index files that can't be reverse engineered, etc. An
obvious approach would be to write a 'FSEncryptedDirectory' class, but
sounds like a performance killer.
Does anyone have experience in making an index secure?
Thanks for any help,
Michael Weir 
 
  This message may contain privileged and/or confidential information.  If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it; do not open any attachments, delete it immediately from your system and notify the sender promptly by e-mail that you have done so.  Thank you. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting and score ordering

2004-10-12 Thread Nader Henein
As far as my testing showed, the sort will take priority, because it's 
basically an opt-in sort as opposed to the defaulted score sort. So 
you're basically displaying a sorted set over all your results as 
opposed to sorting the most relevant results.

Hope this helps
Nader Henein
Chris Fraschetti wrote:
If I use a Sort instance on my searcher, what will have priority?
Score or Sort? Assuming I have a pages with .9, .9, and .5 scores, ...
if the .5 has a higher 'sort' value, will it return higher than one of
the .9 lucene score values if they are lower?
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Nader Henein
There is a way of writing an Arabic stemmer, it's just not a weekend 
project, I've seen the translate/stem option as well, and even tried it 
with Lucene, we've implemented Lucene on our database and we have about 
a million records in our DB with 19 indexed fields (some of which are 
clobs) in each record, the free text fields in each record are in many 
cases Arabic, we do not provide stemming on those just because I 
couldn't find a valid stemming or translation option, which held up to 
proper testing, some were ok, but after collecting data from user 
searches (averaging out at 5 searches per second) the Arabic stemming 
options would not be able to manage user expectations, which is what it 
comes down to, sometimes theory does not translate well to practice.

Nader Henein
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two 
word spelled exactly the same way with different diacritics will mean 
two separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the 
text upon indexing and that the queries usually disambiguate the 
context of the word.

It is just a digression. Now back to the arabic stemmer -- there has 
to be a way of doing it. I know Vivisimo has clustering options for 
arabic. They must be using a stemmer (and an English translation 
dictionary), although it might be a commercial one. Take a look:

http://vivisimo.com/search?v:file=cnnarabic
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Nader Henein
I'd be happy to help anyone test this out, my Arabic is pretty good.
Nader
Andrzej Bialecki wrote:
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two 
word spelled exactly the same way with different diacritics will 
mean two separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the 
text upon indexing and that the queries usually disambiguate the 
context of the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two 
or three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Moving from a single server to a cluster

2004-09-08 Thread Nader Henein
Hey Ben,
We've been using a distributed environment with three servers and three 
separate indecies for the past 2 years since the first stable Lucene 
release and it has been great, recently and for the past two months I've 
been working on a redesign for our Lucene App and I've shared my 
findings and plans with Otis, Doug and Erik, they pointed out a few 
faults in my logic which you will probably come across soon enough that 
mainly have to do with keeping you updates atomic (not too hard) and 
your deletes atomic (a little more tricky), give me a few days and I'll 
send you both the early document and the newer version that deals 
squarely with Lucene in a distributed environment with high volume index.

Regards.
Nader Henein
Ben Sinclair wrote:
My application currently uses Lucene with an index living on the
filesystem, and it works fine. I'm moving to a clustered environment
soon and need to figure out how to keep my indexes together. Since the
index is on the filesystem, each machine in the cluster will end up
with a different index.
I looked into JDBC Directory, but it's not tested under Oracle and
doesn't seem like a very mature project.
What are other people doing to solve this problem?
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Moving from a single server to a cluster

2004-09-08 Thread Nader Henein
be a pleasure, just didn't want to mislead someone down the wrong way.
Give me a few days and I'll have the new version up.
Nader
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Devnagari Search?

2004-06-10 Thread Nader Henein
Have faith in the UNICODE standard it's well thought out and if you have 
any internationalization queries there was an excellent article on Java 
World entitled end-to-end internationalization here's the link: 
http://www.javaworld.com/javaworld/jw-05-2004/jw-0524-i18n_p.html  have 
a read it helps clear out some myths.

Nader Henein
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: read only file system

2004-04-30 Thread Nader Henein
I hate  to speak after Otis, but the way we deal with this is by clearing
locks on server restart, in case a server crash occurs mid indexing and we
also optimize on server restart, it doesn't happen often (God bless Resin)
but when it has we faced no problems from Lucene.

Just fir the record we have a validate function that the LuceneInit calls it
looks something like this:

try {
Directory directory =
FSDirectory.getDirectory(indexPath,false);
if ( directory.list().length == 0 ) clear() ;
Lock writeLock = directory.makeLock(writeFileName); 
if (!writeLock.obtain()) {
IndexReader.unlock(directory) ;
} else {
writeLock.release() ;
}
} catch (IOException e) {
logger.error(Index Validate,e) ;
}


Nader 

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 30, 2004 4:09 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: read only file system

If you have a very recent Lucene, then you can disable locks with command
line parameters.  I believe a page describing various command line
parameters is on Lucene's Wiki.

Otis

--- Supun Edirisinghe [EMAIL PROTECTED] wrote:
 I think I'm alittle confused on how and index is put into use on a 
 readonly file system
 
 I'm using Lucene in my web application. Our indexes are built off our 
 database nightly and copied into our web app servers.
 
 I think our web app dies from time to time and sometimes a lock is 
 left behind from Lucene in /tmp/.
 
 I have read that there is a disableLuceneLocks System Property(is that 
 the full name or is it something like 
 org.apache.jakarta...disableLuceneLocks?). But, I'm still not sure how 
 I can set that. Do I give it as commandline arg to the java VM?
 
 thanks
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disappearing segments

2004-04-30 Thread Nader Henein
Could you share you're indexing code, and just to make sure id there
anything running on your machine that could delete these files, like an a
cron job that'll back up the index.

You could go by process of elimination and shut down your server and see if
the files disappear, coz if the problem is contained within the server you
know that you can safely go on the DEBUG rampage.

Nader 

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 30, 2004 9:15 AM
To: Lucene Users List
Subject: Re: Disappearing segments

An update:

Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see if it
happens with the compound index format. Before I had a chance to try it out,
this happened: 

java.io.FileNotFoundException: C:\index\segments (The system cannot find the
file specified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:200)
at
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.j
ava:321)
at
org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329)
at
org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71)
at
org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154)
at org.apache.lucene.store.Lock$With.run(Lock.java:116)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:149)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:131)

so even the segments file somehow got deleted. Hoping someone can shed some
light on this...

Kelvin

On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said:
 Errr, sorry for the cross-post to lucene-dev as well, but I realized 
 this mail really belongs on lucene-user...
 
 I've been experiencing intermittent disappearing segments which result 
 in the following stacktrace:
 
 Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The 
 system cannot find the file specified) at 
 java.io.RandomAccessFile.open(Native Method) at 
 java.io.RandomAccessFile.init(RandomAccessFile.java:200)
 at
 org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.ja
 va:321) at 
 org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329)
 at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
 at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78)
 at 
 org.apache.lucene.index.SegmentReader.init(SegmentReader.java:104)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:95)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112)
 at org.apache.lucene.store.Lock$With.run(Lock.java:116)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:103)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:91)
 at 
 org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:75)
 
 The segment that disappears (_1ae.fnm) varies.
 
 I can't seem to reproduce this error consistently, so don't have a 
 clue what might cause it, but it usually happens after the application 
 has been running for some time. Has anyone experienced something 
 similar, or can anyone point
me
 in the right direction?
 
 When this occurs, I need to rebuild the entire index for it to be 
 usable. Very troubling indeed...
 
 Kelvin
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-Threading

2003-08-19 Thread Nader Henein
Why do you have concurency problems? are you trying to
have each user initiate the indexing himself? because
that will create issues, how about you put all the new
files you want to index in a directory and then have a
schedule procedure on the webserver run the lucene
indexer on that directory, our application hasn't had
any concurrency problems at all, because we index based
on a pull system, rather than the user puching
documents to the indexer.

I hope I understood your problem correctly, so that the
answer is useful

Nader

On Tue, 19 Aug 2003 12:55:09 +0200, Damien Lust wrote:

 
 Hello,
 
 I developed an Client-Server application on the web,
 with a search  
 module using Lucene. In the same application, the
users
 can index new  
 text.
 
 So, multiple sessions can acces to the Index and
 concurrences problems  
 can be possible.
 
 I used Threads in Java. Is it the best solutions?
 
 I call :
 
 IndexFiles indexFiles = new IndexFiles();
 indexFiles.run();
 
 Here you are an extract of my code.
 
 Thanks.
 
 public class IndexFiles extends Thread{
  public IndexFiles(){
  }
 
  public void run(){
   

SynchronizedIndexWriter.insertDocument(currentIndexDocument(),tmp/ 
 IndexPath,new MainAnalyser());
  }
 
 }
 
 
 
 public class SynchronizedIndexWriter {
 
  static synchronized  void
 insertDocument(IndexDocument  
 document,String indexLocValue,Analyzer analyzerValue){
  File f=new File(indexLocValue);
  if (f.exists())  

addDocumentToIndex(document,indexLocValue,analyzerValue,false);
  else  

addDocumentToIndex(document,indexLocValue,analyzerValue,true);
  }
 
 
  static  synchronized void
 addDocumentToIndex(IndexDocument  
 document,String indexLocValue,Analyzer
 analyzerValue,boolean  
 createNewIndex){
  try{
  IndexWriter indexWriter = new  

IndexWriter(indexLocValue,analyzerValue,createNewIndex);
 
 indexWriter.addDocument(document.getDocument());
  indexWriter.optimize();
  indexWriter.close();
  }
  catch(IOException io){
   // If IndexWrite don't know write on index
because
 it's locked,  
 recall of the function
= It's not very safe
   

addDocumentToIndex(document,indexLocValue,analyzerValue,createNewIndex);
  }
  catch(Exception e){
 
  }
 
  }
 }

The information contained above is proprietary to BAYT.COM
and confidential.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]