Re: Scalability of Lucene indexes

2005-02-19 Thread Praveen Peddi
We are doing the same exacting thing. We didn't test with so many documents. 
The most we tested till now 3 million documents with 3GB file size.
I would be interested in seeing how you maintained replicated indices that r 
in sync. The way we did was, run the indexer on each server independently. I 
the data changes, one server will know the change. That server updates 
lucene index and notifies other servers (using multicast).

Glad to know someone else is doing the similar thing and more happy to know 
that the solution works even for 100 millions documents. I was little 
worried if the index size goes higher and higher but it looks like we should 
not have to worry anymore :)

Thanks
Praveen
- Original Message - 
From: Bryan McCormick [EMAIL PROTECTED]
To: Chris D [EMAIL PROTECTED]
Cc: lucene-user@jakarta.apache.org
Sent: Friday, February 18, 2005 3:45 PM
Subject: Re: Scalability of Lucene indexes


Hi chris,
I'm responsible for the webshots.com search index and we've had very
good results with lucene. It currently indexes over 100 Million
documents and performs 4 Million searches / day.
We initially tested running multiple small copies and using a
MultiSearcher and then merging results as compared to running a very
large single index. We actually found that the single large instance
performed better. To improve load handling we clustered multiple
identical copies together, then session bind a user to particular server
and cache the results, but each server is running a single index.
Bryan McCormick
On Fri, 2005-02-18 at 08:01, Chris D wrote:
Hi all,
I have a question about scaling lucene across a cluster, and good ways
of breaking up the work.
We have a very large index and searches sometimes take more time than
they're allowed. What we have been doing is during indexing we index
into 256 seperate indexes (depending on the md5sum) then distribute
the indexes to the search machines. So if a machine has 128 indexes it
would have to do 128 searches. I gave parallelMultiSearcher a try and
it was significantly slower than simply iterating through the indexes
one at a time.
Our new plan is to somehow have only one index per search machine and
a larger main index stored on the master.
What I'm interested to know is whether having one extremely large
index for the master then splitting the index into several smaller
indexes (if this is possible) would be better than having several
smaller indexes and merging them on the search machines into one
index.
I would also be interested to know how others have divided up search
work across a cluster.
Thanks,
Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanties

2005-02-18 Thread Praveen Peddi
Good work Eric (even though UI could be made pretty). We use lucene so I 
have some knowledge of it. I could see the features you are using with 
lucene (like paging, highlighting, different kinds of pharases). Over all, 
good stuff.

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene User lucene-user@jakarta.apache.org
Sent: Friday, February 18, 2005 2:46 PM
Subject: Lucene in the Humanties


It's about time I actually did something real with Lucene  :)
I have been working with the Applied Research in Patacriticism group at 
the University of Virginia for a few months and finally ready to present 
what I've been doing.  The primary focus of my group is working with the 
Rossetti Archive - poems, artwork, interpretations, collections, and so on 
of Dante Gabriel Rossetti.  I was initially brought on to build a 
collection and exhibit system, though I got detoured a bit as I got 
involved in applying Lucene to the archive to replace their existing 
search system.  The existing system used an old version of Tamino with 
XPath queries.  Tamino is not at fault here, at least not entirely, 
because our data is in a very complicated set of XML files with a lot of 
non-normalized and legacy metadata - getting at things via XPath is 
challenging and practically impossible in many cases.

My work is now presentable at
http://www.rossettiarchive.org/rose
(rose is for ROsetti SEarch)
This system is implicitly designed for academics who are delving into 
Rossetti's work, so it may not be all that interesting for most of you. 
Have fun and send me any interesting things you discover, especially any 
issues you may encounter.

Here are some numbers to give you a sense of what is going on 
underneath... There are currently 4,983 XML files, totally about 110MB. 
Without getting into a lot of details of the confusing domain, there are 
basically 3 types of XML files (works, pictures, and transcripts).  It is 
important that  there be case-sensitive and case-insensitive searches.  To 
accomplish that, a custom analyzer is used in two different modes, one 
applying a LowerCaseFilter, and one not with the same documents written to 
two different indexes.  There is one particular type of XML file that gets 
indexed as two different types of documents (a specialized summary/header 
type).  In this first set of indexes, it is basically a one-to-one mapping 
of XML file to Lucene Document (with one type being indexed twice in 
different ways) - all said there are 5539 documents in each of the two 
main indexes.  The transcript type gets sliced into another set of 
original case and lowercased indexes with each document in that index 
representing a document division (a div element in the XML).  There are 
12326 documents in each of these div-level indexes.   All said, the 4 
indexes built total about 3GB in size - I'm storing several fields in 
order to hit-highlight.  Only one of these indexes is being hit at a 
time - it depends on what parameters you use when querying for which index 
is used.

Lucene brought the search times into a usable, and impressive to the 
scholars, state.  The previous search solution often timed the browser 
out!  Search results now are in the milliseconds range.

The amount of data is tiny compared to most usages of Lucene, but things 
are getting interesting in other ways.   There has been little tuning in 
terms of ranking quality so far, but this is the next area of work.  There 
is one document type that is more important than the others, and it is 
being boosted during indexing.  There is now a growing interest in 
tinkering with all the new knobs and dials that are now possible.  Putting 
in similar and more-like-this features are desired and will be relatively 
straightforward to implement.  I'm currently using 
catch-all-aggregate-field technique for a default field for QueryParser 
searching.  Using a multi-field expansion is an area that is desirable 
instead though.  So, I've got my homework to do and catch up on all the 
goodness that has been mentioned in this list recently regarding all of 
these techniques.

An area where I'd like to solicit more help from the community relates to 
something akin to personalization.  The scholars would like to be able to 
tune results based on the role (such as art historian) that is searching 
the site.  This would involve some type of training or continual learning 
process so that someone searching feeds back preferences implicitly for 
their queries by visiting the actual documents that are of interest.  Now 
that the scholars have seen what is possible (I showed them the cool 
SearchMorph comparison page searching Wikipedia for rossetti), they want 
more and more!

So - here's where I'm soliciting feedback - who's doing these types of 
things in the realm of Humanties?  Where should we go from here in terms 
of researching and applying the types of features dreamed about here? 
How would you recommend 

Best way to find if a document exists, using Reader ...

2005-01-14 Thread Praveen Peddi
Hi luceners,
Using Reader, whats the best (fastest) way to find if a documents exists with a 
given term. The term is unique ID, meaning, with that term, atmost one document 
can exist.

I have seen 2 appropriate methods of Reader. docFreq(Term) and termDocs(Term). 
docFreq should return either 0 or 1 in my case and termDocs should return 
TermDocs of size 0 or 1. But I was not sure which method is faster. All I want 
to find is if a document exist.

The actual reason I want to do is, I want to delete a document with the given 
GUID. It looks like delete(Term) has some overhead. So I thought I can look up 
the document and delete it only if it exists. Since I will be dealing with 
millions of documents, most of which are new documents. But I don't know if a 
document already exists in lucene index. So I was calling Reader.delete(Term) 
on each documen before adding it. This means I am calling delete method 
millions but there are possibly 99.9% of new document in that million docs.

Does it makes sense to call docFreq or termDocs (which ever is faster) before 
calling delete?

Any help is appreciated.

Thanks,
Praveen

 ** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


sorting on a field that can have null values (resend)

2004-12-21 Thread Praveen Peddi
I sent this mail yesterday but had no luck in receiving responses. Trying it 
again .

Hi all,
I am getting null pointer exception when I am sorting on a field that has null 
value for some documents. Order by in sql does work on such fields and I 
think it puts all results with null values at the end of the list. Shouldn't 
lucene also do the same thing instead of throwing null pointer exception. Is 
this an expected behaviour? Is lucene always expecting some value on the 
sortable fields?

I thought of putting empty strings instead of null values but I think empty 
strings are put first in the list while sorting which is the reverse of what 
anyone would want. 

Following is the exception I saw in the error log:

java.lang.NullPointerException
 at 
org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36)
 at 
org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95)
 at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120)
 at 
org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47)
 at 
org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58)
 at 
org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130)
 at 
org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38)
 at 
org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125)
 at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64)
 at 
org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51)
 at 
org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41)

If its a bug in lucene, Will it be fixed in next release? Any suggestions would 
be appreciated.

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


sorting on a field that can have null values

2004-12-20 Thread Praveen Peddi
Hi all,
I am getting null pointer exception when I am sorting on a field that has null 
value for some documents. Order by in sql does work on such fields and I 
think it puts all results with null values at the end of the list. Shouldn't 
lucene also do the same thing instead of throwing null pointer exception. Is 
this an expected behaviour? Is lucene always expecting some value on the 
sortable fields?

I thought of putting empty strings instead of null values but I think empty 
strings are put first in the list while sorting which is the reverse of what 
anyone would want. 

Following is the exception I saw in the error log:

java.lang.NullPointerException
 at 
org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36)
 at 
org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95)
 at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120)
 at 
org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47)
 at 
org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58)
 at 
org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130)
 at 
org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38)
 at 
org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125)
 at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64)
 at 
org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51)
 at 
org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41)

If its a bug in lucene, Will it be fixed in next release? Any suggestions would 
be appreciated.

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 




sorting on a field that can have null values

2004-12-20 Thread Praveen Peddi
Hi all,
I am getting null pointer exception when I am sorting on a field that has null 
value for some documents. Order by in sql does work on such fields and I 
think it puts all results with null values at the end of the list. Shouldn't 
lucene also do the same thing instead of throwing null pointer exception. Is 
this an expected behaviour? Is lucene always expecting some value on the 
sortable fields?

I thought of putting empty strings instead of null values but I think empty 
strings are put first in the list while sorting which is the reverse of what 
anyone would want. 

Following is the exception I saw in the error log:

java.lang.NullPointerException
 at 
org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36)
 at 
org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95)
 at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120)
 at 
org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47)
 at 
org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58)
 at 
org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130)
 at 
org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38)
 at 
org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125)
 at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64)
 at 
org.apache.lucene.search.Hits.init(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51)
 at 
org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41)

If its a bug in lucene, Will it be fixed in next release? Any suggestions would 
be appreciated.

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: Lucene appreciation

2004-12-16 Thread Praveen Peddi
The product looks great. Are you separately indexing by reading info from 
all the sites or just issuing federated search to all job sites? I am 
impressed by the speed. Its surely fater than dice and all other job search 
sites. I understand its in beta version but adding an advanced search option 
would help the users a lot. Just a suggestion

Praveen
- Original Message - 
From: Rony Kahan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 11:26 AM
Subject: Lucene appreciation


Hello fellow Lucene users,
I'd like to introduce myself and say thanks. We've recently launched
http://www.indeed.com, a search engine for jobs based on Lucene.  I'm
consistently impressed with the quality, professionalism and support of the
Lucene project and the Lucene community. This mailing list has been a great
help. I'd also like to give mention to some of the consultants who had a big
hand in making our project a reality ... Thank you Otis, Aviran, Sergiu 
Dawid.
As for our project, we're in beta and would love to get your feedback. The
index size is currently ~1.8m jobs. My personal email address is rony a_t
indeed.com. If you are interested in Lucene work you can set up an rss feed
or email alert from here: http://www.indeed.com/search?q=lucenesort=date
Is it possible to be added to the Wiki Powered By page?
Thanks Everyone,
Rony
Indeed.com - one search. all Jobs.
http://www.indeed.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Praveen Peddi
Hmm. So far all our fields are just strings. But I would guess you should be 
able to use Integer.MAX_VALUE or something on the upper bound. Or there 
might be a better way of doing it.

Praveen
- Original Message - 
From: Akmal Sarhan [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 10:23 AM
Subject: Re: Opinions: Using Lucene as a thin database


that sounds very interesting but how do you handle queries like
select * from MY_TABLE where MY_NUMERIC_FIELD  80
as far as I know you have only the range query so you will have to say
my_numeric_filed:[80 TO ??]
but this would not work in the a/m example or am I missing something?
regards
Akmal
Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07:
Even we use lucene for similar purpose except that we index and store 
quite
a few fields. Infact I also update partial documents as people suggested. 
I
store all the indexed fields so I don't have to build the whole document
again while updating partial document. The reason we do this is due to 
the
speed. I found the lucene search on a millions objects is 4 to 5 times
faster than our oracle queries (ofcourse this might be due to our pitiful
database design :) ). It works great so far. the only caveat that we had
till now was incremental updates. But now I am implementing real-time
updates so that the data in lucene index is almost always in sync with 
data
in database. So now, our search does not goto the database at all.

Praveen
- Original Message - 
From: Kevin L. Cobb [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:40 AM
Subject: Opinions: Using Lucene as a thin database

I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
!EXCUBATOR:41bf0221115901292611315!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Praveen Peddi
Even we use lucene for similar purpose except that we index and store quite 
a few fields. Infact I also update partial documents as people suggested. I 
store all the indexed fields so I don't have to build the whole document 
again while updating partial document. The reason we do this is due to the 
speed. I found the lucene search on a millions objects is 4 to 5 times 
faster than our oracle queries (ofcourse this might be due to our pitiful 
database design :) ). It works great so far. the only caveat that we had 
till now was incremental updates. But now I am implementing real-time 
updates so that the data in lucene index is almost always in sync with data 
in database. So now, our search does not goto the database at all.

Praveen
- Original Message - 
From: Kevin L. Cobb [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:40 AM
Subject: Opinions: Using Lucene as a thin database

I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting tokenized field

2004-12-13 Thread Praveen Peddi
If its not added to the release code already, is there any reason for it 
being not added. Seems like many people agree that this is an important 
functionality of sorting.

Its just that I can't get permission to use customized libraries in our 
company. Either we have to use the library as is or implement our own stuff. 
We don't want to go into the pains of maintaining the 3rd party library code 
whenever we migrate from one version to other. I would assume everyone would 
have the same problem.

Is there any possibility this patch contributed by Aviran can be added to 
the actual release branch.

Thanks
Praveen
- Original Message - 
From: Aviran [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, December 13, 2004 11:30 AM
Subject: RE: sorting tokenized field

The patch is very simple.
What is does is it checks if the field you want to sort on is tokenized. If
it is it loads the values from the documents to the sorting table.
The only con in this approach is that loading the values this way is much
slower than if the values where Keywords, but other than that it should work
just fine.
Aviran
http://www.aviransplace.com
-Original Message-
From: Praveen Peddi [mailto:[EMAIL PROTECTED]
Sent: Monday, December 13, 2004 10:48 AM
To: lucenelist
Subject: Fw: sorting tokenized field
Hi all,
I forwarding the same email I sent before. Just wanted to try my luck again
:).
Thanks in advance.
Praveen
- Original Message - 
From: Praveen Peddi [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 3:33 PM
Subject: Re: sorting tokenized field


Since I am not aware of the lucene code much, I couldn't make much out
of
your patch. But is this patch already tested and proved to be efficient?
If so, why can't it be merge into the lucene code and made it part of the
release. I think the bug is valid. Its very likely that people want to
sort on tokenized fields.
If I apply this patch to lucene code and use it for myself, I will
have
hard time managing it in future (while upgrading lucene library). If the
pathc is applied to lucene release code, it would be very easy for the
lucene users.
If possible, can someone explain what the path does? I am trying to
understand what exactly changed but could not figrue out.
Praveen
- Original Message -
From: Aviran [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 2:30 PM
Subject: RE: sorting tokenized field

I have suggested a solution for this problem (
http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use
the  patch suggested there and recompile lucene.
Aviran
http://www.aviransplace.com
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 10, 2004 13:53 PM
To: Lucene Users List
Subject: Re: sorting tokenized field

On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort
tokenized field, either the application has to duplicate field with
diff name and not tokenize it or come up with something else. But
shouldn't the search engine takecare of this? Are there any plans of
putting this functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be
available for sorting.
Adding one more line to your indexing code to accommodate your
sorting needs seems a pretty small price to pay.  Do you have suggestions
to
improve how this works?   Or how it is documented?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting tokenized field

2004-12-13 Thread Praveen Peddi
Hi Erik,
Thanks a lot for your kind response. I appreciate the details.
What I meant by custom library is, applying aviran's patch to the lucene and 
maintaining it, not adding an extra field. Adding an extra field was my last 
option if I can't use the patch.

I did look at the extensible search and infact I wrote my own comparators 
(IgnoreCaseStringComparator and another custom comparator) and they work 
just fine. But I am not sure if this extensible search features helps me in 
sorting on tokenized field w/o adding the extra field. For now, I will just 
go for the extra field option and later if a more optimized solution is 
built into lucene I can use that.

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, December 13, 2004 3:01 PM
Subject: Re: sorting tokenized field


On Dec 13, 2004, at 2:22 PM, Praveen Peddi wrote:
If its not added to the release code already, is there any reason for it 
being not added.
As noted, there is a performance issue with sorting by tokenized fields. 
It would seem far more advisable for you to simply add another field used 
for sorting which is untokenized.

Why has it not been added?  There have been several committers quite 
active in the codebase (myself excluded).  If you wish for changes to be 
committed, perseverance and patience are key.  Keep lobbying, but do so 
kindly.  When there are viable alternatives (such as adding an untokenized 
field for sorting) then certainly there is less incentive to commit 
changes.  Lucene's codebase is pretty clean and tight - it is wise for us 
to be very selective about changes to it.


 Seems like many people agree that this is an important functionality of 
sorting.
Many do, but not all.  I'm -0 on this change, meaning I'm not veto'ing it, 
but I'm not actually for it given the performance issue.

Its just that I can't get permission to use customized libraries in our 
company.
No custom library is needed for you to add an untokenized field for 
sorting purposes.

Also, sorting is extensible.  Check out the Lucene in Action code, 
specifically the lia.extsearch.sorting.DistanceSortingTest class.

Maybe you could add your own custom sorting code that could do what you 
want without patching Lucene.

Is there any possibility this patch contributed by Aviran can be added to 
the actual release branch.
Keep lobbying - other committers may feel differently than I do about it 
and add it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


sorting tokenized field

2004-12-10 Thread Praveen Peddi
I read that the tokenised fields cannot be sorted. In order to sort tokenized 
field, either the application has to duplicate field with diff name and not 
tokenize it or come up with something else. But shouldn't the search engine 
takecare of this? Are there any plans of putting this functionality built into 
lucene?

Praveen
** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: sorting tokenized field

2004-12-10 Thread Praveen Peddi
I was only thinking in terms of other search engines. I worked with other 
search engines and I didn't see this requirements before. I think you are 
right that its wasteful to duplicate all tokenized fields. Not sure if there 
is a smart of dealing with it.

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 1:53 PM
Subject: Re: sorting tokenized field


On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort 
tokenized field, either the application has to duplicate field with diff 
name and not tokenize it or come up with something else. But shouldn't 
the search engine takecare of this? Are there any plans of putting this 
functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be 
available for sorting.

Adding one more line to your indexing code to accommodate your sorting 
needs seems a pretty small price to pay.  Do you have suggestions to 
improve how this works?   Or how it is documented?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting tokenized field

2004-12-10 Thread Praveen Peddi
Since I am not aware of the lucene code much, I couldn't make much out of 
your patch. But is this patch already tested and proved to be efficient? If 
so, why can't it be merge into the lucene code and made it part of the 
release. I think the bug is valid. Its very likely that people want to sort 
on tokenized fields.

If I apply this patch to lucene code and use it for myself, I will have hard 
time managing it in future (while upgrading lucene library). If the pathc is 
applied to lucene release code, it would be very easy for the lucene users.

If possible, can someone explain what the path does? I am trying to 
understand what exactly changed but could not figrue out.

Praveen
- Original Message - 
From: Aviran [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 2:30 PM
Subject: RE: sorting tokenized field


I have suggested a solution for this problem (
http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the
patch suggested there and recompile lucene.
Aviran
http://www.aviransplace.com
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, December 10, 2004 13:53 PM
To: Lucene Users List
Subject: Re: sorting tokenized field

On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote:
I read that the tokenised fields cannot be sorted. In order to sort
tokenized field, either the application has to duplicate field with
diff name and not tokenize it or come up with something else. But
shouldn't the search engine takecare of this? Are there any plans of
putting this functionality built into lucene?
It would be wasteful for Lucene to assume any field you add should be
available for sorting.
Adding one more line to your indexing code to accommodate your sorting
needs seems a pretty small price to pay.  Do you have suggestions to
improve how this works?   Or how it is documented?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: partial updating of lucene

2004-12-09 Thread Praveen Peddi
But when I am searching, it only searches in the index. Stored fields are 
only used to display the results, not to search. Why would it lose the terms 
in the index when I retrieve the document?

First solution is not possible (I can't create a new document) since I only 
have modified fields.

When I get a document, doesn't the fields have indexed terms along with it? 
Is there no way to get a full document (along with indexed terms) and clone 
it and add it to the index?

Well is there anyway I ca update a document with just one field (because I 
only have data for that one field)?

Praveen
- Original Message - 
From: Justin Swanhart [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, December 08, 2004 5:59 PM
Subject: Re: partial updating of lucene


You unstored fields were not stored in the index, only their terms
were stored.  When you get the document from the index and modify it,
those terms are lost when you add the document again.
You can either simply create a new document and populate all the
fields and add that document to the index, or you can add the unstored
fields to the document retrieved in step 1.
On Wed, 8 Dec 2004 17:53:26 -0500, Praveen Peddi
[EMAIL PROTECTED] wrote:
Hi all,
I have a question about updating the lucene document. I know that there 
is no API to do that now. So this is what I am doing in order to update 
the document with the field title.

1) Get the document from lucene index
2) Remove a field called title and add the same field with a modified 
value
3) Remove the docment (based on one of our field) using Reader and then 
close the Reader.
4) Add the document that is obtained in 1 and modified in 2.

I am not sure if this is the right way of doing it but I am having 
problems searching for that document after updating it. The problem is 
only with the un stored fields.

For example, I search as description:boy where description is a 
unstored, indexed, tokenized field in the document. I find 1 document. 
Now I update the document the document's title as descripbed above and 
repeat the same search description:boy and now I don't find any 
results. I have not touched the field description at all. I just 
updated the field title.

Is this an expected behaviour? If not, is it a bug.
If I change the field description as stored, indexed and tokenized, the 
search works fine before and after updating.

Praveen
**
Praveen Peddi
Sr Software Engg, Context Media, Inc.
email:[EMAIL PROTECTED]
Tel:  401.854.3475
Fax:  401.861.3596
web: http://www.contextmedia.com
**
Context Media- The Leader in Enterprise Content Integration

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: partial updating of lucene

2004-12-09 Thread Praveen Peddi
If I store all the fields I am indexing, is it safe to get the document, 
update a fields and add it again to the search index? I do not want to lose 
anything and I want to make sure that document is same before and after 
updating (execpt for the updated fields).

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, December 09, 2004 10:00 AM
Subject: Re: partial updating of lucene


On Dec 9, 2004, at 9:48 AM, Praveen Peddi wrote:
But when I am searching, it only searches in the index. Stored fields are 
only used to display the results, not to search. Why would it lose the 
terms in the index when I retrieve the document?

First solution is not possible (I can't create a new document) since I 
only have modified fields.

When I get a document, doesn't the fields have indexed terms along with 
it? Is there no way to get a full document (along with indexed terms) and 
clone it and add it to the index?

Well is there anyway I ca update a document with just one field (because 
I only have data for that one field)?
A Document only carries along its *stored* fields.  Fields that are 
indexed, but not stored, are not retrievable from Document.

Have a look at the tool Luke (Google for luke lucene :) and see how it 
does its Reconstruct and Edit facility.  It is possible, though 
potentially lossy, to reconstruct a document and add it again.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene Vs Ixiasoft

2004-12-08 Thread Praveen Peddi
Does anyone know about Ixiasoft server. Its a xml repository/search engine. If 
anyone knows about it, does he/she also know how it is compared to Lucene? 
Which is fast? 

Praveen
** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


partial updating of lucene

2004-12-08 Thread Praveen Peddi
Hi all,
I have a question about updating the lucene document. I know that there is no 
API to do that now. So this is what I am doing in order to update the document 
with the field title.

1) Get the document from lucene index
2) Remove a field called title and add the same field with a modified value
3) Remove the docment (based on one of our field) using Reader and then close 
the Reader.
4) Add the document that is obtained in 1 and modified in 2.

I am not sure if this is the right way of doing it but I am having problems 
searching for that document after updating it. The problem is only with the un 
stored fields.

For example, I search as description:boy where description is a unstored, 
indexed, tokenized field in the document. I find 1 document. Now I update the 
document the document's title as descripbed above and repeat the same search 
description:boy and now I don't find any results. I have not touched the 
field description at all. I just updated the field title.

Is this an expected behaviour? If not, is it a bug.

If I change the field description as stored, indexed and tokenized, the search 
works fine before and after updating.

Praveen
** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: False Locking Conflict?

2004-11-19 Thread Praveen Peddi
If you have more than one lucene application running on the same machine, 
they all share the same temp file? Atleast I had this problem when I run my 
application in 2 diff instances of weblogic on the same machine.
Praveen
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, November 19, 2004 2:13 PM
Subject: Re: False Locking Conflict?


It is possible, but it's not likely, as other users are not reporting
this.
Otis
--- Luke Shannon [EMAIL PROTECTED] wrote:
Hey All;
Is it possible for there to be a situation where the locking file is
in place after the reader has been closed?
I have extra logging in place and have followed the code execution.
The reader finishes deleting old content and closes (I know this for
sure). This is the only reader instance I have for the class (it is a
static member). The reader is not re-opened. I try to open the writer
and I get my old friend:
java.io.IOException: Lock obtain timed out:
Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d43210f7fe8-write.lock
This code is synchronized so I am sure there is no other processes
trying to do the same thing. It looks to me like the reader is
closing and the lock file is not being removed. Is this possible?
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Using Shared directory as lucene index in cluster

2004-10-14 Thread Praveen Peddi
Hi all,
This topic has been discussed in the mailing list before but I could not find an 
answer to the problem I am having.

I am trying to decide whether shared directory based index is better way or local 
index per server is a better way in a clustered application. First I am evaluating the 
shared direcotry option. I am trying to use shared directory (NFS) and test the 
performance difference compared to local directory. I didn't see much difference on 
the search side but indexing I think is little slower in shared directory. I think we 
can live with this. But, I could not make indexing run in cluster mode. We cache 
IndexSearcher on each server in the cluster (for faster search). We make sure that the 
cached index search is always up to date. 

When I run our full indexer, it cleans the index directory and re indexes all the 
objects from DB to lucene index directory. But it looks like indexsearcher holds the 
file handles so I cannot delete the directory. This means I cannot run the full 
indexer in cluster mode since each server holds file handle on some of the index files 
and those files cannot be deleted. 

One solution is to make sure searchers on all servers are closed before running full 
indexer. But there is no direct way to notify this in cluster. So my question is, Is 
there any other solution that doesn't need to close searcher to clean the index files?

Note: This is in fact not related to shared directory but true for local directory 
also.

Praveen



** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: sorting and score ordering

2004-10-13 Thread Praveen Peddi
Use SortField.FIELD_SCORE as the first element in the SortField[] when you 
pass it to sort method.

Praveen
- Original Message - 
From: Chris Fraschetti [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, October 13, 2004 3:19 PM
Subject: Re: sorting and score ordering


Will do.
My other question was: the 'score' for a page as far as I know, is
only accessible post-search... and is not contained in a field. How
can I specift the score as a sort field when there is no field 'score'
?
-Chris
On Wed, 13 Oct 2004 21:06:14 +0200, Daniel Naber
[EMAIL PROTECTED] wrote:
On Wednesday 13 October 2004 20:44, Chris Fraschetti wrote:
 I haven't seen an example on how to apply two sorts to a search.. can
 you help me out with that?
Check out the documentation for Sort(SortField[] fields) and SortField.

Regards
Daniel
--
http://www.danielnaber.de
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Making lucene work in weblogic cluster

2004-10-08 Thread Praveen Peddi
While I was going through the mailing list in solving the lucene cluster problem, I 
came accross this thread. Does any one know if David Townsend had submitted the patch 
he was talking about?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06252.html

I am interested in looking at the NFS solution (mounting the shared drive on each 
server in cluster). I don't know if anyone has used this solution in cluster but this 
seems to be a better approach than RemoteSearchable interface and DB based index 
(SQLDirectory).


I am currently looking at 2 options:
Index on Shared drive: Use single index dir on a shared drive (NFS, etc.), which is 
mounted on each app server. All the servers in the cluster write to this shared drive 
when objects are modified.
Problems:
1) Known problems like file locking etc. (The above thread talks about moving locking 
mechanism to DB but I have no idea how).
2) Performance.

Index Per Server: Create copies of the index dir for each machine. Requires regular 
updates, etc. Each server maintains its own index and searches on its own index.
Problems:
1) Modifying the index is complex. When Objects are modified on a server1 that does 
not run the search system, server1 needs to notify all servers in the cluster about 
these modifications so that each server can update its own index. This may involve 
some kind of remote communication mechanism which will perform bad since our index 
modifies a lot.

So I am still reviewing both options and trying to figure out which one is the best 
and how to solve the above problems.

If you guys have any ideas, Pls shoot them. I would appreciate any help regarding 
making lucene clusterable (both indexing and searching).

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: displaying 'pages' of search results...

2004-09-22 Thread Praveen Peddi
 PROTECTED]
Sent: Wednesday, September 22, 2004 2:53 AM
Subject: displaying 'pages' of search results...


 Hi
 
 Can u share the  searcher.search(query, hitCollector);  [light weight paging
 api ]
 
 Code on the form ,may be somebody like me need's it.
 
 
    ; )
 
 Karthik
 
 -Original Message-
 From: Praveen Peddi [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 22, 2004 1:24 AM
 To: Lucene Users List
 Subject: Re: displaying 'pages' of search results...
 
 
 The way we do it is: Get all the document ids, cache them and then get the
 first 50, second 50 documents etc. We wrote a light weight paging api on top
 of lucene. We call searcher.search(query, hitCollector); Our
 HitCollectorImpl implements collect method and just collects the document id
 only.
 
 Praveen
 
 
 - Original Message -
 From: Chris Fraschetti [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, September 21, 2004 3:33 PM
 Subject: displaying 'pages' of search results...
 
 
I was wondering was the best way was to go about returning say
 1,000,000 results, divided up into say 50 element sections and then
 accessing them via the first 50, second 50, etc etc.

 Is there a way to keep the query around so that lucene doesn't need to
 search again, or would the search be cached and no delay arise?

 Just looking for some ideas and possibly some implementational issues...



 --
 ___
 Chris Fraschetti
 e [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


Re: displaying 'pages' of search results...

2004-09-21 Thread Praveen Peddi
The way we do it is: Get all the document ids, cache them and then get the 
first 50, second 50 documents etc. We wrote a light weight paging api on top 
of lucene. We call searcher.search(query, hitCollector); Our 
HitCollectorImpl implements collect method and just collects the document id 
only.

Praveen
- Original Message - 
From: Chris Fraschetti [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, September 21, 2004 3:33 PM
Subject: displaying 'pages' of search results...


I was wondering was the best way was to go about returning say
1,000,000 results, divided up into say 50 element sections and then
accessing them via the first 50, second 50, etc etc.
Is there a way to keep the query around so that lucene doesn't need to
search again, or would the search be cached and no delay arise?
Just looking for some ideas and possibly some implementational issues...

--
___
Chris Fraschetti
e [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with SortField[] in search method (newbie)

2004-09-15 Thread Praveen Peddi
Does it mean you indexed all not null fields?. I think you should change
your code so that you always index the fields you want to sort.

In any case, it looks like some of your documents have shortName not null
and not indexed. If you do not have any non-indexed shotnames in the index,
I don't think u would have got that error. But I may be wrong.

Praveen

- Original Message - 
From: Wermus Fernando [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 15, 2004 1:53 PM
Subject: RE: problem with SortField[] in search method (newbie)


Aviran,
I can search in not indexed fields without any exception, but I can't order
by the same fields.
Besides,  I can't know in advance if they are indexed in my app, because I
index those fields that have some value, if it doesn't I don't add it to the
document.

What if I don't have any document indexed?



-Mensaje original-
De: Aviran [mailto:[EMAIL PROTECTED]
Enviado el: Miércoles, 15 de Septiembre de 2004 02:35 p.m.
Para: 'Lucene Users List'
Asunto: RE: problem with SortField[] in search method (newbie)

You can only sort on indexed field. (even more than that, it'll work
properly only on Untokenized fields, ie keyword).

Aviran

-Original Message-
From: Wermus Fernando [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 15, 2004 13:13 PM
To: [EMAIL PROTECTED]
Subject: problem with SortField[] in search method (newbie)


Luceners,
My search looks up the whole entities. My entities are accounts, contacts,
tasks, etc. My searching looks up a group of entity's fields. This works
fine despite, I don't have indexed any entity in a document. But If I sort
by some fields from different entities, I get the following error.

field shortName does not appear to be indexed

The account's field I have indexed are

shortName,number,location,fax,phone,symbol

and I order by

shortName

without  any order

shortName,number,location,fax,phone,symbol

it works fine.

I don't understand the behavior because If I don't order the searching and I
don't have any document indexed, It works fine, but If I add an order I get
a runtimeException and I can't catch the exception  to solve the problem.
The only solution it's to index the whole fields' entitities once in a
document, but for me it's a patch.

Any idea,  it could help me out.

Thanks in advance.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Moving from a single server to a cluster

2004-09-08 Thread Praveen Peddi
We went thru the same scenario as yours. We recently made our application
clsuterable and I wrote our own version of jdbc directory (similar to the
SQLDirectory posted by someone) with our own caching. It was great for
searching for indexing had become a real bottleneck. So we have decided to
move back to file system for non-clustered apps. I am still trying to figure
the best way (whether using a RemoteSearcher or manage multiple index). I
already tried multiple index and we didn't really like the solution of
maintaining multiple copies. It requires more space, more maintaineance, all
index needs to be in sync etc.

I will be glad if I can get the best answer for this. Did anyone try
RemoteSearchable and how is it compared to multiple index solution?

 Nader: I would appreciate if you can send me the docs.

Praveen

- Original Message - 
From: David Townsend [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, September 08, 2004 10:42 AM
Subject: RE: Moving from a single server to a cluster


Would it be cheeky to ask you to post the docs to the group?  It would be
interesting to read how you've tackled this.

-Original Message-
From: Nader Henein [mailto:[EMAIL PROTECTED]
Sent: 08 September 2004 13:57
To: Lucene Users List
Subject: Re: Moving from a single server to a cluster


Hey Ben,

We've been using a distributed environment with three servers and three
separate indecies for the past 2 years since the first stable Lucene
release and it has been great, recently and for the past two months I've
been working on a redesign for our Lucene App and I've shared my
findings and plans with Otis, Doug and Erik, they pointed out a few
faults in my logic which you will probably come across soon enough that
mainly have to do with keeping you updates atomic (not too hard) and
your deletes atomic (a little more tricky), give me a few days and I'll
send you both the early document and the newer version that deals
squarely with Lucene in a distributed environment with high volume index.

Regards.

Nader Henein

Ben Sinclair wrote:

My application currently uses Lucene with an index living on the
filesystem, and it works fine. I'm moving to a clustered environment
soon and need to figure out how to keep my indexes together. Since the
index is on the filesystem, each machine in the cluster will end up
with a different index.

I looked into JDBC Directory, but it's not tested under Oracle and
doesn't seem like a very mature project.

What are other people doing to solve this problem?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene for Indian Languages

2004-08-23 Thread Praveen Peddi
Infact CJK analyzer also works well with indian languages. Since CJKAnalyzer
considers the multi byte characters as special case, it works with most
asian multi byte characters. I introduced CJKAnalyzer for japanese text
search and we also tested with hindi and telugu languages. All our search
test cases passed.
Give CJKAnalyzer a try. You will find it a better analyzer than the standard
(for any asian language).

Praveen

- Original Message - 
From: Satish Kagathare [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 9:20 AM
Subject: Re: Lucene for Indian Languages



 Hi,Srinivasa,

 Use StandardAnaylzer for indexing and parsing query for Indian Lang. docs.
 It will work. Right now we r searching on Hindi,Marathi
 but without specific stemmers and filters. We r plannig to develop
 Marathi Morphological Analyzer.

 Thanks,
 Satish.

 On Sun, 22 Aug 2004, srinivasa raghavan wrote:

  Hi all,
 
   Is Lucene API implemented for Indian contexts? I know
  that Lucene stemmers and filters for German and
  Russian Languages. I would like to know, whether there
  are stemmers and filters available/being developed for
  Indian Languages.
 
  Thanks,
  Rahavan.
 
 
 
 
 
  ___
  Do you Yahoo!?
  Express yourself with Y! Messenger! Free. Download now.
  http://messenger.yahoo.com
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and ejb applications

2004-08-20 Thread Praveen Peddi
Infact we do the same exact thing. Session bean method called search()
delegates to a POJO SearchService. We lazy load the IndexSearch cache it in
memory and invalidate that object when someone else modifies the index. This
trick works wonderfually for us. The search has become faster after caching
the searcher.

Praveen
- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, August 20, 2004 12:02 PM
Subject: Re: lucene and ejb applications


 On Aug 20, 2004, at 7:54 AM, Rupinder Singh Mazara wrote:
  hi erik
 
   thanks for the warning and the code.
   Let me re-phrase the question,
 
   i have a index generated by lucene, i need to have the search
  capabilty
   to have a high availabilty. What solutions would be the most optimal

 I'm guessing from your descriptions that you want a search server that
 multiple applications can access.  Correct?  Is that what you mean by
 high availability?

 Take a look at Nutch for examples of doing this kind of thing.  And
 also...

 
   Currentlly i have two senarions in mind
a) setup a RMI based app. that on start-up initializes a
  IndexSearcher
  object
   and waits for invocation of a method like Vector
  executeQuery(Query )

 Lucene has built-in RMI capability, so you don't need to recreate this
 yourself.  Look at RemoteSearchable (and the test cases that use it).

b) create a web based app(jsp/servlet or struts)  that initialises
  the
  IndexSearcher object, and stores in the servletContext on
  intialization, and
  all request invoke the Hits search(Query q)

 This is ok, but you have the same issues with servlet context
 (application scope or even session scope) with distributed
 applications.  IndexSearcher, at the very least, should be transient
 and lazy initialized, perhaps nested under a controller object of your
 making.

with senario a)  i can have more control over updates, insert, and
  deletes
where as with  senario b) has higher availabilty

 I disagree with your analysis of those scenarios.  Neither has more or
 less control or availability than the other.

   I want to create and store the IndexSearcher object, during
  initailization
  to save on
   mutlitple open and reads. once updates are ready signal can be sent to
  block further searches while the updates are integrated into the
  existing
  index.

 It is a good thing to keep an IndexSearcher instance around for big
 indexes to save on that I/O, I completely agree.  A simple
 IndexSearcher-encapsulating Java object which lazy initializes and
 keeps IndexSearcher as a transient would be quite sufficient, I think.
 Store that object wherever you like - application scope seems to be
 appropriate for your web application scenario.

 Erik


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



merge factor and minMergeDocs

2004-07-23 Thread Praveen Peddi
Is anything changed lucene 1.4 regarding mergefactor?
I recently ported to lucene 1.4 final and my indexing time doesnot change with change 
in the merge factor. Increasing minMergeDocs is improving my indexing as expected but 
changing mergefactor is making no difference.

If this is the case, I can always go with the default merge factor of 10 so I won't 
run into too many files open problem. But just vary minMergeDocs to tune the indexing 
perf.

Currently I tested with 25K objects and the indexing time is almost the same with 
mergefactor 10 and mergefactor of 100 (kept minMergeDocs =100 in both cases). I am 
confident that my indexing time used to vary with change in the merge factor before 
(with lucene 1.3 RC3 I think).


Praveen
** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


No change in the indexing time after increase the merge factor

2004-07-20 Thread Praveen Peddi
I performed lucene indexing with 25,000 documents.
We feel that indexing is slow, so I am trying to tune it.
My configuration is as follow:
Machine: Windows XP, 1GB RAM, 3GHz
# of documents: 25,000
App Server: Weblogic 7.0
lucene version: lucene 1.4 final

I ran the indexer with merge factor of 10 and 50. Both times, the total indexing time 
(lucene time only) is almost the same (27.92 mins for mergefactor=10 and 28.11 mins 
for mergefactor=50).

From the lucene mails and lucene related articles I read, I thought increasing merge 
factor will imporve the performance of indexing. Am I wrong?


Praveen


** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 


Re: Problems indexing Japanese with CJKAnalyzer

2004-07-15 Thread Praveen Peddi
If its a web application, you have to cal request.setEncoding(UTF-8)
before reading any parameters. Also make sure html page encoding is
specified as UTF-8 in the metatag. most web app servers decode the request
paramaters in the system's default encoding algorithm. If u call above
method, I think it will solve ur problem.

Praveen
- Original Message - 
From: Bruno Tirel [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 6:15 AM
Subject: RE: Problems indexing Japanese with CJKAnalyzer


Hi All,

I am also trying to localize everything for French application, using UTF-8
encoding. I have already applied what Jon described. I fully confirm his
recommandation for HTML Parser and HTML Document changes with UNICODE and
UTF-8 encoding specification.

In my case, I have still one case not functional : using meta-data from HTML
document, as in demo3 example. Trying to convert to UTF-8, or
ISO-8859-1, it is still not correctly encoded when I check with Luke.
A word Propriété is seen either as Propri?t? with a square, or as
Propriã©tã©.
My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result
when I use local FileEncoding parameter.
All the other fields are correctly encoded into UTF-8, tokenized and
successfully searched through JSP page.

Is anybody already facing this issue? Any help available?
Best regards,

Bruno


-Message d'origine-
De : Jon Schuster [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 14 juillet 2004 22:51
À : 'Lucene Users List'
Objet : RE: Problems indexing Japanese with CJKAnalyzer

Hi all,

Thanks for the help on indexing Japanese documents. I eventually got things
working, and here's an update so that other folks might have an easier time
in similar situations.

The problem I had was indeed with the encoding, but it was more than just
the encoding on the initial creation of the HTMLParser (from the Lucene demo
package). In HTMLDocument, doing this:

InputStreamReader reader = new InputStreamReader( new
FileInputStream(f), SJIS);
HTMLParser parser = new HTMLParser( reader );

creates the parser and feeds it Unicode from the original Shift-JIS encoding
document, but then when the document contents is fetched using this line:

Field fld = Field.Text(contents, parser.getReader() );

HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter
using the default encoding, which in my case was Windows 1252 (essentially
Latin-1). That was bad.

In the HTMLParser.jj grammar file, adding an explicit encoding of UTF8 on
both the Reader and Writer got things mostly working. The one missing piece
was in the options section of the HTMLParser.jj file. The original grammar
file generates an input character stream class that treats the input as a
stream of 1-byte characters. To have JavaCC generate a stream class that
handles double-byte characters, you need the option UNICODE_INPUT=true.

So, there were essentially three changes in two files:

HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit
UTF8 encoding on Reader and Writer creation in getReader(). As far as I
can tell, this changes works fine for all of the languages I need to handle,
which are English, French, German, and Japanese.

HTMLDocument - add explicit encoding of SJIS when creating the Reader used
to create the HTMLParser. (For western languages, I use encoding of
ISO8859_1.)

And of course, use the right language tokenizer.

--Jon

earlier responses snipped; see the list archive

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sorting and tokenization

2004-07-01 Thread Praveen Peddi
The solution you suggested is exactly as I expected and I already thought
about implementing it. But the problem is the memory in efficiency. Somce
times titles are huge. And with i18n, title can be in japanese, chinese or
any language which takes mroe memory than english.

Ok. how about taking the first token of the title and using it just for the
sake of sorting. Does anyone see any problem with it? This solution saves
atleast some memory, compared to the other solution.

Praveen

- Original Message - 
From: John Moylan [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 01, 2004 10:24 AM
Subject: Re: Sorting and tokenization


 Hi,

 You just need to have another title field that is not tokenized - for
 sorting purposes.

 Best,
 John

 On Thu, 2004-07-01 at 15:15, Praveen Peddi wrote:
  Hello all,
  Now that lucene 1.4 rc3 has sorting functionality built in, I am adding
sorting functionality to our searching. Before posting any question to this
mailing list, I have been going thru most of the email responses in this
mailing list related to sorting. I have found that I cannot tokenize the
fields that I want to sort on.
 
  Lets take the example I have.
  I use lucene 1.3 final for searching. Sorting is in fact a very
important feature in our application. But we found that lucene does not
support out of box, we had to implement sorting by score and doc id
programatically which is kind of useless for us. So I thought lucene's new
sorting feature will best suit now. But unfortunately, the field called
title is tokenized currently. And this is done purposefully because users
would want to search partial matches (or rather search on multiple words of
the title). So if we make it un tokenized we may lose an improtant
functionality.
 
  My question is, is there any way I can achieve sorting the objects by
title and keeping title as tokenized?
 
  Thanks in advance.
 
  Praveen
 
 
  **
  Praveen Peddi
  Sr Software Engg, Context Media, Inc.
  email:[EMAIL PROTECTED]
  Tel:  401.854.3475
  Fax:  401.861.3596
  web: http://www.contextmedia.com
  **
  Context Media- The Leader in Enterprise Content Integration
 -- 
 John Moylan
 --
 ePublishing
 Radio Telefis Eireann,
 Montrose House,
 Donnybrook,
 Dublin 4,
 Eire
 t:+353 1 2083564
 e:[EMAIL PROTECTED]




**
 The information in this e-mail is confidential and may be legally
privileged.
 It is intended solely for the addressee. Access to this e-mail by anyone
else
 is unauthorised. If you are not the intended recipient, any disclosure,
 copying, distribution, or any action taken or omitted to be taken in
reliance
 on it, is prohibited and may be unlawful.
 Please note that emails to, from and within RTÉ may be subject to the
Freedom
 of Information Act 1997 and may be liable to disclosure.


**

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



languages lucene can support

2004-07-01 Thread Praveen Peddi
I have read many emails in lucene mailing list regarding analyzers.

Following is the list of languages lucene supports out of box. So they will be 
supported with no change in our code but just a configuration change.
English
German
Russian

Following is the list of languages that are available as external downloads on 
lucene's site:

Chinese
Japanese
Korean (all of the above come as single download)
Brazilian
CZech
French
Dutch

I also read that lucene's StandardAnalyzer supports most of the european languages. 
Does it mean it supports spanish also? or is there a separate analyzer for that? I 
didn't see any spanish analyzer in the sand box or lucene release.

Another question regarding FrenchAnalyzer. I downloaded FrenchAnalyzer and some 
methods do not throw IOException where it is supposed to throw. For example, the 
constructor. I am using 1.4 final (I know its relased only today :)). Whats the fix 
for it?

Praveen

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration 

Do we really need CJKAnalyzer to search japanese characters

2004-06-28 Thread Praveen Peddi
Hello all,
You will have to excuse me if the question looks dumb ;)

I didn't use CJKAnalyzer and I could still search japanese characters.
Actually I used it first but then I thought of testing with just the
standard analyzer. It worked with standard analyzer also.

I was able to search the metadata of our objects that has chinese and
japanese characters.

I think lucene is internally storing unicode characters. So should it matter
if its standard analyzer or CJK analyzer?

When do we have to use CJKAnalyzer really?

Praveen

**
Praveen Peddi
Sr Software Engg, Context Media, Inc.
email:[EMAIL PROTECTED]
Tel:  401.854.3475
Fax:  401.861.3596
web: http://www.contextmedia.com
**
Context Media- The Leader in Enterprise Content Integration


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Do we really need CJKAnalyzer to search japanese characters

2004-06-25 Thread Praveen Peddi
Hello all,
You will have to excuse me if the question looks dumb ;)

I didn't use CJKAnalyzer and I could still search japanese characters. Actually I used 
it first but then I thought of testing with just the standard analyzer. It worked with 
standard analyzer also.

I was able to search the metadata of our objects that has chinese and japanese 
characters.

I think lucene is internally storing unicode characters. So should it matter if its 
standard analyzer or CJK analyzer?

When do we have to use CJKAnalyzer really?

Praveen

** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- The Leader in Enterprise Content Integration