RE: Full Text Search with multiple index and complex requirements

Jonathan Rochkind Sun, 06 Mar 2011 21:00:51 -0800

While it might be possible to work things out, not just one but several of your 
requirements are things that are difficult for Solr to do or which solr isn't 
really optimized to do. Are you sure you need an inverted indexing tool like 
Solr at all, as opposed to some kind of store (rdbms or nosql), for all or some 
parts of your data?  
________________________________________
From: Shrinath M [shrinat...@webyog.com]
Sent: Sunday, March 06, 2011 11:49 PM
To: rajini maski
Cc: solr-user@lucene.apache.org
Subject: Re: Full Text Search with multiple index and complex requirements


On Mon, Mar 7, 2011 at 9:56 AM, rajini maski <rajinima...@gmail.com> wrote:

> I just tried to answer your many questions, liking youe questions type..
> Answers attached to questions..
>
> Thank you Rajini, for your interest :)

>
> A) The data for every user is totally unrelated to every other user. This
> gives us few advantages:
>
>   1. we can keep our indexes small in size.
>  (using cores)
>   2. merging/compatcting fragmented index will take less time.
> (merging is simple,one query)
>   3. if some indexes becomes inaccessible for whatever reason
>   (corruption?), only those users gets affected. Other users are unaffected
>   and the service is available for them.
> yes it affects only that index others are unaffected
>
>
How many cores can we safely have on a machine ? How much is "too much" in
this case ?


> B) Each user can have few different types of data.
>
> So, our index hierarchy will look something like:
> /user1/type1/<index files>
> /user1/type2/<index files>
> /user2/type1/<index files>
> /user3/type3/<index files>
>
> I am not clear with point here..
> Example say you have 2users
> user1
>  types- Name , Emailaddress, Phone number
> user2
>  types- Name , Emailaddress, ID
> So you want to have user1 -3indexes plus  user2-3indexes  Total=6 indexes??
> If user1 type "phone number" is only one type in data index-- Then schema
> will be having only one data type "number type"
>
>
>
I just meant to say, like this :

/myself/docs/index_docs
/myself/spreadsheets/index_spreads
/yourself/docs/index_docs
/yourself/spreadsheets/index_spreads

You get the idea right ?

C) Often, probably with every itereation, we'll add "types" of data that can
> be indexed.
> So we want to have an efficient/programmatic way to add schemas for
> different "types". We would like to avoid having fixed schema for indexing.
>
> you added a type say DATE
> Before you start indexing for this "date" type, u need to update your
> schema with this data type to enable indexing .. correct ?
> So this wont need a fixed schema defined priorly, we can add this only when
> you want to add this data type..  But this requires the service restart..
> This wont effect current index other then adding to it..
>
>
Today I am adding only docs and spreadsheets, tomorrow I may want to add
something else, something from RDBMS for example, then I don't want
to sit tinkering with schema.xml and I wouldn't like a service restart
either...


>
> D) The users can fire search queries which will search either: - Within a
> specific "type" for that user - Across all types for that user: in this
> case
> we want to fire a parallel query like Lucene has.
> (ParallelMultiSearcher<
> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html
> >
> )
>
>
> Shradding in solr workd like this :
> You have phone number detail in one index and again phone number details
> only in other index too..
> You can search across both index firing a query as , Ph:9999 across index1
> and index2
> You cannot fire one search query as :  Name:xyz and Ph:9999 across index
> one and index2 .. when index one has datatype defined for only name and
> index2 has only for phone number.. This can only be done if you define in
> schema the datatypes for both..(this will create a prob of having same/fixed
> schema)
>
>
> E) We require real time update for the index. *This is a must.*
> This can be possible .. Index happening must be enabled every minute ,
> Check if updates made.. If made, re-index and maintain unique ness with the
> userid
>
>
>
> We were considering Lucene, Sphinx and Solr to do this. This is what we
> found:
>
>   - Sphinx: No efficient way to do A, B, C, F. Or is there?
>   - Luecne: Everything looks possible, as it is very low level. But we have
>   to write wrappers to do F and build a communication layer between the web
>   server and the search server.
>   - Solr: Not sure if we can do A, B, C easily. Can we?
>
> So, my question is what is the best software for the above requirements? I
> am inclined more towards Solr and then Lucene if we get all the
> requirements.
>
>
> Regards,
> Rajani Maski
>
>
>
>
>
>
>
>
> On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M <shrinat...@webyog.com> wrote:
>
>> We are building an application which will require us to index data for
>> each
>> of our users so that we can provide full text search on their data. Here
>> are
>> some notable things about the application:
>>
>> A) The data for every user is totally unrelated to every other user. This
>> gives us few advantages:
>>
>>   1. we can keep our indexes small in size.
>>   2. merging/compatcting fragmented index will take less time.
>>   3. if some indexes becomes inaccessible for whatever reason
>>
>>   (corruption?), only those users gets affected. Other users are
>> unaffected
>>   and the service is available for them.
>>
>> B) Each user can have few different types of data. We want to keep each
>> type
>> in separate folders, for the same reasons as above.
>>
>> So, our index hierarchy will look something like:
>> /user1/type1/<index files>
>> /user1/type2/<index files>
>> /user2/type1/<index files>
>> /user3/type3/<index files>
>>
>> C) Often, probably with every itereation, we'll add "types" of data that
>> can
>> be indexed.
>> So we want to have an efficient/programmatic way to add schemas for
>> different "types". We would like to avoid having fixed schema for
>> indexing.
>> I like Lucene's schema-less way of indexing stuff.
>>
>> D) The users can fire search queries which will search either: - Within a
>> specific "type" for that user - Across all types for that user: in this
>> case
>> we want to fire a parallel query like Lucene has.
>> (ParallelMultiSearcher<
>> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html
>> >
>>
>> )
>>
>> E) We require real time update for the index. *This is a must.*
>>
>> F) We are are planning to shard our index across multiple machines. For
>> this
>> also, we want:
>> if a shard becomes inaccessible, only those users whose data are residing
>> in
>> that shard gets affected. Other users get uninterrupted service.
>>
>> We were considering Lucene, Sphinx and Solr to do this. This is what we
>> found:
>>
>>   - Sphinx: No efficient way to do A, B, C, F. Or is there?
>>   - Luecne: Everything looks possible, as it is very low level. But we
>> have
>>   to write wrappers to do F and build a communication layer between the
>> web
>>   server and the search server.
>>   - Solr: Not sure if we can do A, B, C easily. Can we?
>>
>> So, my question is what is the best software for the above requirements? I
>> am inclined more towards Solr and then Lucene if we get all the
>> requirements.
>>
>> --
>> Regards
>> Shrinath.M
>>
>
>


--
Regards
Shrinath.M

RE: Full Text Search with multiple index and complex requirements

Reply via email to