I’ve worked on something similar - data set was 100m documents with thousands 
of users. The ranking is relative in each index. Eg. What is #1 , #2, #3 is 
only 1,2,3 in that index.

Your challenge will in the user interface result display: how to merge results 
in a way that the relevant results are shown first before non relevant results.

There are numerous ways to merge — could even retrieve , merge, index, and 
retrieve from that — but computing power aside, that’s not efficient.

You could consider two indexes not as public and private but as a metadata 
(data indexed only, not stored) and data (index / stored values). This way 
you’ll get your ranking without having to compromise. Once you have your doc 
ids , you can retrieve from a data index / read only SolR cluster or a scalable 
persistent store (Cassandra, Mongo, etc. ) that would scale way better than 
SolR itself for thousands if not millions of users ( please let’s not start a 
debate about this ).

This way your users would have relevant results, and fast access to the index , 
the data would be protected - if you filter by the doc owner Id as a “or” query 
in addition to doc owner I’d = ‘public’. What you lose in not getting the 
document Data from the initial query you can retrieve asynchronously or maybe 
“join” with another collection — which I’ve not done but I know it’s possible.

Also may want to consider CQRS pattern for doc checkin / checkout Actions to 
keep the indexing / query time scalable. It may be more work but it’s more 
scalable. Go big or go home. ;)

Hope it helps

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 18, 2018, 11:14 AM -0400, Steven White <swhite4...@gmail.com>, wrote:
> Hi everyone,
>
> I have a design problem that i"m not sure how to solve best so I figured I
> share it here and see what ideas others may have.
>
> I have a DB that hold documents (over 1 million and growing). This is
> known as the "Public" DB that holds documents visible to all of my end
> users.
>
> My application let users "check-out" one or more documents at a time off
> this "Public" DB, edit them and "check-in" back into the "Public" DB. When
> a document is checked-out, it goes into a "Personal" DB for that user (and
> the document in the "Public" DB is flagged as such to alert other users.)
> The owner of this checked-out document in the "Personal" DB can make
> changes to the document and save it back into the "Personal" DB as often as
> he wants to. Sometimes the document lives in the "Personal" DB for few
> minutes before it is checked-in back into the "Public" DB and sometimes it
> can live in the "Personal" DB for 1 day or 1 month. When a document is
> saved into the "Personal" DB, only the owner of that document can see it.
>
> Currently there are 100 users but this will grow to at least 500 or maybe
> even 1000.
>
> I'm looking at a solution on how to enable a full text search on those
> documents, both in the "Public" and "Personal" DB so that:
>
> 1) Documents in the "Public" DB are searchable by all users. This is the
> easy part.
>
> 2) Documents in the "Personal" DB of each user is searchable by the owner
> of that "Personal" DB. This is easy too.
>
> 3) A user can search both the "Public" and "Personal" DB at anytime but if
> a document is in the "Personal" DB, we will not search it the "Public" --
> i.e.: whatever is in "Personal" DB takes over what's in the "Public" DB.
>
> Item #3 is important and is what I'm trying to solve. The goal is to give
> hits to the user on documents that they are editing (in their "Personal"
> DB) instead of that in the "Public".
>
> The way I'm thinking to solve this problem is to create 2 Solr indexes (do
> we call those "cores"?):
>
> 1) The "Public" DB is indexed into the "Public" Solr index.
>
> 2) The "Personal" DB is indexed into the "Personal" Solr index with a field
> indicating the owner of that document.
>
> With the above 2 indexes, I can now send the user's search syntax to both
> indexes but for the "Public", I will also send a list of IDs (those
> documents in the user's "Personal" DB) to exclude from the result set.
> This way, I let a user search both the "Public" and "Personal" DB as such
> the documents in the "Personal" DB are included in the search and are
> excluded from the "Public" DB.
>
> Did I make sense? If so, is this doable? Will ranking be effected given
> that I'm searching 2 indexes?
>
> Let me know what issues I might be overlooking with this solution.
>
> Thanks
>
> Steve

Reply via email to