"If output connectors have access to the access tokens then I am presuming a custom output connector could look and say, "oh this document is accessible to these specific people", but is that a reasonable assumption?"
The problem is that you don't know what is in those access tokens. If you knew beyond question that the only thing you'd ever index was stuff that (for instance) came from SharePoint, maybe you could make it work. But if you add other connection types, then you'd have to modify your output connector for each one. The other thing you should think about is that usually access tokens correspond to *groups* of users rather than individual users. There is no obvious mapping then that you can use to turn that into a list of corresponding users. I believe that when the SharePoint connector is configured for "Active Directory" authorization, it maps to individual SIDs, but as you might expect the list of SIDs for a given document can be quite large, which is why we went to the SharePoint/Native authorization model as our default. Karl On Thu, Mar 19, 2015 at 2:43 PM, hank williams <[email protected]> wrote: > This is *super* helpful. I think perhaps I am seeing how to handle this. > > Regarding #2, since our database is proprietary, there would be no > existing output connection type so in any case we would need to create our > own. > > But #1 is clearly an issue. My first thought is that the answer would be > to just read everything (not limited by permissions) and then to use a > custom output connector to "place" copies in the right accounts. If output > connectors have access to the access tokens then I am presuming a custom > output connector could look and say, "oh this document is accessible to > these specific people", but is that a reasonable assumption? > > > On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <[email protected]> wrote: > >> "So my question is, notwithstanding that this is not the "typical" way >> ManifoldCF works, can we use it in the way that I am describing. Is it >> malleable enough to work or is it designed to do something so different >> from what we need that it would be useless. I guess the key question is >> really, can we tell ManifoldCF to limit results to those visible to a >> specific user and would there be any performance or other unexpected >> downsides to doing that." >> >> Hi Hank, >> >> There is nothing specific about the ManifoldCF *framework* that prevents >> you from doing what you suggest. But there are problems, as follows: >> >> (1) Most out-of-the-box repository connection types, including the >> SharePoint type, do not give you any ability to limit crawls to a specific >> user. Instead, because they are intended to support a very different >> security model, they fetch a document's access tokens, which are described >> by the book chapter I pointed you to. >> (2) If you modified the SharePoint repository connection type in the >> manner you suggest, you would still need to create a custom output >> connection type to drop the content into your per-user database instances. >> The alternative would be to use an appropriate out-of-the-box output >> connection type, if there is one, and have N jobs for N users. >> >> Hope that answers your question. >> >> Karl >> >> >> >> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <[email protected]> wrote: >> >>> Thanks Karl. >>> >>> I will most certainly be reading the document you linked to in great >>> detail. It looks like stuff I need to know. >>> >>> That said, we have a given technology that we have developed and that we >>> will be using. It creates a separate index for each user. The technology >>> has vastly greater utility than just for sharepoint and Its been in >>> development for about six years . (in fact this sharepoint thing is a >>> recent add-on request.) >>> >>> So my question is, notwithstanding that this is not the "typical" way >>> ManifoldCF works, can we use it in the way that I am describing. Is it >>> malleable enough to work or is it designed to do something so different >>> from what we need that it would be useless. I guess the key question is >>> really, can we tell ManifoldCF to limit results to those visible to a >>> specific user and would there be any performance or other unexpected >>> downsides to doing that. >>> >>> Hank >>> >>> >>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Hank, >>>> >>>> "Our project involves a database that has a private secure user space >>>> for each user. Our database is built on Lucene and indexes every object in >>>> the database. Each user presumably has some number of SharePoint sites that >>>> they have access to. We want to index each sharepoint object (file or >>>> sharepoint page) as we find it, for each user. The user then ends up with >>>> an index of just the objects that they have perrmissions for. But to do >>>> that we need to, for each user crawl all of the sharepoint sites that they >>>> have access to. Permissions to each sharepoint site are managed by K >>>> erberos." >>>> >>>> This is not the typical ManifoldCF model. In the typical case, there >>>> is ONE lucene search engine (not N), and any searches that take place apply >>>> security restrictions internally based on the user's security information, >>>> as obtained from the ManifoldCF authority service, which is in turn >>>> querying SharePoint. >>>> >>>> You can read more about the standard authorization setup here: >>>> >>>> >>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf >>>> >>>> Karl >>>> >>>> >>>> >>>> >>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <[email protected]> >>>> wrote: >>>> >>>>> I am embarking on an effort for which ManifoldCF may be an >>>>> appropriate tool. I am a total noob, having just discovered this project >>>>> and have a few questions that I am hoping someone can answer so that I can >>>>> begin to gain some confidence about the way things work. Basically I am >>>>> trying to make sure I understand, at a top level, how ManifoldCF works. >>>>> >>>>> Our project involves a database that has a private secure user space >>>>> for each user. Our database is built on Lucene and indexes every object in >>>>> the database. Each user presumably has some number of SharePoint sites >>>>> that >>>>> they have access to. We want to index each sharepoint object (file or >>>>> sharepoint page) as we find it, for each user. The user then ends up with >>>>> an index of just the objects that they have perrmissions for. But to do >>>>> that we need to, for each user crawl all of the sharepoint sites that they >>>>> have access to. Permissions to each sharepoint site are managed by K >>>>> erberos. >>>>> >>>>> So the questions are: >>>>> >>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list of >>>>> users and relevant Kerberos appropriate authentication tokens or keys >>>>> (just >>>>> learning about Kerberos), and get back a list of indexable objects/URIs >>>>> (HTML, .docx, pptx, etc.)? >>>>> >>>>> b. Is this the right way to think about it? >>>>> >>>>> c. If so, is there any example code or documentation that would >>>>> explain how I do this? >>>>> >>>>> d. Does manifoldCF provide any information to help indicate whether >>>>> the given object has changed, or is that something we need to figure out >>>>> by >>>>> manually comparing the old and new documents in our code? >>>>> >>>> >>>> >>> >> >
