Re: A hopfully a few simple question about ManifoldCF and SharePoint

Karl Wright Mon, 23 Mar 2015 07:13:21 -0700

Hi Hank,

I can't really recommend any consulting firms specifically skilled with
using bits and pieces of ManifoldCF to build a whole new solution.  If you
are indexing into Solr, maybe you can contact a Solr consulting firm, e.g.
LucidImagination etc.  You *could* try a firm like Zaizi (based in London),
but I can't be sure they'd find the job amenable either.


Karl

On Mon, Mar 23, 2015 at 9:43 AM, hank williams <[email protected]> wrote:

> Karl,
>
> At this point it seems like perhaps ManifoldCF may not be the right tool.
>
> I think the best solution is to have our server log into SharePoint using
> Kerberos or OAuth, and to provide our engine links to the content available
> to the logged in user. This is, in essence, a single user crawl of a
> sharepoint site I guess (we are not interested in other data sources). From
> what I gather based on your responses, ManifoldCF wouldnt help much here,
> but this does not seem like an extraordinarily complicated task (at least
> from the perspective of someone who's never played with any of this stuff!).
>
> So my question is, is my assumption that its not "an extraordinarily
> complicated task" correct, and if not, are there folks in the ManifoldCF
> community (or other communities) that you know of might be available as
> consultants to create that module?
>
> Best ,
> Hank
>
> On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <[email protected]> wrote:
>
>> "If output connectors have access to the access tokens then I am
>> presuming a custom output connector could look and say, "oh this document
>> is accessible to these specific people", but is that a reasonable
>> assumption?"
>>
>> The problem is that you don't know what is in those access tokens.  If
>> you knew beyond question that the only thing you'd ever index was stuff
>> that (for instance) came from SharePoint, maybe you could make it work.
>> But if you add other connection types, then you'd have to modify your
>> output connector for each one.
>>
>> The other thing you should think about is that usually access tokens
>> correspond to *groups* of users rather than individual users.  There is no
>> obvious mapping then that you can use to turn that into a list of
>> corresponding users.  I believe that when the SharePoint connector is
>> configured for "Active Directory" authorization, it maps to individual
>> SIDs, but as you might expect the list of SIDs for a given document can be
>> quite large, which is why we went to the SharePoint/Native authorization
>> model as our default.
>>
>> Karl
>>
>>
>> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <[email protected]> wrote:
>>
>>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>>
>>> Regarding #2, since our database is proprietary, there would be no
>>> existing output connection type so in any case we would need to create our
>>> own.
>>>
>>> But #1 is clearly an issue. My first thought is that the answer would be
>>> to just read everything (not limited by permissions) and then to use a
>>> custom output connector to "place" copies in the right accounts. If output
>>> connectors have access to the access tokens then I am presuming a custom
>>> output connector could look and say, "oh this document is accessible to
>>> these specific people", but is that a reasonable assumption?
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> "So my question is, notwithstanding that this is not the "typical" way
>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>> malleable enough to work or is it designed to do something so different
>>>> from what we need that it would be useless. I guess the key question is
>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>> specific user and would there be any performance or other unexpected
>>>> downsides to doing that."
>>>>
>>>> Hi Hank,
>>>>
>>>> There is nothing specific about the ManifoldCF *framework* that
>>>> prevents you from doing what you suggest.  But there are problems, as
>>>> follows:
>>>>
>>>> (1) Most out-of-the-box repository connection types, including the
>>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>>> user.  Instead, because they are intended to support a very different
>>>> security model, they fetch a document's access tokens, which are described
>>>> by the book chapter I pointed you to.
>>>> (2) If you modified the SharePoint repository connection type in the
>>>> manner you suggest, you would still need to create a custom output
>>>> connection type to drop the content into your per-user database instances.
>>>> The alternative would be to use an appropriate out-of-the-box output
>>>> connection type, if there is one, and have N jobs for N users.
>>>>
>>>> Hope that answers your question.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Karl.
>>>>>
>>>>> I will most certainly be reading the document you linked to in great
>>>>> detail. It looks like stuff I need to know.
>>>>>
>>>>> That said, we have a given technology that we have developed and that
>>>>> we will be using. It creates a separate index for each user. The 
>>>>> technology
>>>>> has vastly greater utility than just for sharepoint and Its been in
>>>>> development for about six years . (in fact this sharepoint thing is a
>>>>> recent add-on request.)
>>>>>
>>>>> So my question is, notwithstanding that this is not the "typical" way
>>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>>> malleable enough to work or is it designed to do something so different
>>>>> from what we need that it would be useless. I guess the key question is
>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>> specific user and would there be any performance or other unexpected
>>>>> downsides to doing that.
>>>>>
>>>>> Hank
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Hank,
>>>>>>
>>>>>> "Our project involves a database that has a private secure user
>>>>>> space for each user. Our database is built on Lucene and indexes every
>>>>>> object in the database. Each user presumably has some number of 
>>>>>> SharePoint
>>>>>> sites that they have access to. We want to index each sharepoint object
>>>>>> (file or sharepoint page) as we find it, for each user. The user then 
>>>>>> ends
>>>>>> up with an index of just the objects that they have perrmissions for. But
>>>>>> to do that we need to, for each user crawl all of the sharepoint sites 
>>>>>> that
>>>>>> they have access to. Permissions to each sharepoint site are managed by K
>>>>>> erberos."
>>>>>>
>>>>>> This is not the typical ManifoldCF model.  In the typical case, there
>>>>>> is ONE lucene search engine (not N), and any searches that take place 
>>>>>> apply
>>>>>> security restrictions internally based on the user's security 
>>>>>> information,
>>>>>> as obtained from the ManifoldCF authority service, which is in turn
>>>>>> querying SharePoint.
>>>>>>
>>>>>> You can read more about the standard authorization setup here:
>>>>>>
>>>>>>
>>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>>>> appropriate tool. I am a total noob, having just discovered this project
>>>>>>> and have a few questions that I am hoping someone can answer so that I 
>>>>>>> can
>>>>>>> begin to gain some confidence about the way things work. Basically I am
>>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works.
>>>>>>>
>>>>>>> Our project involves a database that has a private secure user space
>>>>>>> for each user. Our database is built on Lucene and indexes every object 
>>>>>>> in
>>>>>>> the database. Each user presumably has some number of SharePoint sites 
>>>>>>> that
>>>>>>> they have access to. We want to index each sharepoint object (file or
>>>>>>> sharepoint page) as we find it, for each user. The user then ends up 
>>>>>>> with
>>>>>>> an index of just the objects that they have perrmissions for. But to do
>>>>>>> that we need to, for each user crawl all of the sharepoint sites that 
>>>>>>> they
>>>>>>> have access to. Permissions to each sharepoint site are managed by K
>>>>>>> erberos.
>>>>>>>
>>>>>>> So the questions are:
>>>>>>>
>>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list
>>>>>>> of users and relevant Kerberos appropriate authentication tokens or keys
>>>>>>> (just learning about Kerberos), and get back a list of indexable
>>>>>>> objects/URIs (HTML, .docx, pptx, etc.)?
>>>>>>>
>>>>>>> b. Is this the right way to think about it?
>>>>>>>
>>>>>>> c. If so, is there any example code or documentation that would
>>>>>>> explain how I do this?
>>>>>>>
>>>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>>>> the given object has changed, or is that something we need to figure 
>>>>>>> out by
>>>>>>> manually comparing the old and new documents in our code?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Reply via email to