Re: A hopfully a few simple question about ManifoldCF and SharePoint

hank williams Mon, 23 Mar 2015 08:00:07 -0700

Thanks Karl.

We're not indexing into solr. Its our own technology. What we are really
looking for it sounds like to me is "SharePoint-from-java" experience and
writing web apps that talk to sharepoint.


Well, I'll just keep looking.

Best,
Hank

On Mon, Mar 23, 2015 at 10:10 AM, Karl Wright <[email protected]> wrote:

> Hi Hank,
>
> I can't really recommend any consulting firms specifically skilled with
> using bits and pieces of ManifoldCF to build a whole new solution.  If you
> are indexing into Solr, maybe you can contact a Solr consulting firm, e.g.
> LucidImagination etc.  You *could* try a firm like Zaizi (based in London),
> but I can't be sure they'd find the job amenable either.
>
> Karl
>
> On Mon, Mar 23, 2015 at 9:43 AM, hank williams <[email protected]> wrote:
>
>> Karl,
>>
>> At this point it seems like perhaps ManifoldCF may not be the right tool.
>>
>> I think the best solution is to have our server log into SharePoint using
>> Kerberos or OAuth, and to provide our engine links to the content available
>> to the logged in user. This is, in essence, a single user crawl of a
>> sharepoint site I guess (we are not interested in other data sources). From
>> what I gather based on your responses, ManifoldCF wouldnt help much here,
>> but this does not seem like an extraordinarily complicated task (at least
>> from the perspective of someone who's never played with any of this stuff!).
>>
>> So my question is, is my assumption that its not "an extraordinarily
>> complicated task" correct, and if not, are there folks in the ManifoldCF
>> community (or other communities) that you know of might be available as
>> consultants to create that module?
>>
>> Best ,
>> Hank
>>
>> On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <[email protected]> wrote:
>>
>>> "If output connectors have access to the access tokens then I am
>>> presuming a custom output connector could look and say, "oh this document
>>> is accessible to these specific people", but is that a reasonable
>>> assumption?"
>>>
>>> The problem is that you don't know what is in those access tokens.  If
>>> you knew beyond question that the only thing you'd ever index was stuff
>>> that (for instance) came from SharePoint, maybe you could make it work.
>>> But if you add other connection types, then you'd have to modify your
>>> output connector for each one.
>>>
>>> The other thing you should think about is that usually access tokens
>>> correspond to *groups* of users rather than individual users.  There is no
>>> obvious mapping then that you can use to turn that into a list of
>>> corresponding users.  I believe that when the SharePoint connector is
>>> configured for "Active Directory" authorization, it maps to individual
>>> SIDs, but as you might expect the list of SIDs for a given document can be
>>> quite large, which is why we went to the SharePoint/Native authorization
>>> model as our default.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <[email protected]>
>>> wrote:
>>>
>>>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>>>
>>>> Regarding #2, since our database is proprietary, there would be no
>>>> existing output connection type so in any case we would need to create our
>>>> own.
>>>>
>>>> But #1 is clearly an issue. My first thought is that the answer would
>>>> be to just read everything (not limited by permissions) and then to use a
>>>> custom output connector to "place" copies in the right accounts. If output
>>>> connectors have access to the access tokens then I am presuming a custom
>>>> output connector could look and say, "oh this document is accessible to
>>>> these specific people", but is that a reasonable assumption?
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> "So my question is, notwithstanding that this is not the "typical"
>>>>> way ManifoldCF works, can we use it in the way that I am describing. Is it
>>>>> malleable enough to work or is it designed to do something so different
>>>>> from what we need that it would be useless. I guess the key question is
>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>> specific user and would there be any performance or other unexpected
>>>>> downsides to doing that."
>>>>>
>>>>> Hi Hank,
>>>>>
>>>>> There is nothing specific about the ManifoldCF *framework* that
>>>>> prevents you from doing what you suggest.  But there are problems, as
>>>>> follows:
>>>>>
>>>>> (1) Most out-of-the-box repository connection types, including the
>>>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>>>> user.  Instead, because they are intended to support a very different
>>>>> security model, they fetch a document's access tokens, which are described
>>>>> by the book chapter I pointed you to.
>>>>> (2) If you modified the SharePoint repository connection type in the
>>>>> manner you suggest, you would still need to create a custom output
>>>>> connection type to drop the content into your per-user database instances.
>>>>> The alternative would be to use an appropriate out-of-the-box output
>>>>> connection type, if there is one, and have N jobs for N users.
>>>>>
>>>>> Hope that answers your question.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Karl.
>>>>>>
>>>>>> I will most certainly be reading the document you linked to in great
>>>>>> detail. It looks like stuff I need to know.
>>>>>>
>>>>>> That said, we have a given technology that we have developed and that
>>>>>> we will be using. It creates a separate index for each user. The 
>>>>>> technology
>>>>>> has vastly greater utility than just for sharepoint and Its been in
>>>>>> development for about six years . (in fact this sharepoint thing is a
>>>>>> recent add-on request.)
>>>>>>
>>>>>> So my question is, notwithstanding that this is not the "typical" way
>>>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>>>> malleable enough to work or is it designed to do something so different
>>>>>> from what we need that it would be useless. I guess the key question is
>>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>>> specific user and would there be any performance or other unexpected
>>>>>> downsides to doing that.
>>>>>>
>>>>>> Hank
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Hank,
>>>>>>>
>>>>>>> "Our project involves a database that has a private secure user
>>>>>>> space for each user. Our database is built on Lucene and indexes every
>>>>>>> object in the database. Each user presumably has some number of 
>>>>>>> SharePoint
>>>>>>> sites that they have access to. We want to index each sharepoint object
>>>>>>> (file or sharepoint page) as we find it, for each user. The user then 
>>>>>>> ends
>>>>>>> up with an index of just the objects that they have perrmissions for. 
>>>>>>> But
>>>>>>> to do that we need to, for each user crawl all of the sharepoint sites 
>>>>>>> that
>>>>>>> they have access to. Permissions to each sharepoint site are managed by 
>>>>>>> K
>>>>>>> erberos."
>>>>>>>
>>>>>>> This is not the typical ManifoldCF model.  In the typical case,
>>>>>>> there is ONE lucene search engine (not N), and any searches that take 
>>>>>>> place
>>>>>>> apply security restrictions internally based on the user's security
>>>>>>> information, as obtained from the ManifoldCF authority service, which 
>>>>>>> is in
>>>>>>> turn querying SharePoint.
>>>>>>>
>>>>>>> You can read more about the standard authorization setup here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>>>>> appropriate tool. I am a total noob, having just discovered this 
>>>>>>>> project
>>>>>>>> and have a few questions that I am hoping someone can answer so that I 
>>>>>>>> can
>>>>>>>> begin to gain some confidence about the way things work. Basically I am
>>>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works.
>>>>>>>>
>>>>>>>> Our project involves a database that has a private secure user
>>>>>>>> space for each user. Our database is built on Lucene and indexes every
>>>>>>>> object in the database. Each user presumably has some number of 
>>>>>>>> SharePoint
>>>>>>>> sites that they have access to. We want to index each sharepoint object
>>>>>>>> (file or sharepoint page) as we find it, for each user. The user then 
>>>>>>>> ends
>>>>>>>> up with an index of just the objects that they have perrmissions for. 
>>>>>>>> But
>>>>>>>> to do that we need to, for each user crawl all of the sharepoint sites 
>>>>>>>> that
>>>>>>>> they have access to. Permissions to each sharepoint site are managed 
>>>>>>>> by K
>>>>>>>> erberos.
>>>>>>>>
>>>>>>>> So the questions are:
>>>>>>>>
>>>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list
>>>>>>>> of users and relevant Kerberos appropriate authentication tokens or 
>>>>>>>> keys
>>>>>>>> (just learning about Kerberos), and get back a list of indexable
>>>>>>>> objects/URIs (HTML, .docx, pptx, etc.)?
>>>>>>>>
>>>>>>>> b. Is this the right way to think about it?
>>>>>>>>
>>>>>>>> c. If so, is there any example code or documentation that would
>>>>>>>> explain how I do this?
>>>>>>>>
>>>>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>>>>> the given object has changed, or is that something we need to figure 
>>>>>>>> out by
>>>>>>>> manually comparing the old and new documents in our code?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Reply via email to