Hi Hank, I can't really recommend any consulting firms specifically skilled with using bits and pieces of ManifoldCF to build a whole new solution. If you are indexing into Solr, maybe you can contact a Solr consulting firm, e.g. LucidImagination etc. You *could* try a firm like Zaizi (based in London), but I can't be sure they'd find the job amenable either.
Karl On Mon, Mar 23, 2015 at 9:43 AM, hank williams <[email protected]> wrote: > Karl, > > At this point it seems like perhaps ManifoldCF may not be the right tool. > > I think the best solution is to have our server log into SharePoint using > Kerberos or OAuth, and to provide our engine links to the content available > to the logged in user. This is, in essence, a single user crawl of a > sharepoint site I guess (we are not interested in other data sources). From > what I gather based on your responses, ManifoldCF wouldnt help much here, > but this does not seem like an extraordinarily complicated task (at least > from the perspective of someone who's never played with any of this stuff!). > > So my question is, is my assumption that its not "an extraordinarily > complicated task" correct, and if not, are there folks in the ManifoldCF > community (or other communities) that you know of might be available as > consultants to create that module? > > Best , > Hank > > On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <[email protected]> wrote: > >> "If output connectors have access to the access tokens then I am >> presuming a custom output connector could look and say, "oh this document >> is accessible to these specific people", but is that a reasonable >> assumption?" >> >> The problem is that you don't know what is in those access tokens. If >> you knew beyond question that the only thing you'd ever index was stuff >> that (for instance) came from SharePoint, maybe you could make it work. >> But if you add other connection types, then you'd have to modify your >> output connector for each one. >> >> The other thing you should think about is that usually access tokens >> correspond to *groups* of users rather than individual users. There is no >> obvious mapping then that you can use to turn that into a list of >> corresponding users. I believe that when the SharePoint connector is >> configured for "Active Directory" authorization, it maps to individual >> SIDs, but as you might expect the list of SIDs for a given document can be >> quite large, which is why we went to the SharePoint/Native authorization >> model as our default. >> >> Karl >> >> >> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <[email protected]> wrote: >> >>> This is *super* helpful. I think perhaps I am seeing how to handle this. >>> >>> Regarding #2, since our database is proprietary, there would be no >>> existing output connection type so in any case we would need to create our >>> own. >>> >>> But #1 is clearly an issue. My first thought is that the answer would be >>> to just read everything (not limited by permissions) and then to use a >>> custom output connector to "place" copies in the right accounts. If output >>> connectors have access to the access tokens then I am presuming a custom >>> output connector could look and say, "oh this document is accessible to >>> these specific people", but is that a reasonable assumption? >>> >>> >>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <[email protected]> wrote: >>> >>>> "So my question is, notwithstanding that this is not the "typical" way >>>> ManifoldCF works, can we use it in the way that I am describing. Is it >>>> malleable enough to work or is it designed to do something so different >>>> from what we need that it would be useless. I guess the key question is >>>> really, can we tell ManifoldCF to limit results to those visible to a >>>> specific user and would there be any performance or other unexpected >>>> downsides to doing that." >>>> >>>> Hi Hank, >>>> >>>> There is nothing specific about the ManifoldCF *framework* that >>>> prevents you from doing what you suggest. But there are problems, as >>>> follows: >>>> >>>> (1) Most out-of-the-box repository connection types, including the >>>> SharePoint type, do not give you any ability to limit crawls to a specific >>>> user. Instead, because they are intended to support a very different >>>> security model, they fetch a document's access tokens, which are described >>>> by the book chapter I pointed you to. >>>> (2) If you modified the SharePoint repository connection type in the >>>> manner you suggest, you would still need to create a custom output >>>> connection type to drop the content into your per-user database instances. >>>> The alternative would be to use an appropriate out-of-the-box output >>>> connection type, if there is one, and have N jobs for N users. >>>> >>>> Hope that answers your question. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <[email protected]> >>>> wrote: >>>> >>>>> Thanks Karl. >>>>> >>>>> I will most certainly be reading the document you linked to in great >>>>> detail. It looks like stuff I need to know. >>>>> >>>>> That said, we have a given technology that we have developed and that >>>>> we will be using. It creates a separate index for each user. The >>>>> technology >>>>> has vastly greater utility than just for sharepoint and Its been in >>>>> development for about six years . (in fact this sharepoint thing is a >>>>> recent add-on request.) >>>>> >>>>> So my question is, notwithstanding that this is not the "typical" way >>>>> ManifoldCF works, can we use it in the way that I am describing. Is it >>>>> malleable enough to work or is it designed to do something so different >>>>> from what we need that it would be useless. I guess the key question is >>>>> really, can we tell ManifoldCF to limit results to those visible to a >>>>> specific user and would there be any performance or other unexpected >>>>> downsides to doing that. >>>>> >>>>> Hank >>>>> >>>>> >>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Hank, >>>>>> >>>>>> "Our project involves a database that has a private secure user >>>>>> space for each user. Our database is built on Lucene and indexes every >>>>>> object in the database. Each user presumably has some number of >>>>>> SharePoint >>>>>> sites that they have access to. We want to index each sharepoint object >>>>>> (file or sharepoint page) as we find it, for each user. The user then >>>>>> ends >>>>>> up with an index of just the objects that they have perrmissions for. But >>>>>> to do that we need to, for each user crawl all of the sharepoint sites >>>>>> that >>>>>> they have access to. Permissions to each sharepoint site are managed by K >>>>>> erberos." >>>>>> >>>>>> This is not the typical ManifoldCF model. In the typical case, there >>>>>> is ONE lucene search engine (not N), and any searches that take place >>>>>> apply >>>>>> security restrictions internally based on the user's security >>>>>> information, >>>>>> as obtained from the ManifoldCF authority service, which is in turn >>>>>> querying SharePoint. >>>>>> >>>>>> You can read more about the standard authorization setup here: >>>>>> >>>>>> >>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I am embarking on an effort for which ManifoldCF may be an >>>>>>> appropriate tool. I am a total noob, having just discovered this project >>>>>>> and have a few questions that I am hoping someone can answer so that I >>>>>>> can >>>>>>> begin to gain some confidence about the way things work. Basically I am >>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works. >>>>>>> >>>>>>> Our project involves a database that has a private secure user space >>>>>>> for each user. Our database is built on Lucene and indexes every object >>>>>>> in >>>>>>> the database. Each user presumably has some number of SharePoint sites >>>>>>> that >>>>>>> they have access to. We want to index each sharepoint object (file or >>>>>>> sharepoint page) as we find it, for each user. The user then ends up >>>>>>> with >>>>>>> an index of just the objects that they have perrmissions for. But to do >>>>>>> that we need to, for each user crawl all of the sharepoint sites that >>>>>>> they >>>>>>> have access to. Permissions to each sharepoint site are managed by K >>>>>>> erberos. >>>>>>> >>>>>>> So the questions are: >>>>>>> >>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list >>>>>>> of users and relevant Kerberos appropriate authentication tokens or keys >>>>>>> (just learning about Kerberos), and get back a list of indexable >>>>>>> objects/URIs (HTML, .docx, pptx, etc.)? >>>>>>> >>>>>>> b. Is this the right way to think about it? >>>>>>> >>>>>>> c. If so, is there any example code or documentation that would >>>>>>> explain how I do this? >>>>>>> >>>>>>> d. Does manifoldCF provide any information to help indicate whether >>>>>>> the given object has changed, or is that something we need to figure >>>>>>> out by >>>>>>> manually comparing the old and new documents in our code? >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
