Thanks Karl. We're not indexing into solr. Its our own technology. What we are really looking for it sounds like to me is "SharePoint-from-java" experience and writing web apps that talk to sharepoint.
Well, I'll just keep looking. Best, Hank On Mon, Mar 23, 2015 at 10:10 AM, Karl Wright <[email protected]> wrote: > Hi Hank, > > I can't really recommend any consulting firms specifically skilled with > using bits and pieces of ManifoldCF to build a whole new solution. If you > are indexing into Solr, maybe you can contact a Solr consulting firm, e.g. > LucidImagination etc. You *could* try a firm like Zaizi (based in London), > but I can't be sure they'd find the job amenable either. > > Karl > > On Mon, Mar 23, 2015 at 9:43 AM, hank williams <[email protected]> wrote: > >> Karl, >> >> At this point it seems like perhaps ManifoldCF may not be the right tool. >> >> I think the best solution is to have our server log into SharePoint using >> Kerberos or OAuth, and to provide our engine links to the content available >> to the logged in user. This is, in essence, a single user crawl of a >> sharepoint site I guess (we are not interested in other data sources). From >> what I gather based on your responses, ManifoldCF wouldnt help much here, >> but this does not seem like an extraordinarily complicated task (at least >> from the perspective of someone who's never played with any of this stuff!). >> >> So my question is, is my assumption that its not "an extraordinarily >> complicated task" correct, and if not, are there folks in the ManifoldCF >> community (or other communities) that you know of might be available as >> consultants to create that module? >> >> Best , >> Hank >> >> On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <[email protected]> wrote: >> >>> "If output connectors have access to the access tokens then I am >>> presuming a custom output connector could look and say, "oh this document >>> is accessible to these specific people", but is that a reasonable >>> assumption?" >>> >>> The problem is that you don't know what is in those access tokens. If >>> you knew beyond question that the only thing you'd ever index was stuff >>> that (for instance) came from SharePoint, maybe you could make it work. >>> But if you add other connection types, then you'd have to modify your >>> output connector for each one. >>> >>> The other thing you should think about is that usually access tokens >>> correspond to *groups* of users rather than individual users. There is no >>> obvious mapping then that you can use to turn that into a list of >>> corresponding users. I believe that when the SharePoint connector is >>> configured for "Active Directory" authorization, it maps to individual >>> SIDs, but as you might expect the list of SIDs for a given document can be >>> quite large, which is why we went to the SharePoint/Native authorization >>> model as our default. >>> >>> Karl >>> >>> >>> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <[email protected]> >>> wrote: >>> >>>> This is *super* helpful. I think perhaps I am seeing how to handle this. >>>> >>>> Regarding #2, since our database is proprietary, there would be no >>>> existing output connection type so in any case we would need to create our >>>> own. >>>> >>>> But #1 is clearly an issue. My first thought is that the answer would >>>> be to just read everything (not limited by permissions) and then to use a >>>> custom output connector to "place" copies in the right accounts. If output >>>> connectors have access to the access tokens then I am presuming a custom >>>> output connector could look and say, "oh this document is accessible to >>>> these specific people", but is that a reasonable assumption? >>>> >>>> >>>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> "So my question is, notwithstanding that this is not the "typical" >>>>> way ManifoldCF works, can we use it in the way that I am describing. Is it >>>>> malleable enough to work or is it designed to do something so different >>>>> from what we need that it would be useless. I guess the key question is >>>>> really, can we tell ManifoldCF to limit results to those visible to a >>>>> specific user and would there be any performance or other unexpected >>>>> downsides to doing that." >>>>> >>>>> Hi Hank, >>>>> >>>>> There is nothing specific about the ManifoldCF *framework* that >>>>> prevents you from doing what you suggest. But there are problems, as >>>>> follows: >>>>> >>>>> (1) Most out-of-the-box repository connection types, including the >>>>> SharePoint type, do not give you any ability to limit crawls to a specific >>>>> user. Instead, because they are intended to support a very different >>>>> security model, they fetch a document's access tokens, which are described >>>>> by the book chapter I pointed you to. >>>>> (2) If you modified the SharePoint repository connection type in the >>>>> manner you suggest, you would still need to create a custom output >>>>> connection type to drop the content into your per-user database instances. >>>>> The alternative would be to use an appropriate out-of-the-box output >>>>> connection type, if there is one, and have N jobs for N users. >>>>> >>>>> Hope that answers your question. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks Karl. >>>>>> >>>>>> I will most certainly be reading the document you linked to in great >>>>>> detail. It looks like stuff I need to know. >>>>>> >>>>>> That said, we have a given technology that we have developed and that >>>>>> we will be using. It creates a separate index for each user. The >>>>>> technology >>>>>> has vastly greater utility than just for sharepoint and Its been in >>>>>> development for about six years . (in fact this sharepoint thing is a >>>>>> recent add-on request.) >>>>>> >>>>>> So my question is, notwithstanding that this is not the "typical" way >>>>>> ManifoldCF works, can we use it in the way that I am describing. Is it >>>>>> malleable enough to work or is it designed to do something so different >>>>>> from what we need that it would be useless. I guess the key question is >>>>>> really, can we tell ManifoldCF to limit results to those visible to a >>>>>> specific user and would there be any performance or other unexpected >>>>>> downsides to doing that. >>>>>> >>>>>> Hank >>>>>> >>>>>> >>>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Hank, >>>>>>> >>>>>>> "Our project involves a database that has a private secure user >>>>>>> space for each user. Our database is built on Lucene and indexes every >>>>>>> object in the database. Each user presumably has some number of >>>>>>> SharePoint >>>>>>> sites that they have access to. We want to index each sharepoint object >>>>>>> (file or sharepoint page) as we find it, for each user. The user then >>>>>>> ends >>>>>>> up with an index of just the objects that they have perrmissions for. >>>>>>> But >>>>>>> to do that we need to, for each user crawl all of the sharepoint sites >>>>>>> that >>>>>>> they have access to. Permissions to each sharepoint site are managed by >>>>>>> K >>>>>>> erberos." >>>>>>> >>>>>>> This is not the typical ManifoldCF model. In the typical case, >>>>>>> there is ONE lucene search engine (not N), and any searches that take >>>>>>> place >>>>>>> apply security restrictions internally based on the user's security >>>>>>> information, as obtained from the ManifoldCF authority service, which >>>>>>> is in >>>>>>> turn querying SharePoint. >>>>>>> >>>>>>> You can read more about the standard authorization setup here: >>>>>>> >>>>>>> >>>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I am embarking on an effort for which ManifoldCF may be an >>>>>>>> appropriate tool. I am a total noob, having just discovered this >>>>>>>> project >>>>>>>> and have a few questions that I am hoping someone can answer so that I >>>>>>>> can >>>>>>>> begin to gain some confidence about the way things work. Basically I am >>>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works. >>>>>>>> >>>>>>>> Our project involves a database that has a private secure user >>>>>>>> space for each user. Our database is built on Lucene and indexes every >>>>>>>> object in the database. Each user presumably has some number of >>>>>>>> SharePoint >>>>>>>> sites that they have access to. We want to index each sharepoint object >>>>>>>> (file or sharepoint page) as we find it, for each user. The user then >>>>>>>> ends >>>>>>>> up with an index of just the objects that they have perrmissions for. >>>>>>>> But >>>>>>>> to do that we need to, for each user crawl all of the sharepoint sites >>>>>>>> that >>>>>>>> they have access to. Permissions to each sharepoint site are managed >>>>>>>> by K >>>>>>>> erberos. >>>>>>>> >>>>>>>> So the questions are: >>>>>>>> >>>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list >>>>>>>> of users and relevant Kerberos appropriate authentication tokens or >>>>>>>> keys >>>>>>>> (just learning about Kerberos), and get back a list of indexable >>>>>>>> objects/URIs (HTML, .docx, pptx, etc.)? >>>>>>>> >>>>>>>> b. Is this the right way to think about it? >>>>>>>> >>>>>>>> c. If so, is there any example code or documentation that would >>>>>>>> explain how I do this? >>>>>>>> >>>>>>>> d. Does manifoldCF provide any information to help indicate whether >>>>>>>> the given object has changed, or is that something we need to figure >>>>>>>> out by >>>>>>>> manually comparing the old and new documents in our code? >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
