See CONNECTORS-1115. I looked into this; looked relatively easy to add a method to IProcessActivity that does what you request. Please give it a try and let me know how it works for you.
Thanks, Karl On Tue, Nov 25, 2014 at 6:22 PM, Karl Wright <[email protected]> wrote: > Hi Markus, > > >>>>>> > noDocument() removes the document or the specified component from the > output but keeps track of the version in the status queue. The decision of > not indexing the document/component is considered persistent as long as the > version string does not change. > > deleteDocument() removes the document and all its components from output > and the status queue. The decision of not indexing the document will have > to be made again when the document is processed the next time (version > string is irrelevant) > > removeDocument() removes the primary document from the output and from the > status queue but keeps components in the output. The decision of not > indexing the document will have to be made again when the document is > processed the next time (version string is irrelevant) > > Is this correct? > <<<<<< > > Yes. > > >>>>>> > The scenario is indexing documents with embedded documents. The embedded > documents are ingested as components. > > We assume a document with multiple components was ingested. For the next > processing the version does not change. > So the whole document should not be refetched. > But how i can prevent the deletion of the components when the document is > not re-fetched? > I saw the method "retainDocument" which seems to be the way to go, but the > problem is that without fetching the document > i have no knowledge about the available components. > Is there any other way to retain all components without knowing them? > <<<<<< > > Not at present; the assumption for components is that the processing of a > primary document will allow your connector to determine disposition of all > components of the primary document every time that processDocuments() is > called for it. Effectively that means that the assumption is that > determining what components are in a document is a relatively inexpensive > operation. It's necessary to make that assumption, because that's the only > way the bookkeeping can work - MCF needs to know what happens with the > components, when all it has is a processDocuments() call. I'll look into > how hard it would be to add the functionality you are looking for though, > and get back to you. > > >>>>>> > About a patch for a Test Connector: > I think i could contribute something. > Do you have general requirements/guideline for test connectors? > Are there examples of a similar test connector? > <<<<<< > > Look at > framework/pull-agent/src/test/java/org/apache/manifoldcf/crawler/tests. > There are a number of test connectors there, and tests that use them. > > Thanks, > Karl > > > On Tue, Nov 25, 2014 at 5:53 PM, Markus Schuch <[email protected]> > wrote: > >> Hi Karl, >> >> thanks for the clarification about primary document disposition. >> >> I'm still not 100% sure if i understand the differences... i try to >> explain it in my own words: >> >> noDocument() removes the document or the specified component from the >> output but keeps track of the version in the status queue. The decision of >> not indexing the document/component is considered persistent as long as the >> version string does not change. >> >> deleteDocument() removes the document and all its components from output >> and the status queue. The decision of not indexing the document will have >> to be made again when the document is processed the next time (version >> string is irrelevant) >> >> removeDocument() removes the primary document from the output and from >> the status queue but keeps components in the output. The decision of not >> indexing the document will have to be made again when the document is >> processed the next time (version string is irrelevant) >> >> Is this correct? >> >> ----------------------------- >> >> An new question i have: >> >> The scenario is indexing documents with embedded documents. The embedded >> documents are ingested as components. >> >> We assume a document with multiple components was ingested. For the next >> processing the version does not change. >> So the whole document should not be refetched. >> But how i can prevent the deletion of the components when the document is >> not re-fetched? >> I saw the method "retainDocument" which seems to be the way to go, but >> the problem is that without fetching the document >> i have no knowledge about the available components. >> Is there any other way to retain all components without knowing them? >> >> ---------------------------- >> >> About a patch for a Test Connector: >> I think i could contribute something. >> Do you have general requirements/guideline for test connectors? >> Are there examples of a similar test connector? >> >> Regards, >> Markus >> > >
