I have done enough research to confirm that at least one of the MCF shipped connectors also relies on the empty version string: the JDBC connector.
I've therefore opened CONNECTORS-1283 and attached a patch. Karl On Fri, Mar 4, 2016 at 7:42 AM, Karl Wright <[email protected]> wrote: > Hi Markus, > > I agree that is one key bit of code, and I agree with your analysis. > > There obviously needs to be a way to signal "I don't have a meaningful > document version string", and an empty string is not unreasonable for this > purpose. However, there's more to it than that. > > Specifically, the pipeline code is designed to make intelligent decisions > on an output connection by output connection basis whether to index the > document in that connector. There is also an API concern: specifically, we > *expect* that the caller will have checked whether a document needs to be > indexed or not at the root level. So the whole clause you have mentioned > is, theoretically, unnecessary, if the connector is written right. But we > can't count on that. > > I will look through other connectors to see if there is any problem with > an empty string being used as a signal for "don't care". I will get back > to you. > > Karl > > > On Fri, Mar 4, 2016 at 7:32 AM, Markus Schuch <[email protected]> > wrote: > >> Hi Karl, >> >> yes i am sure ingestDocumentWithException is called twice. The First call >> in the first run, the second call in the second run. Both calls happen with >> same arguments. >> >> I think the interesting part is in the IncrementalIngester: >> The old version and the new version are compared. And an empty string is >> treated like any other version. >> >> boolean needToReindex = (oldDocumentVersion == null); >> if (needToReindex == false) >> { >> needToReindex = (!oldDocumentVersion.equals(newDocumentVersion) || >> >> !oldOutputVersion.equals(fullSpec.getStageDescriptionString(outputStage).getVersionString()) >> || >> >> >> !oldAuthorityName.equals((newAuthorityNameString==null)?"":newAuthorityNameString)); >> } >> if (needToReindex == false) >> { >> needToReindex = >> (!oldTransformationVersion.equals(newTransformationVersion)); >> } >> >> In my case old version and new version both are "" and needToReindex >> stays false. >> >> I think this comparison had the same result in 1.7 but due >> to CONNECTORS-1153 needToReindex was the outputVersion check was buggy. >> >> The question remains: shouldn't an empty version trigger reingestion? >> >> Regards >> Markus >> >> *Gesendet:* Freitag, 04. März 2016 um 13:21 Uhr >> *Von:* "Karl Wright" <[email protected]> >> *An:* "[email protected]" <[email protected]> >> *Betreff:* Re: Re: Should a document with an empty version string always >> be reingested? >> Hi Markus, >> >> If you called ingestDocumentWithVersions() more than once, you should >> have seen two indexing attempts. >> >> Are you sure this is indeed getting called twice? >> >> I've looked briefly at the code and can find no reason why there would be >> version-sensitive incremental behavior in this method call. I will go back >> and look more carefully and get back to you. >> >> Karl >> >> >> On Fri, Mar 4, 2016 at 6:40 AM, Markus Schuch <[email protected]> >> wrote: >>> >>> >>> Hi Karl, >>> >>> thanks for the fast response. >>> >>> We have a simple connector (written before 1.7), that produces documents >>> from an XML file and we use the empty version string to trigger ingestion >>> on every job run. Meaning the empty version string is considered as >>> "alwaysRefetch" and the created document is always sent down the pipeline >>> along with this empty version string. >>> (the connector was relying on the 1.x BaseRepositoryConnector) >>> >>> I noticed the backward compatibility code in the BaseRepositoryConnector >>> in 1.7+ and i used this code to wire our custom connector code to the new >>> 2.3 interface. >>> I debugged the document processing and - as expected - >>> ingestDocumentWithException is still called every time, as before, since an >>> empty version string is still considered as alwaysRefetch. But the sent >>> document is only ingested to the ouputrepository at the first time the job >>> runs. On consecutive runs the output step stays inactive. >>> >>> I think we can boil my issue down to a specific question about one >>> method of IProcessActivity interface: >>> >>> ingestDocumentWithException(String documentIdentifier, String version, >>> String documentURI, RepositoryDocument data) >>> >>> >>> Let's assume the following example flow (starting from an empty and >>> clean MCF 2.3 system): >>> >>> (1) In a first run of my job >>> >>> ingestDocumentWithException( "identiferX", "", "documentUriX", >>> repoDoc) // second param is empty version string >>> >>> is called. This leads to ingestion of the document with the URI >>> "documentUriX". >>> >>> (2) In a second run of my job >>> >>> ingestDocumentWithException( "identiferX", "", "documentUriX", >>> repoDoc) // second param is empty version string >>> >>> is called again (with the same arguments). >>> >>> What is the expected behavior here? >>> Should the document be ingested again or not? >>> And if not, how should i trigger ingestion? By sending always a null >>> version down the pipeline? >>> >>> The actual behavior >>> - In 1.7 it is ingested again. >>> - in 2.3 it is _not_ ingested again. >>> >>> Regards, >>> Markus >>> >>> >>> >>> >>> >>> Gesendet: Freitag, 04. März 2016 um 12:11 Uhr >>> Von: "Karl Wright" <[email protected]> >>> An: "[email protected]" <[email protected]> >>> Betreff: Re: Should a document with an empty version string always be >>> reingested? >>> >>> Hi Markus, >>> >>> The canonical way that a connector handles incrementality changed from >>> 1.7 to 1.10. We maintained backwards compatibility through the inclusion >>> of legacy base connector methods. CONNECTORS-1153 reported a problem in >>> one of those base connector methods, which has been fixed by 1.10. I can't >>> tell whether this applies to your situation. >>> >>> On 2.x the base connector methods no longer have all of the legacy base >>> connector methods at all, so if you have a custom connector you will need >>> to rework your connector class to adhere to the newer model. Specifically, >>> there is no such method anymore as "getDocumentVersions()". Instead, your >>> connector must signal its disposition of any document using the >>> IProcessActivity methods available for that purpose. >>> >>> Can you describe in more detail what you are doing here? >>> (a) Is this a custom connector? >>> (b) Was it developed on 1.7 or before? >>> (c) Are you trying to run it on 1.10 or on 2.x? >>> >>> That will help me give you better responses. >>> >>> Karl >>> >>> >>> On Fri, Mar 4, 2016 at 5:28 AM, Markus Schuch <[email protected]> >>> wrote: >>> >>> Hi, >>> >>> we ran on MCF 1.7 for quite a while and in this environment a document >>> send to the ingestion pipeline together with an empty version string was >>> always reingested. >>> On MCF 2.3 this is no longer the case. >>> >>> I found >>> https://issues.apache.org/jira/browse/CONNECTORS-1153[https://issues.apache.org/jira/browse/CONNECTORS-1153] >>> and may be the 1.7 behavior we were relying on was always a bug. >>> >>> Question: >>> Is the new 2.3 behavior the expected case how the ingestion pipeline >>> handles an empty version string? >>> And how can "always reingestion" be triggered? >>> >>> Thanks in Advance, >>> Markus >>> >> >
