Thanks Karl,
I normally don't have accidentally clicked on the button to forget the
history and I still have all the other events. But I will investigate by
debugging the part of the code that you mentioned and will keep you
informed.
Regards,
Julien
Le 26.04.2017 17:10, Karl Wright a écrit :
> Hi Julien,
>
> The delete logic in the connector is as follows:
>
>>>>>>>
>
> // Now, go through the original id's, and see which ones are still in the
> map. These
> // did not appear in the result and are presumed to be gone from the
> database, and thus must be deleted.
> for (String documentIdentifier : documentIdentifiers)
> {
> if (fetchDocuments.contains(documentIdentifier))
> {
> String documentVersion = map.get(documentIdentifier);
> if (documentVersion != null)
> {
> // This means we did not see it (or data for it) in the result set. Delete
> it!
> activities.noDocument(documentIdentifier,documentVersion);
> activities.recordActivity(null, ACTIVITY_FETCH,
> null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing
> query", null);
> }
> }
> }
> <<<<<<
>
> For a JDBC job without a version query, fetchDocuments contains all the
> documents. But map has the entries removed that were actually fetched.
> Documents that were *not* fetched for whatever reason therefore will not be
> cleaned up. Here's the code that determines that:
>
>>>>>>>
>
> String version = map.get(id);
> if (version == null)
> // Does not need refetching
> continue;
>
> // This document was marked as "not scan only", so we expect to find it.
> if (Logging.connectors.isDebugEnabled())
> Logging.connectors.debug("JDBC: Document data result found for '"+id+"'");
> o = row.getValue(JDBCConstants.urlReturnColumnName);
> if (o == null)
> {
> Logging.connectors.debug("JDBC: Document '"+id+"' has a null url -
> skipping");
> errorCode = activities.NULL_URL;
> errorDesc = "Excluded because document had a null URL";
> activities.noDocument(id,version);
> continue;
> }
>
> // This is not right - url can apparently be a BinaryInput
> String url = JDBCConnection.readAsString(o);
> boolean validURL;
> try
> {
> // Check to be sure url is valid
> new java.net.URI(url);
> validURL = true;
> }
> catch (java.net.URISyntaxException e)
> {
> validURL = false;
> }
>
> if (!validURL)
> {
> Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url:
> '"+url+"' - skipping");
> errorCode = activities.BAD_URL;
> errorDesc = "Excluded because document had illegal URL ('"+url+"')";
> activities.noDocument(id,version);
> continue;
> }
>
> // Process the document itself
> Object contents = row.getValue(JDBCConstants.dataReturnColumnName);
> // Null data is allowed; we just ignore these
> if (contents == null)
> {
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data -
> skipping");
> errorCode = "NULLDATA";
> errorDesc = "Excluded because document had null data";
> activities.noDocument(id,version);
> continue;
> }
>
> // We will ingest something, so remove this id from the map in order that we
> know what we still
> // need to delete when all done.
> map.remove(id);
> <<<<<<
>
> As you see, activities.noDocument() is called for all cases, except the one
> where the document version is null (which cannot happen since all document
> versions for this case will be the empty string). So I am at a loss to
> understand why the delete is not happening.
>
> The only way I can think of is that if you clicked one of the buttons on the
> output connection's view page that told MCF to "forget" all the history for
> that connection.
>
> Thanks,
> Karl
>
> On Wed, Apr 26, 2017 at 10:42 AM, <[email protected]> wrote:
>
> Hi Karl,
>
> I was manually starting the job for test purpose, but even if I schedule it
> with job invocation "Complete" and "Scan every document once", the missing
> IDs from the database are not deleted in my Solr index (no trace of any
> 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I
> am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query.
>
> Julien
>
> Le 26.04.2017 16:05, Karl Wright a écrit :
> Hi Julien,
>
> How are you starting the job? If you use "Start minimal", deletion would not
> take place. If your job is a continuous one, this is also the case.
>
> Thanks,
> Karl
>
> On Wed, Apr 26, 2017 at 9:52 AM, <[email protected]> wrote:
> Hi the MCF community,
>
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and
> index the data into a Solr server, and it works very well. However, when I
> perform a delta re-crawl, the new IDs are correctly retrieved from the
> Database but those who have been deleted are not "detected" by the connector
> and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have missed
> something in the configuration of the job, or if this is not implemented ?
> The only way I found to solve this issue is to reset the seeding of the job,
> but it is very time and resource consuming.
>
> Best regards,
> Julien Massiera