Oh OK so I finally don't have to investigate :) Thanks Karl !
Julien Le 26.04.2017 17:20, Karl Wright a écrit : > Oh, never mind. I see the issue, which is that without the version query, > documents that don't appear in the result list *at all* are never removed > from the map. I'll create a ticket. > > Karl > > On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <[email protected]> wrote: > > Hi Julien, > > The delete logic in the connector is as follows: > >>>>>>> > > // Now, go through the original id's, and see which ones are still in the > map. These > // did not appear in the result and are presumed to be gone from the > database, and thus must be deleted. > for (String documentIdentifier : documentIdentifiers) > { > if (fetchDocuments.contains(documentIdentifier)) > { > String documentVersion = map.get(documentIdentifier); > if (documentVersion != null) > { > // This means we did not see it (or data for it) in the result set. Delete > it! > activities.noDocument(documentIdentifier,documentVersion); > activities.recordActivity(null, ACTIVITY_FETCH, > null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing > query", null); > } > } > } > <<<<<< > > For a JDBC job without a version query, fetchDocuments contains all the > documents. But map has the entries removed that were actually fetched. > Documents that were *not* fetched for whatever reason therefore will not be > cleaned up. Here's the code that determines that: > >>>>>>> > > String version = map.get(id); > if (version == null) > // Does not need refetching > continue; > > // This document was marked as "not scan only", so we expect to find it. > if (Logging.connectors.isDebugEnabled()) > Logging.connectors.debug("JDBC: Document data result found for '"+id+"'"); > o = row.getValue(JDBCConstants.urlReturnColumnName); > if (o == null) > { > Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - > skipping"); > errorCode = activities.NULL_URL; > errorDesc = "Excluded because document had a null URL"; > activities.noDocument(id,version); > continue; > } > > // This is not right - url can apparently be a BinaryInput > String url = JDBCConnection.readAsString(o); > boolean validURL; > try > { > // Check to be sure url is valid > new java.net.URI(url); > validURL = true; > } > catch (java.net.URISyntaxException e) > { > validURL = false; > } > > if (!validURL) > { > Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: > '"+url+"' - skipping"); > errorCode = activities.BAD_URL; > errorDesc = "Excluded because document had illegal URL ('"+url+"')"; > activities.noDocument(id,version); > continue; > } > > // Process the document itself > Object contents = row.getValue(JDBCConstants.dataReturnColumnName); > // Null data is allowed; we just ignore these > if (contents == null) > { > Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - > skipping"); > errorCode = "NULLDATA"; > errorDesc = "Excluded because document had null data"; > activities.noDocument(id,version); > continue; > } > > // We will ingest something, so remove this id from the map in order that we > know what we still > // need to delete when all done. > map.remove(id); > <<<<<< > > As you see, activities.noDocument() is called for all cases, except the one > where the document version is null (which cannot happen since all document > versions for this case will be the empty string). So I am at a loss to > understand why the delete is not happening. > > The only way I can think of is that if you clicked one of the buttons on the > output connection's view page that told MCF to "forget" all the history for > that connection. > > Thanks, > Karl > > On Wed, Apr 26, 2017 at 10:42 AM, <[email protected]> wrote: > > Hi Karl, > > I was manually starting the job for test purpose, but even if I schedule it > with job invocation "Complete" and "Scan every document once", the missing > IDs from the database are not deleted in my Solr index (no trace of any > 'document deletion' event in the history). > I should mention that I only use the 'Seeding query' and 'Data query' and I > am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query. > > Julien > > Le 26.04.2017 16:05, Karl Wright a écrit : > Hi Julien, > > How are you starting the job? If you use "Start minimal", deletion would not > take place. If your job is a continuous one, this is also the case. > > Thanks, > Karl > > On Wed, Apr 26, 2017 at 9:52 AM, <[email protected]> wrote: > Hi the MCF community, > > I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and > index the data into a Solr server, and it works very well. However, when I > perform a delta re-crawl, the new IDs are correctly retrieved from the > Database but those who have been deleted are not "detected" by the connector > and thus, are still present in my Solr index. > I would like to know if normally it should work and that I maybe have missed > something in the configuration of the job, or if this is not implemented ? > The only way I found to solve this issue is to reset the seeding of the job, > but it is very time and resource consuming. > > Best regards, > Julien Massiera
