Re: Delete IDs with JDBC connector

julien . massiera Wed, 26 Apr 2017 08:24:48 -0700

Oh OK so I finally don't have to investigate :)

Thanks Karl !


Julien 

Le 26.04.2017 17:20, Karl Wright a écrit :

> Oh, never mind.  I see the issue, which is that without the version query, 
> documents that don't appear in the result list *at all* are never removed 
> from the map.  I'll create a ticket. 
> 
> Karl 
> 
> On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <[email protected]> wrote:
> 
> Hi Julien, 
> 
> The delete logic in the connector is as follows: 
> 
>>>>>>> 
> 
> // Now, go through the original id's, and see which ones are still in the 
> map.  These 
> // did not appear in the result and are presumed to be gone from the 
> database, and thus must be deleted. 
> for (String documentIdentifier : documentIdentifiers) 
> { 
> if (fetchDocuments.contains(documentIdentifier)) 
> { 
> String documentVersion = map.get(documentIdentifier); 
> if (documentVersion != null) 
> { 
> // This means we did not see it (or data for it) in the result set.  Delete 
> it! 
> activities.noDocument(documentIdentifier,documentVersion); 
> activities.recordActivity(null, ACTIVITY_FETCH, 
> null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing 
> query", null); 
> } 
> } 
> } 
> <<<<<< 
> 
> For a JDBC job without a version query, fetchDocuments contains all the 
> documents.  But map has the entries removed that were actually fetched.  
> Documents that were *not* fetched for whatever reason therefore will not be 
> cleaned up.  Here's the code that determines that: 
> 
>>>>>>> 
> 
> String version = map.get(id); 
> if (version == null) 
> // Does not need refetching 
> continue; 
> 
> // This document was marked as "not scan only", so we expect to find it. 
> if (Logging.connectors.isDebugEnabled()) 
> Logging.connectors.debug("JDBC: Document data result found for '"+id+"'"); 
> o = row.getValue(JDBCConstants.urlReturnColumnName); 
> if (o == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - 
> skipping"); 
> errorCode = activities.NULL_URL; 
> errorDesc = "Excluded because document had a null URL"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // This is not right - url can apparently be a BinaryInput 
> String url = JDBCConnection.readAsString(o); 
> boolean validURL; 
> try 
> { 
> // Check to be sure url is valid 
> new java.net.URI(url); 
> validURL = true; 
> } 
> catch (java.net.URISyntaxException e) 
> { 
> validURL = false; 
> } 
> 
> if (!validURL) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: 
> '"+url+"' - skipping"); 
> errorCode = activities.BAD_URL; 
> errorDesc = "Excluded because document had illegal URL ('"+url+"')"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // Process the document itself 
> Object contents = row.getValue(JDBCConstants.dataReturnColumnName); 
> // Null data is allowed; we just ignore these 
> if (contents == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - 
> skipping"); 
> errorCode = "NULLDATA"; 
> errorDesc = "Excluded because document had null data"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // We will ingest something, so remove this id from the map in order that we 
> know what we still 
> // need to delete when all done. 
> map.remove(id); 
> <<<<<< 
> 
> As you see, activities.noDocument() is called for all cases, except the one 
> where the document version is null (which cannot happen since all document 
> versions for this case will be the empty string).  So I am at a loss to 
> understand why the delete is not happening. 
> 
> The only way I can think of is that if you clicked one of the buttons on the 
> output connection's view page that told MCF to "forget" all the history for 
> that connection. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 10:42 AM, <[email protected]> wrote:
> 
> Hi Karl, 
> 
> I was manually starting the job for test purpose, but even if I schedule it 
> with job invocation "Complete" and "Scan every document once", the missing 
> IDs from the database are not deleted in my Solr index (no trace of any 
> 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I 
> am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query. 
> 
> Julien
> 
> Le 26.04.2017 16:05, Karl Wright a écrit : 
> Hi Julien, 
> 
> How are you starting the job?  If you use "Start minimal", deletion would not 
> take place.  If your job is a continuous one, this is also the case. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 9:52 AM, <[email protected]> wrote:
> Hi the MCF community,
> 
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and 
> index the data into a Solr server, and it works very well. However, when I 
> perform a delta re-crawl, the new IDs are correctly retrieved from the 
> Database but those who have been deleted are not "detected" by the connector 
> and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have missed 
> something in the configuration of the job, or if this is not implemented ?
> The only way I found to solve this issue is to reset the seeding of the job, 
> but it is very time and resource consuming.
> 
> Best regards,
> Julien Massiera

Re: Delete IDs with JDBC connector

Reply via email to