Re: Delete IDs with JDBC connector

julien . massiera Thu, 27 Apr 2017 04:54:09 -0700

Oh yes indeed, never mind, I missed that point. So all is ok 

Thanks Karl


Le 27.04.2017 13:28, Karl Wright a écrit :

> Hi Julien, 
> 
> The JDBC connector uses MODEL_ADD_CHANGE.  The requirement for 
> MODEL_ADD_CHANGE is that seeding includes all documents between the time 
> ranges specified or greater.  That requirement is met. 
> 
> For document processing, the connector receives a list of document 
> identifiers to process.  The documents are *all* queried for, and no start or 
> end time figures into that at all.  The code changes simply note which 
> documents were queried for but were not actually found, and delete those. 
> 
> So I believe the logic to be correct. 
> 
> Karl 
> 
> On Thu, Apr 27, 2017 at 6:32 AM, <[email protected]> wrote:
> 
> Hi Karl, 
> 
> yes your fix works. However, doesn't it break the logic of the delta feature 
> provided by the seeding query that makes good use of the $(STARTTIME) and 
> $(ENDTIME) variables ? 
> 
> For example, let assume that the docs in my database have a timestamp that 
> indicates their last modification date, if I set the following 'Seeding 
> query': 
> 
> Select doc.id [1] as "$(IDCOLUMN)"
> From doctable doc
> Where doc.lastmod > $(STARTTIME) 
> 
> the advantage is that the first crawl will retrieve all my docs from the 
> database and the next ones will only retrieve those that are new or have been 
> modified since the last crawl.
> 
> Now if I combine that with a 'Version check query', each execution of the job 
> will also check the version of all the crawled docs since the very first 
> crawl, and delete those that have disappeared from the database. 
> 
> I think that with your modification, this logic is completely broken because 
> during a 'delta' crawl, all the docs that where crawled and that does not 
> appear in the delta will be deleted, despite that they may still be present 
> in the database.
> I would just change your fix to only make use of the 'seenDocuments' 
> condition when the $(STARTTIME) and $(ENDTIME) variables are not present in 
> the 'Seeding query' and the 'Version check query' is empty 
> 
> What do you think ? 
> 
> Anyway thanks for your quick fix,
> Julien
> 
> Le 26.04.2017 19:12, Karl Wright a écrit : 
> I committed a fix to trunk, and also uploaded a patch to the ticket.  Please 
> let me know if it works for you. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 11:24 AM, <[email protected]> wrote:
> 
> Oh OK so I finally don't have to investigate :)
> 
> Thanks Karl ! 
> 
> Julien
> 
> Le 26.04.2017 17:20, Karl Wright a écrit : 
> Oh, never mind.  I see the issue, which is that without the version query, 
> documents that don't appear in the result list *at all* are never removed 
> from the map.  I'll create a ticket. 
> 
> Karl 
> 
> On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <[email protected]> wrote:
> 
> Hi Julien, 
> 
> The delete logic in the connector is as follows: 
> 
>>>>>>> 
> 
> // Now, go through the original id's, and see which ones are still in the 
> map.  These 
> // did not appear in the result and are presumed to be gone from the 
> database, and thus must be deleted. 
> for (String documentIdentifier : documentIdentifiers) 
> { 
> if (fetchDocuments.contains(documentIdentifier)) 
> { 
> String documentVersion = map.get(documentIdentifier); 
> if (documentVersion != null) 
> { 
> // This means we did not see it (or data for it) in the result set.  Delete 
> it! 
> activities.noDocument(documentIdentifier,documentVersion); 
> activities.recordActivity(null, ACTIVITY_FETCH, 
> null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing 
> query", null); 
> } 
> } 
> } 
> <<<<<< 
> 
> For a JDBC job without a version query, fetchDocuments contains all the 
> documents.  But map has the entries removed that were actually fetched.  
> Documents that were *not* fetched for whatever reason therefore will not be 
> cleaned up.  Here's the code that determines that: 
> 
>>>>>>> 
> 
> String version = map.get(id); 
> if (version == null) 
> // Does not need refetching 
> continue; 
> 
> // This document was marked as "not scan only", so we expect to find it. 
> if (Logging.connectors.isDebugEnabled()) 
> Logging.connectors.debug("JDBC: Document data result found for '"+id+"'"); 
> o = row.getValue(JDBCConstants.urlReturnColumnName); 
> if (o == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - 
> skipping"); 
> errorCode = activities.NULL_URL; 
> errorDesc = "Excluded because document had a null URL"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // This is not right - url can apparently be a BinaryInput 
> String url = JDBCConnection.readAsString(o); 
> boolean validURL; 
> try 
> { 
> // Check to be sure url is valid 
> new java.net.URI(url); 
> validURL = true; 
> } 
> catch (java.net.URISyntaxException e) 
> { 
> validURL = false; 
> } 
> 
> if (!validURL) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: 
> '"+url+"' - skipping"); 
> errorCode = activities.BAD_URL; 
> errorDesc = "Excluded because document had illegal URL ('"+url+"')"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // Process the document itself 
> Object contents = row.getValue(JDBCConstants.dataReturnColumnName); 
> // Null data is allowed; we just ignore these 
> if (contents == null) 
> { 
> Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - 
> skipping"); 
> errorCode = "NULLDATA"; 
> errorDesc = "Excluded because document had null data"; 
> activities.noDocument(id,version); 
> continue; 
> } 
> 
> // We will ingest something, so remove this id from the map in order that we 
> know what we still 
> // need to delete when all done. 
> map.remove(id); 
> <<<<<< 
> 
> As you see, activities.noDocument() is called for all cases, except the one 
> where the document version is null (which cannot happen since all document 
> versions for this case will be the empty string).  So I am at a loss to 
> understand why the delete is not happening. 
> 
> The only way I can think of is that if you clicked one of the buttons on the 
> output connection's view page that told MCF to "forget" all the history for 
> that connection. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 10:42 AM, <[email protected]> wrote:
> 
> Hi Karl, 
> 
> I was manually starting the job for test purpose, but even if I schedule it 
> with job invocation "Complete" and "Scan every document once", the missing 
> IDs from the database are not deleted in my Solr index (no trace of any 
> 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and I 
> am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query. 
> 
> Julien
> 
> Le 26.04.2017 16:05, Karl Wright a écrit : 
> Hi Julien, 
> 
> How are you starting the job?  If you use "Start minimal", deletion would not 
> take place.  If your job is a continuous one, this is also the case. 
> 
> Thanks, 
> Karl 
> 
> On Wed, Apr 26, 2017 at 9:52 AM, <[email protected]> wrote:
> Hi the MCF community,
> 
> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and 
> index the data into a Solr server, and it works very well. However, when I 
> perform a delta re-crawl, the new IDs are correctly retrieved from the 
> Database but those who have been deleted are not "detected" by the connector 
> and thus, are still present in my Solr index.
> I would like to know if normally it should work and that I maybe have missed 
> something in the configuration of the job, or if this is not implemented ?
> The only way I found to solve this issue is to reset the seeding of the job, 
> but it is very time and resource consuming.
> 
> Best regards,
> Julien Massiera

 

Links:
------
[1] http://doc.id

Re: Delete IDs with JDBC connector

Reply via email to