Oh yes indeed, never mind, I missed that point. So all is ok Thanks Karl
Le 27.04.2017 13:28, Karl Wright a écrit : > Hi Julien, > > The JDBC connector uses MODEL_ADD_CHANGE. The requirement for > MODEL_ADD_CHANGE is that seeding includes all documents between the time > ranges specified or greater. That requirement is met. > > For document processing, the connector receives a list of document > identifiers to process. The documents are *all* queried for, and no start or > end time figures into that at all. The code changes simply note which > documents were queried for but were not actually found, and delete those. > > So I believe the logic to be correct. > > Karl > > On Thu, Apr 27, 2017 at 6:32 AM, <[email protected]> wrote: > > Hi Karl, > > yes your fix works. However, doesn't it break the logic of the delta feature > provided by the seeding query that makes good use of the $(STARTTIME) and > $(ENDTIME) variables ? > > For example, let assume that the docs in my database have a timestamp that > indicates their last modification date, if I set the following 'Seeding > query': > > Select doc.id [1] as "$(IDCOLUMN)" > From doctable doc > Where doc.lastmod > $(STARTTIME) > > the advantage is that the first crawl will retrieve all my docs from the > database and the next ones will only retrieve those that are new or have been > modified since the last crawl. > > Now if I combine that with a 'Version check query', each execution of the job > will also check the version of all the crawled docs since the very first > crawl, and delete those that have disappeared from the database. > > I think that with your modification, this logic is completely broken because > during a 'delta' crawl, all the docs that where crawled and that does not > appear in the delta will be deleted, despite that they may still be present > in the database. > I would just change your fix to only make use of the 'seenDocuments' > condition when the $(STARTTIME) and $(ENDTIME) variables are not present in > the 'Seeding query' and the 'Version check query' is empty > > What do you think ? > > Anyway thanks for your quick fix, > Julien > > Le 26.04.2017 19:12, Karl Wright a écrit : > I committed a fix to trunk, and also uploaded a patch to the ticket. Please > let me know if it works for you. > > Thanks, > Karl > > On Wed, Apr 26, 2017 at 11:24 AM, <[email protected]> wrote: > > Oh OK so I finally don't have to investigate :) > > Thanks Karl ! > > Julien > > Le 26.04.2017 17:20, Karl Wright a écrit : > Oh, never mind. I see the issue, which is that without the version query, > documents that don't appear in the result list *at all* are never removed > from the map. I'll create a ticket. > > Karl > > On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <[email protected]> wrote: > > Hi Julien, > > The delete logic in the connector is as follows: > >>>>>>> > > // Now, go through the original id's, and see which ones are still in the > map. These > // did not appear in the result and are presumed to be gone from the > database, and thus must be deleted. > for (String documentIdentifier : documentIdentifiers) > { > if (fetchDocuments.contains(documentIdentifier)) > { > String documentVersion = map.get(documentIdentifier); > if (documentVersion != null) > { > // This means we did not see it (or data for it) in the result set. Delete > it! > activities.noDocument(documentIdentifier,documentVersion); > activities.recordActivity(null, ACTIVITY_FETCH, > null, documentIdentifier, "NOTFETCHED", "Document was not seen by processing > query", null); > } > } > } > <<<<<< > > For a JDBC job without a version query, fetchDocuments contains all the > documents. But map has the entries removed that were actually fetched. > Documents that were *not* fetched for whatever reason therefore will not be > cleaned up. Here's the code that determines that: > >>>>>>> > > String version = map.get(id); > if (version == null) > // Does not need refetching > continue; > > // This document was marked as "not scan only", so we expect to find it. > if (Logging.connectors.isDebugEnabled()) > Logging.connectors.debug("JDBC: Document data result found for '"+id+"'"); > o = row.getValue(JDBCConstants.urlReturnColumnName); > if (o == null) > { > Logging.connectors.debug("JDBC: Document '"+id+"' has a null url - > skipping"); > errorCode = activities.NULL_URL; > errorDesc = "Excluded because document had a null URL"; > activities.noDocument(id,version); > continue; > } > > // This is not right - url can apparently be a BinaryInput > String url = JDBCConnection.readAsString(o); > boolean validURL; > try > { > // Check to be sure url is valid > new java.net.URI(url); > validURL = true; > } > catch (java.net.URISyntaxException e) > { > validURL = false; > } > > if (!validURL) > { > Logging.connectors.debug("JDBC: Document '"+id+"' has an illegal url: > '"+url+"' - skipping"); > errorCode = activities.BAD_URL; > errorDesc = "Excluded because document had illegal URL ('"+url+"')"; > activities.noDocument(id,version); > continue; > } > > // Process the document itself > Object contents = row.getValue(JDBCConstants.dataReturnColumnName); > // Null data is allowed; we just ignore these > if (contents == null) > { > Logging.connectors.debug("JDBC: Document '"+id+"' seems to have null data - > skipping"); > errorCode = "NULLDATA"; > errorDesc = "Excluded because document had null data"; > activities.noDocument(id,version); > continue; > } > > // We will ingest something, so remove this id from the map in order that we > know what we still > // need to delete when all done. > map.remove(id); > <<<<<< > > As you see, activities.noDocument() is called for all cases, except the one > where the document version is null (which cannot happen since all document > versions for this case will be the empty string). So I am at a loss to > understand why the delete is not happening. > > The only way I can think of is that if you clicked one of the buttons on the > output connection's view page that told MCF to "forget" all the history for > that connection. > > Thanks, > Karl > > On Wed, Apr 26, 2017 at 10:42 AM, <[email protected]> wrote: > > Hi Karl, > > I was manually starting the job for test purpose, but even if I schedule it > with job invocation "Complete" and "Scan every document once", the missing > IDs from the database are not deleted in my Solr index (no trace of any > 'document deletion' event in the history). > I should mention that I only use the 'Seeding query' and 'Data query' and I > am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding query. > > Julien > > Le 26.04.2017 16:05, Karl Wright a écrit : > Hi Julien, > > How are you starting the job? If you use "Start minimal", deletion would not > take place. If your job is a continuous one, this is also the case. > > Thanks, > Karl > > On Wed, Apr 26, 2017 at 9:52 AM, <[email protected]> wrote: > Hi the MCF community, > > I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database and > index the data into a Solr server, and it works very well. However, when I > perform a delta re-crawl, the new IDs are correctly retrieved from the > Database but those who have been deleted are not "detected" by the connector > and thus, are still present in my Solr index. > I would like to know if normally it should work and that I maybe have missed > something in the configuration of the job, or if this is not implemented ? > The only way I found to solve this issue is to reset the seeding of the job, > but it is very time and resource consuming. > > Best regards, > Julien Massiera Links: ------ [1] http://doc.id
