Re: Delete IDs with JDBC connector

Karl Wright Thu, 27 Apr 2017 04:28:27 -0700

Hi Julien,

The JDBC connector uses MODEL_ADD_CHANGE.  The requirement for
MODEL_ADD_CHANGE is that seeding includes all documents between the time
ranges specified or greater.  That requirement is met.


For document processing, the connector receives a list of document
identifiers to process.  The documents are *all* queried for, and no start
or end time figures into that at all.  The code changes simply note which
documents were queried for but were not actually found, and delete those.

So I believe the logic to be correct.

Karl


On Thu, Apr 27, 2017 at 6:32 AM, <[email protected]> wrote:

> Hi Karl,
>
> yes your fix works. However, doesn't it break the logic of the delta
> feature provided by the seeding query that makes good use of the
> $(STARTTIME) and $(ENDTIME) variables ?
>
> For example, let assume that the docs in my database have a timestamp that
> indicates their last modification date, if I set the following 'Seeding
> query':
>
> Select doc.id as "$(IDCOLUMN)"
> From doctable doc
> Where doc.lastmod > $(STARTTIME)
>
> the advantage is that the first crawl will retrieve all my docs from the
> database and the next ones will only retrieve those that are new or have
> been modified since the last crawl.
>
> Now if I combine that with a 'Version check query', each execution of the
> job will also check the version of all the crawled docs since the very
> first crawl, and delete those that have disappeared from the database.
>
> I think that with your modification, this logic is completely broken
> because during a 'delta' crawl, all the docs that where crawled and that
> does not appear in the delta will be deleted, despite that they may still
> be present in the database.
> I would just change your fix to only make use of the 'seenDocuments'
> condition when the $(STARTTIME) and $(ENDTIME) variables are not present in
> the 'Seeding query' and the 'Version check query' is empty
>
> What do you think ?
>
> Anyway thanks for your quick fix,
> Julien
>
> Le 26.04.2017 19:12, Karl Wright a écrit :
>
> I committed a fix to trunk, and also uploaded a patch to the ticket.
> Please let me know if it works for you.
>
> Thanks,
> Karl
>
>
> On Wed, Apr 26, 2017 at 11:24 AM, <[email protected]> wrote:
>
>> Oh OK so I finally don't have to investigate :)
>>
>> Thanks Karl !
>>
>> Julien
>>
>> Le 26.04.2017 17:20, Karl Wright a écrit :
>>
>> Oh, never mind.  I see the issue, which is that without the version
>> query, documents that don't appear in the result list *at all* are never
>> removed from the map.  I'll create a ticket.
>>
>> Karl
>>
>>
>> On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Julien,
>>>
>>> The delete logic in the connector is as follows:
>>>
>>> >>>>>>
>>>     // Now, go through the original id's, and see which ones are still
>>> in the map.  These
>>>     // did not appear in the result and are presumed to be gone from the
>>> database, and thus must be deleted.
>>>     for (String documentIdentifier : documentIdentifiers)
>>>     {
>>>       if (fetchDocuments.contains(documentIdentifier))
>>>       {
>>>         String documentVersion = map.get(documentIdentifier);
>>>         if (documentVersion != null)
>>>         {
>>>           // This means we did not see it (or data for it) in the result
>>> set.  Delete it!
>>>           activities.noDocument(documentIdentifier,documentVersion);
>>>           activities.recordActivity(null, ACTIVITY_FETCH,
>>>             null, documentIdentifier, "NOTFETCHED", "Document was not
>>> seen by processing query", null);
>>>         }
>>>       }
>>>     }
>>> <<<<<<
>>>
>>> For a JDBC job without a version query, fetchDocuments contains all the
>>> documents.  But map has the entries removed that were actually fetched.
>>> Documents that were *not* fetched for whatever reason therefore will not be
>>> cleaned up.  Here's the code that determines that:
>>>
>>> >>>>>>
>>>             String version = map.get(id);
>>>             if (version == null)
>>>               // Does not need refetching
>>>               continue;
>>>
>>>             // This document was marked as "not scan only", so we expect
>>> to find it.
>>>             if (Logging.connectors.isDebugEnabled())
>>>               Logging.connectors.debug("JDBC: Document data result
>>> found for '"+id+"'");
>>>             o = row.getValue(JDBCConstants.urlReturnColumnName);
>>>             if (o == null)
>>>             {
>>>               Logging.connectors.debug("JDBC: Document '"+id+"' has a
>>> null url - skipping");
>>>               errorCode = activities.NULL_URL;
>>>               errorDesc = "Excluded because document had a null URL";
>>>               activities.noDocument(id,version);
>>>               continue;
>>>             }
>>>
>>>             // This is not right - url can apparently be a BinaryInput
>>>             String url = JDBCConnection.readAsString(o);
>>>             boolean validURL;
>>>             try
>>>             {
>>>               // Check to be sure url is valid
>>>               new java.net.URI(url);
>>>               validURL = true;
>>>             }
>>>             catch (java.net.URISyntaxException e)
>>>             {
>>>               validURL = false;
>>>             }
>>>
>>>             if (!validURL)
>>>             {
>>>               Logging.connectors.debug("JDBC: Document '"+id+"' has an
>>> illegal url: '"+url+"' - skipping");
>>>               errorCode = activities.BAD_URL;
>>>               errorDesc = "Excluded because document had illegal URL
>>> ('"+url+"')";
>>>               activities.noDocument(id,version);
>>>               continue;
>>>             }
>>>
>>>             // Process the document itself
>>>             Object contents = row.getValue(JDBCConstants.dat
>>> aReturnColumnName);
>>>             // Null data is allowed; we just ignore these
>>>             if (contents == null)
>>>             {
>>>               Logging.connectors.debug("JDBC: Document '"+id+"' seems
>>> to have null data - skipping");
>>>               errorCode = "NULLDATA";
>>>               errorDesc = "Excluded because document had null data";
>>>               activities.noDocument(id,version);
>>>               continue;
>>>             }
>>>
>>>             // We will ingest something, so remove this id from the map
>>> in order that we know what we still
>>>             // need to delete when all done.
>>>             map.remove(id);
>>> <<<<<<
>>>
>>> As you see, activities.noDocument() is called for all cases, except the
>>> one where the document version is null (which cannot happen since all
>>> document versions for this case will be the empty string).  So I am at a
>>> loss to understand why the delete is not happening.
>>>
>>> The only way I can think of is that if you clicked one of the buttons on
>>> the output connection's view page that told MCF to "forget" all the history
>>> for that connection.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Wed, Apr 26, 2017 at 10:42 AM, <[email protected]>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> I was manually starting the job for test purpose, but even if I
>>>> schedule it with job invocation "Complete" and "Scan every document once",
>>>> the missing IDs from the database are not deleted in my Solr index (no
>>>> trace of any 'document deletion' event in the history).
>>>> I should mention that I only use the 'Seeding query' and 'Data query'
>>>> and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
>>>> query.
>>>>
>>>> Julien
>>>>
>>>> Le 26.04.2017 16:05, Karl Wright a écrit :
>>>>
>>>> Hi Julien,
>>>>
>>>> How are you starting the job?  If you use "Start minimal", deletion
>>>> would not take place.  If your job is a continuous one, this is also the
>>>> case.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>> On Wed, Apr 26, 2017 at 9:52 AM, <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi the MCF community,
>>>>>
>>>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>>>>> and index the data into a Solr server, and it works very well. However,
>>>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>>>>> the Database but those who have been deleted are not "detected" by the
>>>>> connector and thus, are still present in my Solr index.
>>>>> I would like to know if normally it should work and that I maybe have
>>>>> missed something in the configuration of the job, or if this is not
>>>>> implemented ?
>>>>> The only way I found to solve this issue is to reset the seeding of
>>>>> the job, but it is very time and resource consuming.
>>>>>
>>>>> Best regards,
>>>>> Julien Massiera
>>>>
>>>>
>>>>
>>
>

Re: Delete IDs with JDBC connector

Reply via email to