Re: Logging and Document filter transformation connector

Olivier Tavard Wed, 17 Oct 2018 07:35:33 -0700

Hi Karl,

I  opened a ticket on JIRA, it will be simpler to discuss on it : 
https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1547


Thanks,

Olivier 


> Le 11 oct. 2018 à 19:25, Karl Wright <[email protected]> a écrit :
> 
> The fact that the history is different for the two suggests that the 
> mechanism is different.  You can turn on connector logging and that should 
> help figure out why the png is being rejected.  Once we know that it should 
> be possible to consider improvements to the history.
> 
> Karl
> 
> On Thu, Oct 11, 2018, 10:41 AM Olivier Tavard <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello Karl,
> 
> OK thanks for the detailed explanation.
> So I understand that we cannot add a distinct result code if the repository 
> connector has no knowledge of the pipeline.
> My problem is that sometimes we do not have any activity status about an 
> excluded file.
> 
> To be more precise, I created a job that only keeps doc and docx extensions 
> (web repository connector and document filter transformation connector). If 
> you look at the screenshot, you will see that the html and the png files are 
> excluded by the repository connector as expected but only the html file has a 
> specific activity log entry with a explicit result code (EXCLUDEURL) :
> 
> The png file has only a "fech activity" and has a 200 result code. I had to 
> activate the debug mode to find a log line about the exclusion of the png 
> file :
> "Removing url 'https://www.datafari.com/assets/img/img_feature_phone_list.png 
> <https://www.datafari.com/assets/img/img_feature_phone_list.png>' because it 
> had the wrong content type ('image/png')"
> The code related to this is located l. 902 in the WebcrawlerConnector and it 
> contains only :
> activityResultCode = null; 
> 
> At the other hand for the html file, the section is l. 1366 and it has 
> explicit code to handle that :
> 
> errorCode = activities.EXCLUDED_URL;
>         errorDesc = "Rejected due to URL ('"+documentIdentifier+"')";
>         activities.noDocument(documentIdentifier,versionString);
> 
> I do not understand why for the html file the log activity is present with a 
> specific result code and not for the png file for example. Would it be 
> possible to have the same log entry for all the files  ?
> 
> Thanks,
> Best regards,
> 
> Olivier 
> 
>> Le 11 oct. 2018 à 16:00, Karl Wright <[email protected] 
>> <mailto:[email protected]>> a écrit :
>> 
>> Hi Olivier,
>> 
>> The Repository connector has no knowledge of what the pipeline looks like.  
>> It simply asks the framework whether the mime type, length, etc. is 
>> acceptable to the downstream pipeline.  It's the connector's responsibility 
>> to note the reason for the rejection in the simple history, but it does not 
>> have any knowledge whatsoever of which connector rejected the document, and 
>> therefore cannot say which transformer or output rejected the document.
>> 
>> Transformation and output connectors which respond to checks for document 
>> mime type or length checks likewise do not have any knowledge of the 
>> upstream connector that is doing the checking.
>> 
>> Karl
>> 
>> 
>> 
>> On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard 
>> <[email protected] <mailto:[email protected]>> wrote:
>> Hello,
>> 
>> I have a question regarding the Document filter transformation connector and 
>> the log about it.
>> I would like to have a look of all the documents excluded by the rules 
>> configured in the Document filter transformation connector by looking at the 
>> Simple history or by the MCF log but it is not easy so far.
>> 
>> Let’s say that I want to crawl a website and I want to index html pages 
>> only. So I configure a web repository connector with a Document filter 
>> transformation connector and I create the rule with only one allowed mime 
>> type content and one file extension. So far so good, the job works well but 
>> if I want to visualize on the MCF log or by the simple history all the files 
>> that were excluded by the transformation connector it is quickly complicated 
>> : I have to search manually all the files that were fetched but not 
>> processed by Tika transformation connector or ingested by the output 
>> connector.
>> 
>> Of my understanding of the code, the document filter transformation 
>> connector can communicate directly with the repo transformation connector to 
>> indicate the rules of exclusion of the documents and so the document that 
>> need to be excluded are not processed in the Document filter transformation 
>> connector but directly excluded by the web repo connector.
>> So in the simple history, I can see that a document that will be excluded is 
>> in "activity fetch" and that’s it, there is no additional information about 
>> it.
>> Could it be possible to add a log entry with an explicit result code as 
>> excluded by "document filter connector" or something like when the document 
>> is excluded by the repository connector?
>>  
>> Thank you,
>> Best regards,
>> Olivier 
>> 
> 
> <simple_history_web_job_document_filter.jpg>

Re: Logging and Document filter transformation connector

Reply via email to