Hi Karl, I opened a ticket on JIRA, it will be simpler to discuss on it : https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1547
Thanks, Olivier > Le 11 oct. 2018 à 19:25, Karl Wright <[email protected]> a écrit : > > The fact that the history is different for the two suggests that the > mechanism is different. You can turn on connector logging and that should > help figure out why the png is being rejected. Once we know that it should > be possible to consider improvements to the history. > > Karl > > On Thu, Oct 11, 2018, 10:41 AM Olivier Tavard <[email protected] > <mailto:[email protected]>> wrote: > Hello Karl, > > OK thanks for the detailed explanation. > So I understand that we cannot add a distinct result code if the repository > connector has no knowledge of the pipeline. > My problem is that sometimes we do not have any activity status about an > excluded file. > > To be more precise, I created a job that only keeps doc and docx extensions > (web repository connector and document filter transformation connector). If > you look at the screenshot, you will see that the html and the png files are > excluded by the repository connector as expected but only the html file has a > specific activity log entry with a explicit result code (EXCLUDEURL) : > > The png file has only a "fech activity" and has a 200 result code. I had to > activate the debug mode to find a log line about the exclusion of the png > file : > "Removing url 'https://www.datafari.com/assets/img/img_feature_phone_list.png > <https://www.datafari.com/assets/img/img_feature_phone_list.png>' because it > had the wrong content type ('image/png')" > The code related to this is located l. 902 in the WebcrawlerConnector and it > contains only : > activityResultCode = null; > > At the other hand for the html file, the section is l. 1366 and it has > explicit code to handle that : > > errorCode = activities.EXCLUDED_URL; > errorDesc = "Rejected due to URL ('"+documentIdentifier+"')"; > activities.noDocument(documentIdentifier,versionString); > > I do not understand why for the html file the log activity is present with a > specific result code and not for the png file for example. Would it be > possible to have the same log entry for all the files ? > > Thanks, > Best regards, > > Olivier > >> Le 11 oct. 2018 à 16:00, Karl Wright <[email protected] >> <mailto:[email protected]>> a écrit : >> >> Hi Olivier, >> >> The Repository connector has no knowledge of what the pipeline looks like. >> It simply asks the framework whether the mime type, length, etc. is >> acceptable to the downstream pipeline. It's the connector's responsibility >> to note the reason for the rejection in the simple history, but it does not >> have any knowledge whatsoever of which connector rejected the document, and >> therefore cannot say which transformer or output rejected the document. >> >> Transformation and output connectors which respond to checks for document >> mime type or length checks likewise do not have any knowledge of the >> upstream connector that is doing the checking. >> >> Karl >> >> >> >> On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard >> <[email protected] <mailto:[email protected]>> wrote: >> Hello, >> >> I have a question regarding the Document filter transformation connector and >> the log about it. >> I would like to have a look of all the documents excluded by the rules >> configured in the Document filter transformation connector by looking at the >> Simple history or by the MCF log but it is not easy so far. >> >> Let’s say that I want to crawl a website and I want to index html pages >> only. So I configure a web repository connector with a Document filter >> transformation connector and I create the rule with only one allowed mime >> type content and one file extension. So far so good, the job works well but >> if I want to visualize on the MCF log or by the simple history all the files >> that were excluded by the transformation connector it is quickly complicated >> : I have to search manually all the files that were fetched but not >> processed by Tika transformation connector or ingested by the output >> connector. >> >> Of my understanding of the code, the document filter transformation >> connector can communicate directly with the repo transformation connector to >> indicate the rules of exclusion of the documents and so the document that >> need to be excluded are not processed in the Document filter transformation >> connector but directly excluded by the web repo connector. >> So in the simple history, I can see that a document that will be excluded is >> in "activity fetch" and that’s it, there is no additional information about >> it. >> Could it be possible to add a log entry with an explicit result code as >> excluded by "document filter connector" or something like when the document >> is excluded by the repository connector? >> >> Thank you, >> Best regards, >> Olivier >> > > <simple_history_web_job_document_filter.jpg>
