Re: web connector : links extraction issues

Olivier Tavard Thu, 15 Nov 2018 03:53:37 -0800

Hi Karl,

Do you think that I need to create a Jira issue relative to this bug ie that 
the links extraction does not work if inside Javascript tags some code contain 
special characters like '>', '< '?


Thanks,
Best regards,

Olivier



> Le 30 oct. 2018 à 12:05, Olivier Tavard <[email protected]> a 
> écrit :
> 
> Hi Karl,
> 
> Thanks for your answer.
> I kept looking into this and I found what was the problem. The Javascript 
> code into the tags <script></scripts>  contained the character '<'. If so the 
> links extraction does not work with the web connector.
> 
> To reproduce it, I created this page hosted in local Apache then I indexed it 
> with MCF 2.11 out of the box.
> 
> in the first example the page was :
> <!DOCTYPE html>
> 
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> <script type="text/javascript"></script>
> 
> </head>
> <body>
> 
> <a href="https://manifoldcf.apache.org/en_US/index.html 
> <https://manifoldcf.apache.org/en_US/index.html>">manifoldcf</a>
> </body>
> 
> The links extraction was correct, in the debug log :
> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an 
> HttpClient object
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For 
> http://localhost:8888/testjs/test.html 
> <http://localhost:8888/testjs/test.html>, setting virtual host to localhost
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient 
> object after 1 ms.
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for 
> '/testjs/test.html'
>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH 
> URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223| 
> <http://localhost:8888/testjs/test.html|1540896372585+75|200|223|>
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
> 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8 
> <http://localhost:8888/testjs/test.html'%20is%20text,%20with%20encoding%20'UTF-8>';
>  link extraction starting
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 
> 'http://localhost:8888/testjs/test.html', found link to 
> 'https://manifoldcf.apache.org/en_US/index.html 
> <http://localhost:8888/testjs/test.html',%20found%20link%20to%20'https://manifoldcf.apache.org/en_US/index.html>'
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content 
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 
> 'http://localhost:8888/testjs/test.html 
> <http://localhost:8888/testjs/test.html>'
> —
> In the second example, the code was pretty quite the same except that I 
> included the character '<' in the content of the script tags :
> <!DOCTYPE html>
> 
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> <script type="text/javascript">a<b</script>
> 
> </head>
> <body>
> 
>     <a href="https://manifoldcf.apache.org/en_US/index.html 
> <https://manifoldcf.apache.org/en_US/index.html>">manifoldcf</a>
>     
> </body>
> 
> The links extraction was not successful, the debug log indicates :
> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an 
> HttpClient object
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For 
> http://localhost:8888/testjs/test.html 
> <http://localhost:8888/testjs/test.html>, setting virtual host to localhost
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient 
> object after 1 ms.
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for 
> '/testjs/test.html'
>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH 
> URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226| 
> <http://localhost:8888/testjs/test.html|1540896493475+76|200|226|>
> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 
> 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8 
> <http://localhost:8888/testjs/test.html'%20is%20text,%20with%20encoding%20'UTF-8>';
>  link extraction starting
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content 
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 
> 'http://localhost:8888/testjs/test.html 
> <http://localhost:8888/testjs/test.html>'
> —
> So special characters like the less than sign should be escaped in the code 
> of the web connector to preserve the links extraction.
> 
> Thanks,
> Best regards,
> 
> 
> Olivier 
> 
>> Le 29 oct. 2018 à 19:39, Karl Wright <[email protected] 
>> <mailto:[email protected]>> a écrit :
>> 
>> Hi Olivier,
>> 
>> Javascript inclusion in the Web Connector is not evaluated.  In fact, no 
>> Javascript is executed at all.  Therefore it should not matter what is 
>> included via javascript.
>> 
>> Thanks,
>> Karl
>> 
>> 
>> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard 
>> <[email protected] <mailto:[email protected]>> wrote:
>> Hi,
>> 
>> Regarding the web connector, I noticed that for specific websites, some 
>> Javascript code can prevent the web connector to fetch correctly all the 
>> links present on the page. Specifically, for websites that contain a 
>> deprecated version of New relic web agent as 
>> js-agent.newrelic.com/nr-1071.min.js 
>> <http://js-agent.newrelic.com/nr-1071.min.js>.
>> After downloading the page locally and removing the reference to the new 
>> relic agent browser, the links were correctly fetched in the page by the web 
>> connector. So it seems that the Javascript injection here caused by the new 
>> relic agent was the cause of the links not fetched in the page.
>> This case is rare and concerns only old versions of New Relic agent. But in 
>> a more generic way, would it be possible to block the javascript injection 
>> at the connector level during the indexation ?
>>  
>> Thanks,
>> Best regards,
>> Olivier 
>> 
>> 
>

Re: web connector : links extraction issues

Reply via email to