Hi Karl, Thanks for your answer. I kept looking into this and I found what was the problem. The Javascript code into the tags <script></scripts> contained the character '<'. If so the links extraction does not work with the web connector.
To reproduce it, I created this page hosted in local Apache then I indexed it with MCF 2.11 out of the box. in the first example the page was : <!DOCTYPE html> <head> <title>test</title> <meta charset="utf-8" /> <script type="text/javascript"></script> </head> <body> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a> </body> The links extraction was correct, in the debug log : DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an HttpClient object DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For http://localhost:8888/testjs/test.html, setting virtual host to localhost DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient object after 1 ms. DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for '/testjs/test.html' INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223| DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8'; link extraction starting DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 'http://localhost:8888/testjs/test.html', found link to 'https://manifoldcf.apache.org/en_US/index.html' DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content exclusion rule supplied... returning DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html' — In the second example, the code was pretty quite the same except that I included the character '<' in the content of the script tags : <!DOCTYPE html> <head> <title>test</title> <meta charset="utf-8" /> <script type="text/javascript">a<b</script> </head> <body> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a> </body> The links extraction was not successful, the debug log indicates : DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an HttpClient object DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For http://localhost:8888/testjs/test.html, setting virtual host to localhost DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient object after 1 ms. DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for '/testjs/test.html' INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226| DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8'; link extraction starting DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content exclusion rule supplied... returning DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html' — So special characters like the less than sign should be escaped in the code of the web connector to preserve the links extraction. Thanks, Best regards, Olivier > Le 29 oct. 2018 à 19:39, Karl Wright <[email protected]> a écrit : > > Hi Olivier, > > Javascript inclusion in the Web Connector is not evaluated. In fact, no > Javascript is executed at all. Therefore it should not matter what is > included via javascript. > > Thanks, > Karl > > > On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <[email protected] > <mailto:[email protected]>> wrote: > Hi, > > Regarding the web connector, I noticed that for specific websites, some > Javascript code can prevent the web connector to fetch correctly all the > links present on the page. Specifically, for websites that contain a > deprecated version of New relic web agent as > js-agent.newrelic.com/nr-1071.min.js > <http://js-agent.newrelic.com/nr-1071.min.js>. > After downloading the page locally and removing the reference to the new > relic agent browser, the links were correctly fetched in the page by the web > connector. So it seems that the Javascript injection here caused by the new > relic agent was the cause of the links not fetched in the page. > This case is rare and concerns only old versions of New Relic agent. But in a > more generic way, would it be possible to block the javascript injection at > the connector level during the indexation ? > > Thanks, > Best regards, > Olivier > >
