Hi Karl, Do you think that I need to create a Jira issue relative to this bug ie that the links extraction does not work if inside Javascript tags some code contain special characters like '>', '< '?
Thanks, Best regards, Olivier > Le 30 oct. 2018 à 12:05, Olivier Tavard <[email protected]> a > écrit : > > Hi Karl, > > Thanks for your answer. > I kept looking into this and I found what was the problem. The Javascript > code into the tags <script></scripts> contained the character '<'. If so the > links extraction does not work with the web connector. > > To reproduce it, I created this page hosted in local Apache then I indexed it > with MCF 2.11 out of the box. > > in the first example the page was : > <!DOCTYPE html> > > <head> > <title>test</title> > <meta charset="utf-8" /> > <script type="text/javascript"></script> > > </head> > <body> > > <a href="https://manifoldcf.apache.org/en_US/index.html > <https://manifoldcf.apache.org/en_US/index.html>">manifoldcf</a> > </body> > > The links extraction was correct, in the debug log : > DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an > HttpClient object > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For > http://localhost:8888/testjs/test.html > <http://localhost:8888/testjs/test.html>, setting virtual host to localhost > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient > object after 1 ms. > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for > '/testjs/test.html' > INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH > URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223| > <http://localhost:8888/testjs/test.html|1540896372585+75|200|223|> > DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document > 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8 > <http://localhost:8888/testjs/test.html'%20is%20text,%20with%20encoding%20'UTF-8>'; > link extraction starting > DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document > 'http://localhost:8888/testjs/test.html', found link to > 'https://manifoldcf.apache.org/en_US/index.html > <http://localhost:8888/testjs/test.html',%20found%20link%20to%20'https://manifoldcf.apache.org/en_US/index.html>' > DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content > exclusion rule supplied... returning > DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest > 'http://localhost:8888/testjs/test.html > <http://localhost:8888/testjs/test.html>' > — > In the second example, the code was pretty quite the same except that I > included the character '<' in the content of the script tags : > <!DOCTYPE html> > > <head> > <title>test</title> > <meta charset="utf-8" /> > <script type="text/javascript">a<b</script> > > </head> > <body> > > <a href="https://manifoldcf.apache.org/en_US/index.html > <https://manifoldcf.apache.org/en_US/index.html>">manifoldcf</a> > > </body> > > The links extraction was not successful, the debug log indicates : > DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an > HttpClient object > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For > http://localhost:8888/testjs/test.html > <http://localhost:8888/testjs/test.html>, setting virtual host to localhost > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient > object after 1 ms. > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for > '/testjs/test.html' > INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH > URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226| > <http://localhost:8888/testjs/test.html|1540896493475+76|200|226|> > DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document > 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8 > <http://localhost:8888/testjs/test.html'%20is%20text,%20with%20encoding%20'UTF-8>'; > link extraction starting > DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content > exclusion rule supplied... returning > DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest > 'http://localhost:8888/testjs/test.html > <http://localhost:8888/testjs/test.html>' > — > So special characters like the less than sign should be escaped in the code > of the web connector to preserve the links extraction. > > Thanks, > Best regards, > > > Olivier > >> Le 29 oct. 2018 à 19:39, Karl Wright <[email protected] >> <mailto:[email protected]>> a écrit : >> >> Hi Olivier, >> >> Javascript inclusion in the Web Connector is not evaluated. In fact, no >> Javascript is executed at all. Therefore it should not matter what is >> included via javascript. >> >> Thanks, >> Karl >> >> >> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard >> <[email protected] <mailto:[email protected]>> wrote: >> Hi, >> >> Regarding the web connector, I noticed that for specific websites, some >> Javascript code can prevent the web connector to fetch correctly all the >> links present on the page. Specifically, for websites that contain a >> deprecated version of New relic web agent as >> js-agent.newrelic.com/nr-1071.min.js >> <http://js-agent.newrelic.com/nr-1071.min.js>. >> After downloading the page locally and removing the reference to the new >> relic agent browser, the links were correctly fetched in the page by the web >> connector. So it seems that the Javascript injection here caused by the new >> relic agent was the cause of the links not fetched in the page. >> This case is rare and concerns only old versions of New Relic agent. But in >> a more generic way, would it be possible to block the javascript injection >> at the connector level during the indexation ? >> >> Thanks, >> Best regards, >> Olivier >> >> >
