Hi Olivier, You can create a ticket but I don't have a good solution for you in any case.
Karl On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard < [email protected]> wrote: > Hi Karl, > > Do you think that I need to create a Jira issue relative to this bug ie > that the links extraction does not work if inside Javascript tags some code > contain special characters like '>', '< '? > > Thanks, > Best regards, > > Olivier > > > > Le 30 oct. 2018 à 12:05, Olivier Tavard <[email protected]> a > écrit : > > Hi Karl, > > Thanks for your answer. > I kept looking into this and I found what was the problem. The Javascript > code into the tags <script></scripts> contained the character '<'. If so > the links extraction does not work with the web connector. > > To reproduce it, I created this page hosted in local Apache then I indexed > it with MCF 2.11 out of the box. > > in the first example the page was : > <!DOCTYPE html> > > <head> > <title>test</title> > <meta charset="utf-8" /> > *<script type="text/javascript"></script>* > > </head> > <body> > > <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a> > </body> > > The links extraction was correct, in the debug log : > DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an > HttpClient object > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For > http://localhost:8888/testjs/test.html, setting virtual host to localhost > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an > HttpClient object after 1 ms. > DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for > '/testjs/test.html' > INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL| > http://localhost:8888/testjs/test.html|1540896372585+75|200|223| > <http://localhost:8888/testjs/test.html%7C1540896372585+75%7C200%7C223%7C> > DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document > 'http://localhost:8888/testjs/test.html' > is text, with encoding 'UTF-8'; link extraction starting > DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document > 'http://localhost:8888/testjs/test.html', found link to > 'https://manifoldcf.apache.org/en_US/index.html' > DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content > exclusion rule supplied... returning > DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to > ingest 'http://localhost:8888/testjs/test.html' > — > In the second example, the code was pretty quite the same except that I > included the character '<' in the content of the script tags : > <!DOCTYPE html> > > <head> > <title>test</title> > <meta charset="utf-8" /> > *<script type="text/javascript">a<b</script>* > > </head> > <body> > > <a href="https://manifoldcf.apache.org/en_US/index.html > ">manifoldcf</a> > > </body> > > The links extraction was not successful, the debug log indicates : > DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an > HttpClient object > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For > http://localhost:8888/testjs/test.html, setting virtual host to localhost > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an > HttpClient object after 1 ms. > DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for > '/testjs/test.html' > INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL| > http://localhost:8888/testjs/test.html|1540896493475+76|200|226| > <http://localhost:8888/testjs/test.html%7C1540896493475+76%7C200%7C226%7C> > DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document > 'http://localhost:8888/testjs/test.html' > is text, with encoding 'UTF-8'; link extraction starting > DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content > exclusion rule supplied... returning > DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to > ingest 'http://localhost:8888/testjs/test.html' > — > So special characters like the less than sign should be escaped in the > code of the web connector to preserve the links extraction. > > Thanks, > Best regards, > > > Olivier > > Le 29 oct. 2018 à 19:39, Karl Wright <[email protected]> a écrit : > > Hi Olivier, > > Javascript inclusion in the Web Connector is not evaluated. In fact, no > Javascript is executed at all. Therefore it should not matter what is > included via javascript. > > Thanks, > Karl > > > On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard < > [email protected]> wrote: > >> Hi, >> >> Regarding the web connector, I noticed that for specific websites, some >> Javascript code can prevent the web connector to fetch correctly all the >> links present on the page. Specifically, for websites that contain a >> deprecated version of New relic web agent as >> js-agent.newrelic.com/nr-1071.min.js. >> After downloading the page locally and removing the reference to the new >> relic agent browser, the links were correctly fetched in the page by the >> web connector. So it seems that the Javascript injection here caused by >> the new relic agent was the cause of the links not fetched in the page. >> This case is rare and concerns only old versions of New Relic agent. But >> in a more generic way, would it be possible to block the javascript >> injection at the connector level during the indexation ? >> >> Thanks, >> Best regards, >> Olivier >> >> >> > >
