Re: web connector : links extraction issues

Olivier Tavard Tue, 30 Oct 2018 04:05:42 -0700

Hi Karl,

Thanks for your answer.
I kept looking into this and I found what was the problem. The Javascript code 
into the tags <script></scripts>  contained the character '<'. If so the links 
extraction does not work with the web connector.

To reproduce it, I created this page hosted in local Apache then I indexed it 
with MCF 2.11 out of the box.

in the first example the page was :
<!DOCTYPE html>

<head>
<title>test</title>
<meta charset="utf-8" />
<script type="text/javascript"></script>

</head>
<body>

<a href="https://manifoldcf.apache.org/en_US/index.html";>manifoldcf</a>
</body>

The links extraction was correct, in the debug log :
DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an 
HttpClient object
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For 
http://localhost:8888/testjs/test.html, setting virtual host to localhost
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient 
object after 1 ms.
DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for 
'/testjs/test.html'
 INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH 
URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8'; link 
extraction starting
DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 
'http://localhost:8888/testjs/test.html', found link to 
'https://manifoldcf.apache.org/en_US/index.html'
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content exclusion 
rule supplied... returning
DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 
'http://localhost:8888/testjs/test.html'
—
In the second example, the code was pretty quite the same except that I 
included the character '<' in the content of the script tags :
<!DOCTYPE html>

<head>
<title>test</title>
<meta charset="utf-8" />
<script type="text/javascript">a<b</script>

</head>
<body>

    <a href="https://manifoldcf.apache.org/en_US/index.html";>manifoldcf</a>

</body>

The links extraction was not successful, the debug log indicates :
DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an 
HttpClient object
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For 
http://localhost:8888/testjs/test.html, setting virtual host to localhost
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient 
object after 1 ms.
DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for 
'/testjs/test.html'
 INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH 
URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 
'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8'; link 
extraction starting
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content exclusion 
rule supplied... returning
DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 
'http://localhost:8888/testjs/test.html'
—
So special characters like the less than sign should be escaped in the code of 
the web connector to preserve the links extraction.

Thanks,
Best regards,

Olivier 

> Le 29 oct. 2018 à 19:39, Karl Wright <[email protected]> a écrit :
> 
> Hi Olivier,
> 
> Javascript inclusion in the Web Connector is not evaluated.  In fact, no 
> Javascript is executed at all.  Therefore it should not matter what is 
> included via javascript.
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> 
> Regarding the web connector, I noticed that for specific websites, some 
> Javascript code can prevent the web connector to fetch correctly all the 
> links present on the page. Specifically, for websites that contain a 
> deprecated version of New relic web agent as 
> js-agent.newrelic.com/nr-1071.min.js 
> <http://js-agent.newrelic.com/nr-1071.min.js>.
> After downloading the page locally and removing the reference to the new 
> relic agent browser, the links were correctly fetched in the page by the web 
> connector. So it seems that the Javascript injection here caused by the new 
> relic agent was the cause of the links not fetched in the page.
> This case is rare and concerns only old versions of New Relic agent. But in a 
> more generic way, would it be possible to block the javascript injection at 
> the connector level during the indexation ?
>  
> Thanks,
> Best regards,
> Olivier 
> 
>

Re: web connector : links extraction issues

Reply via email to