Re: about the state of parse-js and extracting links from js in general

Andrzej Bialecki Thu, 25 Nov 2010 03:22:49 -0800

On 2010-11-25 11:27, Claudio Martella wrote:
> Hello list,
> 
> I played around with parse-js about 6 months ago and found some problems
> with bogus url links extraction and some infinite loops. I remember that
> some of the code was considered legacy and there was some effort and
> discussion about providing a new implementation (maybe in tika? i can't
> find the discussion).
> 
> How is the state of parse-js? Is it still considered kind of a hack?


Pretty much yes. Since we can't afford to run a full Javascript
interpreter, we need to use some other method to extract likely URL
strings. Currently we use a simple regex - probably too simplistic.

Another possible way of attacking the problem would be to parse the
javascript using a lightweight parser (e.g. ANTLR + JS grammar) and then
traverse the AST to extract strings that look like URLs.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: about the state of parse-js and extracting links from js in general

Reply via email to