On 2010-11-25 11:27, Claudio Martella wrote: > Hello list, > > I played around with parse-js about 6 months ago and found some problems > with bogus url links extraction and some infinite loops. I remember that > some of the code was considered legacy and there was some effort and > discussion about providing a new implementation (maybe in tika? i can't > find the discussion). > > How is the state of parse-js? Is it still considered kind of a hack?
Pretty much yes. Since we can't afford to run a full Javascript interpreter, we need to use some other method to extract likely URL strings. Currently we use a simple regex - probably too simplistic. Another possible way of attacking the problem would be to parse the javascript using a lightweight parser (e.g. ANTLR + JS grammar) and then traverse the AST to extract strings that look like URLs. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

