On 3 September 2013 19:08, Jordi Carretero <[email protected]> wrote: > Hi > > I'm building a spider using a regular expression extractor and a for-each- > controller and works pretty well but.. > > I'm using <a href="[.]*/([^"]+)" as a expression extractor , and works well > to extract links like: > <a href="../rel/c/items" > > <a href="/professions.html" > > but I can not find any expression that will work at the same time for > expressions found in some sites like: > > <a > href="http://www.mysite.es/index.php?main_page=page&id=20<http://www.mysite.es/index.php?main_page=page&id=20> > " > > that include the full domain at the beginning (and has to be removed) > > It's a matter of working with the perl expression but after some days I > could not manage to make it work, so any help will be appreciated
If you want to ignore an optional string, use something like: (?:http://www\.mysite\.es)? The form (abc)? means abc or nothing; the (?:) form means don't save the contents. In your case, if you want to ignore both ".", ".." and "http:/www.mysite.es" you could use: (?:http://www\.mysite\.es|\.\.?)? BTW, rather than use "[.]" to escape the meta-character ".", the usual method is "\.". > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
