Author: Julien D. Email: jul...@clustaar.com Message: > Hello, > > > Hello, > > > > I couldn't find any information on this subject. > > As people start using HTTPS, I get more and more problems when crawling with > > links that don't use a specific protocol. > > > > Let's take this example of a link from http://www.example.com/page-a.html : > > <a href="//www.example.com/page-b.html">text</a> > > > > Will be seen as : http://www.example.com/www.example.com/page-b.html > > And of course will cause a 404 error. > > > > Any idea on how to get the right links ? > > > > Thanks. > > The crawler stores full URLs in the database. > But you can remove the protocol at search time, > using the search template language functionality. > > In 3.4.x use regex_substr: > http://www.mnogosearch.org/doc34/msearch-templates.html#template- functions > > In 3.3.x use the EREG template operator: > http://www.mnogosearch.org/doc33/msearch-templates- oper.html#templates-oper-misc >
Hello Alexander, Thanks for the answer. However, the problem occurs on the indexing phase : the crawler tries to index http://www.example.com/www.example.com/page-b.html (which does not exist) instead of http://www.example.com/page-b.html Can I prevent those 404 errors ? Thanks ! Reply: <http://www.mnogosearch.org/board/message.php?id=21810> _______________________________________________ General mailing list General@mnogosearch.org http://lists.mnogosearch.org/listinfo/general