Hi,

yes, that seems to be the reason. In:

https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java

there is the following code sequence:

else if (lowercaseLine.startsWith("sitemap:"))
          {
// We don't complain about this, but right now we don't listen to it either.
          }

But if I have a look at:

https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java

a sitemap containing an urlset seems to be handled

else if (localName.equals("urlset") || localName.equals("sitemapindex"))
      {
        // Sitemap detected
        outerTagCount++;
return new UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
      }

So, my question is: is there another way to handle sitemaps inside the Web Crawler?

Cheers Sebastian





Am 07.07.2021 12:23 schrieb Karl Wright:

The robots parsing does not recognize the "sitemaps" line, which was likely not in the spec for robots when this connector was written.

Karl

On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 <h0444...@posteo.de> wrote:

Hi,

I have a general question. Is the Web connector supporting sitemap files referenced by the robots.txt? In my use case the robots.txt is stored in
the root of the website and is referencing two compressed sitemaps.

Example of robots.txt
------------------------
User-Agent: *
Disallow:
Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]

When start crawling in „Simple History" there is an error log entry as
follows:

Unknown robots.txt line: 'Sitemap:
https://www.example.de/sitemap/en-sitemap.xml.gz [2]'

Is there a general problem with sitemaps at all or with sitemaps
referenced in robots.txt or with compressed sitemaps?

Best regards

Sebastian


Links:
------
[1] https://www.example.de/sitemap/de-sitemap.xml.gz
[2] https://www.example.de/sitemap/en-sitemap.xml.gz

Reply via email to