Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
 I had a quick look at Jira. I think there is already a ticket which
covers the reqirement of using a sitemap.xml file which is referenced by
robots.txt

https://issues.apache.org/jira/browse/CONNECTORS-1657

I'll update this ticket with infos from the sitemap protocol page
https://www.sitemaps.org/protocol.html#submit_robots

I'll try to add other tickets to require the functionality to send the
sitemap directly to the 'search engine' (in our case this role is played
by the ManifoldCF Web connector)

Sebastian

Am 07.07.2021 16:00 schrieb Karl Wright: 

> If you wish to add a feature request, please create a CONNECTORS ticket that 
> describes the functionality you think the connector should have. 
> 
> Karl 
> 
> On Wed, Jul 7, 2021 at 9:29 AM h0444xk8  wrote: 
> 
>> Hi,
>> 
>> yes, that seems to be the reason. In:
>> 
>> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>>  [1]
>> 
>> there is the following code sequence:
>> 
>> else if (lowercaseLine.startsWith("sitemap:"))
>> {
>> // We don't complain about this, but right now we don't 
>> listen to it either.
>> }
>> 
>> But if I have a look at:
>> 
>> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>>  [2]
>> 
>> a sitemap containing an urlset seems to be handled
>> 
>> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>> {
>> // Sitemap detected
>> outerTagCount++;
>> return new 
>> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>> }
>> 
>> So, my question is: is there another way to handle sitemaps inside the 
>> Web Crawler?
>> 
>> Cheers Sebastian
>> 
>> Am 07.07.2021 12:23 schrieb Karl Wright:
>> 
>>> The robots parsing does not recognize the "sitemaps" line, which was 
>>> likely not in the spec for robots when this connector was written.
>>> 
>>> Karl
>>> 
>>> On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:
>>> 
 Hi,
 
 I have a general question. Is the Web connector supporting sitemap 
 files
 referenced by the robots.txt? In my use case the robots.txt is stored 
 in
 the root of the website and is referencing two compressed sitemaps.
 
 Example of robots.txt
 
 User-Agent: *
 Disallow:
 Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [3] [1]
 Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [4] [2]
 
 When start crawling in „Simple History" there is an error log entry as
 follows:
 
 Unknown robots.txt line: 'Sitemap:
 https://www.example.de/sitemap/en-sitemap.xml.gz [4] [2]'
 
 Is there a general problem with sitemaps at all or with sitemaps
 referenced in robots.txt or with compressed sitemaps?
 
 Best regards
 
 Sebastian
>> 
>> Links:
>> --
>> [1] https://www.example.de/sitemap/de-sitemap.xml.gz [3]
>> [2] https://www.example.de/sitemap/en-sitemap.xml.gz [4]
 

Links:
--
[1]
https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
[2]
https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
[3] https://www.example.de/sitemap/de-sitemap.xml.gz
[4] https://www.example.de/sitemap/en-sitemap.xml.gz


Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
If you wish to add a feature request, please create a CONNECTORS ticket
that describes the functionality you think the connector should have.

Karl


On Wed, Jul 7, 2021 at 9:29 AM h0444xk8  wrote:

> Hi,
>
> yes, that seems to be the reason. In:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>
> there is the following code sequence:
>
> else if (lowercaseLine.startsWith("sitemap:"))
>{
>  // We don't complain about this, but right now we don't
> listen to it either.
>}
>
> But if I have a look at:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>
> a sitemap containing an urlset seems to be handled
>
> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>{
>  // Sitemap detected
>  outerTagCount++;
>  return new
>
> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>}
>
> So, my question is: is there another way to handle sitemaps inside the
> Web Crawler?
>
> Cheers Sebastian
>
>
>
>
>
> Am 07.07.2021 12:23 schrieb Karl Wright:
>
> > The robots parsing does not recognize the "sitemaps" line, which was
> > likely not in the spec for robots when this connector was written.
> >
> > Karl
> >
> > On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:
> >
> >> Hi,
> >>
> >> I have a general question. Is the Web connector supporting sitemap
> >> files
> >> referenced by the robots.txt? In my use case the robots.txt is stored
> >> in
> >> the root of the website and is referencing two compressed sitemaps.
> >>
> >> Example of robots.txt
> >> 
> >> User-Agent: *
> >> Disallow:
> >> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
> >> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]
> >>
> >> When start crawling in „Simple History" there is an error log entry as
> >> follows:
> >>
> >> Unknown robots.txt line: 'Sitemap:
> >> https://www.example.de/sitemap/en-sitemap.xml.gz [2]'
> >>
> >> Is there a general problem with sitemaps at all or with sitemaps
> >> referenced in robots.txt or with compressed sitemaps?
> >>
> >> Best regards
> >>
> >> Sebastian
>
>
> Links:
> --
> [1] https://www.example.de/sitemap/de-sitemap.xml.gz
> [2] https://www.example.de/sitemap/en-sitemap.xml.gz
>


Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8

Hi,

yes, that seems to be the reason. In:

https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java

there is the following code sequence:

else if (lowercaseLine.startsWith("sitemap:"))
  {
// We don't complain about this, but right now we don't 
listen to it either.

  }

But if I have a look at:

https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java

a sitemap containing an urlset seems to be handled

else if (localName.equals("urlset") || localName.equals("sitemapindex"))
  {
// Sitemap detected
outerTagCount++;
return new 
UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);

  }

So, my question is: is there another way to handle sitemaps inside the 
Web Crawler?


Cheers Sebastian





Am 07.07.2021 12:23 schrieb Karl Wright:

The robots parsing does not recognize the "sitemaps" line, which was 
likely not in the spec for robots when this connector was written.


Karl

On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:


Hi,

I have a general question. Is the Web connector supporting sitemap 
files
referenced by the robots.txt? In my use case the robots.txt is stored 
in

the root of the website and is referencing two compressed sitemaps.

Example of robots.txt

User-Agent: *
Disallow:
Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]

When start crawling in „Simple History" there is an error log entry as
follows:

Unknown robots.txt line: 'Sitemap:
https://www.example.de/sitemap/en-sitemap.xml.gz [2]'

Is there a general problem with sitemaps at all or with sitemaps
referenced in robots.txt or with compressed sitemaps?

Best regards

Sebastian



Links:
--
[1] https://www.example.de/sitemap/de-sitemap.xml.gz
[2] https://www.example.de/sitemap/en-sitemap.xml.gz


Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
The robots parsing does not recognize the "sitemaps" line, which was likely
not in the spec for robots when this connector was written.

Karl


On Wed, Jul 7, 2021 at 3:31 AM h0444xk8  wrote:

> Hi,
>
> I have a general question. Is the Web connector supporting sitemap files
> referenced by the robots.txt? In my use case the robots.txt is stored in
> the root of the website and is referencing two compressed sitemaps.
>
> Example of robots.txt
> 
> User-Agent: *
> Disallow:
> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz
> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz
>
> When start crawling in „Simple History" there is an error log entry as
> follows:
>
> Unknown robots.txt line: 'Sitemap:
> https://www.example.de/sitemap/en-sitemap.xml.gz'
>
> Is there a general problem with sitemaps at all or with sitemaps
> referenced in robots.txt or with compressed sitemaps?
>
> Best regards
>
> Sebastian
>