Re: [QGIS-Developer] GeoSeer ogc services data harvesting

2020-06-09 Thread Andreas Neumann

On 2020-06-09 13:46, Jonathan Moules wrote:

Hi Andreas, 

Interesting. 


Behind the scenes, GeoSeer one-way hashes the GetCapabilities documents and 
that hash is used as the document key. Identical GetCapabilities documents 
therefore get the same key and thus only appear once in the final index. But 
one single character different in the entire document and it's a completely 
different hash.


Yes - they are not 100% identical. E.g. the "OnlineResource" would be
different, but all the rest would be identical. I know exceptions here
and there ... difficult topic. 





Andreas___
QGIS-Developer mailing list
QGIS-Developer@lists.osgeo.org
List info: https://lists.osgeo.org/mailman/listinfo/qgis-developer
Unsubscribe: https://lists.osgeo.org/mailman/listinfo/qgis-developer

Re: [QGIS-Developer] GeoSeer ogc services data harvesting

2020-06-09 Thread Jonathan Moules

Hi Andreas,

Interesting.

Behind the scenes, GeoSeer one-way hashes the GetCapabilities documents 
and that hash is used as the document key. Identical GetCapabilities 
documents therefore get the same key and thus only appear once in the 
final index. But one single character different in the entire document 
and it's a completely different hash.


There's also de-duplication at the endpoint, service, and dataset levels 
using a similar mechanism. GeoSeer also de-duplicates across services. 
I.e. if something is served from the same place as both WMS and WFS, we 
glue them together.


The problem with using DNS is that you get organisations the size of 
NOAA/USGS and they have deployments across various subdomains that are 
doing different (but similar) things. You also get a kind of opposite - 
a single domain belonging to a geospatial "cloud" hosting provider that 
has lots of layers that have the same names and similar metadata because 
all their local-government customers are sharing their own 
fire-stations/roads etc.


There are all manner of ways in which server admins and data custodians 
make this more complicated than it seems. :-)


Cheers,

Jonathan


On 2020-06-09 12:25, Andreas Neumann wrote:


Hi Jonathan,

Thanks for sharing this information. I don't know anything better.

While looking at some services that I know personally, I also found 
out that others services are listed twice, because a machine might 
have a DNS alias. That is also something to consider - perhaps sort 
out machines that have identical GetCapabilities responses and just 
the DNS name varies.


I agree, the numbers probably wouldn't change significantly.

Thanks and greetings,

Andreas

On 2020-06-09 13:14, Jonathan Moules wrote:


Hi Andreas,
Sure, happy to share.
There's a little on the About page: https://www.geoseer.net/about.php 
and then scattered around blog posts (the ones with the "GeoSeer" tag 
are probably best for that: https://www.geoseer.net/blog/?t=GeoSeer 
), but put simply - We scrape a lot of different sources and metadata 
catalogs and get the services from them. Then we request not only the 
GetCapabilities that was declared, but also make educated guesses as 
to what else might be on the box and request those too.


It's not perfect, but to the best of my knowledge it's by far the 
largest such index in the world, and more importantly, it's 
*current*. Everything in there responded with a valid GetCapabilities 
document with at least one meaningful named dataset when it was last 
scraped within the last few weeks.


Pertaining to your given services, GeoSeer has:
http://geoweb.so.ch/wms/sogis_natgef.wms? and a few others on that 
sub-domain, as well as some on the subdomain: 
http://www.sogis1.so.ch/cgi-bin/sogis/sogis_natgef.wms? - both are 
now defunct I see which is why they're not in the database.


Thanks for the URL, I've added it for scraping.

So I wonder how many other QGIS server installations may not be in 
your database?
Alas that's a "unknown unknown"; there's no way to know (I can't 
think of a way to find out anyway; suggestions welcome). However the 
vast majority of the time when I come across a new service manually 
(i.e. from following various mailing lists like this), it turns out 
it's already in the index, so I think it's reasonably comprehensive 
at this point.


While missing servers may change the absolute number of QGIS 
Installations, they're very unlikely to change the proportions. For a 
sample-size this large I'd expect the proportions to remain largely 
the same, certainly for deployments.


Hope that's of interest and answers the question,
Cheers,
Jonathan


On 2020-06-09 10:45, Andreas Neumann wrote:


Hi Jonathan,

Can you share with us how you harvest your information on available 
public OGC services? You probably have that information published 
somewhere - so if you could point me towards this URL, it would help.


I noticed that all of the services of our province (my employer) 
can't be found, as an example.


Here is the start point:

https://so.ch/verwaltung/bau-und-justizdepartement/amt-fuer-geoinformation/geoportal/geodienste/wms-web-map-service/

and the GetCapabilities link:

https://geo.so.ch/api/wms?SERVICE=WMS=GetCapabilities=1.3.0

So I wonder how many other QGIS server installations may not be in 
your database? Of course I know you don't claim full coverage, but 
it would still be good to know how you harvest your data.


Thanks for clarifying and greetings,

Andreas

___
QGIS-Developer mailing list
QGIS-Developer@lists.osgeo.org
List info: https://lists.osgeo.org/mailman/listinfo/qgis-developer
Unsubscribe: https://lists.osgeo.org/mailman/listinfo/qgis-developer

Re: [QGIS-Developer] GeoSeer ogc services data harvesting

2020-06-09 Thread Andreas Neumann
Hi Jonathan, 

Thanks for sharing this information. I don't know anything better. 


While looking at some services that I know personally, I also found out
that others services are listed twice, because a machine might have a
DNS alias. That is also something to consider - perhaps sort out
machines that have identical GetCapabilities responses and just the DNS
name varies. 

I agree, the numbers probably wouldn't change significantly. 

Thanks and greetings, 

Andreas 


On 2020-06-09 13:14, Jonathan Moules wrote:


Hi Andreas,
Sure, happy to share.
There's a little on the About page: https://www.geoseer.net/about.php and then scattered 
around blog posts (the ones with the "GeoSeer" tag are probably best for that: 
https://www.geoseer.net/blog/?t=GeoSeer ), but put simply - We scrape a lot of different 
sources and metadata catalogs and get the services from them. Then we request not only 
the GetCapabilities that was declared, but also make educated guesses as to what else 
might be on the box and request those too.

It's not perfect, but to the best of my knowledge it's by far the largest such 
index in the world, and more importantly, it's *current*. Everything in there 
responded with a valid GetCapabilities document with at least one meaningful 
named dataset when it was last scraped within the last few weeks.

Pertaining to your given services, GeoSeer has:
http://geoweb.so.ch/wms/sogis_natgef.wms? and a few others on that sub-domain, 
as well as some on the subdomain: 
http://www.sogis1.so.ch/cgi-bin/sogis/sogis_natgef.wms? - both are now defunct 
I see which is why they're not in the database.

Thanks for the URL, I've added it for scraping.


So I wonder how many other QGIS server installations may not be in your 
database?

Alas that's a "unknown unknown"; there's no way to know (I can't think of a way 
to find out anyway; suggestions welcome). However the vast majority of the time when I 
come across a new service manually (i.e. from following various mailing lists like this), 
it turns out it's already in the index, so I think it's reasonably comprehensive at this 
point.

While missing servers may change the absolute number of QGIS Installations, 
they're very unlikely to change the proportions. For a sample-size this large 
I'd expect the proportions to remain largely the same, certainly for 
deployments.

Hope that's of interest and answers the question,
Cheers,
Jonathan

On 2020-06-09 10:45, Andreas Neumann wrote: 


Hi Jonathan,

Can you share with us how you harvest your information on available public OGC 
services? You probably have that information published somewhere - so if you 
could point me towards this URL, it would help.

I noticed that all of the services of our province (my employer) can't be 
found, as an example.

Here is the start point:

https://so.ch/verwaltung/bau-und-justizdepartement/amt-fuer-geoinformation/geoportal/geodienste/wms-web-map-service/

and the GetCapabilities link:

https://geo.so.ch/api/wms?SERVICE=WMS=GetCapabilities=1.3.0

So I wonder how many other QGIS server installations may not be in your 
database? Of course I know you don't claim full coverage, but it would still be 
good to know how you harvest your data.

Thanks for clarifying and greetings,

Andreas___
QGIS-Developer mailing list
QGIS-Developer@lists.osgeo.org
List info: https://lists.osgeo.org/mailman/listinfo/qgis-developer
Unsubscribe: https://lists.osgeo.org/mailman/listinfo/qgis-developer

Re: [QGIS-Developer] GeoSeer ogc services data harvesting

2020-06-09 Thread Jonathan Moules

Hi Andreas,
Sure, happy to share.
There's a little on the About page: https://www.geoseer.net/about.php 
and then scattered around blog posts (the ones with the "GeoSeer" tag 
are probably best for that: https://www.geoseer.net/blog/?t=GeoSeer ), 
but put simply - We scrape a lot of different sources and metadata 
catalogs and get the services from them. Then we request not only the 
GetCapabilities that was declared, but also make educated guesses as to 
what else might be on the box and request those too.


It's not perfect, but to the best of my knowledge it's by far the 
largest such index in the world, and more importantly, it's *current*. 
Everything in there responded with a valid GetCapabilities document with 
at least one meaningful named dataset when it was last scraped within 
the last few weeks.


Pertaining to your given services, GeoSeer has:
http://geoweb.so.ch/wms/sogis_natgef.wms? and a few others on that 
sub-domain, as well as some on the subdomain: 
http://www.sogis1.so.ch/cgi-bin/sogis/sogis_natgef.wms? - both are now 
defunct I see which is why they're not in the database.


Thanks for the URL, I've added it for scraping.

> So I wonder how many other QGIS server installations may not be in 
your database?
Alas that's a "unknown unknown"; there's no way to know (I can't think 
of a way to find out anyway; suggestions welcome). However the vast 
majority of the time when I come across a new service manually (i.e. 
from following various mailing lists like this), it turns out it's 
already in the index, so I think it's reasonably comprehensive at this 
point.


While missing servers may change the absolute number of QGIS 
Installations, they're very unlikely to change the proportions. For a 
sample-size this large I'd expect the proportions to remain largely the 
same, certainly for deployments.


Hope that's of interest and answers the question,
Cheers,
Jonathan


On 2020-06-09 10:45, Andreas Neumann wrote:


Hi Jonathan,

Can you share with us how you harvest your information on available 
public OGC services? You probably have that information published 
somewhere - so if you could point me towards this URL, it would help.


I noticed that all of the services of our province (my employer) can't 
be found, as an example.


Here is the start point:

https://so.ch/verwaltung/bau-und-justizdepartement/amt-fuer-geoinformation/geoportal/geodienste/wms-web-map-service/

and the GetCapabilities link:

https://geo.so.ch/api/wms?SERVICE=WMS=GetCapabilities=1.3.0

So I wonder how many other QGIS server installations may not be in 
your database? Of course I know you don't claim full coverage, but it 
would still be good to know how you harvest your data.


Thanks for clarifying and greetings,

Andreas


___
QGIS-Developer mailing list
QGIS-Developer@lists.osgeo.org
List info: https://lists.osgeo.org/mailman/listinfo/qgis-developer
Unsubscribe: https://lists.osgeo.org/mailman/listinfo/qgis-developer

[QGIS-Developer] GeoSeer ogc services data harvesting

2020-06-09 Thread Andreas Neumann
Hi Jonathan, 


Can you share with us how you harvest your information on available
public OGC services? You probably have that information published
somewhere - so if you could point me towards this URL, it would help. 


I noticed that all of the services of our province (my employer) can't
be found, as an example. 

Here is the start point: 


https://so.ch/verwaltung/bau-und-justizdepartement/amt-fuer-geoinformation/geoportal/geodienste/wms-web-map-service/


and the GetCapabilities link: 


https://geo.so.ch/api/wms?SERVICE=WMS=GetCapabilities=1.3.0


So I wonder how many other QGIS server installations may not be in your
database? Of course I know you don't claim full coverage, but it would
still be good to know how you harvest your data. 

Thanks for clarifying and greetings, 


Andreas___
QGIS-Developer mailing list
QGIS-Developer@lists.osgeo.org
List info: https://lists.osgeo.org/mailman/listinfo/qgis-developer
Unsubscribe: https://lists.osgeo.org/mailman/listinfo/qgis-developer