Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-28 Thread Finn Aarup Nielsen

I was wondering why our research section was number 8!?

Then I recalled our dashboard running from 
"http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It 
updates around each 3 minute all day long :)


/Finn

On 08/23/2018 09:57 PM, Daniel Mietchen wrote:

I just ran Max' one-liner over one of the dump files, and it worked
smoothly. Not sure where the best place would be to store such things,
so I simply put it in my sandbox for now:
https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160
.
d.
On Tue, Aug 7, 2018 at 6:06 PM David Cuenca Tudela  wrote:


If someone could post the 10 (or 50!) more popular items, I would really 
appreciate it :-)

Cheers,
Micru

On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx  
wrote:



Hi,

On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch 
 said:

If you want a sorted list of "most popular" items, this is a bit more
work and would require at least some Python script, or some less
obvious combination of sed (extracting all URLs of entities), and
sort.


   zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' 
dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr

should do the trick.

Best,

Maximilian
--
Dipl.-Math. Maximilian Marx
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 43510
https://kbs.inf.tu-dresden.de/max

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Etiamsi omnes, ego non
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Markus Kroetzsch

On 23/08/18 23:10, Stas Malyshev wrote:

Hi!

On 8/23/18 2:07 PM, Daniel Mietchen wrote:

On Thu, Aug 23, 2018 at 10:44 PM  wrote:

I was wondering why our research section was number 8. Then I recalled
our dashboard running from
"http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It
updates around each 3 minute all day long...


Such automated queries should not be in the organic query file that I looked at.


If it's a browser page and the underlying code does not set distinctive
user agent, I think they will be. It'd be hard to identify such cases
otherwise (ccing Markus in case he knows more on the topic).


Yes, the "organic" file is a subset of the queries from agents that 
pretended to be a browser. We filtered agents and query patterns that 
were clearly not "human-like" but a tool that asks one query every 3 min 
would not be recognised at this level.


Such a tool would also not strongly affect most statistics, but it can 
in cases of statistics that have an extremely high number of possible 
values (e.g., items used in the query). In such cases, normal "organic" 
traffic is usually so diverse, that no individual value receives much 
prominence, so that even a rather small number of queries from one 
source could have an impact.


In general, popularity measures based on query traffic, even on the 
organic part, must be taken with caution, because of the many effects 
that lead to skewed query volumes from a particular source (without this 
necessarily indicating real "popularity"). It is an open question how 
one should best evaluate the traffic in the presence of these skews. Our 
two-class system of "robotic" [massively skewed] and "organic" [less 
skewed] is only a first step there.


Best,

Markus







smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Lucas Werkmeister
Ah, and I think I found a bug in your command: by grepping for
|Q[1-9][0-9]*+*|, you’re excluding single-digit item IDs. I’m going to
speculate that if you fix that, Q5 will comfortably beat all other items :)

On 8/23/18 11:31 PM, Lucas Werkmeister wrote:
>
> The top result freaks me out, to be honest. Are /that many/ people
> running the first query from the SPARQL tutorial
> , or is there
> some other reason why Bach might be so overwhelmingly popular?
>
> On 8/23/18 9:57 PM, Daniel Mietchen wrote:
>> I just ran Max' one-liner over one of the dump files, and it worked
>> smoothly. Not sure where the best place would be to store such things,
>> so I simply put it in my sandbox for now:
>> https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160
>> .
>> d.
>> On Tue, Aug 7, 2018 at 6:06 PM David Cuenca Tudela  wrote:
>>> If someone could post the 10 (or 50!) more popular items, I would really 
>>> appreciate it :-)
>>>
>>> Cheers,
>>> Micru
>>>
>>> On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx 
>>>  wrote:
 Hi,

 On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch 
  said:
> If you want a sorted list of "most popular" items, this is a bit more
> work and would require at least some Python script, or some less
> obvious combination of sed (extracting all URLs of entities), and
> sort.
   zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' 
 dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr

 should do the trick.

 Best,

 Maximilian
 --
 Dipl.-Math. Maximilian Marx
 Knowledge-Based Systems Group
 Faculty of Computer Science
 TU Dresden
 +49 351 463 43510
 https://kbs.inf.tu-dresden.de/max

 ___
 Wikidata mailing list
 Wikidata@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> --
>>> Etiamsi omnes, ego non
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Lucas Werkmeister
The top result freaks me out, to be honest. Are /that many/ people
running the first query from the SPARQL tutorial
, or is there
some other reason why Bach might be so overwhelmingly popular?

On 8/23/18 9:57 PM, Daniel Mietchen wrote:
> I just ran Max' one-liner over one of the dump files, and it worked
> smoothly. Not sure where the best place would be to store such things,
> so I simply put it in my sandbox for now:
> https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160
> .
> d.
> On Tue, Aug 7, 2018 at 6:06 PM David Cuenca Tudela  wrote:
>> If someone could post the 10 (or 50!) more popular items, I would really 
>> appreciate it :-)
>>
>> Cheers,
>> Micru
>>
>> On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx 
>>  wrote:
>>>
>>> Hi,
>>>
>>> On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch 
>>>  said:
 If you want a sorted list of "most popular" items, this is a bit more
 work and would require at least some Python script, or some less
 obvious combination of sed (extracting all URLs of entities), and
 sort.
>>>   zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' 
>>> dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr
>>>
>>> should do the trick.
>>>
>>> Best,
>>>
>>> Maximilian
>>> --
>>> Dipl.-Math. Maximilian Marx
>>> Knowledge-Based Systems Group
>>> Faculty of Computer Science
>>> TU Dresden
>>> +49 351 463 43510
>>> https://kbs.inf.tu-dresden.de/max
>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>> --
>> Etiamsi omnes, ego non
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Stas Malyshev
Hi!

On 8/23/18 2:07 PM, Daniel Mietchen wrote:
> On Thu, Aug 23, 2018 at 10:44 PM  wrote:
>> I was wondering why our research section was number 8. Then I recalled
>> our dashboard running from
>> "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It
>> updates around each 3 minute all day long...
> 
> Such automated queries should not be in the organic query file that I looked 
> at.

If it's a browser page and the underlying code does not set distinctive
user agent, I think they will be. It'd be hard to identify such cases
otherwise (ccing Markus in case he knows more on the topic).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Daniel Mietchen
On Thu, Aug 23, 2018 at 10:44 PM  wrote:
> I was wondering why our research section was number 8. Then I recalled
> our dashboard running from
> "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It
> updates around each 3 minute all day long...

Such automated queries should not be in the organic query file that I looked at.

d.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread fn



I was wondering why our research section was number 8. Then I recalled 
our dashboard running from 
"http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It 
updates around each 3 minute all day long...


/Finn

On 08/23/2018 09:57 PM, Daniel Mietchen wrote:

I just ran Max' one-liner over one of the dump files, and it worked
smoothly. Not sure where the best place would be to store such things,
so I simply put it in my sandbox for now:
https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160
.
d.
On Tue, Aug 7, 2018 at 6:06 PM David Cuenca Tudela  wrote:


If someone could post the 10 (or 50!) more popular items, I would really 
appreciate it :-)

Cheers,
Micru

On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx  
wrote:



Hi,

On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch 
 said:

If you want a sorted list of "most popular" items, this is a bit more
work and would require at least some Python script, or some less
obvious combination of sed (extracting all URLs of entities), and
sort.


   zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' 
dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr

should do the trick.

Best,

Maximilian
--
Dipl.-Math. Maximilian Marx
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 43510
https://kbs.inf.tu-dresden.de/max

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Etiamsi omnes, ego non
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Daniel Mietchen
Hi Stas,
I had thought about putting it on Commons as tabular data, but did not
know how to reuse it from there for multilingual display using the Q
template on Wikidata, so went the simpler route.
Can you (or someone else) perhaps demo that briefly?
Thanks,
d.
On Thu, Aug 23, 2018 at 10:03 PM Stas Malyshev  wrote:
>
> Hi!
>
> > I just ran Max' one-liner over one of the dump files, and it worked
> > smoothly. Not sure where the best place would be to store such things,
> > so I simply put it in my sandbox for now:
> > https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160
>
> If you think it's a dataset others may want to reuse, tabular data on
> Commons may be a venue: https://www.mediawiki.org/wiki/Help:Tabular_Data
>
> --
> Stas Malyshev
> smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Stas Malyshev
Hi!

> I just ran Max' one-liner over one of the dump files, and it worked
> smoothly. Not sure where the best place would be to store such things,
> so I simply put it in my sandbox for now:
> https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160

If you think it's a dataset others may want to reuse, tabular data on
Commons may be a venue: https://www.mediawiki.org/wiki/Help:Tabular_Data

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-07 Thread David Cuenca Tudela
If someone could post the 10 (or 50!) more popular items, I would really
appreciate it :-)

Cheers,
Micru

On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx <
maximilian.m...@tu-dresden.de> wrote:

>
> Hi,
>
> On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <
> markus.kroetz...@tu-dresden.de> said:
> > If you want a sorted list of "most popular" items, this is a bit more
> > work and would require at least some Python script, or some less
> > obvious combination of sed (extracting all URLs of entities), and
> > sort.
>
>   zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E'
> dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr
>
> should do the trick.
>
> Best,
>
> Maximilian
> --
> Dipl.-Math. Maximilian Marx
> Knowledge-Based Systems Group
> Faculty of Computer Science
> TU Dresden
> +49 351 463 43510
> https://kbs.inf.tu-dresden.de/max
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Etiamsi omnes, ego non
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-07 Thread Maximilian Marx

Hi,

On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch 
 said:
> If you want a sorted list of "most popular" items, this is a bit more
> work and would require at least some Python script, or some less
> obvious combination of sed (extracting all URLs of entities), and
> sort.

  zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' 
dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr
  
should do the trick.

Best,

Maximilian
-- 
Dipl.-Math. Maximilian Marx
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 43510
https://kbs.inf.tu-dresden.de/max

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-07 Thread Markus Kroetzsch

Hi Micru,

On 07/08/18 17:26, David Cuenca Tudela wrote:

Hi Markus,

Thanks for making the logs available. Personally I would be interested 
in knowing how often a certain item pops up in queries. That way it 
would make easier to know the popularity of certain items.


Do you think it's something that could be accomplished?


This would be quite easy to do: since each query is one line in the 
files, and since we have expanded all URLs (meaning they close with ">", 
which is URL-encoded as "%3E"), you can simply do a zgrep -c over the 
files to count the queries that mention the item (and make sure to use 
the closing "%3E" to avoid Q1234 matching a search for Q123). One such 
grep over any of the three larger files takes less than a minute.


If you want a sorted list of "most popular" items, this is a bit more 
work and would require at least some Python script, or some less obvious 
combination of sed (extracting all URLs of entities), and sort.


Best,

Markus



Regards,
Micru



On Tue, 7 Aug 2018, 17:01 Markus Kroetzsch, 
mailto:markus.kroetz...@tu-dresden.de>> 
wrote:


Dear all,

I am happy to announce that as part of an ongoing research
collaboration
between TU Dresden researchers and Wikimedia [1], we could now release
pre-processed logs from the Wikidata SPARQL Query Service [2]. You can
find details and download links on the following page:

https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en

The data so far comprises over 200 million queries answered in
June-August 2017. There is also an accompanying publication that
describes the workings of and practical experiences with the SPARQL
query service [3].

The logs have been pre-processed to remove information that could
potentially be used for identifying individual users (e.g., comments
were removed, geo-coordinates coarsened, and query strings reformatted
completely -- see above page for details). Nevertheless, one can still
learn many interesting things from the logs, e.g., which properties and
entities are used in queries, which SPARQL features are most prominent,
or which languages are requested.

We also have preserved some amount of user agent information, but
without overly detailed software versions and only in cases where the
agents occurred many times across several weeks. This can at least be
used to recognise the (significant amount) of queries generated, e.g.,
by Magnus' tools, or to do a rough analysis of which software platforms
are mostly used to send queries from. We used #TOOL comments found in
queries to refine user agent information in some cases.

We also made an effort to identify those queries that come from browser
agents *and* also behave like one would expect from a browser (not all
"browsers" did). We called such queries "organic" and provide this
classification with the logs (there is also a filtered dump of only
organic queries, which is much smaller and therefore nicer to process,
also for testing). See the paper for details on our methodology.

Finally, the data contains the time of each request, so one can
reconstruct query loads over time.

Feedback is very welcome, both in terms of comments on the data (is it
useful to you? would you like to see more? do you have concerns?)
and in
terms of insights that you can get from it (we did some analyses but
one
can surely do more).

Cheers,

Markus

[1]
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
[2] https://query.wikidata.org/ (or rather the web service that powers
this UI and many other applications).
[3] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius
Gonsior,
Adrian Bielefeldt: Getting the Most out of Wikidata: Semantic
Technology
Usage in Wikipedia’s Knowledge Graph. In Proceedings of the 17th
International Semantic Web Conference (ISWC-18), Springer 2018.
https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en

-- 
Prof. Dr. Markus Kroetzsch

Knowledge-Based Systems Group
Center for Advancing Electronics Dresden (cfaed)
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://kbs.inf.tu-dresden.de/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata





smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-07 Thread David Cuenca Tudela
Hi Markus,

Thanks for making the logs available. Personally I would be interested in
knowing how often a certain item pops up in queries. That way it would make
easier to know the popularity of certain items.

Do you think it's something that could be accomplished?

Regards,
Micru



On Tue, 7 Aug 2018, 17:01 Markus Kroetzsch, 
wrote:

> Dear all,
>
> I am happy to announce that as part of an ongoing research collaboration
> between TU Dresden researchers and Wikimedia [1], we could now release
> pre-processed logs from the Wikidata SPARQL Query Service [2]. You can
> find details and download links on the following page:
>
> https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
>
> The data so far comprises over 200 million queries answered in
> June-August 2017. There is also an accompanying publication that
> describes the workings of and practical experiences with the SPARQL
> query service [3].
>
> The logs have been pre-processed to remove information that could
> potentially be used for identifying individual users (e.g., comments
> were removed, geo-coordinates coarsened, and query strings reformatted
> completely -- see above page for details). Nevertheless, one can still
> learn many interesting things from the logs, e.g., which properties and
> entities are used in queries, which SPARQL features are most prominent,
> or which languages are requested.
>
> We also have preserved some amount of user agent information, but
> without overly detailed software versions and only in cases where the
> agents occurred many times across several weeks. This can at least be
> used to recognise the (significant amount) of queries generated, e.g.,
> by Magnus' tools, or to do a rough analysis of which software platforms
> are mostly used to send queries from. We used #TOOL comments found in
> queries to refine user agent information in some cases.
>
> We also made an effort to identify those queries that come from browser
> agents *and* also behave like one would expect from a browser (not all
> "browsers" did). We called such queries "organic" and provide this
> classification with the logs (there is also a filtered dump of only
> organic queries, which is much smaller and therefore nicer to process,
> also for testing). See the paper for details on our methodology.
>
> Finally, the data contains the time of each request, so one can
> reconstruct query loads over time.
>
> Feedback is very welcome, both in terms of comments on the data (is it
> useful to you? would you like to see more? do you have concerns?) and in
> terms of insights that you can get from it (we did some analyses but one
> can surely do more).
>
> Cheers,
>
> Markus
>
> [1]
> https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
> [2] https://query.wikidata.org/ (or rather the web service that powers
> this UI and many other applications).
> [3] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior,
> Adrian Bielefeldt: Getting the Most out of Wikidata: Semantic Technology
> Usage in Wikipedia’s Knowledge Graph. In Proceedings of the 17th
> International Semantic Web Conference (ISWC-18), Springer 2018.
> https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
>
> --
> Prof. Dr. Markus Kroetzsch
> Knowledge-Based Systems Group
> Center for Advancing Electronics Dresden (cfaed)
> Faculty of Computer Science
> TU Dresden
> +49 351 463 38486
> https://kbs.inf.tu-dresden.de/
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata SPARQL query logs available

2018-08-07 Thread Markus Kroetzsch

Dear all,

I am happy to announce that as part of an ongoing research collaboration 
between TU Dresden researchers and Wikimedia [1], we could now release 
pre-processed logs from the Wikidata SPARQL Query Service [2]. You can 
find details and download links on the following page:


https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en

The data so far comprises over 200 million queries answered in 
June-August 2017. There is also an accompanying publication that 
describes the workings of and practical experiences with the SPARQL 
query service [3].


The logs have been pre-processed to remove information that could 
potentially be used for identifying individual users (e.g., comments 
were removed, geo-coordinates coarsened, and query strings reformatted 
completely -- see above page for details). Nevertheless, one can still 
learn many interesting things from the logs, e.g., which properties and 
entities are used in queries, which SPARQL features are most prominent, 
or which languages are requested.


We also have preserved some amount of user agent information, but 
without overly detailed software versions and only in cases where the 
agents occurred many times across several weeks. This can at least be 
used to recognise the (significant amount) of queries generated, e.g., 
by Magnus' tools, or to do a rough analysis of which software platforms 
are mostly used to send queries from. We used #TOOL comments found in 
queries to refine user agent information in some cases.


We also made an effort to identify those queries that come from browser 
agents *and* also behave like one would expect from a browser (not all 
"browsers" did). We called such queries "organic" and provide this 
classification with the logs (there is also a filtered dump of only 
organic queries, which is much smaller and therefore nicer to process, 
also for testing). See the paper for details on our methodology.


Finally, the data contains the time of each request, so one can 
reconstruct query loads over time.


Feedback is very welcome, both in terms of comments on the data (is it 
useful to you? would you like to see more? do you have concerns?) and in 
terms of insights that you can get from it (we did some analyses but one 
can surely do more).


Cheers,

Markus

[1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
[2] https://query.wikidata.org/ (or rather the web service that powers 
this UI and many other applications).
[3] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, 
Adrian Bielefeldt: Getting the Most out of Wikidata: Semantic Technology 
Usage in Wikipedia’s Knowledge Graph. In Proceedings of the 17th 
International Semantic Web Conference (ISWC-18), Springer 2018. 
https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en


--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Center for Advancing Electronics Dresden (cfaed)
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://kbs.inf.tu-dresden.de/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata