Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-24 Thread Daan Kuijsten


On 23-Apr-14 21:29, wikitech-l-requ...@lists.wikimedia.org wrote:

Re: API attribute ID for querying wikipedia pages


@Matma Rex: This is way to general, I think it would be a lot better 
when this would be in more detail. For example when I want to fetch a 
table with all currencies on 
https://en.wikipedia.org/wiki/List_of_circulating_currencies, I would 
make an API call like 
this:https://en.wikipedia.org/w/api.php?action=parsepage=List%20of%20circulating%20currenciesprop=sectionsformat=jsonfm. 
This returns 5 sections with numbers which I can use as reference 
points, but I would rather have a number for the table in the section. 
A section can have multiple tables.


Querying specific (structured) data from Wikipedia is still very 
difficult in my opinion. My suggestion is that every paragraph, image, 
link and table get a unique identifiable number. This way Wikipedia gets 
more machine readable.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-24 Thread Martijn Hoekstra
On Thu, Apr 24, 2014 at 2:24 PM, Daan Kuijsten daankuijs...@gmail.comwrote:


 On 23-Apr-14 21:29, wikitech-l-requ...@lists.wikimedia.org wrote:

 Re: API attribute ID for querying wikipedia pages


 @Matma Rex: This is way to general, I think it would be a lot better when
 this would be in more detail. For example when I want to fetch a table with
 all currencies on https://en.wikipedia.org/wiki/
 List_of_circulating_currencies, I would make an API call like this:
 https://en.wikipedia.org/w/api.php?action=parsepage=
 List%20of%20circulating%20currenciesprop=sectionsformat=jsonfm. This
 returns 5 sections with numbers which I can use as reference points, but
 I would rather have a number for the table in the section. A section can
 have multiple tables.

 Querying specific (structured) data from Wikipedia is still very difficult
 in my opinion. My suggestion is that every paragraph, image, link and table
 get a unique identifiable number. This way Wikipedia gets more machine
 readable.


I see where you are coming from, but this implies that these are stable
properties over multiple revisions, which they aren't. If I have a table in
revision 1, remove it in revision 2, and add it back in in revision 3, is
it still the same table? What if I slightly change it? How much do I have
to change it before its identity changes?

A wiki(pedia) page is by its very nature a dynamic construct, and assigning
stable identifiers to elements would make this at least extremely
impractical.


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-24 Thread Bartosz Dziewoński

On Thu, 24 Apr 2014 14:24:08 +0200, Daan Kuijsten daankuijs...@gmail.com 
wrote:


Querying specific (structured) data from Wikipedia is still very difficult in 
my opinion. My suggestion is that every paragraph, image, link and table get a 
unique identifiable number. This way Wikipedia gets more machine readable.


You want Semantic MediaWiki[1] then (which the Wikipedias don't use) or 
Wikidata[2], which is one of Wikipedia's sister projects and has been growing 
very fast. Wikipedia was never intended to be machine-readable in the way you 
propose (although it does provide access to MediaWiki's awesome API).

[1] https://www.mediawiki.org/wiki/Extension:Semantic_MediaWiki
[2] https://www.wikidata.org/

--
Matma Rex

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-24 Thread Gerard Meijssen
Hoi,
I totally agree that you should be able to do this. However, would it not
make more sense to get structured information from Wikidata?
Thanks,
GerardM


On 24 April 2014 14:24, Daan Kuijsten daankuijs...@gmail.com wrote:


 On 23-Apr-14 21:29, wikitech-l-requ...@lists.wikimedia.org wrote:

 Re: API attribute ID for querying wikipedia pages


 @Matma Rex: This is way to general, I think it would be a lot better when
 this would be in more detail. For example when I want to fetch a table with
 all currencies on https://en.wikipedia.org/wiki/
 List_of_circulating_currencies, I would make an API call like this:
 https://en.wikipedia.org/w/api.php?action=parsepage=
 List%20of%20circulating%20currenciesprop=sectionsformat=jsonfm. This
 returns 5 sections with numbers which I can use as reference points, but
 I would rather have a number for the table in the section. A section can
 have multiple tables.

 Querying specific (structured) data from Wikipedia is still very difficult
 in my opinion. My suggestion is that every paragraph, image, link and table
 get a unique identifiable number. This way Wikipedia gets more machine
 readable.


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-24 Thread Gabriel Wicke
On 04/24/2014 05:24 AM, Daan Kuijsten wrote:
 
 On 23-Apr-14 21:29, wikitech-l-requ...@lists.wikimedia.org wrote:
 Re: API attribute ID for querying wikipedia pages
 
 @Matma Rex: This is way to general, I think it would be a lot better when
 this would be in more detail. For example when I want to fetch a table with
 all currencies on
 https://en.wikipedia.org/wiki/List_of_circulating_currencies, I would make
 an API call like
 this:https://en.wikipedia.org/w/api.php?action=parsepage=List%20of%20circulating%20currenciesprop=sectionsformat=jsonfm.
 This returns 5 sections with numbers which I can use as reference points,
 but I would rather have a number for the table in the section. A section
 can have multiple tables.
 
 Querying specific (structured) data from Wikipedia is still very difficult
 in my opinion. My suggestion is that every paragraph, image, link and table
 get a unique identifiable number. This way Wikipedia gets more machine
 readable.

We (the Parsoid team) are actually working on this, see
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Element_IDs

Besides making it possible to reference content, our goal is to use these
ids as a key that lets us associate additional metadata with each element in
the DOM.

We expect stable element ids to be available in Parsoid output by this summer.

Gabriel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-23 Thread Daan Kuijsten
Currently we are experiencing problems when we try to query wikipedia. 
Fetching content via the Wikipedia API can be a lot easier in our 
opinion. The problem we have is that it is possible to fetch content via 
the property rvsection, which will accept a value (number) which 
represents the section number starting from the top section to the 
bottom section. This is a very dangerous way of fetching content. When 
there is another section inserted on top of the page, all section 
numbers will be moved 1 up.


A better way for fetching content via an API is to assign a unique ID to 
a section, a paragraph, a table, an image etc. This way we could simply 
fetch a part of the content of wikipedia via this ID.


I would like to know if my problem is shared with other developers 
inside the Wikipedia API team.


Kind regards,
Daan Kuijsten
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-23 Thread Brad Jorsch (Anomie)
On Wed, Apr 23, 2014 at 3:48 AM, Daan Kuijsten daankuijs...@gmail.comwrote:

 A better way for fetching content via an API is to assign a unique ID to a
 section, a paragraph, a table, an image etc. This way we could simply fetch
 a part of the content of wikipedia via this ID.


That doesn't sound much better. Say a vandal blanks a page then someone
reverts, and probably all your unique ID numbers will have changed. Or
someone renames a section or edits a paragraph, or combines two sections,
or splits a section into two, etc.


-- 
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API attribute ID for querying wikipedia pages

2014-04-23 Thread Bartosz Dziewoński

On Wed, 23 Apr 2014 09:48:17 +0200, Daan Kuijsten daankuijs...@gmail.com 
wrote:


A better way for fetching content via an API is to assign a unique ID to
a section, a paragraph, a table, an image etc. This way we could simply
fetch a part of the content of wikipedia via this ID.


Such ids already exist, and they are present in the page HTML as 'id' attributes on 
the headings. They are constructed simply based on heading text, with unique 
identifiers appended if duplicates happen. You can access these via the API too, 
using action=parseprop=sections [1] (the 'anchor' property), then map them to 
the numerical identifiers other API modules use (the 'number' property).

[1] 
https://en.wikipedia.org/w/api.php?action=parsepage=Main%20Pageprop=sectionsformat=jsonfm

--
Matma Rex

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l