Hi L., There is unfortunately no table that tracks how links are included on a page (hard-coded via wikitext or transcluded via templates/lua). It sounds like you're already aware of the pagelinks <https://www.mediawiki.org/wiki/Manual:Pagelinks_table> table, which can be easily parsed with the mwsql <https://pypi.org/project/mwsql/> library if you're working in Python. That leaves you with two options generally:
- Extract the links from the raw wikitext XML dumps, which would achieve what you want. In Python, the easiest way is via mwxml <https://pypi.org/project/mwxml/> for the dumps and mwparserfromhell <https://github.com/earwig/mwparserfromhell> for extracting the links. You could use the mwconstants <https://pypi.org/project/mwconstants/> library then for filtering down to just the namespace you're interested in. - Extract the links from the HTML dumps <https://dumps.wikimedia.org/other/enterprise_html/>. This gives you all the links in the article and you could separate between the transcluded ones and non-transcluded ones. There's a work-in-progress Python library for this too called mwparserfromhtml <https://pypi.org/project/mwparserfromhtml/> (see blogpost about it <https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/>). Sorry that doesn't solve you problem but hope that helps. If you're curious about why this isn't supported, you can read through some of the past discussion <https://phabricator.wikimedia.org/T278236> around adding this sort of functionality (essentially the links tables are already massive so adding more information is not desirable at the moment). Best, Isaac On Thu, Feb 16, 2023 at 5:13 PM Luigi Assom <[email protected]> wrote: > Hi All, > > for an applied research work, I am working on extracting links from the > Wikipedia corpus. > > I've been using in the past the XML streams, but not I was hoping to speed > up and handle better the situation by parsing the sql tables. > > However, I am stuck on this: > > I could not find a way to filter the relevant links. > > I can only filter by namespace apparently, while I want to only keep the > links that were mentioned in the main text, still namespace 0, but not > belonging to the infoboxes and navboxes menu. > > How could I do that? > Is there any information that a link belongs to a menu or to the main > content, beyond the namespace? > > Thanks All for your help, > L. > _______________________________________________ > Wiki-research-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > -- Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
