I'm confused. I downloaded the 2012-12-01 dump files, but looking for known categories I'm not finding what I expect. For example:
SELECT * FROM `categorylinks` WHERE `cl_to` = 'Humanities' Yields 9 rows, but: http://en.wikipedia.org/wiki/Category:Humanities lists 26 subcategories and 71 pages. I'm wondering if maybe I downloaded the wrong files, or if they didn't import completely. Here's the files and row counts, as reported by phpMyAdmin: 1. <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-category.sql.gz> enwiki-20121201-category.sql.gz - ~1,544,750 rows 2. <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-categorylinks.sql.gz> enwiki-20121201-categorylinks.sql.gz - ~1,380,956 rows 3. <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page.sql.gz> enwiki-20121201-page.sql.gz - ~1,492,392 rows 4. <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page_props.sql.gz> enwiki-20121201-page_props.sql.gz - ~5,415,922 rows The MD5 checksums match. What am I doing wrong? Thanks, Robert -----Original Message----- From: Ariel T. Glenn [mailto:ar...@wikimedia.org] Sent: Thursday, January 10, 2013 10:50 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: RE: [Xmldatadumps-l] Which files do I need? You want the page_props table, and look for entries with the string 'hiddencat' for pp_propname. (*-page_props.sql.gz) Ariel Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe έγραψε: > Perfect! Thanks Ariel. What is the best way to distinguish hidden > categories? I see that the category table used to have a cat_hidden column, > but that's been removed. > > Robert > > -----Original Message----- > From: Ariel T. Glenn [ <mailto:ar...@wikimedia.org> > mailto:ar...@wikimedia.org] > Sent: Thursday, January 10, 2013 3:34 AM > To: Robert Crowe > Cc: <mailto:xmldatadumps-l@lists.wikimedia.org> > xmldatadumps-l@lists.wikimedia.org > Subject: Re: [Xmldatadumps-l] Which files do I need? > > If you are just trying to get at the structure from the various dump > files, the page table has page ids, titles, and whether the page is a > redirect or not (*-page.sql.gz), the category table has category > names, ids, and summary information (*-category.sql.gz), and > categorylinks has the list of all category links in a page, with the > page id and the category name (*-categorylinks.sql.gz). You can find > details on the tables here: > <http://www.mediawiki.org/wiki/Manual:Categorylinks_table> > http://www.mediawiki.org/wiki/Manual:Categorylinks_table > (here's the category: > <http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables> > http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables ) > > Hopefully this should get you started. > > Ariel > > Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe > έγραψε: > > I'd like to mirror just the category structure of the English > > Wikipedia, and I'm wondering which of the dump files I need to start > > with. > > > > > > > > I don't need the page content, just the page names, and only for the > > most current revision. I need the categories and category members, > > and I'd like to exclude hidden categories. I also need to > > distinguish redirects, because I don't want to treat them as > > separate pages. As much as possible I'd like to work with SQL > > files, but I can crunch through XML if necessary. > > > > > > > > So which files do I need to download? I may also need some help in > > understanding the schemas. > > > > > > > > Thanks, > > > > > > > > Robert > > > > > > > > > > _______________________________________________ > > Xmldatadumps-l mailing list > > <mailto:Xmldatadumps-l@lists.wikimedia.org> > > Xmldatadumps-l@lists.wikimedia.org > > <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l> > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > >
_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l