Re: [Xmldatadumps-l] Which files do I need?

Robert Crowe Sun, 13 Jan 2013 15:41:13 -0800

I'm confused.  I downloaded the 2012-12-01 dump files, but looking for known 
categories I'm not finding what I expect.  For example:


 

SELECT * 

FROM  `categorylinks` 

WHERE  `cl_to` =  'Humanities'

 

Yields 9 rows, but:

 

http://en.wikipedia.org/wiki/Category:Humanities

 

lists 26 subcategories and 71 pages.  I'm wondering if maybe I downloaded the 
wrong files, or if they didn't import completely.  Here's the files and row 
counts, as reported by phpMyAdmin:

 

1.     
<http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-category.sql.gz> 
enwiki-20121201-category.sql.gz - ~1,544,750 rows

2.     
<http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-categorylinks.sql.gz>
 enwiki-20121201-categorylinks.sql.gz - ~1,380,956 rows

3.     <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page.sql.gz> 
enwiki-20121201-page.sql.gz - ~1,492,392 rows

4.     
<http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page_props.sql.gz> 
enwiki-20121201-page_props.sql.gz - ~5,415,922 rows

 

The MD5 checksums match.  What am I doing wrong?

 

Thanks,

 

Robert

 

 

 

-----Original Message-----
From: Ariel T. Glenn [mailto:ar...@wikimedia.org] 
Sent: Thursday, January 10, 2013 10:50 AM
To: Robert Crowe
Cc: xmldatadumps-l@lists.wikimedia.org
Subject: RE: [Xmldatadumps-l] Which files do I need?

 

You want the page_props table, and look for entries with the string 'hiddencat' 
for pp_propname.  (*-page_props.sql.gz)

 

Ariel

 

Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe

έγραψε:

> Perfect!  Thanks Ariel.  What is the best way to distinguish hidden 
> categories?  I see that the category table used to have a cat_hidden column, 
> but that's been removed.

> 

> Robert

> 

 

> -----Original Message-----

> From: Ariel T. Glenn [ <mailto:ar...@wikimedia.org> 
> mailto:ar...@wikimedia.org]

> Sent: Thursday, January 10, 2013 3:34 AM

> To: Robert Crowe

> Cc:  <mailto:xmldatadumps-l@lists.wikimedia.org> 
> xmldatadumps-l@lists.wikimedia.org

> Subject: Re: [Xmldatadumps-l] Which files do I need?

> 

> If you are just trying to get at the structure from the various dump 

> files, the page table has page ids, titles, and whether the page is a 

> redirect or not (*-page.sql.gz), the category table has category 

> names, ids, and summary information (*-category.sql.gz), and 

> categorylinks has the list of all category links in a page, with the 

> page id and the category name (*-categorylinks.sql.gz).  You can find 

> details on the tables here: 

>  <http://www.mediawiki.org/wiki/Manual:Categorylinks_table> 
> http://www.mediawiki.org/wiki/Manual:Categorylinks_table

> (here's the category:

>  <http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables> 
> http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )

> 

> Hopefully this should get you started.

> 

> Ariel

> 

> Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe

> έγραψε:

> > I'd like to mirror just the category structure of the English 

> > Wikipedia, and I'm wondering which of the dump files I need to start 

> > with.

> > 

> >  

> > 

> > I don't need the page content, just the page names, and only for the 

> > most current revision.  I need the categories and category members, 

> > and I'd like to exclude hidden categories.  I also need to 

> > distinguish redirects, because I don't want to treat them as 

> > separate pages.  As much as possible I'd like to work with SQL 

> > files, but I can crunch through XML if necessary.

> > 

> >  

> > 

> > So which files do I need to download?  I may also need some help in 

> > understanding the schemas.

> > 

> >  

> > 

> > Thanks,

> > 

> >  

> > 

> > Robert

> > 

> >  

> > 

> > 

> > _______________________________________________

> > Xmldatadumps-l mailing list

> >  <mailto:Xmldatadumps-l@lists.wikimedia.org> 
> > Xmldatadumps-l@lists.wikimedia.org

> >  <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l> 
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

> 

>

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] Which files do I need?

Reply via email to