Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

Aryeh Gregor Sun, 22 Nov 2009 08:26:57 -0800

On Sat, Nov 21, 2009 at 9:48 PM, Jona Christopher Sahnwaldt
<[email protected]> wrote:
> The one thing that is slow is builiding the indexes after
> the data has been imported (eight hours or so). Maybe
> we could omit some indexes that are not used in our
> application, but I haven't really looked into that.


MyISAM should be able to build keys quite quickly; it can do so by
sorting if all goes well.  That's commonly ten times as fast as
rebuilding by keycache, or more.  I don't know offhand how mwdumper
works, but you might want to try increasing the value of
myisam_max_sort_file_size to something larger than all your indexes,
if you have enough disk space.  Or you could do something like import
into a table with no keys at all, create a table with the same
definition but with keys and no data, shut down MySQL, copy the new
table's .frm and .MYI over the old table's, run myisamchk
--sort-recover on the old table (with the data and copied .frm and
.MYI), and restart MySQL, but this is at your own risk, of course.  :)

On Sat, Nov 21, 2009 at 10:36 PM, Anthony <[email protected]> wrote:
> Quite a lot?  The theoretical max would be about twice as fast, as
> (page_is_redirect, page_namespace, page_title) is going to take up at
> least half as much space as the whole page table.

Yes, that's true.  Twice as fast is still a pretty good improvement, though.  :)

> I know it wouldn't be reached by PostgreSQL, which would still do a
> sequential scan through the table.

MySQL skips the table and just looks at the index, in both InnoDB and MyISAM.

mysql> EXPLAIN SELECT page_namespace, page_title FROM page WHERE
page_is_redirect=0;
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len |
ref  | rows  | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
|  1 | SIMPLE      | page  | ALL  | NULL          | NULL | NULL    |
NULL | 43603 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT page_namespace, page_title FROM page2 WHERE
page_is_redirect=0;
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys    | key
 | key_len | ref   | rows  | Extra       |
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
|  1 | SIMPLE      | page2 | ref  | page_is_redirect |
page_is_redirect | 1       | const | 22048 | Using index |
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

Note "Using index" in the second query; it never looks at the table
data, only the index.  (Those queries are on a local copy of Simple,
not enwiki, thus the small row counts.)

> I'm actually in the process of trying to import enwiki into a
> postgresql database right now.  Attempt 1 was to import everything
> into a MySQL database (with no indexes), export it as a TSV file, then
> import from the TSV file into Postgresql.  Hit a snag with some data
> that Postgres is saying isn't valid UTF8, which is probably due to
> something I did wrong with the import, but I can't figure out what it
> is.

The most likely problem is that it's not actually valid UTF-8.  There
are some places where we end up truncating things bytewise rather than
characterwise, like edit summaries, leaving only the first byte or two
of a multibyte character.  Last I checked, edit summaries were just
passed to the database with no hard length check, so MySQL in
non-strict mode with a binary schema (like Wikipedia) will just
truncate them to fit.  (In strict mode it will raise an error, and if
we used utf8 rather than binary it would presumably truncate by
characters.)

On Sat, Nov 21, 2009 at 10:40 PM, Ryan Chan <[email protected]> wrote:
> Is the source available in the svn? Can you point me to the right direction?

http://svn.wikimedia.org/viewvc/mysql/

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

Reply via email to