Halfak added a comment.

@daniel, it seems that normalizing the property references out of that table would save considerable amounts of space.

E.g. with the current scheme, we have

  • eu_row_id -- bigint(20) -- 8 bytes
  • eu_entity_id -- varbinary(255) -- 256 bytes
  • eu_aspect_id -- varbinary(37) -- 38 bytes
  • eu_page_id -- int(11) -- 4 bytes
  • eu_touched -- binary(14) -- 14 bytes

I ran a query to find out the average size of the eu_entity_id and eu_aspect_id fields.

[enwiki]> select avg(length(eu_entity_id)), avg(length(eu_aspect)) from wbc_entity_usage;
+---------------------------+------------------------+
| avg(length(eu_entity_id)) | avg(length(eu_aspect)) |
+---------------------------+------------------------+
|                    7.9961 |                 1.1235 |
+---------------------------+------------------------+
1 row in set (3 min 43.88 sec)

So that cuts the respective byte usage down to 9 bytes and 3 bytes respectively.

8 + 9 + 3 + 4 + 14 = 38 bytes per row.

38 bytes * 11167312 rows (in enwiki) = 404.69 MB

OK, so now if we make a schema change and update the wbc_entity_usage table, we'd get something like this:

  • eu_row_id -- bigint(20) -- 8 bytes
  • eu_entity_id -- varbinary(255) -- ~9 bytes
  • eu_aspect_id -- varbinary(37) -- ~3 bytes
  • eu_property_id -- unsignedint -- 4 bytes
  • eu_page_id -- int(11) -- 4 bytes
  • eu_touched -- binary(14) -- 14 bytes

8 + 9 + 3 + 4 + 14 = 42 bytes per row. But now, we'll have multiple rows per entity. Right now, there are 3927150 that use "X" or "O" as an aspect. Let's say that, on average, each entity/page relationship involves 5 properties.

(3927150 * 4) + 11167312 rows = 26875912 rows.
42 bytes * 26875912 rows = 1.13 GB

If instead, we store the separate table that @Hall1467 proposes, that will add an additional 3927150 * 5 = 19635750 rows of a much smaller table.

  • page_id -- unsigned int -- 4 bytes
  • entity_id -- varbinary(255) -- ~9 bytes
  • property_id -- unsigned int -- 4 bytes

4 + 9 + 4 = 17 bytes

17 bytes * 19635750 = 333.8 MB

333.8 MB + 404.69 = 738 MB

1.13 GB - 738 MB = 392 MB storage savings.


TASK DETAIL
https://phabricator.wikimedia.org/T151717

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Halfak
Cc: Halfak, jcrespo, TomT0m, Hall1467, hoo, zhuyifei1999, Eloquence, Lydia_Pintscher, Sannita, Ainali, Liuxinyu970226, MZMcBride, Ricordisamoa, Micru, jayvdb, Daniel_Mietchen, Tobi_WMDE_SW, Legoktm, Abraham, Wikidata-bugs, liangent, jeremyb, aude, Candalua, Bianjiang, Aklapper, DixonD, daniel, D3r1ck01, Izno, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to