[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-11-22 Thread Dušan Kreheľ
The article is for me done now. D. K. 2022-11-08 21:32 GMT+01:00, Dušan Kreheľ : > [Fix]: > > A link to the source code has been added. > > @Dan Andreescu: The format is correct. The annual summary is a typical > basic statistical interval, and we save time by merging. The file size > problem

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-11-08 Thread Dušan Kreheľ
[Fix]: A link to the source code has been added. @Dan Andreescu: The format is correct. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by local wikis. And the skwiki is only 49MB for the year 2021,

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-11-08 Thread Dušan Kreheľ
A link to the source code has been added. @Dan Andreescu: The format is correct now. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by wík. And the skwiki has only 49MB for the year 2021, which does

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-10-06 Thread Dan Andreescu
@Dušan Kreheľ: I think there's a misunderstanding. I read your re-written article. In it, you say that the current format is: domain_code page_title count_views total_response_size For an example, you give this: sk Kreheľ 2 0 But, actually, that format is deprecated and the new format is

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-10-01 Thread Dušan Kreheľ
The big update of the article is done. Please, You look. Gergő Tisza: The current fresh hour format can remain. Later it can be converted to another format. And thus be more suitable for others. 2022-09-18 22:35 GMT+02:00, Dušan Kreheľ : > I have updated the document. I added the export of human

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-18 Thread Dušan Kreheľ
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added. Dan Andreescu: None problem was to understand You. 2022-09-05 21:48 GMT+02:00, Dan Andreescu : > Hi Dušan, > > I added the details on

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-05 Thread Dan Andreescu
Hi Dušan, I added the details on pageviews_complete to the talk page on your proposal . Please let me know if it's still confusing.

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-05 Thread Dušan Kreheľ
Thiemo and all: I also added tests for the binary version. Dušan. 2022-09-05 2:45 GMT+02:00, Dan Andreescu : > Our pageview dumps were in the middle of a refactor when our team changed a > lot. We haven't been able to finish it, but we do actually have a > well-compressed version that we just

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-04 Thread Dan Andreescu
Our pageview dumps were in the middle of a refactor when our team changed a lot. We haven't been able to finish it, but we do actually have a well-compressed version that we just haven't properly launched as a new dataset. I'm working on prioritizing that. On Sun, Sep 4, 2022 at 02:58 Gergő

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-04 Thread Gergő Tisza
I'd imagine the current format is optimized for being able to output hourly dumps (and thus reducing data latency and data processing costs), not so much for storage space ___ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-03 Thread Dušan Kreheľ
Hello Thiemo. I updated the document. Look You the document or the document changes. I think, for the low number values is better storing as text. Example, one reason, the RAW data have lower memory size. Example for input "1 15 85" is the test size 7 B, but in memory format would be minimal 3

[Wikitech-l] Re: Reducing size of pageviews dump (shared link on the article)

2022-09-03 Thread Thiemo Kreuz
Hello Dušan, I find this really fascinating. Unfortunately, it looks like the article doesn't explain the proposed format. Where is the domain in the new format? What does "DAY_HOUR" mean? What's the difference between "DAY_HOUR2", "DAY2_HOUR", and "DAY2_HOUR2"? What is the file naming scheme for