Aaron,
this seems kind of redundant as I already agreed that there is an overall high
correlation and you posted this (almost) identical analysis 7 months ago. I
don't know if you missed my later emails on the topic, but I already wrote that
this mistake as you repeatedly put it, was a result
Hi Fabian,
I think that the primary reason that articles with smaller byte counts show
less consistency is due to templates. A lot of stubs and starts are
created with a collection of templates that consume few bytes of wikitext,
but balloon into lots of HTML/content. Regardless, there doesn't
Hi,
Aaron, I tend to agree with your conclusion, and personally have little
interest in the relationship between actual size and readable size.
But from technical point of view, I guess you should plot your scatter plot
in log-log scale and also calculate the correlation between the logarithm
of
Hi Fabian,
I can honestly say I had never seen an article like Timeline of
architectural styles 1000–present
But even with that one and removing everything I could interpret as hidden
or code generated I wound up with a lot more than 95 bytes:
6000BC–1000AD • 1000–1750 • 1750–1900 1900–Present
Update:
Not surprising and congruent with Aarons results, I also get a high linear
correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000
sample even if I filter out Disamb articles. See scatterplot [1].
So first of all, it can be fairly certainly concluded that our methods
Luca Ciampaglia [glciamp...@gmail.com]
Sent: Wednesday, August 07, 2013 3:55 PM
To: Floeck, Fabian (AIFB)
Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles
Hi Fabian,
in principle you should be able to recover the same correlation also in the
range 5800-6000 Kb
From: Giovanni Luca Ciampaglia [glciamp...@gmail.com]
Sent: Wednesday, August 07, 2013 3:55 PM
To: Floeck, Fabian (AIFB)
Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles
Hi Fabian,
in principle you should be able to recover
Ziko van Dijk, 06/08/2013 02:12:
Hello,
When in 2008 I made some observations on language versions, it struck me
that in some cases the wikisyntax and the meta article information was
more KB than the whole encyclopedic content of an article. For example,
the wikicode of the article Berlin in
@Jonathan: Good point, but I'm actually not stripping the content of tables,
just the mark-up of the tables. (Also I leave the whitespaces in and count
them, just remove line breaks, as the cleaning leaves a lot of empty lines) I
checked the results manually in over 50 cases and what my script
Fabian,
I suspect that your primary mistake is in only looking at pages with byte
length between 5800 and 6000 bytes. You've severely limited the range of
your regressor and therefor invalidated a set of assumptions for the
correlation. If you still don't think that you made a mistake, I
Thanks both of you,
I suspect that you two are using very different rules to define readable
characters, and for Aaron to get a close correlation and Fabian not to get
any correlation implies to me that Fabian is stripping out the things that
are not linked to article size, and that Aaron may be
I am removing all HTML tags and comments to include only those characters
that are shown on the screen. This will include the content of tables
without including the markup contained within. In other words, I stripped
anything out of the HTML that looked like a tag (e.g. foo and /bar)
or a
Hello,
When in 2008 I made some observations on language versions, it struck me
that in some cases the wikisyntax and the meta article information was
more KB than the whole encyclopedic content of an article. For example, the
wikicode of the article Berlin in Upper Sorabian consisted of more than
Hi Aaron,
I'm not sure how Fabian limiting his byte length to 5,500-6,000 would make
a difference. But as you've confirmed that your formula includes both the
whitespace and the contents of tables, I suspect we just need Fabian to
confirm that he ignores both and we have an explanation for the
(note that I posted this yesterday, but the message bounced due to the
attached scatter plot. I just uploaded the plot to commons and re-sent)
I just replicated this analysis. I think you might have made some
mistakes.
I took a random sample of non-redirect articles from English Wikipedia and
Hi,
to whoever is interested in this (and I hope I didn't just repeat someone
else's experiments on this):
I wanted to know if a long or short article in terms of how much readable
material (excluding pictures) is presented to the reader in the front-end is
correlated to the byte size of the
16 matches
Mail list logo