Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2014-03-15 Thread Floeck, Fabian (AIFB)
Aaron, this seems kind of redundant as I already agreed that there is an overall high correlation and you posted this (almost) identical analysis 7 months ago. I don't know if you missed my later emails on the topic, but I already wrote that this mistake as you repeatedly put it, was a result

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2014-03-15 Thread Aaron Halfaker
Hi Fabian, I think that the primary reason that articles with smaller byte counts show less consistency is due to templates. A lot of stubs and starts are created with a collection of templates that consume few bytes of wikitext, but balloon into lots of HTML/content. Regardless, there doesn't

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2014-03-15 Thread Taha Yasseri
Hi, Aaron, I tend to agree with your conclusion, and personally have little interest in the relationship between actual size and readable size. But from technical point of view, I guess you should plot your scatter plot in log-log scale and also calculate the correlation between the logarithm of

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-10 Thread WereSpielChequers
Hi Fabian, I can honestly say I had never seen an article like Timeline of architectural styles 1000–present But even with that one and removing everything I could interpret as hidden or code generated I wound up with a lot more than 95 bytes: 6000BC–1000AD • 1000–1750 • 1750–1900 1900–Present

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-07 Thread Floeck, Fabian (AIFB)
Update: Not surprising and congruent with Aarons results, I also get a high linear correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000 sample even if I filter out Disamb articles. See scatterplot [1]. So first of all, it can be fairly certainly concluded that our methods

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-07 Thread Floeck, Fabian (AIFB)
Luca Ciampaglia [glciamp...@gmail.com] Sent: Wednesday, August 07, 2013 3:55 PM To: Floeck, Fabian (AIFB) Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles Hi Fabian, in principle you should be able to recover the same correlation also in the range 5800-6000 Kb

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-07 Thread Floeck, Fabian (AIFB)
From: Giovanni Luca Ciampaglia [glciamp...@gmail.com] Sent: Wednesday, August 07, 2013 3:55 PM To: Floeck, Fabian (AIFB) Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles Hi Fabian, in principle you should be able to recover

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-06 Thread Federico Leva (Nemo)
Ziko van Dijk, 06/08/2013 02:12: Hello, When in 2008 I made some observations on language versions, it struck me that in some cases the wikisyntax and the meta article information was more KB than the whole encyclopedic content of an article. For example, the wikicode of the article Berlin in

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-06 Thread Floeck, Fabian (AIFB)
@Jonathan: Good point, but I'm actually not stripping the content of tables, just the mark-up of the tables. (Also I leave the whitespaces in and count them, just remove line breaks, as the cleaning leaves a lot of empty lines) I checked the results manually in over 50 cases and what my script

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-05 Thread Aaron Halfaker
Fabian, I suspect that your primary mistake is in only looking at pages with byte length between 5800 and 6000 bytes. You've severely limited the range of your regressor and therefor invalidated a set of assumptions for the correlation. If you still don't think that you made a mistake, I

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-05 Thread WereSpielChequers
Thanks both of you, I suspect that you two are using very different rules to define readable characters, and for Aaron to get a close correlation and Fabian not to get any correlation implies to me that Fabian is stripping out the things that are not linked to article size, and that Aaron may be

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-05 Thread Aaron Halfaker
I am removing all HTML tags and comments to include only those characters that are shown on the screen. This will include the content of tables without including the markup contained within. In other words, I stripped anything out of the HTML that looked like a tag (e.g. foo and /bar) or a

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-05 Thread Ziko van Dijk
Hello, When in 2008 I made some observations on language versions, it struck me that in some cases the wikisyntax and the meta article information was more KB than the whole encyclopedic content of an article. For example, the wikicode of the article Berlin in Upper Sorabian consisted of more than

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-05 Thread WereSpielChequers
Hi Aaron, I'm not sure how Fabian limiting his byte length to 5,500-6,000 would make a difference. But as you've confirmed that your formula includes both the whitespace and the contents of tables, I suspect we just need Fabian to confirm that he ignores both and we have an explanation for the

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-04 Thread Aaron Halfaker
(note that I posted this yesterday, but the message bounced due to the attached scatter plot. I just uploaded the plot to commons and re-sent) I just replicated this analysis. I think you might have made some mistakes. I took a random sample of non-redirect articles from English Wikipedia and

[Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-02 Thread Floeck, Fabian (AIFB)
Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a long or short article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the