Re: [Wiki-research-l] Verifying claims about ENWP project size

2015-09-16 Thread WereSpielChequers
I'm pretty sure that English Wikipedia is the largest English language encyclopaedia, but there are some humongous ones in China. Baidu Baike with almost 12.5 million articles is way bigger than any one language version of Wikipedia and Baike.com formerly Hudong is about a million bigger

Re: [Wiki-research-l] Verifying claims about ENWP project size

2015-09-16 Thread Oliver Keyes
Search is a Discovery team focus, rather than a Readership focus. I'd suggest reaching out to Dan Garry (we have been talking about project integration very recently, actually). On 16 September 2015 at 15:32, Pine W wrote: > I was thinking in terms of GB of text. > > I too

Re: [Wiki-research-l] Verifying claims about ENWP project size

2015-09-16 Thread Pine W
I was thinking in terms of GB of text. I too have wondered about creating closer ties between Wiktionary, Wikipedia and Wikisource so that it's easier for someone to start their search on one site and quickly find relevant pages on the other sites. This might (among other things) lead to an

Re: [Wiki-research-l] Verifying claims about ENWP project size

2015-09-16 Thread Pine W
Oh, ok! Now pinging Discovery Dan. (: On Sep 16, 2015 1:04 PM, "Oliver Keyes" wrote: > Search is a Discovery team focus, rather than a Readership focus. I'd > suggest reaching out to Dan Garry (we have been talking about project > integration very recently, actually). > >

[Wiki-research-l] Verifying claims about ENWP project size

2015-09-15 Thread Pine W
Hi researchers, I could use a little help with understanding these dumps: https://dumps.wikimedia.org/enwikisource/latest/ https://dumps.wikimedia.org/enwiki/20150901/ I'm trying to verify the claim that ENWP is the world's largest open text project, and to do that I need to verify that ENWP

Re: [Wiki-research-l] Verifying claims about ENWP project size

2015-09-15 Thread Jonathan Morgan
Hi Pine, TL;DR: best to just say it's the largest encyclopedia ever. That should be safe. Claims like this are hard to make because terms that seem concrete from afar tend to break down up close. For example: What do you mean by largest? Largest in bytes? Words? Content "units" (articles vs.