I'd like to point out that the increasingly technical nature of this conversation probably belongs either on wikitech-l, or off-list, and that the strident nature of the comments is fast approaching inappropriate.
Alex Wikimedia-l list administrator 2012/5/17 Anthony <wikim...@inbox.org> > On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverr...@gmail.com> wrote: > > On Thu, May 17, 2012 at 1:52 AM, Anthony <wikim...@inbox.org> wrote: > >> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverr...@gmail.com> > wrote: > >> > Anthony the process is linear, you have a php inserting X number of > rows > >> > per > >> > Y time frame. > >> > >> Amazing. I need to switch all my databases to MySQL. It can insert X > >> rows per Y time frame, regardless of whether the database is 20 > >> gigabytes or 20 terabytes in size, regardless of whether the average > >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a > >> RAID array or a cluster of servers, etc. > > > > When refering to X over Y time, its an average of a of say 1000 revisions > > per 1 minute, any X over Y period must be considered with averages in > mind, > > or getting a count wouldnt be possible. > > The *average* en.wikipedia revision is more than twice the size of the > *average* simple.wikipedia revision. The *average* performance of a > 20 gig database is faster than the *average* performance of a 20 > terabyte database. The *average* performance of your laptop's thumb > drive is different from the *average* performance of a(n array of) > drive(s) which can handle 20 terabytes of data. > > > If you setup your sever/hardware correctly it will compress the text > > information during insertion into the database > > Is this how you set up your simple.wikipedia test? How long does it > take import the data if you're using the same compression mechanism as > WMF (which, you didn't answer, but I assume is concatenation and > compression). How exactly does this work "during insertion" anyway? > Does it intelligently group sets of revisions together to avoid > decompressing and recompressing the same revision several times? I > suppose it's possible, but that would introduce quite a lot of > complication into the import script, slowing things down dramatically. > > What about the answers to my other questions? > > >> If you want to put your money where your mouth is, import > >> en.wikipedia. It'll only take 5 days, right? > > > > If I actually had a server or the disc space to do it I would, just to > prove > > your smartass comments as stupid as they actually are. However given my > > current resource limitations (fairly crappy internet connection, older > > laptops, and lack of HDD) I tried to select something that could give > > reliable benchmarks. If your willing to foot the bill for the new > hardware > > Ill gladly prove my point > > What you seem to be saying is that you're *not* putting your money > where your mouth is. > > Anyway, if you want, I'll make a deal with you. A neutral third party > rents the hardware at Amazon Web Services (AWS). We import > simple.wikipedia full history (concatenating and compressing during > import). We take the ratio of revisions in simple.wikipedia to the > ratio of revisions in en.wikipedia. We import en.wikipedia full > history (concatenating and compressing during import). If the ratio > of time it takes to import en.wikipedia vs simple.wikipedia is greater > than or equal to twice the ratio of revisions, then you reimburse the > third party. If the ratio of import time is less than twice the ratio > of revisions (you claim it is linear, therefore it'll be the same > ratio), then I reimburse the third party. > > Either way, we save the new dump, with the processing already done, > and send it to archive.org (and WMF if they're willing to host it). > So we actually get a useful result out of this. It's not just for the > purpose of settling an argument. > > Either of us can concede defeat at any point, and stop the experiment. > At that point if the neutral third party wishes to pay to continue > the job, s/he would be responsible for the additional costs. > > Shouldn't be too expensive. If you concede defeat after 5 days, then > your CPU-time costs are $54 (assuming Extra Large High Memory > Instance). Including 4 terabytes of EBS (which should be enough if > you compress on the fly) for 5 days should be less than $100. > > I'm tempted to do it even if you don't take the bet. > > _______________________________________________ > Wikimedia-l mailing list > Wikimedia-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l