Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Tei
On Wed, Jan 28, 2009 at 8:28 AM, Tei  wrote:
> On Wed, Jan 28, 2009 at 1:41 AM, Aryeh Gregor
>  wrote:
>> On Tue, Jan 27, 2009 at 7:37 PM, George Herbert
>>  wrote:
>>> Right, but a live mirror is a very different thing than a search box link.
>>
>> Well, as far as I can tell, we have no idea whether the original
>> poster meant either of those, or perhaps something else altogether.
>> Obviously nobody minds a search box link, that's just a *link*.  You
>> can't stop people from linking to you.
>>
>
> This one code don't even need to use
> http://en.wiktionary.org/wiki/Special:Search
>
> 
> 
> 
> 
>
> 
>
> function $(name){
>  return document.getElementById(name);
> }
>
> function searchWiktionary(){
>  var word = $("word").value;
>  $("form1").setAttribute("action","http://en.wiktionary.org/wiki/"+
> escape(word) );
>  $("form1").submit();
> }
> 
>
>

postadata:
I know the OP talk about OpenSearch. This snip of code is something
different instead.

-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Tei
On Wed, Jan 28, 2009 at 1:41 AM, Aryeh Gregor
 wrote:
> On Tue, Jan 27, 2009 at 7:37 PM, George Herbert
>  wrote:
>> Right, but a live mirror is a very different thing than a search box link.
>
> Well, as far as I can tell, we have no idea whether the original
> poster meant either of those, or perhaps something else altogether.
> Obviously nobody minds a search box link, that's just a *link*.  You
> can't stop people from linking to you.
>

This one code don't even need to use
http://en.wiktionary.org/wiki/Special:Search








function $(name){
  return document.getElementById(name);
}

function searchWiktionary(){
  var word = $("word").value;
  $("form1").setAttribute("action","http://en.wiktionary.org/wiki/"+
escape(word) );
  $("form1").submit();
}




-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Toolserver-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 1:13 AM, Daniel Kinzler  wrote:
> Marco Schuster schrieb:
>>> Fetch them from the toolserver (there's a tool by duesentrieb for that).
>>> It will catch almost all of them from the toolserver cluster, and make a
>>> request to wikipedia only if needed.
>> I highly doubt this is "legal" use for the toolserver, and I pretty
>> much guess that 800k revisions to fetch would be a huge resource load.
>>
>> Thanks, Marco
>>
>> PS: CC-ing toolserver list.
>
> It's a legal use, the only problem is that the tool i wrote for is is quite
> slow. You shouldn't hit it at full speed. So it might actually be better to
> query the main server cluster, they can distribute the load more nicely.
What is the best speed, actually? 2 requests per second? Or can I go up to 4?

> One day i'll rewrite WikiProxy and everything will be better :)
:)

> But by then, i do hope we have revision flags in the dumps. because that would
> be The Right Thing to use.
Still, using the dumps would require me to get the full history dump
because I only want flagged revisions and not current revisions
without the flag.

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJgAIpW6S2GapJUuQRAuY/AJ47eppKPbBqjz0l4HllCPolMWz9KACfRurR
Lod/wkd4ZM0ee+cPTfaO7yg=
=zB26
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Jason Schulz
http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/jobs-loop/run-jobs.c?revision=22101&view=markup&sortby=date

As mentioned, it is just a sample script. For sites with just one 
master/slave cluster, any simple script that keeps looping to run 
maintenance/runJobs.php will do.

-Aaron

--
From: "Marco Schuster" 
Sent: Tuesday, January 27, 2009 6:56 PM
To: "Wikimedia developers" 
Subject: Re: [Wikitech-l] MediaWiki Slow, what to look for?

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On Tue, Jan 27, 2009 at 6:56 PM, Jason Schulz  wrote:
>> Also, see
>> http://www.mediawiki.org/wiki/User:Aaron_Schulz/How_to_make_MediaWiki_fast
> The shell script you mention in step 2 has some stuff in it that makes
> it unusable outside Wikimedia:
> 1) lots of hard-coded paths
> 2) what is "/usr/local/bin/run-jobs"?
>
> I'd put "0 0 * * * /usr/bin/php /var/www/wiki/maintenance/runJobs.php
> 2>&1 > /var/log/runJobs.log" as crontab entry in your guide, as it's a
> bit more compatible with non-wikimedia environments ;)
>
> Marco
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (MingW32)
> Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)
>
> iD8DBQFJf59oW6S2GapJUuQRAvYCAJ4vWBAHSTHlJljfnnUSF7IpZlechQCcCY5A
> Zb5SMJz146sM5HalNQuA/9k=
> =Ie27
> -END PGP SIGNATURE-
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Platonides
Dawson wrote:
> Modified config file as follows:
> 
> $wgUseDatabaseMessage = false;
> $wgUseFileCache = true;
> $wgMainCacheType = "CACHE_ACCEL";

This should be $wgMainCacheType = CACHE_ACCEL; (constant) not
$wgMainCacheType = "CACHE_ACCEL"; (string)


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Aryeh Gregor
On Tue, Jan 27, 2009 at 7:37 PM, George Herbert
 wrote:
> Right, but a live mirror is a very different thing than a search box link.

Well, as far as I can tell, we have no idea whether the original
poster meant either of those, or perhaps something else altogether.
Obviously nobody minds a search box link, that's just a *link*.  You
can't stop people from linking to you.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread George Herbert
On Tue, Jan 27, 2009 at 3:54 PM, Aryeh Gregor

> wrote:

> Anyway, the reason live mirrors are prohibited is not for load
> reasons.  I believe it's because if a site does nothing but stick up
> some ads and add no value, Wikimedia is going to demand a cut of the
> profit for using its trademarks and so on.  Some sites pay Wikimedia
> for live mirroring.  So the others, in principle, get blocked.


Right, but a live mirror is a very different thing than a search box link.


-- 
-george william herbert
george.herb...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Toolserver-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Marco Schuster schrieb:
>> Fetch them from the toolserver (there's a tool by duesentrieb for that).
>> It will catch almost all of them from the toolserver cluster, and make a
>> request to wikipedia only if needed.
> I highly doubt this is "legal" use for the toolserver, and I pretty
> much guess that 800k revisions to fetch would be a huge resource load.
> 
> Thanks, Marco
> 
> PS: CC-ing toolserver list.

It's a legal use, the only problem is that the tool i wrote for is is quite
slow. You shouldn't hit it at full speed. So it might actually be better to
query the main server cluster, they can distribute the load more nicely.

One day i'll rewrite WikiProxy and everything will be better :)

But by then, i do hope we have revision flags in the dumps. because that would
be The Right Thing to use.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 12:53 AM, Platonides  wrote:
> Marco Schuster wrote:
>> Hi all,
>>
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
>> For this, I obviously need to spider Wikipedia.
>> What are the limits (rate!) here, what UA should I use and what
>> caveats do I have to take care of?
>>
>> Thanks,
>> Marco
>>
>> PS: I already have a revisions list, created with the Toolserver. I
>> used the following query: "select fp_stable,fp_page_id from
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs, fp_stable being the revid of
>> the most current flagged rev for this article?
>
> Fetch them from the toolserver (there's a tool by duesentrieb for that).
> It will catch almost all of them from the toolserver cluster, and make a
> request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty
much guess that 800k revisions to fetch would be a huge resource load.

Thanks, Marco

PS: CC-ing toolserver list.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf6AjW6S2GapJUuQRAvBuAJ46G0qhk+e2axFddbHFMUqzScH4PgCeIMBL
L9WWNeZaA/6vHyzSoKrGN54=
=p/R+
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa  wrote:
> Marco Schuster skrev:
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
> [...]
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs,
>
>
> Doesn't the xml dumps contain the flag for flagged revs?

The xml dumps are nothing for me, way too much overhead (especially,
they are old, and I want to use single files, it's easier to process
these than one hge xml file). And they don't contain flagged
revisions flags :(

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5/cW6S2GapJUuQRAj1KAJ9feF3ElQTQbuENa2xfDoXJE5pq5QCfYtRd
x8lfmVHMzmVOqtO39MCfieQ=
=8YJP
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Tue, Jan 27, 2009 at 6:56 PM, Jason Schulz  wrote:
> Also, see
> http://www.mediawiki.org/wiki/User:Aaron_Schulz/How_to_make_MediaWiki_fast
The shell script you mention in step 2 has some stuff in it that makes
it unusable outside Wikimedia:
1) lots of hard-coded paths
2) what is "/usr/local/bin/run-jobs"?

I'd put "0 0 * * * /usr/bin/php /var/www/wiki/maintenance/runJobs.php
2>&1 > /var/log/runJobs.log" as crontab entry in your guide, as it's a
bit more compatible with non-wikimedia environments ;)

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf59oW6S2GapJUuQRAvYCAJ4vWBAHSTHlJljfnnUSF7IpZlechQCcCY5A
Zb5SMJz146sM5HalNQuA/9k=
=Ie27
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Aryeh Gregor
On Tue, Jan 27, 2009 at 6:43 PM, George Herbert
 wrote:
> Google switching to use our search would crush us, obviously.

Doubtful.  It wouldn't be terribly pleasant, but I doubt it would take
down the site so easily.  Alexa says google.com gets about ten times
the traffic as wikipedia.org.  If google.com/ redirected to
wikipedia.org, I don't know if that would crash the site by itself.

> As would AOL.

Wikipedia is far bigger than AOL.  That would only be a 20% or 30%
spike in traffic.  I'm pretty sure we could handle that.


Anyway, the reason live mirrors are prohibited is not for load
reasons.  I believe it's because if a site does nothing but stick up
some ads and add no value, Wikimedia is going to demand a cut of the
profit for using its trademarks and so on.  Some sites pay Wikimedia
for live mirroring.  So the others, in principle, get blocked.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Platonides
Marco Schuster wrote:
> Hi all,
> 
> I want to crawl around 800.000 flagged revisions from the German
> Wikipedia, in order to make a dump containing only flagged revisions.
> For this, I obviously need to spider Wikipedia.
> What are the limits (rate!) here, what UA should I use and what
> caveats do I have to take care of?
> 
> Thanks,
> Marco
> 
> PS: I already have a revisions list, created with the Toolserver. I
> used the following query: "select fp_stable,fp_page_id from
> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
> list of all articles with flagged revs, fp_stable being the revid of
> the most current flagged rev for this article?

Fetch them from the toolserver (there's a tool by duesentrieb for that).
It will catch almost all of them from the toolserver cluster, and make a
request to wikipedia only if needed.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Rolf Lampa schrieb:
> Marco Schuster skrev:
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
> [...]
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs, 
> 
> 
> Doesn't the xml dumps contain the flag for flagged revs?
> 
They don't. And that's very sad.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Rolf Lampa
Marco Schuster skrev:
> I want to crawl around 800.000 flagged revisions from the German
> Wikipedia, in order to make a dump containing only flagged revisions.
[...]
> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
> list of all articles with flagged revs, 


Doesn't the xml dumps contain the flag for flagged revs?

// Rolf Lampa

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:55 PM, Robert Rohde wrote:
> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber  wrote:
>> On 1/27/09 2:35 PM, Thomas Dalton wrote:
>>> The way I see it, what we need is to get a really powerful server
>> Nope, it's a software architecture issue. We'll restart it with the new
>> arch when it's ready to go.
>
> I don't know what your timetable is, but what about doing something to
> address the other aspects of the dump (logs, stubs, etc.) that are in
> limbo while full history chugs along.  All the other enwiki files are
> now 3 months old and that is already enough to inconvenience some
> people.
>
> The simplest solution is just to kill the current dump job if you have
> faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread George Herbert
On Tue, Jan 27, 2009 at 11:29 AM, Steve Summit  wrote:

> Jeff Ferland wrote:
> > You'll need a quite impressive machine to host even just the current
> > revisions of the wiki. Expect to expend 10s to even hundreds of
> > gigabytes on the database alone for Wikipedia using only the current
> > versions.
>
> No, no, no.  You're looking at it all wrong.  That's the sucker's
> way of doing it.
>
> If you're smart, you put up a simple page with a text box labeled
> "Wikipedia search", and whenever someone types a query into
> the box and submits it, you ship the query over to the Wikimedia
> servers, and then slurp back the response, and display it back
> to the original submitter.  That way only Wikimedia has to worry
> about all those pesky gigabyte-level database hosting requirements,
> while you get all the glory.
>
> This appears to be what the questioner is asking about.
>

Let's AGF a bit...

Even if someone with a not particularly Wikipedia goal in life links to one
of our searches from their page, all the resultant search result links are
back into Wikipedia.

If people have a question about something, and want to look it up, does it
really matter if they go to Wikipedia's front page and click "search" versus
doing so in another context?

We're providing an information resource - other sites can and often do link
to our articles (quite appropriately).  Why not link to our search?

The search link should in fairness tell people what they're getting, sure,
but that's more of a website-to-end-user disclosure problem than a problem
for us.

Google switching to use our search would crush us, obviously.  As would
AOL.  But J. Random site?  Seems like an ok thing, to me.


-- 
-george william herbert
george.herb...@gmail.com
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

I want to crawl around 800.000 flagged revisions from the German
Wikipedia, in order to make a dump containing only flagged revisions.
For this, I obviously need to spider Wikipedia.
What are the limits (rate!) here, what UA should I use and what
caveats do I have to take care of?

Thanks,
Marco

PS: I already have a revisions list, created with the Toolserver. I
used the following query: "select fp_stable,fp_page_id from
flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
list of all articles with flagged revs, fp_stable being the revid of
the most current flagged rev for this article?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5wcW6S2GapJUuQRAl8NAJ0Xs+ImyTqmoX2Vtj6k6PK9ntlS5wCeJjsl
M5kMETB3URYni5TilIOt8Fs=
=j7Og
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber  wrote:
> On 1/27/09 2:35 PM, Thomas Dalton wrote:
>> The way I see it, what we need is to get a really powerful server
>
> Nope, it's a software architecture issue. We'll restart it with the new
> arch when it's ready to go.

I don't know what your timetable is, but what about doing something to
address the other aspects of the dump (logs, stubs, etc.) that are in
limbo while full history chugs along.  All the other enwiki files are
now 3 months old and that is already enough to inconvenience some
people.

The simplest solution is just to kill the current dump job if you have
faith that a new architecture can be put in place in less than a year.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:35 PM, Thomas Dalton wrote:
> The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new 
arch when it's ready to go.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Thomas Dalton
> Whether we want to let the current process continue to try and finish
> or not, I would seriously suggest someone look into redumping the rest
> of the enwiki files (i.e. logs, current pages, etc.).  I am also among
> the people that care about having reasonably fresh dumps and it really
> is a problem that the other dumps (e.g. stubs-meta-history) are frozen
> while we wait to see if the full history dump can run to completion.

Even if we do let it finish, I'm not sure a dump of what Wikipedia was
like 13 months ago is much use... The way I see it, what we need is to
get a really powerful server to do the dump just once at a reasonable
speed and then we'll have a previous dump to build on so future ones
would be more reasonable.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
The problem, as I understand it (and Brion may come by to correct me)
is essentially that the current dump process is designed in a way that
can't be sustained given the size of enwiki.  It really needs to be
re-engineered, which means that developer time is needed to create a
new approach to dumping.

The main target for improvement is almost certainly parallelizing the
process so that wouldn't be a single monolithic dump process, but
rather a lot of little processes working in parallel.  That would also
ensure that if a single process gets stuck and dies, the entire dump
doesn't need to start over.


By way of observation, the dewiki's full history dumps in 26 hours
with 96% prefetched (i.e. loaded from previous dumps).  That suggests
that even starting from scratch (prefetch = 0%) it should dump in ~25
days under the current process.  enwiki is perhaps 3-6 times larger
than dewiki depending on how you do the accounting, which implies
dumping the whole thing from scratch would take ~5 months if the
process scaled linearly.  Of course it doesn't scale linearly, and we
end up with a prediction for completion that is currently 10 months
away (which amounts to a 13 month total execution).  And of course, if
there is any serious error in the next ten months the entire process
could die with no result.


Whether we want to let the current process continue to try and finish
or not, I would seriously suggest someone look into redumping the rest
of the enwiki files (i.e. logs, current pages, etc.).  I am also among
the people that care about having reasonably fresh dumps and it really
is a problem that the other dumps (e.g. stubs-meta-history) are frozen
while we wait to see if the full history dump can run to completion.

-Robert Rohde


On Tue, Jan 27, 2009 at 11:24 AM, Christian Storm  wrote:
>>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
>>> The current enwiki database dump 
>>> (http://download.wikimedia.org/enwiki/20081008/
>>> ) has been crawling along since 10/15/2008.
>> The current dump system is not sustainable on very large wikis and
>> is being replaced. You'll hear about it when we have the new one in
>> place. :)
>> -- brion
>
> Following up on this thread:  
> http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html
>
> Brion,
>
> Can you offer any general timeline estimates (weeks, months, 1/2
> year)?  Are there any alternatives to retrieving the article data
> beyond directly crawling
> the site?  I know this is verboten but we are in dire need of
> retrieving this data and don't know of any alternatives.  The current
> estimate of end of year is
> too long for us to wait.  Unfortunately, wikipedia is a favored source
> for students to plagiarize from which makes out of date content a real
> issue.
>
> Is there any way to help this process along?  We can donate disk
> drives, developer time, ...?  There is another possibility
> that we could offer but I would need to talk with someone at the
> wikimedia foundation offline.  Is there anyone I could
> contact?
>
> Thanks for any information and/or direction you can give.
>
> Christian
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Make upload headings changeable

2009-01-27 Thread Marcus Buck
Chad hett schreven:
> Should be done with a wiki's content language as of  r46372.
>
> -Chad
Thanks! That's already a big improvement, but why content language? As I 
pointed out in response to your question, it need's to be user language 
on Meta, Incubator, Wikispecies, Beta Wikiversity, old Wikisource, and 
all the multilingual wikis of third party users. It's not actually 
necessary on non-multilingual wikis, but it does no harm either. So why 
content language?
This could be solved with a setting in LocalSettings.php 
"isMultilingual", but that's another affair and as long as that does not 
exist, we should use user language.

Marcus Buck

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Make upload headings changeable

2009-01-27 Thread Chad
On Mon, Jan 26, 2009 at 12:44 PM, Ilmari Karonen  wrote:

> Chad wrote:
> > I was going to provide a specific parameter for it. That entire key sucks
> > though anyway, I should probably ditch the md5()'d URL in favor of using
> > the actual name. Fwiw: I've got a patch working, but I'm not quite ready
> > to commit it yet. While we're at it, are we sure we want to use $wgLang
> and
> > not $wgContLang? Image description pages are "content", not a part of
> > the interface. That being said, I would think it would be best to fetch
> the
> > information using the wiki's content language.
>
> Well, if you actually visit the description page on Commons, you'll see
> the templates in your interface language -- that's kind of the _point_
> of the autotranslated templates.
>
> Then again, Commons is kind of a special case, since, being a
> multilingual project, it doesn't _have_ a real content language; in a
> technical sense its content language is English, but that's only because
> MediaWiki requires one language to be specified as a content language
> even if the actual content is multilingual.  So I can see arguments
> either way.
>
> What language is the "shareduploadwiki-desc" message shown in, anyway?
> Seems to be $wgLang, which would seem to suggest that the actual
> description should be shown in the interface language too, for consistency.
>
> --
> Ilmari Karonen
>

Should be done with a wiki's content language as of  r46372.

-Chad
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Bilal Abdul Kader
I have a decent server that is dedicated for a Wikipedia project that
depends on the fresh dumps. Can this be used anyway to speed up the process
of generating the dumps?

bilal


On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm wrote:

> >> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
> >> The current enwiki database dump (
> http://download.wikimedia.org/enwiki/20081008/
> >> ) has been crawling along since 10/15/2008.
> > The current dump system is not sustainable on very large wikis and
> > is being replaced. You'll hear about it when we have the new one in
> > place. :)
> > -- brion
>
> Following up on this thread:
> http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html
>
> Brion,
>
> Can you offer any general timeline estimates (weeks, months, 1/2
> year)?  Are there any alternatives to retrieving the article data
> beyond directly crawling
> the site?  I know this is verboten but we are in dire need of
> retrieving this data and don't know of any alternatives.  The current
> estimate of end of year is
> too long for us to wait.  Unfortunately, wikipedia is a favored source
> for students to plagiarize from which makes out of date content a real
> issue.
>
> Is there any way to help this process along?  We can donate disk
> drives, developer time, ...?  There is another possibility
> that we could offer but I would need to talk with someone at the
> wikimedia foundation offline.  Is there anyone I could
> contact?
>
> Thanks for any information and/or direction you can give.
>
> Christian
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Steve Summit
Jeff Ferland wrote:
> You'll need a quite impressive machine to host even just the current  
> revisions of the wiki. Expect to expend 10s to even hundreds of  
> gigabytes on the database alone for Wikipedia using only the current  
> versions.

No, no, no.  You're looking at it all wrong.  That's the sucker's
way of doing it.

If you're smart, you put up a simple page with a text box labeled
"Wikipedia search", and whenever someone types a query into
the box and submits it, you ship the query over to the Wikimedia
servers, and then slurp back the response, and display it back
to the original submitter.  That way only Wikimedia has to worry
about all those pesky gigabyte-level database hosting requirements,
while you get all the glory.

This appears to be what the questioner is asking about.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Christian Storm
>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
>> The current enwiki database dump 
>> (http://download.wikimedia.org/enwiki/20081008/ 
>> ) has been crawling along since 10/15/2008.
> The current dump system is not sustainable on very large wikis and  
> is being replaced. You'll hear about it when we have the new one in  
> place. :)
> -- brion

Following up on this thread:  
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

Brion,

Can you offer any general timeline estimates (weeks, months, 1/2  
year)?  Are there any alternatives to retrieving the article data  
beyond directly crawling
the site?  I know this is verboten but we are in dire need of  
retrieving this data and don't know of any alternatives.  The current  
estimate of end of year is
too long for us to wait.  Unfortunately, wikipedia is a favored source  
for students to plagiarize from which makes out of date content a real  
issue.

Is there any way to help this process along?  We can donate disk  
drives, developer time, ...?  There is another possibility
that we could offer but I would need to talk with someone at the  
wikimedia foundation offline.  Is there anyone I could
contact?

Thanks for any information and/or direction you can give.

Christian


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Jeff Ferland
I'll try to weigh in with a bit of useful information, but it probably  
won't help that much.

You'll need a quite impressive machine to host even just the current  
revisions of the wiki. Expect to expend 10s to even hundreds of  
gigabytes on the database alone for Wikipedia using only the current  
versions.

There instructions for how to load the data that can be found by  
googling "wikipedia dump".

Several others have inquired for more information about your goal, and  
I'm going to echo that. The mechanics of hosting this kind of data  
(volume, really) are highly related to the associated task.

This data used for academic research would be handled differenty than  
for a live website, for example.

Nobody likes to be told they can't do something, or get a bunch of  
useless responses to a request for help. Very sincerely, though,  
unless you find enough information from the dump instruction pages to  
point you on the right direction and are able to ask more specific  
questions, you are in over your head. Your solution at that point  
would be to hire somebody.

Sent from my phone,
Jeff

On Jan 27, 2009, at 12:34 PM, Stephen Dunn  wrote:

> Hi Folks:
>
> I am a newbie so I apologize if I am asking basic questions. How  
> would I go about hosting wiktionary allowing search queries via the  
> web using opensearch. I am having trouble fining info on how to set  
> this up. Any assistance is greatly appreciated.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Tei
maybe this is what this guy need:

http://en.wiktionary.org/wiki/Special:Search";>



test:
http://zerror.com/unorganized/wika/test.htm

it don't seems wiktionary block external searchs now (via REFERRER),
but maybe may change the policy on the future/change the parameters
needed.

On Tue, Jan 27, 2009 at 7:18 PM, Stephen Dunn  wrote:
> refer to reference. com website and do a search
>
>> yes, website. so a web page has a search box that passes the input to 
>> wiktionary and results are provided on a results page. an example may be 
>> reference..com
>
> How would this differ from the search box on en.wiktionary.org? What
> are you actually trying to achieve?
>


-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
refer to reference. com website and do a search



- Original Message 
From: Thomas Dalton 
To: Wikimedia developers 
Sent: Tuesday, January 27, 2009 1:07:36 PM
Subject: Re: [Wikitech-l] hosting wikipedia

2009/1/27 Stephen Dunn :
> yes, website. so a web page has a search box that passes the input to 
> wiktionary and results are provided on a results page. an example may be 
> reference..com

How would this differ from the search box on en.wiktionary.org? What
are you actually trying to achieve?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Thomas Dalton
2009/1/27 Stephen Dunn :
> yes, website. so a web page has a search box that passes the input to 
> wiktionary and results are provided on a results page. an example may be 
> reference..com

How would this differ from the search box on en.wiktionary.org? What
are you actually trying to achieve?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
yes, website. so a web page has a search box that passes the input to 
wiktionary and results are provided on a results page. an example may be 
reference..com



- Original Message 
From: Thomas Dalton 
To: Wikimedia developers 
Sent: Tuesday, January 27, 2009 12:50:18 PM
Subject: Re: [Wikitech-l] hosting wikipedia

2009/1/27 Stephen Dunn :
> I am working on a project to host wiktionary on one web page and wikipedia on 
> another. So both, sorry..

You mean web *site*, surely? They are both far too big to fit on a
single page. I think you need to work out precisely what it is you're
trying to do before we can help you.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Jason Schulz
To use filecache, you need to set $wgShowIPinHeader = false;

Also, see 
http://www.mediawiki.org/wiki/User:Aaron_Schulz/How_to_make_MediaWiki_fast
-Aaron

--
From: "Dawson" 
Sent: Tuesday, January 27, 2009 9:52 AM
To: "Wikimedia developers" 
Subject: Re: [Wikitech-l] MediaWiki Slow, what to look for?

> Modified config file as follows:
>
> $wgUseDatabaseMessage = false;
> $wgUseFileCache = true;
> $wgMainCacheType = "CACHE_ACCEL";
>
> I also installed xcache and eaccelerator. The improvement in speed is 
> huge.
>
> 2009/1/27 Aryeh Gregor
> 
>>
>
>> On Tue, Jan 27, 2009 at 5:31 AM, Dawson  wrote:
>> > Hello, I have a couple of mediawiki installations on two different 
>> > slices
>> at
>> > Slicehost, both of which run websites on the same slice with no speed
>> > problems, however, the mediawiki themselves run like dogs!
>> > http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or 
>> > ways
>> to
>> > optimise them? I still can't get over they need a 100mb ini_set in
>> settings
>> > to just load due to the messages or something.
>>
>> If you haven't already, you should set up an opcode cache like APC or
>> XCache, and a variable cache like APC or XCache (if using one
>> application server) or memcached (if using multiple application
>> servers).  Those are essential for decent performance.  If you want
>> really snappy views, at least for logged-out users, you should use
>> Squid too, although that's probably overkill for a small site.  It
>> also might be useful to install wikidiff2 and use that for diffs.
>>
>> Of course, none of this works if you don't have root access.  (Well,
>> maybe you could get memcached working with only shell . . .)  In that
>> case, I'm not sure what advice to give.
>>
>> MediaWiki is a big, slow package, though.  For large sites, it has
>> scalability features that are almost certainly unparalleled in any
>> other wiki software, but it's probably not optimized as much for quick
>> loading on small-scale, cheap hardware.  It's mainly meant for
>> Wikipedia.  If you want to try digging into what's taking so long, you
>> can try enabling profiling:
>>
>> http://www.mediawiki.org/wiki/Profiling#Profiling
>>
>> If you find something that helps a lot, it would be helpful to mention 
>> it.
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Thomas Dalton
2009/1/27 Stephen Dunn :
> I am working on a project to host wiktionary on one web page and wikipedia on 
> another. So both, sorry..

You mean web *site*, surely? They are both far too big to fit on a
single page. I think you need to work out precisely what it is you're
trying to do before we can help you.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
I am working on a project to host wiktionary on one web page and wikipedia on 
another. So both, sorry..



- Original Message 
From: Thomas Dalton 
To: Wikimedia developers 
Sent: Tuesday, January 27, 2009 12:43:49 PM
Subject: Re: [Wikitech-l] hosting wikipedia

2009/1/27 Stephen Dunn :
> Hi Folks:
>
> I am a newbie so I apologize if I am asking basic questions. How would I go 
> about hosting wiktionary allowing search queries via the web using 
> opensearch. I am having trouble fining info on how to set this up. Any 
> assistance is greatly appreciated.

Why do you want to host Wiktionary? It's already hosted at
en.wiktionary.org. And do you mean Wiktionary (as you said in the body
of your email) or Wikipedia (as you said in the subject line)? Or do
you actually mean your own wiki, unrelated to either of those?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hosting wikipedia

2009-01-27 Thread Thomas Dalton
2009/1/27 Stephen Dunn :
> Hi Folks:
>
> I am a newbie so I apologize if I am asking basic questions. How would I go 
> about hosting wiktionary allowing search queries via the web using 
> opensearch. I am having trouble fining info on how to set this up. Any 
> assistance is greatly appreciated.

Why do you want to host Wiktionary? It's already hosted at
en.wiktionary.org. And do you mean Wiktionary (as you said in the body
of your email) or Wikipedia (as you said in the subject line)? Or do
you actually mean your own wiki, unrelated to either of those?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] hosting wikipedia

2009-01-27 Thread Stephen Dunn
Hi Folks:

I am a newbie so I apologize if I am asking basic questions. How would I go 
about hosting wiktionary allowing search queries via the web using opensearch. 
I am having trouble fining info on how to set this up. Any assistance is 
greatly appreciated.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Dawson
Modified config file as follows:

$wgUseDatabaseMessage = false;
$wgUseFileCache = true;
$wgMainCacheType = "CACHE_ACCEL";

I also installed xcache and eaccelerator. The improvement in speed is huge.

2009/1/27 Aryeh Gregor

>

> On Tue, Jan 27, 2009 at 5:31 AM, Dawson  wrote:
> > Hello, I have a couple of mediawiki installations on two different slices
> at
> > Slicehost, both of which run websites on the same slice with no speed
> > problems, however, the mediawiki themselves run like dogs!
> > http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or ways
> to
> > optimise them? I still can't get over they need a 100mb ini_set in
> settings
> > to just load due to the messages or something.
>
> If you haven't already, you should set up an opcode cache like APC or
> XCache, and a variable cache like APC or XCache (if using one
> application server) or memcached (if using multiple application
> servers).  Those are essential for decent performance.  If you want
> really snappy views, at least for logged-out users, you should use
> Squid too, although that's probably overkill for a small site.  It
> also might be useful to install wikidiff2 and use that for diffs.
>
> Of course, none of this works if you don't have root access.  (Well,
> maybe you could get memcached working with only shell . . .)  In that
> case, I'm not sure what advice to give.
>
> MediaWiki is a big, slow package, though.  For large sites, it has
> scalability features that are almost certainly unparalleled in any
> other wiki software, but it's probably not optimized as much for quick
> loading on small-scale, cheap hardware.  It's mainly meant for
> Wikipedia.  If you want to try digging into what's taking so long, you
> can try enabling profiling:
>
> http://www.mediawiki.org/wiki/Profiling#Profiling
>
> If you find something that helps a lot, it would be helpful to mention it.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Aryeh Gregor
On Tue, Jan 27, 2009 at 5:31 AM, Dawson  wrote:
> Hello, I have a couple of mediawiki installations on two different slices at
> Slicehost, both of which run websites on the same slice with no speed
> problems, however, the mediawiki themselves run like dogs!
> http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or ways to
> optimise them? I still can't get over they need a 100mb ini_set in settings
> to just load due to the messages or something.

If you haven't already, you should set up an opcode cache like APC or
XCache, and a variable cache like APC or XCache (if using one
application server) or memcached (if using multiple application
servers).  Those are essential for decent performance.  If you want
really snappy views, at least for logged-out users, you should use
Squid too, although that's probably overkill for a small site.  It
also might be useful to install wikidiff2 and use that for diffs.

Of course, none of this works if you don't have root access.  (Well,
maybe you could get memcached working with only shell . . .)  In that
case, I'm not sure what advice to give.

MediaWiki is a big, slow package, though.  For large sites, it has
scalability features that are almost certainly unparalleled in any
other wiki software, but it's probably not optimized as much for quick
loading on small-scale, cheap hardware.  It's mainly meant for
Wikipedia.  If you want to try digging into what's taking so long, you
can try enabling profiling:

http://www.mediawiki.org/wiki/Profiling#Profiling

If you find something that helps a lot, it would be helpful to mention it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] MediaWiki Slow, what to look for?

2009-01-27 Thread Dawson
Hello, I have a couple of mediawiki installations on two different slices at
Slicehost, both of which run websites on the same slice with no speed
problems, however, the mediawiki themselves run like dogs!
http://wiki.medicalstudentblog.co.uk/ Any ideas what to look for or ways to
optimise them? I still can't get over they need a 100mb ini_set in settings
to just load due to the messages or something.

Thank you, Dawson
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l