On Sat, Oct 25, 2014 at 7:16 AM, MZMcBride <z...@mzmcbride.com> wrote:

> Labs is a playground and Galleries, Libraries, Archives, and Museums are
> serious enough to warrant a proper investment of resources, in my view.
> Magnus and many others develop magnificent tools, but my sense is that
> they're largely proofs of concept, not final implementations.

Far from being treated as mere proofs of concept, Magnus' GLAM tools
[1] have been used to measure and report success in the context of
project grant and annual plan proposals and reports, ongoing project
performance measurements, blog posts and press releases, etc. Daniel
Mietchen has, to my knowledge, been the main person doing any
systematic auditing or verification of the reports generated by these
tools, and results can be found in his tool testing reports, the last
one of which is unfortunately more than a year old. [2]

Integration with MediaWiki should IMO not be viewed as a runway that
all useful developments must be pushed towards. Rather, we should seek
to establish clearer criteria by which to decide that functionality
benefits from this level of integration, to such an extent that it
justifies the cost. Functionality that is not integrated in this
manner should, then, not be dismissed as "proofs of concept" but
rather judged on its own merits.

GWToolset [3] is a good example. It was built as a MediaWiki extension
to manage GLAM batch uploads, but we should not regard this decision
as sacrosanct, or the only correct way to develop this kind of
functionality. The functionality it provides is of highly specialized
interest, and indeed, the number of potential users to-date is 47
according to [4], most of whom have not performed significant uploads
yet.  Its user interface is highly specialized and special permissions
+ detailed instructions are required to use it. At the same time, it
has been used to upload 322,911 files overall, an amazing number even
without going into the quality and value of the individual

So, why does it need to be a MediaWiki extension at all? When
development began in 2012, OAuth support in MediaWiki did not exist,
so it was impossible for an external tool (then running on toolserver)
to manage an upload on the user's behalf without asking for the user's
password, which would have been in violation of policy. But today, we
have other options. It's possible that storage requirements or other
specific desired integration points would make it impossible to create
this as a Tool Labs tool -- but if we created the same tool today, we
should carefully consider that.

Indeed, highly specialized tools for the cultural and education sector
_are_ being developed and hosted inside Tool Labs or externally.
Looking at the current OAuth consumer requests [5], there are
submissions for a metadata editor developed by librarians at the
University of Miami Libraries in Coral Gables, Florida, and an
assignment creation wizard developed by the Wiki Education Foundation.
There's nothing "improper" about that, as Marc-André pointed out.

As noted before, for tools like the ones used for GLAM reporting to
get better, WMF has its role to play in providing more datasets and
improved infrastructure. But there's nothing inherent in the
development of those tools that forces them to live in production
land, or that requires large development teams to move them forward.
Auditing of numbers, improved scheduling/queuing of database requests,
optimization of API calls and DB queries; all of this can be done by
individual contributors, making this suitable work for even chapters
with limited experience managing technical projects to take on.

On the analytics side, we're well aware that many users have asked for
better access to the pageview data, either through MariaDB, or through
a dedicated API. We have now said for some time that our focus is on
modernizing the infrastructure for log analysis and collection,
because the numbers collected by the old webstatscollector code were
incomplete, and the infrastructure subject to frequent packet loss
issues. In addition, our ability to meet additional requirements on
the basis of simple pageview aggregation code was inherently

To this end, we have put into production use infrastructure to collect
and analyze site traffic using Kafka/Hadoop/Hive. At our scale, this
has been a tremendously complex infrastructure project which has
included custom development such as varnishkafka [6]. While it's taken
longer than we've wanted, this new infrastructure is being used to
generate a public page count dataset as of this month, including
article-level mobile traffic for the first time [7]. Using
Hadoop/Hive, we'll be able to compile many more specialized reports,
and this is only just beginning.

Giving community developers better access to this data needs to be
prioritized relative to other ongoing analytics work, including but
not limited to:

- Continued development and maintenance of the above infrastructure foundations;

- Development of "Vital Signs": public reports on editor activity,
content contribution, sign-ups and other metrics. This tool gives us
more timely access to key measures than WikiStats [9] (or the
reportcard [10], which to-date still consumes Wikistats data). Rather
than having to wait 4-6 weeks to know what's happening with regard to
editor numbers, we can see continuous updates on a day-to-day basis.

- Development of Wikimetrics, which analyzes the editing activity of a
group of editors, and which is essential for measuring all movement
work that targets increased activity by a targeted group (e.g.
editathon), and is a key tool used for grants evaluation (was a funded
program worth the $$?). A lot of thought has gone into the development
of standardized  global metrics [12] for program work, much of it
using this technology and dependent on its continued development.

- Measurement (instrumentation) of site actions and
development/maintenance of associated infrastructure. As an example,
in-depth data collection for features like Media Viewer (see
dashboards at  [13] ) is only possible because of the EventLogging
extension developed by Ori Livneh, and the increasing use of this
technology by WMF developers. EventLogging requires significant
management, maintenance and teaching effort from the analytics team.

Lila is requesting visibility into all primary funnels on Wikimedia
sites (e.g. sign-ups, edits/saves through wikitext, edits/saves
through VisualEditor, etc.), and this will require lots of sustained
effort from lots of people to get done. What it will give us is a
better sense of where people succeed and fail to complete an action --
by way of example, see the initial UploadWizard funnel analysis here:

- Improved software and infrastructure support for A/B testing,
possibly including adoption of existing open source tooling such as
Facebook's PlanOut library/interpreter [14].

- Improved readership metrics, possibly including a privacy-sensitive
approach to estimating Unique Visitors, and better geographic
breakdowns for readers/editors.

These are all complex problems, most of which are dependent on the
small analytics team, and feedback on projects and priorities is very
much welcome on the analytics mailing list:

With regard to better embedding of graphs in wikis specifically, Yuri
Astrakhan has led the development of a new extension, inspired by work
by Dan Andreescu, to visualize data directly in wikis. This extension
has been deployed already to Meta and MediaWiki.org and can be used
for dynamic graphs where it's appropriate to not have a fallback to a
static image, for example in grant reports. See:

I agree this is the kind of functionality that should make its way
into Wikipedia. Again, we need to judge throwing a full team behind
that against the relative priority of other work. In the meantime,
Yuri and others will continue to push it along and may even be able to
get it all the way there in due time. The main blockers, from what I
can tell, are generation of static fallback images for users without
JavaScript, and a better way to manage the data sources.

In general, the point of my original message was this: All
organizations that seek to improve Wikipedia and the other Wikimedia
projects ultimately depend on technology to do so; to view WMF as the
sole "tech provider" does not scale. Larger, well-funded chapters can
take on big, hairy challenges like Wikidata; smaller, less-funded orgs
are better positioned to work on specialized technical support for
programmatic work.

I would caution against requesting WMF to work on highly specialized
solutions for highly specialized problems. If such solutions are
needed, I would caution against building them into MediaWiki unless
they can be generalized to benefit a larger number of users, at which
point it's appropriate to seek partnership with WMF, or to ask WMF for
the relative priority of such work. But often, it's perfectly fine
(and much faster) to build such tools and reports independently, and
to ask WMF for help in providing APIs/services/data/infrastructure to
get it done.


[1] http://tools.wmflabs.org/glamtools/
[3] https://www.mediawiki.org/wiki/Extension:GWToolset
[6] https://github.com/wikimedia/varnishkafka
[7] https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites
[8] https://metrics.wmflabs.org/static/public/dash/
[9] http://stats.wikimedia.org/
[10] http://reportcard.wmflabs.org/
[11] https://metrics.wmflabs.org/
[13] http://multimedia-metrics.wmflabs.org/dashboards/mmv
[14] https://github.com/facebook/planout
Erik Möller
VP of Product & Strategy, Wikimedia Foundation

Wikimedia-l mailing list, guidelines at: 
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 

Reply via email to