Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-13 Thread John Coggeshall
> As the person that will end up having to maintain this, I wasn't aware
> that https://github.com/matomo-org/matomo-php-tracker is pretty much the
> same as the JS tracker. And as it does HTTPS requests from the PHP
> application to Matomo, instead of a JS tracker, this seems like a
> better solution, which is also more customisable.
>

I'm personally a big fan of Matomo as well.
Coogle


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-13 Thread Gina P. Banyard
On Wednesday, 13 November 2024 at 19:06, Jonathan Vollebregt  
wrote:

> On 11/11/24 8:11 PM, Larry Garfield wrote:
> 
> > Metrics that go into a black box with a 3rd party are bad. That's not what 
> > is being proposed.
> 
> 
> From my perspective it's still minified (obfuscated) code going into a
> black box. Just because you own the box doesn't make it any more
> transparent to me.

So what would be transparent? That we publish all the data that we collect on a 
minute by minute basis?
Minified is not obfuscated, so this argument is complete nonsense and not 
receivable.

Best regards,

Gina P. Banyard


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-13 Thread Larry Garfield
On Wed, Nov 13, 2024, at 1:06 PM, Jonathan Vollebregt wrote:
> On 11/11/24 8:11 PM, Larry Garfield wrote:
>> Metrics that go into a black box with a 3rd party are bad.  That's not what 
>> is being proposed.
>
>  From my perspective it's still minified (obfuscated) code going into a 
> black box. Just because you own the box doesn't make it any more 
> transparent to me.
>
> This is a tradeoff between users privacy and useful metrics. You can get 
> more than half of the way there without any client-side code, with only 
> the information users were already sending you.
>
> If server side tracking isn't an option, and minimal (As opposed to 
> minified) client side tracking isn't an option either, then it sounds 
> like you already made up your mind.

I don't understand why the JS being minified (which basically everything does 
for performance) is an issue.  In concept, there's no reason we couldn't link 
from the Privacy Policy page or footer or something to the Matomo website, or 
even deep link to the code file, though that seems excessive.  Just identifying 
it and letting people go look up the code themselves if they want should be 
sufficient for 99% of the people who would even notice or care, which is 
already less than 1% of visitors.

And data going into a black box is the same no matter where it's collected; 
it's a GPL analytics server, but of course we're not going to release raw data. 
 So that's no different no matter where the collection happens.

--Larry Garfield


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-13 Thread Jonathan Vollebregt

On 11/11/24 8:11 PM, Larry Garfield wrote:

Metrics that go into a black box with a 3rd party are bad.  That's not what is 
being proposed.


From my perspective it's still minified (obfuscated) code going into a 
black box. Just because you own the box doesn't make it any more 
transparent to me.


This is a tradeoff between users privacy and useful metrics. You can get 
more than half of the way there without any client-side code, with only 
the information users were already sending you.


If server side tracking isn't an option, and minimal (As opposed to 
minified) client side tracking isn't an option either, then it sounds 
like you already made up your mind.


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-11 Thread Larry Garfield
On Tue, Nov 5, 2024, at 3:46 PM, Jonathan Vollebregt wrote:

> For the first there's a user agent (Again, matomo-php-tracker) as well 
> as media queries for transparent tracking with  or CSS

The browser user-agent is widely recognized as basically useless in the vast 
majority of cases.  Most browsers load so much crap in there to try and emulate 
each other that it rarely tells you anything useful, in addition to being 
trivially spoofable.

> Transparency is a big deal. Server side analytics are ok because PHP 
> devs know what goes into an HTTP request. (And it's fairly limited in 
> scope by definition) We don't know what goes into a request sent from a 
> black box blob of minified JS.

Matomo was selected precisely for this reason.  It's GPLv3 licensed.  100% of 
the code is available to review and audit.  Here's the unminified JS code:

https://github.com/matomo-org/matomo/blob/5.x-dev/js/piwik.js

It cannot get more transparent than that.  Using the server-side library would 
be no more transparent.  Using log ingestion would be no more transparent.  
Potentially it would be less.

> If your JS just consisted of `if(wasm) fetch()` I would be fine with 
> that, but it's actually a 66kb minified JS file.
>
> Perhaps you could just start with server side tracking and see how it 
> goes? I'd be much happier with client side tracking in future if it's 
> voted on one metric at a time rather than a big opaque file.

Just to nip this part in the bud: RFCs for any config change on the servers is 
a doomed idea that should never even be considered.  Infra-RFCs are very rare, 
and they should be.  Infra should by and large be handled by dedicated people, 
not by direct democracy.  Eg, the move to GitHub issues was an RFC 
(https://wiki.php.net/rfc/github_issues), but tweaks to, say, issue templates 
or permissions or other configuration have not gone through an RFC, nor should 
they.

We have looked into Matomo's server library.  It's potentially useful, but it 
doesn't give the same data that a client-side tracker would.  They'd give 
overlapping but distinct information, so it's potentially useful to have both.  
That said, it would also require integrating into the server-side PHP code for 
the website, and triggering IO (database calls at least) in the web process.  
That can only slow down the page loading process.  That would in turn mean we 
should really make better use of HTTP caching (which we currently do not use at 
all for HTML pages), which would in turn make server-side metrics even less 
reliable.  (I'm of the mind that we should be aggressively caching pages 
anyway, especially as pages are virtually static in practice, but that's a 
separate matter.)

Really, the order of ease for the various collection mechanisms is:

1. Client-side JS
2. Server component
3. Log ingestion

So saying "start with the hard one, then maybe do the easy one" is frankly 
backwards, and just creates more work and problems.

To those that seem uncomfortable about using a JS-based metrics tool, I need to 
ask... why?  I have yet to see anyone put forward a practical reason why 
JS-based metrics are bad.

Metrics that go into a black box with a 3rd party are bad.  That's not what is 
being proposed.
Metrics that collect PII are bad.  That's not what is being proposed.
Metrics that collect unnecessary telemetry for advertising, etc. are bad.  
That's not what is being proposed.
Closed source/non-free code is bad.  That's not what is being proposed.

I also run an ad blocker myself, and I share the general concern about the 
enshittification of the Internet through advertising stalkers.  That's not what 
is being proposed.  The tools being proposed are precisely to avoid that.  (It 
would be easier still to just toss Google Analytics on the site and be done 
with it, but we're very deliberately not doing that.)

So what practical, non-knee-jerk reason is there why the easiest to implement, 
easiest-to-onboard-people, least-unreliable-data option is not the best 
solution?

Serious question, because I cannot think of one.

--Larry Garfield


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-07 Thread Derick Rethans
On Sat, 2 Nov 2024, Jonathan Vollebregt wrote:

> On 11/2/24 12:10 AM, Bob Weinand wrote:
> > What percentage of users get to the docs through direct links vs the 
> > home page
> > 
> > That's something you can generally infer from server logs - was the 
> > home page accessed from that IP right before another page was 
> > opened? It's not as accurate, but for a general understanding of 
> > orders of magnitude it's good enough.
> 
> Even better: If we're talking about internal navigation you can check 
> the referrer header and know for sure, since the docs don't add 
> rel=noreferrer on links or anything.
> 
> You shouldn't need server logs _or_ client side JS. A lot of this 
> tracking stuff could be done by just putting down a proxy or shim that 
> checks request headers. It looks like matomo offers exactly this via 
> matomo/matomo-php-tracker.
> 
> I second bob's general sentiment: There's no need for client side 
> tracking.

As the person that will end up having to maintain this, I wasn't aware 
that https://github.com/matomo-org/matomo-php-tracker is pretty much the 
same as the JS tracker. And as it does HTTPS requests from the PHP 
application to Matomo, instead of a JS tracker, this seems like a
better solution, which is also more customisable.

cheers,
Derick

-- 
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support

mastodon: @derickr@phpc.social @xdebug@phpc.social


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-07 Thread Derick Rethans
On Sat, 2 Nov 2024, Bob Weinand wrote:

> On 1.11.2024 22:41:29, Larry Garfield wrote:
> > In a similar vein to approving the use of software, Roman Pronskiy 
> > asked for my help putting together an RFC on collecting analytics 
> > for PHP.net.
> > 
> > https://wiki.php.net/rfc/phpnet-analytics

[snip]

> Let's see what the RFC names:
> 
>     Time-on-page
>     Whether they read the whole page or just a part
>     Whether they even saw comments
> 
> Yes, these need a client side tracker. But I doubt the usefulness of 
> the signal. You don't know who reads that. Is it someone who is 
> already familiar with PHP and searches a detail? He'll quickly just 
> find one part. Is it someone who is new to PHP and tries to understand 
> PHP. He may well read the whole page. But you don't know that.

The Matomo JS tracker also does not collect all this data, it only 
improves the "time spent on each page":
https://developer.matomo.org/guides/tracking-javascript-guide#accurately-measure-the-time-spent-on-each-page

I make no comments about how useful this is.

>     How much are users using the search function? Is it finding what they
> want, or is it just a crutch?
> 
> How much is probably greppable from the server logs as well.

The PHP client tracker that was mentioned in this thread, has a specific 
feature for this:
https://github.com/matomo-org/matomo-php-tracker/blob/master/MatomoTracker.php#L910

The JS version can't track our internal search systems, and I am not 
sure it use the HTTP referrer to see whether the entry came through a 
Google (or other third party) search.

> And yeah, server logs are more locked down. But that's something you 
> can fix. I hope that the raw analytics data is just as locked down as 
> the server logs...

The raw server logs will definitely not be opened up, and neither will 
any other raw data.

cheers,
Derick

-- 
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support

mastodon: @derickr@phpc.social @xdebug@phpc.social

Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-05 Thread Jonathan Vollebregt

On 11/5/24 6:29 PM, Larry Garfield wrote:

Overall I feel like the signal we can get from using a JS tracker
specifically is comparatively low to the point it's not actually worth it.


Some more things a client-side tracker could do that logs cannot:

* How many people are accessing the site from a desktop vs mobile?
* What speed connection do people have?
* How many people are using the in-browser Wasm code runner that is currently 
being worked on?  cf: https://github.com/php/web-php/pull/1097


For the first there's a user agent (Again, matomo-php-tracker) as well 
as media queries for transparent tracking with  or CSS



Even if someone wanted to block it, meh.  We'd still be getting enough signal 
to make informed decisions.


Firefox famously keeps killing features thinking no-one uses them 
because the people who use them are savvy enough to turn off tracking. 
Even if you do use client side tracking that's a good argument to have 
server side tracking anyway as a fallback.


Transparency is a big deal. Server side analytics are ok because PHP 
devs know what goes into an HTTP request. (And it's fairly limited in 
scope by definition) We don't know what goes into a request sent from a 
black box blob of minified JS.


If your JS just consisted of `if(wasm) fetch()` I would be fine with 
that, but it's actually a 66kb minified JS file.


Perhaps you could just start with server side tracking and see how it 
goes? I'd be much happier with client side tracking in future if it's 
voted on one metric at a time rather than a big opaque file.


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-05 Thread Larry Garfield
On Fri, Nov 1, 2024, at 6:10 PM, Bob Weinand wrote:
> On 1.11.2024 22:41:29, Larry Garfield wrote:
>> In a similar vein to approving the use of software, Roman Pronskiy asked for 
>> my help putting together an RFC on collecting analytics for PHP.net.
>>
>> https://wiki.php.net/rfc/phpnet-analytics
>>
>> Of particular note:
>>
>> * This is self-hosted, first-party only.  No third parties get data, so no 
>> third parties can do evil things with it.
>> * There is no plan to collect any PII.
>> * The goal is to figure how how to most efficiently spend Foundation money 
>> improving php.net, something that is sorely needed.
>>
>> Ideally we'd have this in place by the 8.4 release or shortly thereafter, 
>> though I realize that's a little tight on the timeline.
>
> Hey Larry,
>
> I have a couple concerns and questions:
>
> Is there a way to track analytics with only transient data? As in, data 
> actually stored is always already anonymized enough that it would be 
> unproblematic to share it with everyone?
> Or possibly, is there a retention period for the raw data after which 
> only anonymized data remains?

The plan is to configure Matomo to not collect anything non-anonymous to begin 
with, to the extent possible.  We're absolutely not talking about user-stalking 
like ad companies do, or anything even remotely close to that.

I'm not convinced that publishing raw, even anonymized data, is valuable or 
responsible.  I don't know of any other sites off hand that publish their raw 
analytics, and I don't know what purpose that would serve other than just a 
principled "radical transparency" stance, which I generally don't agree with.

However, having an automated aggregate dashboard similar to 
https://analytics.bookstackapp.com/bookstackapp.com (made by a different tool, 
but same idea) that we could make public is the goal, but we don't want to do 
that until it's been running a while and we're sure that nothing personally 
identifiable could leak through that way.

> Do you actually have a plan what to use that data for? The RFC mostly 
> talks about "high traffic". But does that mean anything? I do look at a 
> documentation page, because I need to look something specific up (what 
> was the order of arguments of strpos again?). I may only look shortly at 
> it. Maybe even often. But it has absolutely zero signal on whether the 
> documentation page is good enough. In that case I don't look at the 
> comments either. Comments are something you rarely look at, mostly the 
> first time you want to even use a function.

Right now, the key problem is that there's a lot of "we don't know what we 
don't know."  We want to improve the site and docs, the Foundation wants to 
spend money on doing so, but other than "fill in the empty pages" we have no 
definition of "improve" to work from.  The intent is that better data will give 
us a better sense of what "improve" even means.  

It would also be useful for marketing campaigns, even on-site.  Eg, if we spend 
the time to write a "How to convince your boss to use PHP" page... how useful 
is it?  From logs, all we could get is page count.  That's it.  Or the 
PHP-new-release landing page that we've put up for the last several releases.  
Do people actually get value of that?  Do they bother to scroll down through 
each section or do they just look at the first one or two and leave, meaning 
the time we spent on any other items is wasted?  Right now, we have no idea if 
the time spent on those is even useful.  

Another example off the top of my head: Right now, the enum documentation is 
spread across a dozen sub-pages.  I don't know why I did that exactly in the 
first place rather than one huge page, other than "huge pages bad."  But are 
they bad?  Would it be better to combine enums back into fewer pages, or to 
split the visibility long-page up into smaller ones?  I have no idea.  We need 
data to answer that.

It's also absolutely true that analytics are not the end of data collection.  
User surveys, usability tests, etc. are also highly valuable, and can get you a 
different kind of data.  We should likely do those at some point, but that 
doesn't make automated analytics not useful.


Another concern with just using raw logs is that it would be more work to 
setup, and have more moving parts to break.  Let's be honest, PHP has an 
absolutely terrible track record when it comes to keeping our moving parts 
working, and the Infra Team right now is tiny.  The bus factor there is a 
concern.  Using a client-side tracker is the more-supported and 
fewer-custom-scripts approach, which makes it easier for someone new to pick it 
up when needed.

Logs also will fold anyone behind a NAT together into a single IP, and thus 
"user."  IP address is in general a pretty poor way of uniquely identifying 
people with the number of translation layers on the Internet these days.

> Overall I feel like the signal we can get from using a JS tracker 
> specifically is comp

Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-02 Thread Rob Landers
On Sat, Nov 2, 2024, at 00:54, Jonathan Vollebregt wrote:
> On 11/2/24 12:10 AM, Bob Weinand wrote:
> >  What percentage of users get to the docs through direct links vs 
> > the home page
> > 
> > That's something you can generally infer from server logs - was the home 
> > page accessed from that IP right before another page was opened? It's 
> > not as accurate, but for a general understanding of orders of magnitude 
> > it's good enough.
> 
> Even better: If we're talking about internal navigation you can check 
> the referrer header and know for sure, since the docs don't add 
> rel=noreferrer on links or anything.
> 
> You shouldn't need server logs _or_ client side JS. A lot of this 
> tracking stuff could be done by just putting down a proxy or shim that 
> checks request headers. It looks like matomo offers exactly this via 
> matomo/matomo-php-tracker.
> 
> I second bob's general sentiment: There's no need for client side tracking.
> 

Further, most (all?) devs I know generally tend to use pi-holes and other 
tracking blockers. Devs are notoriously hard people to track via client-side 
analytics. If we went with a client side solution, I would hope that we use a 
dedicated domain for ingestion so that this tracking can be easily blocked. It 
will still be blocked, but some people would rather block the entire domain 
(e.g., go to other mirrors/sites with the documentation) than be tracked.

For the case of whether comments are viewed via server-side, you could always 
load the comments div async once the scroll position goes past a certain point, 
and inject them into the dom (see: htmx). This has really crappy usability, but 
works and might create a faster page load for pages with lots of comments. For 
people not using javascript, a simple button to reload the page with comments 
(`?comments=1`?) should be enough and provide the desired analytics as well.

— Rob

Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-01 Thread Jonathan Vollebregt

On 11/2/24 12:10 AM, Bob Weinand wrote:
     What percentage of users get to the docs through direct links vs 
the home page


That's something you can generally infer from server logs - was the home 
page accessed from that IP right before another page was opened? It's 
not as accurate, but for a general understanding of orders of magnitude 
it's good enough.


Even better: If we're talking about internal navigation you can check 
the referrer header and know for sure, since the docs don't add 
rel=noreferrer on links or anything.


You shouldn't need server logs _or_ client side JS. A lot of this 
tracking stuff could be done by just putting down a proxy or shim that 
checks request headers. It looks like matomo offers exactly this via 
matomo/matomo-php-tracker.


I second bob's general sentiment: There's no need for client side tracking.


Re: [PHP-DEV] [RFC] PHP.net analytics

2024-11-01 Thread Bob Weinand

On 1.11.2024 22:41:29, Larry Garfield wrote:

In a similar vein to approving the use of software, Roman Pronskiy asked for my 
help putting together an RFC on collecting analytics for PHP.net.

https://wiki.php.net/rfc/phpnet-analytics

Of particular note:

* This is self-hosted, first-party only.  No third parties get data, so no 
third parties can do evil things with it.
* There is no plan to collect any PII.
* The goal is to figure how how to most efficiently spend Foundation money 
improving php.net, something that is sorely needed.

Ideally we'd have this in place by the 8.4 release or shortly thereafter, 
though I realize that's a little tight on the timeline.


Hey Larry,

I have a couple concerns and questions:

Is there a way to track analytics with only transient data? As in, data 
actually stored is always already anonymized enough that it would be 
unproblematic to share it with everyone?
Or possibly, is there a retention period for the raw data after which 
only anonymized data remains?



Do you actually have a plan what to use that data for? The RFC mostly 
talks about "high traffic". But does that mean anything? I do look at a 
documentation page, because I need to look something specific up (what 
was the order of arguments of strpos again?). I may only look shortly at 
it. Maybe even often. But it has absolutely zero signal on whether the 
documentation page is good enough. In that case I don't look at the 
comments either. Comments are something you rarely look at, mostly the 
first time you want to even use a function.



Also, I don't buy the argument none of that can be derived from server 
logs. Let's see what the RFC names:


    Time-on-page
    Whether they read the whole page or just a part
    Whether they even saw comments

Yes, these need a client side tracker. But I doubt the usefulness of the 
signal. You don't know who reads that. Is it someone who is already 
familiar with PHP and searches a detail? He'll quickly just find one 
part. Is it someone who is new to PHP and tries to understand PHP. He 
may well read the whole page. But you don't know that.


Quality of documentation is measured in whether it's possible to grasp 
the information easily. Not in how long or how completely a page is 
being read.


    What percentage of users get to the docs through direct links vs 
the home page


That's something you can generally infer from server logs - was the home 
page accessed from that IP right before another page was opened? It's 
not as accurate, but for a general understanding of orders of magnitude 
it's good enough.


    If users are hitting a single page per browser window or navigating 
through the site, and if the latter, how?


Number of windows needs a client side tracker too. Knowing whether the 
cross-referencing links (e.g. "See also") are used is possibly relevant. 
And also "what functions are looked up after this function".


    How much are users using the search function? Is it finding what 
they want, or is it just a crutch?


How much is probably greppable from the server logs as well. Whether 
they find what they want - I'm not sure how you'd determine that. I 
search something ... and possibly open a page. If that's not what I 
wanted, I'll leave the site and e.g. use google. If that's what I 
wanted, I'll also stop looking after that page.


    Do people use the translations alone, or do they use both the 
English site and other languages in tandem?

    Does anyone use multiple translations?

That's likely also determinable by server logs.


And yeah, server logs are more locked down. But that's something you can 
fix. I hope that the raw analytics data is just as locked down as the 
server logs...


I get that "cached by another proxy" is a possible problem, but it's a 
strawman I think. You don't need to be able to track all users, but just 
many.



Overall I feel like the signal we can get from using a JS tracker 
specifically is comparatively low to the point it's not actually worth it.



Bob