Re: [webkit-dev] Spam and indexing

2019-05-02 Thread Alexey Proskuryakov

I posted a tool that I used for this today to 
https://bugs.webkit.org/show_bug.cgi?id=197537. Probably a lot to improve, but 
it works.

- Alexey


> 2 мая 2019 г., в 14:32, Darin Adler  написал(а):
> 
> Should we post instructions somewhere for people dealing with spam? I believe 
> the instructions are:
> 
> 1) Look up the email address of the account that posted the spam and disable 
> it first, so spammers don’t get email about other steps. Do this by clicking 
> on Administration, Users, finding the user and putting the word “Spam” into 
> the disable text.
> 
> 2) Move any spam bugs into the Spam component.
> 
> 3) Mark any spam comments as Private and also add the tag “spam”.
> 
> But maybe there’s more to it than that. For example, can someone without 
> administration privileges do the right thing? Should we make a small tool to 
> make this easier to do correctly?
> 
> I like the idea of having instructions so this isn’t oral tradition.
> 
> — Darin


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-05-02 Thread Michael Catanzaro

On Thu, May 2, 2019 at 4:32 PM, Darin Adler  wrote:
For example, can someone without administration privileges do the 
right thing?


Nope.


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-05-02 Thread Darin Adler
Should we post instructions somewhere for people dealing with spam? I believe 
the instructions are:

1) Look up the email address of the account that posted the spam and disable it 
first, so spammers don’t get email about other steps. Do this by clicking on 
Administration, Users, finding the user and putting the word “Spam” into the 
disable text.

2) Move any spam bugs into the Spam component.

3) Mark any spam comments as Private and also add the tag “spam”.

But maybe there’s more to it than that. For example, can someone without 
administration privileges do the right thing? Should we make a small tool to 
make this easier to do correctly?

I like the idea of having instructions so this isn’t oral tradition.

— Darin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-05-02 Thread Alexey Proskuryakov

One change that I'm going to make is to mark spam comments as private instead 
of simply tagging. That way, bugs will look cleaner, and there will be no doubt 
about whether search engines index hidden comments or not.

I'll also mark old spam comments as private. I think that this will generate 
e-mail notifications, so apologies for the upcoming e-mail storm. These should 
be possible to delete all at once in most e-mail clients.

- Alexey

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-04-22 Thread Michael Catanzaro
On Mon, Apr 22, 2019 at 11:06 AM, Konstantin Tokarev 
 wrote:
Another possible way is to disable self-registration for new users, 
similarly

to what LLVM project did [1].


GCC Bugzilla did this a long time ago.

It will make it really hard to convince users to report bugs. I would 
try deindexing first, since it's a smaller hammer. Then we could try 
this if that fails.


Michael


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-04-22 Thread Konstantin Tokarev



22.04.2019, 18:58, "Michael Catanzaro" :
> Not indexing bugs.webkit.org will be sad for people who won't be able
> to find bugs they may be interested in via search engines... but those
> people are probably not WebKit developers working with WebKit on a
> daily basis. For us, it's just annoying to deal with the spam. I would
> turn off the indexing if we think it could make a difference.

Another possible way is to disable self-registration for new users, similarly
to what LLVM project did [1].

[1] https://bugs.llvm.org/

-- 
Regards,
Konstantin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-04-22 Thread Michael Catanzaro
Not indexing bugs.webkit.org will be sad for people who won't be able 
to find bugs they may be interested in via search engines... but those 
people are probably not WebKit developers working with WebKit on a 
daily basis. For us, it's just annoying to deal with the spam. I would 
turn off the indexing if we think it could make a difference.




___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-03-29 Thread Konstantin Tokarev


29.03.2019, 19:30, "Michael Catanzaro" :
> On Thu, Mar 28, 2019 at 3:57 PM, Alexey Proskuryakov 
> wrote:
>>  2. Block indexing completely.
>>
>>  Seems like no one was bothered by lack of indexing on new bugs so far.
>
> Spam problem seems worse than not being indexed.
>
> If you want to search for WebKit bugs, you can do that on WebKit
> Bugzilla, right?

1. If some issue is referenced from external Web sites (such as, e.g., 
StackOverflow),
it's placed higher in search engine results, so for issues affecting many people
using search engine may allow finding right issue faster

2. Search engines allow searching answer in all web, which may be useful if one 
is not
sure if bug is in WebKit or not.

-- 
Regards,
Konstantin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-03-29 Thread Konstantin Tokarev


29.03.2019, 19:16, "Alexey Proskuryakov" :
>> 28 марта 2019 г., в 14:10, Konstantin Tokarev  написал(а):
>>
>> 28.03.2019, 23:58, "Alexey Proskuryakov" :
>>> Hello,
>>>
>>> The robots.txt file that we have on bugs.webkit.org currently allows search 
>>> engines access to individual bug pages, but not to any bug lists. As a 
>>> result, search engines and the Internet Archive only index bugs that were 
>>> filed before robots.txt changes a few years ago, and bugs that are directly 
>>> linked from webpages elsewhere. These bugs are where most spam content 
>>> naturally ends up on.
>>>
>>> This is quite wrong, as indexing just a subset of bugs is not beneficial to 
>>> anyone other than spammers. So we can go in either direction:
>>>
>>> 1. Allow indexers to enumerate bugs, thus indexing all of them.
>>>
>>> Seems reasonable that people should be able to find bugs using search 
>>> engines.
>>
>> Yes, and it may give better result even than searching bugzilla directly
>>
>>> On the other hand, we'll need to do something to ensure that indexers don't 
>>> destroy Bugzilla performance,
>>
>> This can be solved by caching
>
> Is this something that other Bugzilla instances do? I'm actually not sure how 
> caching can be meaningfully applied to Bugzilla. One wants to always see the 
> latest updates, and our automation in particular won't be OK with stale data.

I'm not sure if HTTP-level caching may be used here, but quick search brings 
this:
https://www.bugzilla.org/releases/5.0.4/release-notes.html#feat_caching_performance

If we can update Bugzilla it should be possible at least to reduce number of 
database hits when pages
are rendered.

> - Alexey
>
>>> and of course spammers will love having more flexibility.
>>
>> rel="nofollow" on all links in comments should be enough to make spamming 
>> useless
>>
>>> 2. Block indexing completely.
>>>
>>> Seems like no one was bothered by lack of indexing on new bugs so far.
>>
>> That's survival bias - if nobody can find relevant bugs, nobody will ever 
>> complain
>>
>>> Thoughts?
>>>
>>> For reference, here is the current robots.txt content:
>>>
>>> $ curl https://bugs.webkit.org/robots.txt
>>> User-agent: *
>>> Allow: /index.cgi
>>> Allow: /show_bug.cgi
>>> Disallow: /
>>> Crawl-delay: 20
>>>
>>> - Alexey
>>> - Alexey
>>>
>>> ___
>>> webkit-dev mailing list
>>> webkit-dev@lists.webkit.org
>>> https://lists.webkit.org/mailman/listinfo/webkit-dev
>>
>> --
>> Regards,
>> Konstantin


-- 
Regards,
Konstantin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-03-29 Thread Michael Catanzaro
On Thu, Mar 28, 2019 at 3:57 PM, Alexey Proskuryakov  
wrote:

2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.


Spam problem seems worse than not being indexed.

If you want to search for WebKit bugs, you can do that on WebKit 
Bugzilla, right?


Michael


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-03-29 Thread Alexey Proskuryakov


> 28 марта 2019 г., в 14:10, Konstantin Tokarev  написал(а):
> 
> 
> 
> 28.03.2019, 23:58, "Alexey Proskuryakov" :
>> Hello,
>> 
>> The robots.txt file that we have on bugs.webkit.org currently allows search 
>> engines access to individual bug pages, but not to any bug lists. As a 
>> result, search engines and the Internet Archive only index bugs that were 
>> filed before robots.txt changes a few years ago, and bugs that are directly 
>> linked from webpages elsewhere. These bugs are where most spam content 
>> naturally ends up on.
>> 
>> This is quite wrong, as indexing just a subset of bugs is not beneficial to 
>> anyone other than spammers. So we can go in either direction:
>> 
>> 1. Allow indexers to enumerate bugs, thus indexing all of them.
>> 
>> Seems reasonable that people should be able to find bugs using search 
>> engines.
> 
> Yes, and it may give better result even than searching bugzilla directly
> 
>> On the other hand, we'll need to do something to ensure that indexers don't 
>> destroy Bugzilla performance,
> 
> This can be solved by caching

Is this something that other Bugzilla instances do? I'm actually not sure how 
caching can be meaningfully applied to Bugzilla. One wants to always see the 
latest updates, and our automation in particular won't be OK with stale data.

- Alexey

>> and of course spammers will love having more flexibility.
> 
> rel="nofollow" on all links in comments should be enough to make spamming 
> useless
> 
>> 
>> 2. Block indexing completely.
>> 
>> Seems like no one was bothered by lack of indexing on new bugs so far.
> 
> That's survival bias - if nobody can find relevant bugs, nobody will ever 
> complain
> 
>> 
>> Thoughts?
>> 
>> For reference, here is the current robots.txt content:
>> 
>> $ curl https://bugs.webkit.org/robots.txt
>> User-agent: *
>> Allow: /index.cgi
>> Allow: /show_bug.cgi
>> Disallow: /
>> Crawl-delay: 20
>> 
>> - Alexey
>> - Alexey
>> 
>> ___
>> webkit-dev mailing list
>> webkit-dev@lists.webkit.org
>> https://lists.webkit.org/mailman/listinfo/webkit-dev
> 
> -- 
> Regards,
> Konstantin
> 



___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-03-28 Thread Lucas Forschler


> On Mar 28, 2019, at 2:10 PM, Konstantin Tokarev  wrote:
> 
> 
> 
> 28.03.2019, 23:58, "Alexey Proskuryakov"  >:
>> Hello,
>> 
>> The robots.txt file that we have on bugs.webkit.org 
>>  currently allows search engines access to 
>> individual bug pages, but not to any bug lists. As a result, search engines 
>> and the Internet Archive only index bugs that were filed before robots.txt 
>> changes a few years ago, and bugs that are directly linked from webpages 
>> elsewhere. These bugs are where most spam content naturally ends up on.
>> 
>> This is quite wrong, as indexing just a subset of bugs is not beneficial to 
>> anyone other than spammers. So we can go in either direction:
>> 
>> 1. Allow indexers to enumerate bugs, thus indexing all of them.
>> 
>> Seems reasonable that people should be able to find bugs using search 
>> engines.
> 
> Yes, and it may give better result even than searching bugzilla directly
> 
>> On the other hand, we'll need to do something to ensure that indexers don't 
>> destroy Bugzilla performance,
> 
> This can be solved by caching
> 
>> and of course spammers will love having more flexibility.
> 
> rel="nofollow" on all links in comments should be enough to make spamming 
> useless

Theoretically yes… but a couple google searches say it doesn’t make a 
difference. Here is one of many
https://www.seroundtable.com/google-nofollow-link-attribute-failed-comments-26959.html
 


I expect that spammers don’t reply care if they get a nofollow or not, they are 
mostly un-manned scripts anyway.

I’m not opposed to adding this, I just don’t expect it will solve the problem. 
We could measure and see.
Lucas


> 
>> 
>> 2. Block indexing completely.
>> 
>> Seems like no one was bothered by lack of indexing on new bugs so far.
> 
> That's survival bias - if nobody can find relevant bugs, nobody will ever 
> complain
> 
>> 
>> Thoughts?
>> 
>> For reference, here is the current robots.txt content:
>> 
>> $ curl https://bugs.webkit.org/robots.txt
>> User-agent: *
>> Allow: /index.cgi
>> Allow: /show_bug.cgi
>> Disallow: /
>> Crawl-delay: 20
>> 
>> - Alexey
>> - Alexey
>> 
>> ___
>> webkit-dev mailing list
>> webkit-dev@lists.webkit.org
>> https://lists.webkit.org/mailman/listinfo/webkit-dev
> 
> -- 
> Regards,
> Konstantin
> 
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org 
> https://lists.webkit.org/mailman/listinfo/webkit-dev 
> 
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Spam and indexing

2019-03-28 Thread Konstantin Tokarev



28.03.2019, 23:58, "Alexey Proskuryakov" :
> Hello,
>
> The robots.txt file that we have on bugs.webkit.org currently allows search 
> engines access to individual bug pages, but not to any bug lists. As a 
> result, search engines and the Internet Archive only index bugs that were 
> filed before robots.txt changes a few years ago, and bugs that are directly 
> linked from webpages elsewhere. These bugs are where most spam content 
> naturally ends up on.
>
> This is quite wrong, as indexing just a subset of bugs is not beneficial to 
> anyone other than spammers. So we can go in either direction:
>
> 1. Allow indexers to enumerate bugs, thus indexing all of them.
>
> Seems reasonable that people should be able to find bugs using search engines.

Yes, and it may give better result even than searching bugzilla directly

>On the other hand, we'll need to do something to ensure that indexers don't 
>destroy Bugzilla performance,

This can be solved by caching

>and of course spammers will love having more flexibility.

rel="nofollow" on all links in comments should be enough to make spamming 
useless

>
> 2. Block indexing completely.
>
> Seems like no one was bothered by lack of indexing on new bugs so far.

That's survival bias - if nobody can find relevant bugs, nobody will ever 
complain

>
> Thoughts?
>
> For reference, here is the current robots.txt content:
>
> $ curl https://bugs.webkit.org/robots.txt
> User-agent: *
> Allow: /index.cgi
> Allow: /show_bug.cgi
> Disallow: /
> Crawl-delay: 20
>
> - Alexey
> - Alexey
>
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> https://lists.webkit.org/mailman/listinfo/webkit-dev

-- 
Regards,
Konstantin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


[webkit-dev] Spam and indexing

2019-03-28 Thread Alexey Proskuryakov
Hello,

The robots.txt file that we have on bugs.webkit.org currently allows search 
engines access to individual bug pages, but not to any bug lists. As a result, 
search engines and the Internet Archive only index bugs that were filed before 
robots.txt changes a few years ago, and bugs that are directly linked from 
webpages elsewhere. These bugs are where most spam content naturally ends up on.

This is quite wrong, as indexing just a subset of bugs is not beneficial to 
anyone other than spammers. So we can go in either direction:

1. Allow indexers to enumerate bugs, thus indexing all of them.

Seems reasonable that people should be able to find bugs using search engines. 
On the other hand, we'll need to do something to ensure that indexers don't 
destroy Bugzilla performance, and of course spammers will love having more 
flexibility.

2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

Thoughts?


For reference, here is the current robots.txt content:

$ curl https://bugs.webkit.org/robots.txt
User-agent: *
Allow: /index.cgi
Allow: /show_bug.cgi
Disallow: /
Crawl-delay: 20

- Alexey
- Alexey


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev