Re: [sympy] AI generated pull requests

Francesco Bonazzi Wed, 05 Nov 2025 14:50:31 -0800

Maybe it should be made mandatory to disclose any usage of LLM when opening 
PRs.


Banning usage of LLM completely is a bit extreme, but it may be necessary 
if vibe spammers keep flooding github with useless PRs.
 
On Thursday, October 30, 2025 at 7:50:55 p.m. UTC+1 [email protected] 
wrote:

> I don't think it is a terrible idea to simply have a "no LLMs" policy at 
> this point in time. We can always change it in the future as things get 
> clearer. People will still use them in their LLM enhanced editors, of 
> course, and we can never detect the basic uses of the tools. But if people 
> submit large chunks of text and code that have hallmarks of full 
> generation from an LLM, then we can reject and point to the policy.
>
> As for the smell test and misconceptions about what an LLM can produce, 
> this may depend on whether you think only a literal copy of something 
> violates copyright or if also a derivative of something violates copyright. 
> I think the essential question lies in whether the code a LLM produces is a 
> derivative of copyrighted code. There are many past court cases ruling that 
> derivatives are copyright violations in the US and the OSS licenses almost 
> all state that derivatives fall under the license. I doubt the LLM can 
> produce a fix to the polynomials module if the only training data was the 
> polynomials module. An LLM relies entirely on training on a vast corpus of 
> works and generating code from all of that large body. Now, is that output 
> then a derivative of one, some, or all of the training data? That is to be 
> determined by those that rule on laws (hopefully). Given that we have spent 
> about 40 years collectively trying to protect open source code with 
> copyright licenses, it seems terribly wrong that if you can make your copy 
> source large enough that you no longer have to abide by the licenses.
>
> Paul Ivanov and Matthew Brett have done a good job explaining this nuance 
> here: https://github.com/matthew-brett/sp-ai-post/blob/main/notes.md
>
> My personal opinion is that the LLMs should honor the licenses of the 
> training set and if they did, then all is good. I have no idea how they can 
> solve that from a technical perspective, but the companies are simply 
> ignoring copyright and claiming they are above such laws and that all that 
> they do is fair use. We plebes do not get that same ruling.
>
> Jason
> moorepants.info
> +01 530-601-9791 <(530)%20601-9791>
>
>
> On Thu, Oct 30, 2025 at 7:08 PM Aaron Meurer <[email protected]> wrote:
>
>> I like the Ghostty policy, which is that AI coding assistance is
>> allowed, but it must be disclosed
>>
>> https://github.com/ghostty-org/ghostty/blob/main/CONTRIBUTING.md#ai-assistance-notice
>> .
>> It should also be our policy that the person submitting the code is
>> ultimately responsible for it, regardless of what tools were used to
>> create it.
>>
>> I think it would be a mistake to ban AI usage entirely because AI can
>> be very useful if used properly, i.e., you review the code it writes
>> before submitting it.
>>
>> For me the copyright question doesn't really pass the smell test, at
>> least for the majority of the use-cases where I would use AI in SymPy.
>> For example, if I use AI to generate some fix for some part of SymPy,
>> say the polynomials module, then where would that fix have "come from"
>> for it to be a copyright violation? Where else in the world is there
>> code that looks like the SymPy polynomials module? Most code in SymPy
>> is very unique to SymPy. The only place it could have possibly come
>> from is SymPy itself, but if SymPy already had it then the code
>> wouldn't be needed in the first place (and anyways that wouldn't be a
>> copyright violation). I think there's a misconception that LLMs can
>> only generate text that they've already seen before, and if you
>> believe that misconception then it would be easy to believe that
>> everything generated by an LLM is a copyright violation. But this is
>> something that is very easily seen to not be true if you spend any
>> amount of time using coding tools.
>>
>> As for PR descriptions, I agree those should always be hand-written.
>> But that's always been a battle, even before AI. And similarly almost
>> no one writes real commit messages anymore.
>>
>> Aaron Meurer
>>
>> On Sun, Oct 26, 2025 at 1:30 PM Oscar Benjamin
>> <[email protected]> wrote:
>> >
>> > Yes, the copyright is a big problem. I don't think I would say that
>> > LLMs universally violate copyright e.g. if used for autocompleting an
>> > obvious line of code or many other tasks. There are certain basic
>> > things like x += 1 that cannot reasonably be considered to be under
>> > copyright even if they do appear in much code. Clearly though an LLM
>> > can produce a large body of code where the only meaningful
>> > interpretation is that the code has been "copied" from one or two
>> > publicly available codebases.
>> >
>> > The main difficulty I think with having a policy about the use of LLMs
>> > is that unless it begins by saying "no LLMs" then it somehow needs to
>> > begin by acknowledging what a reasonable use can be which means
>> > confronting the copyright issue up front.
>> >
>> > On Sun, 26 Oct 2025 at 06:15, Jason Moore <[email protected]> wrote:
>> > >
>> > > Hi Oscar,
>> > >
>> > > Thanks for raising this. I agree, this problem will grow and it is 
>> not good. I think we should have a policy about LLM generated 
>> contributions. It would be nice if a SYMPEP was drafted for one.
>> > >
>> > > Having a standard way to reject spam PRs would be helpful. If we 
>> could close a PR and add a label to trigger sympybot to leave a comment 
>> that says "This PR does not meet SymPy's quality standards for AI generated 
>> code and comments, see policy <link>" could be helpful. It still requires 
>> manual steps from reviewers.
>> > >
>> > > I also share the general concern expressed by some in the scipy 
>> ecosystem here:
>> > >
>> > > 
>> https://github.com/scientific-python/summit-2025/issues/35#issuecomment-3038587497
>> > >
>> > > which is that LLMs universally violate copyright licenses of open 
>> source code. If this is true, then PRs with LLM generated code are 
>> polluting SymPy's codebase with copyright violations.
>> > >
>> > > Jason
>> > > moorepants.info
>> > > +01 530-601-9791 <(530)%20601-9791>
>> > >
>> > >
>> > > On Sun, Oct 26, 2025 at 12:46 AM Oscar Benjamin <
>> [email protected]> wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> I am increasingly seeing pull requests in the SymPy repo that were
>> > >> written by AI e.g. something like Claude Code or ChatGPT etc. I don't
>> > >> think that any of these PRs are written by actual AI bots but rather
>> > >> that they are "written" by contributors who are using AI tooling.
>> > >>
>> > >> There are two separate categories:
>> > >>
>> > >> - Some contributors are making reasonable changes to the code and 
>> then
>> > >> using LLMs to write things like the PR description or comments on
>> > >> issues.
>> > >> - Some contributors are basically just vibe coding by having an LLM
>> > >> write all the code for them and then opening PRs usually with very
>> > >> obvious problems.
>> > >>
>> > >> In the first case some people use LLMs to write things like PR
>> > >> descriptions because English is not their first language. I can
>> > >> understand this and I think it is definitely possible to do this with
>> > >> LLMs in a way that is fine but it needs to amount to using them like
>> > >> Google Translate rather than asking them to write the text. The
>> > >> problems are that:
>> > >>
>> > >> - LLM summaries for something like a PR are too verbose and include
>> > >> lots of irrelevant information making it harder to see what the 
>> actual
>> > >> point is.
>> > >> - LLMs often include information that is just false such as "fixes
>> > >> issue #12345" when the issue is not fixed.
>> > >>
>> > >> I think some people are doing this in a way that is not good and I
>> > >> would prefer for them to just write in broken English or use Google
>> > >> Translate or something but I don't see this as a major problem.
>> > >>
>> > >> For the vibe coding case I think that there is a real problem. Many
>> > >> SymPy contributors are novices at programming and are nowhere near
>> > >> experienced enough to be able to turn vibe coding into outputs that
>> > >> can be included in the codebase. This means that there are just 
>> spammy
>> > >> PRs with false claims about what they do like "fixes X", "10x faster"
>> > >> etc where the code has not even been lightly tested and clearly does
>> > >> not work or possibly does not even do anything.
>> > >>
>> > >> I think what has happened is that the combination of user-friendly
>> > >> editors with easy git/GitHub integration and LLM agent plugins has
>> > >> brought us to the point where there are pretty much no technical
>> > >> barriers preventing someone from opening up gibberish spam PRs while
>> > >> having no real idea what they are doing.
>> > >>
>> > >> Really this is just inexperienced people using the tools badly which
>> > >> is not new. Low quality spammy PRs are not new either. There are some
>> > >> significant differences though:
>> > >>
>> > >> - I think that the number of low quality PRs is going to explode. It
>> > >> was already bad last year in the run up to GSOC (January to March
>> > >> time) and I think it will be much worse this year.
>> > >> - I don't think that it is reasonable to give meaningful feedback on
>> > >> PRs where this happens because the contributor has not spent any time
>> > >> studying the code that they are changing and any feedback is just
>> > >> going to be fed into an LLM.
>> > >>
>> > >> I'm not sure what we can do about this so for now I am regularly
>> > >> closing low quality PRs without much feedback but some contributors
>> > >> will just go on to open up new PRs. The "anyone can submit a PR 
>> model"
>> > >> has been under threat for some time but I worry that the whole idea 
>> is
>> > >> going to become unsustainable.
>> > >>
>> > >> In the context of the Russia-Ukraine war I have often seen references
>> > >> to the "cost-exchange problem". This refers to the fact that while
>> > >> both sides have a lot of anti-air defence capability they can still 
>> be
>> > >> overrun by cheap drones because million dollar interceptor missiles
>> > >> are just too expensive to be used against any large number of 
>> incoming
>> > >> thousand dollar drones. The solution there would be to have some kind
>> > >> of cheap interceptor like an automatic AA gun that can take out many
>> > >> cheap drones efficiently even if much less effective against fancier
>> > >> targets like enemy planes.
>> > >>
>> > >> The first time I heard about ChatGPT was when I got an email from
>> > >> StackOverflow saying that any use of ChatGPT was banned. Looking into
>> > >> it the reason given was that it was just too easy to generate
>> > >> superficially reasonable text that was low quality spam and then too
>> > >> much effort for real humans to filter that spam out manually. In 
>> other
>> > >> words bad/incorrect answers were nothing new but large numbers of
>> > >> inexperienced people using ChatGPT had ruined the cost-exchange ratio
>> > >> of filtering them out.
>> > >>
>> > >> I think in the case of SymPy pull requests there is an analogous
>> > >> "effort-exchange problem". The effort PR reviewers put in to help 
>> with
>> > >> PRs is not reasonable if the author of the PR is not putting in a lot
>> > >> more effort themselves because there are many times more people 
>> trying
>> > >> to author PRs than review them. I don't think that it can be
>> > >> sustainable in the face of this spam to review PRs in the same way as
>> > >> if they had been written by humans who are at least trying to
>> > >> understand what they are doing (and therefore learning from 
>> feedback).
>> > >> Even just closing PRs and not giving any feedback needs to become 
>> more
>> > >> efficient somehow.
>> > >>
>> > >> We need some sort of clear guidance or policy on the use of AI that
>> > >> sets clear explanations like "you still need to understand the code".
>> > >> I think we will also need to ban people for spam if they are doing
>> > >> things like opening AI-generated PRs without even testing the code.
>> > >> The hype that is spun by AI companies probably has many novice
>> > >> programmers believing that it actually is reasonable to behave like
>> > >> this but it really is not and that needs to be clearly stated
>> > >> somewhere. I don't think any of this is malicious but I think that it
>> > >> has the potential to become very harmful to open source projects.
>> > >>
>> > >> The situation right now is not so bad but if you project forwards a
>> > >> bit to when the repo gets a lot busier after Christmas I think this 
>> is
>> > >> going to be a big problem and I think it will only get worse in 
>> future
>> > >> years as well.
>> > >>
>> > >> It is very unfortunate that right now AI is being used in all the
>> > >> wrong places. It can do a student's homework because it knows the
>> > >> answers to all the standard homework problems but it can't do the 
>> more
>> > >> complicated more realistic things and then students haven't learned
>> > >> anything from doing their homework. In the context of SymPy it would
>> > >> be so much more useful to have AI doing other things like reviewing
>> > >> the code, finding bugs, etc rather than helping novices to get a PR
>> > >> merged without actually investing the time to learn anything from the
>> > >> process.
>> > >>
>> > >> --
>> > >> Oscar
>> > >>
>> > >> --
>> > >> You received this message because you are subscribed to the Google 
>> Groups "sympy" group.
>> > >> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to [email protected].
>> > >> To view this discussion visit 
>> https://groups.google.com/d/msgid/sympy/CAHVvXxQ1ntG0EWBGihrXErLhGuABHH7Kt5RmGJvp9bHcqaC5%3DQ%40mail.gmail.com
>> .
>> > >
>> > > --
>> > > You received this message because you are subscribed to the Google 
>> Groups "sympy" group.
>> > > To unsubscribe from this group and stop receiving emails from it, 
>> send an email to [email protected].
>> > > To view this discussion visit 
>> https://groups.google.com/d/msgid/sympy/CAP7f1AhXNE-UapwEm1bQW9de%3Di%2BWixFen5sTp8MsCMScsqA-%3Dg%40mail.gmail.com
>> .
>> >
>> > --
>> > You received this message because you are subscribed to the Google 
>> Groups "sympy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to [email protected].
>> > To view this discussion visit 
>> https://groups.google.com/d/msgid/sympy/CAHVvXxSW_5u4Qvj5kddZUQzzNdkteTZ5GJX46D_c3Gko87Dj%2Bg%40mail.gmail.com
>> .
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "sympy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion visit 
>> https://groups.google.com/d/msgid/sympy/CAKgW%3D6LO6iAs22%2BRuwR893ef-8b6WpiexqoBF4f%2ByyjDQGcF3A%40mail.gmail.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/sympy/32a6ea14-ef10-4fa7-913c-8bd2b549b6c6n%40googlegroups.com.

Re: [sympy] AI generated pull requests

Reply via email to