Maybe it should be made mandatory to disclose any usage of LLM when opening PRs.
Banning usage of LLM completely is a bit extreme, but it may be necessary if vibe spammers keep flooding github with useless PRs. On Thursday, October 30, 2025 at 7:50:55 p.m. UTC+1 [email protected] wrote: > I don't think it is a terrible idea to simply have a "no LLMs" policy at > this point in time. We can always change it in the future as things get > clearer. People will still use them in their LLM enhanced editors, of > course, and we can never detect the basic uses of the tools. But if people > submit large chunks of text and code that have hallmarks of full > generation from an LLM, then we can reject and point to the policy. > > As for the smell test and misconceptions about what an LLM can produce, > this may depend on whether you think only a literal copy of something > violates copyright or if also a derivative of something violates copyright. > I think the essential question lies in whether the code a LLM produces is a > derivative of copyrighted code. There are many past court cases ruling that > derivatives are copyright violations in the US and the OSS licenses almost > all state that derivatives fall under the license. I doubt the LLM can > produce a fix to the polynomials module if the only training data was the > polynomials module. An LLM relies entirely on training on a vast corpus of > works and generating code from all of that large body. Now, is that output > then a derivative of one, some, or all of the training data? That is to be > determined by those that rule on laws (hopefully). Given that we have spent > about 40 years collectively trying to protect open source code with > copyright licenses, it seems terribly wrong that if you can make your copy > source large enough that you no longer have to abide by the licenses. > > Paul Ivanov and Matthew Brett have done a good job explaining this nuance > here: https://github.com/matthew-brett/sp-ai-post/blob/main/notes.md > > My personal opinion is that the LLMs should honor the licenses of the > training set and if they did, then all is good. I have no idea how they can > solve that from a technical perspective, but the companies are simply > ignoring copyright and claiming they are above such laws and that all that > they do is fair use. We plebes do not get that same ruling. > > Jason > moorepants.info > +01 530-601-9791 <(530)%20601-9791> > > > On Thu, Oct 30, 2025 at 7:08 PM Aaron Meurer <[email protected]> wrote: > >> I like the Ghostty policy, which is that AI coding assistance is >> allowed, but it must be disclosed >> >> https://github.com/ghostty-org/ghostty/blob/main/CONTRIBUTING.md#ai-assistance-notice >> . >> It should also be our policy that the person submitting the code is >> ultimately responsible for it, regardless of what tools were used to >> create it. >> >> I think it would be a mistake to ban AI usage entirely because AI can >> be very useful if used properly, i.e., you review the code it writes >> before submitting it. >> >> For me the copyright question doesn't really pass the smell test, at >> least for the majority of the use-cases where I would use AI in SymPy. >> For example, if I use AI to generate some fix for some part of SymPy, >> say the polynomials module, then where would that fix have "come from" >> for it to be a copyright violation? Where else in the world is there >> code that looks like the SymPy polynomials module? Most code in SymPy >> is very unique to SymPy. The only place it could have possibly come >> from is SymPy itself, but if SymPy already had it then the code >> wouldn't be needed in the first place (and anyways that wouldn't be a >> copyright violation). I think there's a misconception that LLMs can >> only generate text that they've already seen before, and if you >> believe that misconception then it would be easy to believe that >> everything generated by an LLM is a copyright violation. But this is >> something that is very easily seen to not be true if you spend any >> amount of time using coding tools. >> >> As for PR descriptions, I agree those should always be hand-written. >> But that's always been a battle, even before AI. And similarly almost >> no one writes real commit messages anymore. >> >> Aaron Meurer >> >> On Sun, Oct 26, 2025 at 1:30 PM Oscar Benjamin >> <[email protected]> wrote: >> > >> > Yes, the copyright is a big problem. I don't think I would say that >> > LLMs universally violate copyright e.g. if used for autocompleting an >> > obvious line of code or many other tasks. There are certain basic >> > things like x += 1 that cannot reasonably be considered to be under >> > copyright even if they do appear in much code. Clearly though an LLM >> > can produce a large body of code where the only meaningful >> > interpretation is that the code has been "copied" from one or two >> > publicly available codebases. >> > >> > The main difficulty I think with having a policy about the use of LLMs >> > is that unless it begins by saying "no LLMs" then it somehow needs to >> > begin by acknowledging what a reasonable use can be which means >> > confronting the copyright issue up front. >> > >> > On Sun, 26 Oct 2025 at 06:15, Jason Moore <[email protected]> wrote: >> > > >> > > Hi Oscar, >> > > >> > > Thanks for raising this. I agree, this problem will grow and it is >> not good. I think we should have a policy about LLM generated >> contributions. It would be nice if a SYMPEP was drafted for one. >> > > >> > > Having a standard way to reject spam PRs would be helpful. If we >> could close a PR and add a label to trigger sympybot to leave a comment >> that says "This PR does not meet SymPy's quality standards for AI generated >> code and comments, see policy <link>" could be helpful. It still requires >> manual steps from reviewers. >> > > >> > > I also share the general concern expressed by some in the scipy >> ecosystem here: >> > > >> > > >> https://github.com/scientific-python/summit-2025/issues/35#issuecomment-3038587497 >> > > >> > > which is that LLMs universally violate copyright licenses of open >> source code. If this is true, then PRs with LLM generated code are >> polluting SymPy's codebase with copyright violations. >> > > >> > > Jason >> > > moorepants.info >> > > +01 530-601-9791 <(530)%20601-9791> >> > > >> > > >> > > On Sun, Oct 26, 2025 at 12:46 AM Oscar Benjamin < >> [email protected]> wrote: >> > >> >> > >> Hi all, >> > >> >> > >> I am increasingly seeing pull requests in the SymPy repo that were >> > >> written by AI e.g. something like Claude Code or ChatGPT etc. I don't >> > >> think that any of these PRs are written by actual AI bots but rather >> > >> that they are "written" by contributors who are using AI tooling. >> > >> >> > >> There are two separate categories: >> > >> >> > >> - Some contributors are making reasonable changes to the code and >> then >> > >> using LLMs to write things like the PR description or comments on >> > >> issues. >> > >> - Some contributors are basically just vibe coding by having an LLM >> > >> write all the code for them and then opening PRs usually with very >> > >> obvious problems. >> > >> >> > >> In the first case some people use LLMs to write things like PR >> > >> descriptions because English is not their first language. I can >> > >> understand this and I think it is definitely possible to do this with >> > >> LLMs in a way that is fine but it needs to amount to using them like >> > >> Google Translate rather than asking them to write the text. The >> > >> problems are that: >> > >> >> > >> - LLM summaries for something like a PR are too verbose and include >> > >> lots of irrelevant information making it harder to see what the >> actual >> > >> point is. >> > >> - LLMs often include information that is just false such as "fixes >> > >> issue #12345" when the issue is not fixed. >> > >> >> > >> I think some people are doing this in a way that is not good and I >> > >> would prefer for them to just write in broken English or use Google >> > >> Translate or something but I don't see this as a major problem. >> > >> >> > >> For the vibe coding case I think that there is a real problem. Many >> > >> SymPy contributors are novices at programming and are nowhere near >> > >> experienced enough to be able to turn vibe coding into outputs that >> > >> can be included in the codebase. This means that there are just >> spammy >> > >> PRs with false claims about what they do like "fixes X", "10x faster" >> > >> etc where the code has not even been lightly tested and clearly does >> > >> not work or possibly does not even do anything. >> > >> >> > >> I think what has happened is that the combination of user-friendly >> > >> editors with easy git/GitHub integration and LLM agent plugins has >> > >> brought us to the point where there are pretty much no technical >> > >> barriers preventing someone from opening up gibberish spam PRs while >> > >> having no real idea what they are doing. >> > >> >> > >> Really this is just inexperienced people using the tools badly which >> > >> is not new. Low quality spammy PRs are not new either. There are some >> > >> significant differences though: >> > >> >> > >> - I think that the number of low quality PRs is going to explode. It >> > >> was already bad last year in the run up to GSOC (January to March >> > >> time) and I think it will be much worse this year. >> > >> - I don't think that it is reasonable to give meaningful feedback on >> > >> PRs where this happens because the contributor has not spent any time >> > >> studying the code that they are changing and any feedback is just >> > >> going to be fed into an LLM. >> > >> >> > >> I'm not sure what we can do about this so for now I am regularly >> > >> closing low quality PRs without much feedback but some contributors >> > >> will just go on to open up new PRs. The "anyone can submit a PR >> model" >> > >> has been under threat for some time but I worry that the whole idea >> is >> > >> going to become unsustainable. >> > >> >> > >> In the context of the Russia-Ukraine war I have often seen references >> > >> to the "cost-exchange problem". This refers to the fact that while >> > >> both sides have a lot of anti-air defence capability they can still >> be >> > >> overrun by cheap drones because million dollar interceptor missiles >> > >> are just too expensive to be used against any large number of >> incoming >> > >> thousand dollar drones. The solution there would be to have some kind >> > >> of cheap interceptor like an automatic AA gun that can take out many >> > >> cheap drones efficiently even if much less effective against fancier >> > >> targets like enemy planes. >> > >> >> > >> The first time I heard about ChatGPT was when I got an email from >> > >> StackOverflow saying that any use of ChatGPT was banned. Looking into >> > >> it the reason given was that it was just too easy to generate >> > >> superficially reasonable text that was low quality spam and then too >> > >> much effort for real humans to filter that spam out manually. In >> other >> > >> words bad/incorrect answers were nothing new but large numbers of >> > >> inexperienced people using ChatGPT had ruined the cost-exchange ratio >> > >> of filtering them out. >> > >> >> > >> I think in the case of SymPy pull requests there is an analogous >> > >> "effort-exchange problem". The effort PR reviewers put in to help >> with >> > >> PRs is not reasonable if the author of the PR is not putting in a lot >> > >> more effort themselves because there are many times more people >> trying >> > >> to author PRs than review them. I don't think that it can be >> > >> sustainable in the face of this spam to review PRs in the same way as >> > >> if they had been written by humans who are at least trying to >> > >> understand what they are doing (and therefore learning from >> feedback). >> > >> Even just closing PRs and not giving any feedback needs to become >> more >> > >> efficient somehow. >> > >> >> > >> We need some sort of clear guidance or policy on the use of AI that >> > >> sets clear explanations like "you still need to understand the code". >> > >> I think we will also need to ban people for spam if they are doing >> > >> things like opening AI-generated PRs without even testing the code. >> > >> The hype that is spun by AI companies probably has many novice >> > >> programmers believing that it actually is reasonable to behave like >> > >> this but it really is not and that needs to be clearly stated >> > >> somewhere. I don't think any of this is malicious but I think that it >> > >> has the potential to become very harmful to open source projects. >> > >> >> > >> The situation right now is not so bad but if you project forwards a >> > >> bit to when the repo gets a lot busier after Christmas I think this >> is >> > >> going to be a big problem and I think it will only get worse in >> future >> > >> years as well. >> > >> >> > >> It is very unfortunate that right now AI is being used in all the >> > >> wrong places. It can do a student's homework because it knows the >> > >> answers to all the standard homework problems but it can't do the >> more >> > >> complicated more realistic things and then students haven't learned >> > >> anything from doing their homework. In the context of SymPy it would >> > >> be so much more useful to have AI doing other things like reviewing >> > >> the code, finding bugs, etc rather than helping novices to get a PR >> > >> merged without actually investing the time to learn anything from the >> > >> process. >> > >> >> > >> -- >> > >> Oscar >> > >> >> > >> -- >> > >> You received this message because you are subscribed to the Google >> Groups "sympy" group. >> > >> To unsubscribe from this group and stop receiving emails from it, >> send an email to [email protected]. >> > >> To view this discussion visit >> https://groups.google.com/d/msgid/sympy/CAHVvXxQ1ntG0EWBGihrXErLhGuABHH7Kt5RmGJvp9bHcqaC5%3DQ%40mail.gmail.com >> . >> > > >> > > -- >> > > You received this message because you are subscribed to the Google >> Groups "sympy" group. >> > > To unsubscribe from this group and stop receiving emails from it, >> send an email to [email protected]. >> > > To view this discussion visit >> https://groups.google.com/d/msgid/sympy/CAP7f1AhXNE-UapwEm1bQW9de%3Di%2BWixFen5sTp8MsCMScsqA-%3Dg%40mail.gmail.com >> . >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "sympy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to [email protected]. >> > To view this discussion visit >> https://groups.google.com/d/msgid/sympy/CAHVvXxSW_5u4Qvj5kddZUQzzNdkteTZ5GJX46D_c3Gko87Dj%2Bg%40mail.gmail.com >> . >> >> -- >> You received this message because you are subscribed to the Google Groups >> "sympy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion visit >> https://groups.google.com/d/msgid/sympy/CAKgW%3D6LO6iAs22%2BRuwR893ef-8b6WpiexqoBF4f%2ByyjDQGcF3A%40mail.gmail.com >> . >> > -- You received this message because you are subscribed to the Google Groups "sympy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/sympy/32a6ea14-ef10-4fa7-913c-8bd2b549b6c6n%40googlegroups.com.
