Yes, the copyright is a big problem. I don't think I would say that LLMs universally violate copyright e.g. if used for autocompleting an obvious line of code or many other tasks. There are certain basic things like x += 1 that cannot reasonably be considered to be under copyright even if they do appear in much code. Clearly though an LLM can produce a large body of code where the only meaningful interpretation is that the code has been "copied" from one or two publicly available codebases.
The main difficulty I think with having a policy about the use of LLMs is that unless it begins by saying "no LLMs" then it somehow needs to begin by acknowledging what a reasonable use can be which means confronting the copyright issue up front. On Sun, 26 Oct 2025 at 06:15, Jason Moore <[email protected]> wrote: > > Hi Oscar, > > Thanks for raising this. I agree, this problem will grow and it is not good. > I think we should have a policy about LLM generated contributions. It would > be nice if a SYMPEP was drafted for one. > > Having a standard way to reject spam PRs would be helpful. If we could close > a PR and add a label to trigger sympybot to leave a comment that says "This > PR does not meet SymPy's quality standards for AI generated code and > comments, see policy <link>" could be helpful. It still requires manual steps > from reviewers. > > I also share the general concern expressed by some in the scipy ecosystem > here: > > https://github.com/scientific-python/summit-2025/issues/35#issuecomment-3038587497 > > which is that LLMs universally violate copyright licenses of open source > code. If this is true, then PRs with LLM generated code are polluting SymPy's > codebase with copyright violations. > > Jason > moorepants.info > +01 530-601-9791 > > > On Sun, Oct 26, 2025 at 12:46 AM Oscar Benjamin <[email protected]> > wrote: >> >> Hi all, >> >> I am increasingly seeing pull requests in the SymPy repo that were >> written by AI e.g. something like Claude Code or ChatGPT etc. I don't >> think that any of these PRs are written by actual AI bots but rather >> that they are "written" by contributors who are using AI tooling. >> >> There are two separate categories: >> >> - Some contributors are making reasonable changes to the code and then >> using LLMs to write things like the PR description or comments on >> issues. >> - Some contributors are basically just vibe coding by having an LLM >> write all the code for them and then opening PRs usually with very >> obvious problems. >> >> In the first case some people use LLMs to write things like PR >> descriptions because English is not their first language. I can >> understand this and I think it is definitely possible to do this with >> LLMs in a way that is fine but it needs to amount to using them like >> Google Translate rather than asking them to write the text. The >> problems are that: >> >> - LLM summaries for something like a PR are too verbose and include >> lots of irrelevant information making it harder to see what the actual >> point is. >> - LLMs often include information that is just false such as "fixes >> issue #12345" when the issue is not fixed. >> >> I think some people are doing this in a way that is not good and I >> would prefer for them to just write in broken English or use Google >> Translate or something but I don't see this as a major problem. >> >> For the vibe coding case I think that there is a real problem. Many >> SymPy contributors are novices at programming and are nowhere near >> experienced enough to be able to turn vibe coding into outputs that >> can be included in the codebase. This means that there are just spammy >> PRs with false claims about what they do like "fixes X", "10x faster" >> etc where the code has not even been lightly tested and clearly does >> not work or possibly does not even do anything. >> >> I think what has happened is that the combination of user-friendly >> editors with easy git/GitHub integration and LLM agent plugins has >> brought us to the point where there are pretty much no technical >> barriers preventing someone from opening up gibberish spam PRs while >> having no real idea what they are doing. >> >> Really this is just inexperienced people using the tools badly which >> is not new. Low quality spammy PRs are not new either. There are some >> significant differences though: >> >> - I think that the number of low quality PRs is going to explode. It >> was already bad last year in the run up to GSOC (January to March >> time) and I think it will be much worse this year. >> - I don't think that it is reasonable to give meaningful feedback on >> PRs where this happens because the contributor has not spent any time >> studying the code that they are changing and any feedback is just >> going to be fed into an LLM. >> >> I'm not sure what we can do about this so for now I am regularly >> closing low quality PRs without much feedback but some contributors >> will just go on to open up new PRs. The "anyone can submit a PR model" >> has been under threat for some time but I worry that the whole idea is >> going to become unsustainable. >> >> In the context of the Russia-Ukraine war I have often seen references >> to the "cost-exchange problem". This refers to the fact that while >> both sides have a lot of anti-air defence capability they can still be >> overrun by cheap drones because million dollar interceptor missiles >> are just too expensive to be used against any large number of incoming >> thousand dollar drones. The solution there would be to have some kind >> of cheap interceptor like an automatic AA gun that can take out many >> cheap drones efficiently even if much less effective against fancier >> targets like enemy planes. >> >> The first time I heard about ChatGPT was when I got an email from >> StackOverflow saying that any use of ChatGPT was banned. Looking into >> it the reason given was that it was just too easy to generate >> superficially reasonable text that was low quality spam and then too >> much effort for real humans to filter that spam out manually. In other >> words bad/incorrect answers were nothing new but large numbers of >> inexperienced people using ChatGPT had ruined the cost-exchange ratio >> of filtering them out. >> >> I think in the case of SymPy pull requests there is an analogous >> "effort-exchange problem". The effort PR reviewers put in to help with >> PRs is not reasonable if the author of the PR is not putting in a lot >> more effort themselves because there are many times more people trying >> to author PRs than review them. I don't think that it can be >> sustainable in the face of this spam to review PRs in the same way as >> if they had been written by humans who are at least trying to >> understand what they are doing (and therefore learning from feedback). >> Even just closing PRs and not giving any feedback needs to become more >> efficient somehow. >> >> We need some sort of clear guidance or policy on the use of AI that >> sets clear explanations like "you still need to understand the code". >> I think we will also need to ban people for spam if they are doing >> things like opening AI-generated PRs without even testing the code. >> The hype that is spun by AI companies probably has many novice >> programmers believing that it actually is reasonable to behave like >> this but it really is not and that needs to be clearly stated >> somewhere. I don't think any of this is malicious but I think that it >> has the potential to become very harmful to open source projects. >> >> The situation right now is not so bad but if you project forwards a >> bit to when the repo gets a lot busier after Christmas I think this is >> going to be a big problem and I think it will only get worse in future >> years as well. >> >> It is very unfortunate that right now AI is being used in all the >> wrong places. It can do a student's homework because it knows the >> answers to all the standard homework problems but it can't do the more >> complicated more realistic things and then students haven't learned >> anything from doing their homework. In the context of SymPy it would >> be so much more useful to have AI doing other things like reviewing >> the code, finding bugs, etc rather than helping novices to get a PR >> merged without actually investing the time to learn anything from the >> process. >> >> -- >> Oscar >> >> -- >> You received this message because you are subscribed to the Google Groups >> "sympy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion visit >> https://groups.google.com/d/msgid/sympy/CAHVvXxQ1ntG0EWBGihrXErLhGuABHH7Kt5RmGJvp9bHcqaC5%3DQ%40mail.gmail.com. > > -- > You received this message because you are subscribed to the Google Groups > "sympy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/sympy/CAP7f1AhXNE-UapwEm1bQW9de%3Di%2BWixFen5sTp8MsCMScsqA-%3Dg%40mail.gmail.com. -- You received this message because you are subscribed to the Google Groups "sympy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/sympy/CAHVvXxSW_5u4Qvj5kddZUQzzNdkteTZ5GJX46D_c3Gko87Dj%2Bg%40mail.gmail.com.
