Quote:
Originally Posted by clevermax I do not see a copyright issue here because of these two points:
1. An LLM does not store word-to-word copies of its training data set anywhere within the model. The training data set is used to adjust the weights of several millions/billions of parameters within the model, so that it can be said that it 'learned' all those huge amounts of data it is trained with.
So when it generates an output for you, every word (token) that comes out of it is actually given out by a sampling algorithm that picks the next token (~word) from a creamy layer of a probability distribution that comes at the final layer of the transformer's neural network, in its sequence of generation of the completion, until the end of sequence token is encountered or the token limit is hit. |
That's a great, succinct explanation of how an LLM works. However, this copyrighted material is part of the
training dataset for the LLM, so the AI companies + potentially RLHF annotators use your copyrighted data to train and improve their model. Copyright was already broken there when the foundation models were being trained. In the case illustrated by Samurai, for someone directly copying this code and integrating it into their codebase, I feel that Boost's copyright license has been broken because they state "all derivative works of the Software" must include a copy of this license. And this is clearly a derivative work. In software, I don't think that a "word for word" copy is what matters as much as the underlying algorithm/patterns/logic.
When training a model, you aim to minimize it's loss/error. But how can you tell, especially with software/math/physics etc. whether the output is correct or not? The presence of certain words/tokens, or any other such heuristic measure, falls short here. Enter RLHF, a billion dollar industry. (This is the problem that popular voices think newer AI models have solved, btw, despite synthetic data pipelines having existed for a while now and the RLHF industry doing better than ever before. Either way, we won't know for a while.)
With RLHF, annotators can "correct" the model enough times that despite the probability distribution you spoke of, a chosen model with a low enough temperature, when asked a specific coding question w/o any tricks in the prompt, almost always results in the same set of results (barring small errors like missing characters or slightly incorrect logic). A reason for this could be because certain answers are rated correct so many times, the weights reflect this and the tokens leading to them are so highly probable that the top-k/top-p tokens after such training would all lead to the same set of answers (like the saying "all roads lead to Rome"). You can try this for yourself with LeetCode problems and the model's solutions which, conveniently, are always similar to the most popular solutions that are openly available!
Copyright law in AI is a huge debate, an ex-OpenAI whistleblower who brought this up mysteriously died just last month. I don't know enough about it as I am neither a lawyer nor have I worked with frontier model labs, but it's definitely a problem and I feel that it shouldn't be dismissed just because AI is not reproducing an exact, one-to-one copy of any materials used in its training set.
I'd say if staying safe from this type of legal issue is of importance to you, use with caution!
EDIT: I missed @NetfreakBombay's response. Thanks for the clarification! This is great to know for IDE applications. Any idea what happens wrt code produced in a chat by the models, such as ChatGPT/Claude's Web interface?