Artificial Intelligence: How far is it?

clevermax · 17th January 2025, 20:32

Quote:

Originally Posted by Samurai

Is it really designing or stealing from the best without giving credit? How do you handle the copyright issue that would come from this practice.

I do not see a copyright issue here because of these two points:

1. An LLM does not store word-to-word copies of its training data set anywhere within the model. The training data set is used to adjust the weights of several millions/billions of parameters within the model, so that it can be said that it 'learned' all those huge amounts of data it is trained with.

So when it generates an output for you, every word (token) that comes out of it is actually given out by a sampling algorithm that picks the next token (~word) from a creamy layer of a probability distribution that comes at the final layer of the transformer's neural network, in its sequence of generation of the completion, until the end of sequence token is encountered or the token limit is hit.

2. When I provided my design needs as a prompt to it, the model created the code and class structures aligning to those specific needs, which is a form of original design, not copying. Of course, the model was able to do it because it had the 'knowledge' from a lot of code blocks that it was trained with. It won't technically copy and paste any exact sequence from the trained dataset. Think of it like this - a human reads and understands a lot of python code and design patterns. Later, he/she is asked to write code for a specific need - they will write the code by mentally recalling the knowledge gained from all those books that they read, and it is very likely that the pieces of code written by that person may carry similarities with some of the pieces of code that they learned from the books. It can't be said that they copied from a few sections of some book blindly. If that is the case, every atomic logic piece written today by anyone is a copy of some code written by someone else.

Samurai · 17th January 2025, 21:41

Quote:

Originally Posted by clevermax

When I provided my design needs as a prompt to it, the model created the code and class structures aligning to those specific needs, which is a form of original design, not copying. Of course, the model was able to do it because it had the 'knowledge' from a lot of code blocks that it was trained with. It won't technically copy and paste any exact sequence from the trained dataset. Think of it like this - a human reads and understands a lot of python code and design patterns.

Is that what you think it is doing?

When LLMs first debuted, I had posted here how ChatGPT simply copied from Boost website.

But it is 2025 now, so I repeated the exercise with Microsoft Co-pilot.

Artificial Intelligence: How far is it?-screenshot-20250117-213438.png

Name: Screenshot 20250117 213454.png
Views: 160
Size: 106.0 KB

After this, I went to look where it lifted it from... and found it.

https://www.boost.org/doc/libs/maste...ync_client.cpp

Notice how it stripped all the comments and also did some minor changes. However, all variable names, error messages and the order of the instructions are same.

So, I am not convinced it is not copying source code from somewhere.

clevermax · 17th January 2025, 22:21

Quote:

Originally Posted by Samurai

Is that what you think it is doing?
So, I am not convinced it is not copying source code from somewhere.

Yes, I am very sure about this - that the Large Language Models are not storing copies of the pieces of code or any other training data within the model and are simply playing it back. They are, simply put, huge statistical calculators.

Regarding your example -
It is not a detailed prompt with a lot of specifics - it is simply asking the model to give code to create an http client. There are only so many standard approaches to doing this in cpp (of course one can come up with very bespoke & complicated solution to this but who does?), and it is very likely that the code that you found on that website, the same implementation with slight variations are seen in many other places too. It is a good example of a 'common solution to a common problem' scenario. The core logic of an HTTP client (making requests, handling responses, etc.) involves these common patterns that the LLM might have learned.

Since the ask here is to just create an http client, it is very likely that the LLM, once it starts pumping out tokens, aligns to the most common patterns it has learned for this type of an ask. LLMs are trained with text sequences taken from training dataset, from which some tokens are purposefully masked. LLMs are asked to predict the missing token sequences, and then, an error is calculated by comparing the LLM's 'best guess' with the actual tokens that are masked (as it exists in training data set) and this error is used to tune up the weights of the model, this process repeats over and over again for the entire training dataset. With this process, LLMs slowly figure out how to minimize the error.

In the example I mentioned, my prompt was almost 2 pages it had a lot of specifics on what needs to be built to solve my problem. The overall code structure and what it solves is unique. However, while deeply examining this code, one may find similarities in smaller segments that handle common routines within the larger solution, matching (very similar to) pieces of code that exist on the internet. This holds true even for code written by human programmers in some cases, as they too implement standard solutions for common programming tasks.

Samurai · 17th January 2025, 22:59

Quote:

Originally Posted by clevermax

Since the ask here is to just create an http client, it is very likely that the LLM, once it starts pumping out tokens, aligns to the most common patterns it has learned for this type of an ask.

Including identical coding standard, variable names and error messages?

Quote:

Originally Posted by clevermax

In the example I mentioned, my prompt was almost 2 pages it had a lot of specifics on what needs to be built to solve my problem. The overall code structure and what it solves is unique. However, while deeply examining this code, one may find similarities in smaller segments that handle common routines within the larger solution, matching pieces of code that exist on the internet.

This is all fine, but does it pass the IP law challenge?

I remember reading a post on this in law.stackexchange, so I dug it up again.

https://law.stackexchange.com/questi...-generative-ai

Quote:

you can pair a text with a word-frequency table and write a mathematical algorithm that exactly reduces any original text to a “model”, which can then exactly reproduce an original text not by copying from the clear text, but via the mathematical model. If that is what the program does, there is no question that this is illegal creation of a derivative work (in the same way that it is infringement to read a novel and start translating it into French, without the author’s permission). This is another legal risk, one that is harder to disprove, that you created an unauthoriized derivative work.

There are word-frequency tables ‘out there’ which are created by copying (infringing) protected works and reducing individual words to counts, e.g. “the” occurs 129,465,324 times. Nobody holds copyright on a single word, or two words, or three words… there is a line there, the law has not said what the bright line is (in terms of numbers of consecutive words). But one may benefit from a fair use argument, which is why at least so far nobody has been sued for creating a word frequency table, even though the act of creating the table requires copying of protected text (unless one is doing a table of frequencies for 19th century English, or 20th century laws passed by Congress). LLM works not just in terms of individual words in isolation, it encodes context, which if long enough could be very much like just the approach of substituting frequency-rank for the actual word.

In theory, the creators of the language model could be found liable for copyright infringement. The main risk, in my opinion, is that the law (in the US) prohibits creation of substantially-similar text, and a user might well end up in court because they relied on a tool that can produce extremely similar works. The user does not have access to the guts of Open AI or other such LLM software, and to mount a successful defense, they will need non-trivial assistance from the creators of the model to prove that this isn’t just plain old copying.

Vishal.R · 18th January 2025, 13:04

Quote:

Originally Posted by Samurai

Including identical coding standard, variable names and error messages?

This is all fine, but does it pass the IP law challenge?

I remember reading a post on this in law.stackexchange, so I dug it up again.

https://law.stackexchange.com/questi...-generative-ai

As per above link also Its the primary responsibility of creators of LLM not users, but still there is a risk.

Also to be noted as training code used by GitHub copilot or other tools is normally open source, but it can still break agreement as some open source licenses allows to copy also but with retaining authorship (e.g copy but with due credit / citation), which LLM generated code wont retain.

Few links,

https://blogs.microsoft.com/on-the-i...egal-concerns/

https://www.reddit.com/r/programming...ithub_copilot/

Also I was thinking if there would be some mechanism like robots.txt which can be used by websites to explicitly set boundaries for crawlers, so stubbled upon below discussion,

reddit : How do you opt out of your GitHub code from being used for training by Copilot?/

NetfreakBombay · 19th January 2025, 20:08

Quote:

Originally Posted by Samurai

This is all fine, but does it pass the IP law challenge?

For organizations :

License cost includes indemnity for IP claims
IP settings can be enforced by Org Admins

For example, code that CoPilot copied, can be copied as per license without attribution (Boost license).

From CoPilot TnC :

Quote:

Which legal terms and contracts apply to my business?
GitHub Copilot Business and GitHub Copilot Enterprise customers are entitled to IP indemnification from GitHub for GitHub Copilot’s unmodified suggestions when the duplicate detection filter setting is set to “Block”. In other words, if you enable this feature, we will defend you against the copyright claim in court.

https://resources.github.com/learn/p...ithub-copilot/

There are similar terms in Jetbrains contracts.

Samurai · 19th January 2025, 20:24

Ah, so indemnity can be achieved by getting a licensed account and switching on the block feature. This is interesting.

rkv_2401 · 20th January 2025, 12:10

Quote:

Originally Posted by clevermax

I do not see a copyright issue here because of these two points:

1. An LLM does not store word-to-word copies of its training data set anywhere within the model. The training data set is used to adjust the weights of several millions/billions of parameters within the model, so that it can be said that it 'learned' all those huge amounts of data it is trained with.

So when it generates an output for you, every word (token) that comes out of it is actually given out by a sampling algorithm that picks the next token (~word) from a creamy layer of a probability distribution that comes at the final layer of the transformer's neural network, in its sequence of generation of the completion, until the end of sequence token is encountered or the token limit is hit.

That's a great, succinct explanation of how an LLM works. However, this copyrighted material is part of the training dataset for the LLM, so the AI companies + potentially RLHF annotators use your copyrighted data to train and improve their model. Copyright was already broken there when the foundation models were being trained. In the case illustrated by Samurai, for someone directly copying this code and integrating it into their codebase, I feel that Boost's copyright license has been broken because they state "all derivative works of the Software" must include a copy of this license. And this is clearly a derivative work. In software, I don't think that a "word for word" copy is what matters as much as the underlying algorithm/patterns/logic.

When training a model, you aim to minimize it's loss/error. But how can you tell, especially with software/math/physics etc. whether the output is correct or not? The presence of certain words/tokens, or any other such heuristic measure, falls short here. Enter RLHF, a billion dollar industry. (This is the problem that popular voices think newer AI models have solved, btw, despite synthetic data pipelines having existed for a while now and the RLHF industry doing better than ever before. Either way, we won't know for a while.)

With RLHF, annotators can "correct" the model enough times that despite the probability distribution you spoke of, a chosen model with a low enough temperature, when asked a specific coding question w/o any tricks in the prompt, almost always results in the same set of results (barring small errors like missing characters or slightly incorrect logic). A reason for this could be because certain answers are rated correct so many times, the weights reflect this and the tokens leading to them are so highly probable that the top-k/top-p tokens after such training would all lead to the same set of answers (like the saying "all roads lead to Rome"). You can try this for yourself with LeetCode problems and the model's solutions which, conveniently, are always similar to the most popular solutions that are openly available!

Copyright law in AI is a huge debate, an ex-OpenAI whistleblower who brought this up mysteriously died just last month. I don't know enough about it as I am neither a lawyer nor have I worked with frontier model labs, but it's definitely a problem and I feel that it shouldn't be dismissed just because AI is not reproducing an exact, one-to-one copy of any materials used in its training set.

I'd say if staying safe from this type of legal issue is of importance to you, use with caution!

EDIT: I missed @NetfreakBombay's response. Thanks for the clarification! This is great to know for IDE applications. Any idea what happens wrt code produced in a chat by the models, such as ChatGPT/Claude's Web interface?

PearlJam · 20th January 2025, 13:29

Quote:

Originally Posted by Samurai

More than job losses, I think the following should be our chief concern:

If ChatGPT is causing such kids to be raised over time, without critical thinking and reasoning skills, then we probably don't have to worry much about competition in job hunting. Because I'll be surprised if such people can really compete and get jobs in the first place!

Thad E Ginathom · 20th January 2025, 13:34

With the usual disclosures/disclaimers, ie, take this as my opinion (although I worked with copyright issues a bit over thirty years ago)...

A couple of random thoughts...

1. A general thing:

Despite the word, a breach of copyright is not the copying of something: it is the publishing of it. You are totally free to make copies, even photocopies, of every book on your shelf. But if you start distributing them, then you are breaching copywrite.

2. I'm wondering, about the coding thing:

If it is a compiled language, then who is ever going to know that the code was trawled from somewhere else? Unless it does something fairly rare and specific. Or copies an interface.

3. Ideas, algorithms, etc...

Did Excel breach Lotus's copywrite in 123? Does Libre Office, which (sadly) is literally (horribly) like MS office, breach MS's copyright?

Just wondering. I never was a developer/programmer. I was just a systems manager who could do small but clever things with shell, awk, etc.

Samurai · 20th January 2025, 16:39

Quote:

Originally Posted by Thad E Ginathom

Despite the word, a breach of copyright is not the copying of something: it is the publishing of it. You are totally free to make copies, even photocopies, of every book on your shelf. But if you start distributing them, then you are breaching copywrite.

The very act of software product development and sales it is known as software publishing.

https://www.computerhope.com/jargon/s/softpubl.htm

It is a rarely used term, one would come across it legal documents though. I first learned it in a government form.

So, if you selling software, you are publishing.

Thad E Ginathom · 20th January 2025, 17:35

known as publishing does not necessarily mean known as publishing to the legal systems. Unless you are publishing open-source (or something interpreted at run time) software, you are not publishing the code. What does reverse engineering reveal? I suppose that depends on the language, but, as the example given is C (++?) would one getting anything more than maybe variable names out of it?

So, as my previous post asked, how would the original authors know anyway?

DigitalOne · 20th January 2025, 18:20

Quote:

Originally Posted by Thad E Ginathom

So, as my previous post asked, how would the original authors know anyway?

In software industry, I don't think publishing means publishing externally. Even it is published to the company's internal github repositories, the company is liable for copyright infringement. While with reverse engineering it is not possible to detect, but a company can be asked to reveal their source code during copyright violation court cases. And this has nothing to do with Gen AI

.

BullettuPaandi · 20th January 2025, 18:51

Copyright laws- or any generally similar laws currently under enforcement for that matter- don't quite apply here. And that's part of the problem; legality of all this is fully in 'uncharted territory'. Copyright laws is just the only context for us to have a conversation.

For instance, Web Scraping or Data Scraping is nearly as old as the web, which has been a similar contentious issue ever since. This directly and fully come under copyright laws, but we all know how well the enforcement has been on this.

Advancements in Machine Learning, has opened a whole another can of worms- w.r.t. ownership, legality and enforcement, etc.- building on top of the already quite loose enforcement on Scraping, and making (part of) the product/service all the more accessible to a wider group of people.

The proliferation of technologies that build on such 'grey' areas is hard to tackle, when

Every single business entity that does it, is inevitably incentivised to gravitate towards such a neo-liberal economy
Most actual owners of the data, neither have the incentive nor the resources to identify illegal scraping and pursue litigation
Legality differs from one country to the other, but the internet doesn't quite; resulting in even a nearly 'caught' AI company being able to claim that 'they scraped the data hosted on that countries' server, and their data protection norms allow this'
All these complications in legality is also served with a hot hairy mess of mumbo-jumbo, marketing non-sense and jargon to neither Artificial nor Intelligent politicians, who ultimately resort to 'what's the ask, and what's the give' due to their core transactional nature
All this is also only in regards to how such products/services are developed, and doesn't directly translate to what they sell/serve
...and probably some other complications that I'm missing

But the bottom-line TL;DR is: robots.txt don't mean a damn.

BhayanaV · 20th January 2025, 18:53

Quote:

Originally Posted by Samurai

Is it really designing or stealing from the best without giving credit? How do you handle the copyright issue that would come from this practice.

That sir is a very complex and currently a grey area. Legal and related bigwigs are working on these areas. But the current situation is you can't let the world run past you and wait for these answers/clearances

.

As someone with kids and working with US/Europe people since 14 years, we keep discussing this topic on personal and professional front. Should we be worried about next generation not developing "X" skills if that skill itself becomes redundant. E.g. adding simple figures in mind or travel routing without smartphone etc. My kids can't do it not they have any inclination to learn these but then will it be required in near future with technology in car, phone, watch or even your ring?. In some schools even teaching how to read clocks is stopped now, probably it will not be required (ask time to Alexa, google assistant or see in smartphone, even TV). If next generation is rather spending their time on learning better things, I'm pretty much happy.
AI is present in anything we open on internet nowadays, and that looks like the way to go.

19th January 2025, 20:24	#307
Samurai Team-BHP Support Join Date: Jan 2005 Location: Bangalore/Udupi Posts: 26,005 Thanked: 49,750 Times View My Garage	Re: Artificial Intelligence: How far is it? Ah, so indemnity can be achieved by getting a licensed account and switching on the block feature. This is interesting.
	() Thanks

20th January 2025, 13:34	#310
Thad E Ginathom Distinguished - BHPian Join Date: Jun 2007 Location: Chennai Posts: 11,442 Thanked: 30,011 Times	Re: Artificial Intelligence: How far is it? With the usual disclosures/disclaimers, ie, take this as my opinion (although I worked with copyright issues a bit over thirty years ago)... A couple of random thoughts... 1. A general thing: Despite the word, a breach of copyright is not the copying of something: it is the publishing of it. You are totally free to make copies, even photocopies, of every book on your shelf. But if you start distributing them, then you are breaching copywrite. 2. I'm wondering, about the coding thing: If it is a compiled language, then who is ever going to know that the code was trawled from somewhere else? Unless it does something fairly rare and specific. Or copies an interface. 3. Ideas, algorithms, etc... Did Excel breach Lotus's copywrite in 123? Does Libre Office, which (sadly) is literally (horribly) like MS office, breach MS's copyright? Just wondering. I never was a developer/programmer. I was just a systems manager who could do small but clever things with shell, awk, etc.
	(3) Thanks

20th January 2025, 17:35	#312
Thad E Ginathom Distinguished - BHPian Join Date: Jun 2007 Location: Chennai Posts: 11,442 Thanked: 30,011 Times	Re: Artificial Intelligence: How far is it? known as publishing does not necessarily mean known as publishing to the legal systems. Unless you are publishing open-source (or something interpreted at run time) software, you are not publishing the code. What does reverse engineering reveal? I suppose that depends on the language, but, as the example given is C (++?) would one getting anything more than maybe variable names out of it? So, as my previous post asked, how would the original authors know anyway?
	() Thanks

20th January 2025, 18:51	#314
BullettuPaandi BHPian Join Date: Apr 2023 Location: Tirunelveli Posts: 366 Thanked: 1,009 Times	Re: Artificial Intelligence: How far is it? Copyright laws- or any generally similar laws currently under enforcement for that matter- don't quite apply here. And that's part of the problem; legality of all this is fully in 'uncharted territory'. Copyright laws is just the only context for us to have a conversation. For instance, Web Scraping or Data Scraping is nearly as old as the web, which has been a similar contentious issue ever since. This directly and fully come under copyright laws, but we all know how well the enforcement has been on this. Advancements in Machine Learning, has opened a whole another can of worms- w.r.t. ownership, legality and enforcement, etc.- building on top of the already quite loose enforcement on Scraping, and making (part of) the product/service all the more accessible to a wider group of people. The proliferation of technologies that build on such 'grey' areas is hard to tackle, when Every single business entity that does it, is inevitably incentivised to gravitate towards such a neo-liberal economy Most actual owners of the data, neither have the incentive nor the resources to identify illegal scraping and pursue litigation Legality differs from one country to the other, but the internet doesn't quite; resulting in even a nearly 'caught' AI company being able to claim that 'they scraped the data hosted on that countries' server, and their data protection norms allow this' All these complications in legality is also served with a hot hairy mess of mumbo-jumbo, marketing non-sense and jargon to neither Artificial nor Intelligent politicians, who ultimately resort to 'what's the ask, and what's the give' due to their core transactional nature All this is also only in regards to how such products/services are developed, and doesn't directly translate to what they sell/serve ...and probably some other complications that I'm missing But the bottom-line *TL;DR is: robots.txt don't mean a damn.* Last edited by BullettuPaandi : 20th January 2025 at 18:53. Reason: wording
	(2) Thanks