cross-posted from: https://nom.mom/post/121481
OpenAI could be fined up to $150,000 for each piece of infringing content.https://arstechnica.com/tech-policy/2023/08/report-potential-nyt-lawsuit-could-force-openai-to-wipe-chatgpt-and-start-over/#comments
Good
AI should not be given free reign to train on anything and everything we’ve ever created. Copyright holders should be able to decide if their works are allowed to be used for model training, especially commercial model training. We’re not going to stop a hobbyist, but google/Microsoft/openAI should be paying for materials they’re using and compensating the creators.
While that’s understandable, I think it’s important to recognize that this is something where we’re going to have to treat pretty carefully.
If a human wants to become a writer, we tell them to read. If you want to write science fiction, you should both study the craft of writing ranging from plots and storylines to character development to Stephen King’s advice on avoiding adverbs. You also have to read science fiction so you know what has been done, how the genre handles storytelling, what is allowed versus shunned, and how the genre evolved and where it’s going. The point is not to write exactly like Heinlein (god forbid), but to throw Heinlein into the mix with other classic and contemporary authors.
Likewise, if you want to study fine art, you do so by studying other artists. You learn about composition, perspective, and color by studying works of other artists. You study art history, broken down geographically and by period. You study DaVinci’s subtle use of shading and Mondrian’s bold colors and geometry. Art students will sit in museums for hours reproducing paintings or working from photographs.
Generative AI is similar. Being software (and at a fairly early stage at that), it’s both more naive and in some ways more powerful than human artists. Once trained, it can crank out a hundred paintings or short stories per hour, but some of the people will have 14 fingers and the stories might be formulaic and dull. AI art is always better when glanced at on your phone than when looked at in detail on a big screen.
In both the cases of human learners and generative AI, a neural network(-like) structure is being conditioned to associate weights between concepts, whether it’s how to paint a picture or how to create one by using 1000 words.
A friend of mine who was an attorney used to say “bad facts make bad law.” It means that misinterpretation, over-generalization, politicization, and a sense of urgency can make for both bad legislation and bad court decisions. That’s especially true when the legislators and courts aren’t well educated in the subjects they’re asked to judge.
In a sense, it’s a new technology that we don’t fully understand - and by “we” I’m including the researchers. It’s theoretically and in some ways mechanically grounded in old technology that we also don’t understand - biological neural networks and complex adaptive systems.
We wouldn’t object to a journalism student reading articles online to learn how to write like a reporter, and we rightfully feel anger over the situation of someone like Aaron Swartz. As a scientist, I want my papers read by as many people as possible. I’ve paid thousands of dollars per paper to make sure they’re freely available and not stuck behind a paywall. On the other hand, I was paid while writing those papers. I am not paid for the paper, but writing the paper was part of my job.
I realize that is a case of the copyright holder (me) opening up my work to whoever wants a copy. On the other other hand, we would find it strange if an author forbade their work being read by someone who wants to learn from it, even if they want to learn how to write. We live in a time where technology makes things like DRM possible, which attempts to make it difficult or impossible to create a copy of that work. We live in societies that will send people to prison for copying literal bits of information without a license to do so. You can play a game, and you can make a similar game. You can play a thousand games, and make one that blends different elements of all of them. But if you violate IP, you can be sued.
I think that’s what it comes down to. We need to figure out what constitutes intellectual property and what rights go with it. What constitutes cultural property, and what rights do people have to works made available for reading or viewing? It’s easy to say that a company shouldn’t be able to hack open a paywall to get at WSJ content, but does that also go for people posting open access to Medium?
I don’t have the answers, and I do want people treated fairly. I recognize the tremendous potential for abuse of LLMs in generating viral propaganda, and I recognize that in another generation they may start making a real impact on the economy in terms of dislocating people. I’m not against legislation. I don’t expect the industry to regulate itself, because that’s not how the world works. I’d just like for it to be done deliberately and realistically and with the understanding that we’re not going to get it right and will have to keep tuning the laws as the technology and our understanding continue to evolve.
Sorry this is a bit too level-headed for me, can you please repeat with a bullhorn, and use 4-letter words instead? I need to know who to blame here.
This is an astonishingly well written, nuanced, and level headed response. Really on a level I’m not used to seeing on this platform.
Well written sir.
Both an AI and an art student are a complex web of weights that take inputs and returns an output. Agreed.
But the inputs are vastly different. An art student has all the inputs of every moment leading up to the point of putting paint to canvas. Emotion, hunger, pain, and every moment that life has thrown at them. All of them lead to very different results. Every art piece affects the subsequent ones.
The AI on the other hand is purely derivative. It’s only ever told about pre-existing art and a brief interpretation of it. It does not feel emotion. It does not worry about paying its bills or falling in love. It builds a map of weights once and that is that. Every input repeated however many times will yield exactly the same output.
And yes, you have the artists who are professional plagiarists, making hand-painted Picasso imitations of someone’s chihuahua for $20 over the internet. But they’re not mass producing derivative work by the thousands.
I fully agree with the shit-in, shit-out sentiment, and researchers should be free to train their models of whatever data they need.
But monetising their models, that by definition are generating derivative works is another matter.
How do you know it is purely derivative? Are you saying an AI can’t write a sentence that has never been written before or are you saying that it can’t have an original thought? If it is writing a brand new sentence that is an amalgamation of many other writings how is that violating a copyright (or any differentthan a human doingit)? The copyright claims are absurd.
Not trying to argue or troll, but I really don’t get this take, maybe I’m just naive though.
Like yea, fuck Big Data, but…
Humans do this naturally, we consume data, we copy data, sometimes for profit. When a program does it, people freak out?
edit well fuck me for taking 10 minutes to write my comment, seems this was already said and covered as I was typing mine lol
It’s just a natural extension of the concept that entities have some kind of ownership of their creation and thus some say over how it’s used. We already do this for humans and human-based organizations, so why would a program not need to follow the same rules?
Because we don’t already do this. In fact, the raw knowledge contained in a copyrighted work is explicitly not copyrighted and can be done with as people please. Only the specific expression of that knowledge can be copyrighted.
An AI model doesn’t contain the copyrighted works that went into training it. It only contains the concepts that were learned from it.
There’s no learning of concepts. That’s why models hallucinate so frequently. They don’t “know” anything, they’re doing a lot of math based on what they’ve seen before and essentially taking the best guess at what the next word is.
There very much is learning of concepts. This is completely provable. You can give it problems it has never seen before and it will come up with good solutions.
Very much like humans do. Many people think that somehow their brain is special, but really, you’re just neurons behaving as neurons do, which can be modeled mathematically.
This take often denies that entropy soul or not is critically important for the types of intellegence thats not controlled by reward and punishment with an iron fist.
It sounds like you know english words but cannot compose them. I honestly cannot parse what you said.
We can’t even map the entirety of the brain of a mouse due to the scale of how neurons work. Mapping a human brain 1:1 will eventually happen, and that’s likely going to coincide with when I’m convinced AI is capable of individual thought and actual intelligence
Just saw this today. You should check it out, nitwit: https://www.theguardian.com/science/2023/aug/15/scientists-reconstruct-pink-floyd-song-by-listening-to-peoples-brainwaves
Edit: “nitwit” was uncalled for, but I do think you are an ignorant person.
You aren’t magical. You don’t have a soul that talks to Jesus. You’re a bunch of organized electrical signals—a machine. Because your machine is carbon-based doesn’t make you special.
Edit: Downvote all you want, but we’re all still animals. Most people don’t even believe that simple fact. Then again, most people don’t even understand how their cellphone works.
I fundamentally disagree and if that’s your take on humanity I’m scared for our future.
There is a human element to us. I’m not spiritual at all. I believe when we die the lights just go out and we cease to exist. But there is undoubtedly a part of us that is still far from being replicated in a machine. I’m not saying it won’t happen, I’m saying we’re a long way from it and what we’re seeing out of current AI is nothing even close to resembling intelligence.
It might be nice if we reserve some things just for humans 🤷🏻♂️
That doesn’t sit well to me. I agree that, to some extent, artists and writers should be compensated for their work, even if it just means those interested in creating training sets have to buy a copy of each work they intend to use in a training set so long as it can’t be legally acquired for free (similar to how a human has to buy, “”“buy”“” and/or borrow a book if they want to study it).
However, at the same time, this mindset opens the door to actual racism, not the silly “hurr, my skin color’s better than your skin color” bullshit we call racism, but the much nastier, “there are actual differences between you and I which I will use to justify my poor behavior” kinda racism; and when your academic partner has the potential to outclass you in nearly every way (assuming most general AI would decide to work in STEM fields), it’s much easier to justify your bigotry. That bigotry may then be learned by the AI and spit back at you; but this time, the accusations of inferiority may truly be justified.
I mean, think of it this way, what if someone created a general AI that displays all the characteristics of a human to the point of being seemingly indistinguishable from one? Should they not be considered a person? Should they not then be given the same rights as any other person?
Maybe it’s not possible to create a general AI, but maybe we eventually encounter aliens; the universe is a big place after all. Should they not also be given the same rights as a person?
The AI problem is so much larger than I think most people realize. The people making these are trying to create life, even if they don’t realize it. Just because it’s a program or amalgamation of programs that run on silicon and copper doesn’t mean it’s any less alive than an amalgamation of programs running on chemical reactions and electric impulses. It’s just a different kind of alive, like how a car that uses electricity and a car that uses internal combustion are both cars, they just have different ways of doing the same thing. That’s not to say current AI is anywhere near as intelligent as a dog, cat or human, but it has the potential to one day become truly intelligent.
It’s also easy to assume that these are all issues that will be solved in the future, but we have plenty of examples even now of how kicking the can down the street isn’t really an intelligent strategy. Look at how well that can-kicking is turning out in regards to climate change, wealth inequality, healthcare, LGBT+ and BIPOC rights, etc. Regarding AI, I believe there are a lot of hang-ups that we as humans have, whether conscious or unconscious, when it comes to tolerating beings unlike us (we’re still struggling with the skin-color racism); and that it’s better to start working on them now than to wait until Mr. Roboto has his chassis smashed by a bunch of neo-luddites who insist that he’s just a bunch of circuits formed into a crude imitation of humanity.
Edit: you could also make the argument that choosing not to extend personhood to an intelligent machine opens the door to prejudice and bigotry in regards to transhumanism. At some point, we as humans will start modifying ourselves, via either meat or circuitry, and when that happens, there’ll be plenty of people trying to argue that Joe isn’t a human because he’s had his whole brain replaced with a computer. It doesn’t matter if the surgeons replaced his brain step-by-step to insure Joe himself wasn’t lost in the process; they’ll argue that since the brain is what makes Joe, “Joe”, then he must not be human because his brain is no longer organic.
Edit 2: Also, I apologize if I misinterpreted your statement. I’ve seen way too many people saying that AI should never, ever be treated as a person.
this is a super-reach, why don’t we deal with AI indistinguishable from human when it happens.
right now what we have is a language model that is very distinguishable from human so it doesn’t get any human considerations.
if a monkey or chicken created an artwork, it doesn’t have copyrights, because it’s not human either.
I like that argument as it applies to our ai, which isnt ment to reject bad ideas or motiefs but to never have a bad idea in the first place. This setup results in the bot’s path of least resistance being to copy someones homework. Nobody wants the bot to do that.
Someday we may have AI that argument is harder to apply to
i attempt explain, irrelivant
text generators have a “most correct” output that looks and behaves simmlar to pressing the first of the keyboard suggested words repeatedly. We add noise, where the bot is on a dice roll forced to add a random letter to it’s output. Like the above example if you typed a 5 letter word every so often instead.
So if I make an AI of the Google name and logo, it’s cool? I’m pretty sure it’s not.
No.
.
We 100% need to ensure that automation and AI benefits everyone, not a few select companies. But copyright is totally the wrong mechanism for that.
A pen is not a creative work. A creative work is much different than something that’s mass produced.
Nobody is limiting how people can use their pc. This would be regulations targeted at commercial use and monetization.
Writers can already do that. Commercial licensing is a thing.
… Google’s proposed Web Integrity API seems like a move in that direction to me.
But that’s besides the point, I was trying to establish the principle that people who make things shouldn’t be able to impose limitations on how these things are used later on.
Why should that difference matter, in particular when it comes to the principle I mentioned?
Because creative works are rather obviously fundamentally different from physical objects, in spite of a number of shared qualities.
Like physical objects, they can be distinguished one from another - the text of Moby Dick is notably different from the text of Waiting for Godot, for instance
More to the point, like physical objects, they’re products of applied labor - the text of Moby Dick exists only because Herman Melville labored to bring it into existence.
However, they’re notably different from physical objects insofar as they’re quite simply NOT physical objects. The text of Moby Dick - the thing that Melville labored to create - really exists only conceptually. It’s of course presented in a physical form - generally as a printed book - but that physical form is not really the thing under consideration, and more importantly, the thing to which copyright law applies (or in the case of Moby Dick, used to apply). The thing under consideration is more fundamental than that - the original composition.
And, bluntly, that distinction matters and has to be stipulated because selectively ignoring it in order to equivocate on the concept of rightful property is central to the NoIP position, as illustrated by your inaccurate comparison to a pen.
Nobody is trying to control the use of pens (or computers, as they were being compared to). The dispute is over the use of original compositions - compositions that are at least arguably, and certainly under the law, somebody else’s property.
It’s not like AI is using works to create something new. Chatgpt is similar to if someone were to buy 10 copies of different books, put them into 1 book as a collection of stories, then mass produce and sell the “new” book. It’s the same thing but much more convoluted.
Edit: to reply to your main point, people who make things should absolutely be able to impose limitations on how they are used. That’s what copyright is. Someone else made a song, can you freely use that song in your movie since you listened to it once? Not without their permission. You wrote a book, can I buy a copy and then use it to make more copies and sell? Not without your permission.
it’s not even close to that black and white… i’d say it’s a much more grey area:
possibly that you buy a bunch of books by the same author and emulate their style… that’s perfectly acceptable until you start using their characters
if you wrote a research paper about the linguistic and statistical information that makes an authors style, that also wouldn’t be a problem
so there’s something beyond just the authors “style” that they think is being infringed. we need to sort out exactly where the line is. what’s the extension to these 2 ideas that makes training an LLM a problem?
No, someone emulating someone else’s style is still going to have their own experiences, style, and creativity make their way into the book. They have an entire lifetime of “training data” to draw from. An AI that would “emulate” someone else’s style would really only be able to refer to the author’s books, or someone else’s books, therefore it’s stealing. Another example: if someone decided to remix different parts of a musician’s catalogue into one song, that would be a copyright infringement. AI adds nothing beyond what it’s trained on, therefore whatever it spits out is just other people’s works in a different way.
we output nothing other than what we’re trained on; the only difference is that we’re allowed to roam the world freely and consume whatever information we stumble on
what you say would be true if the LLM were only trained on content by the author seeking to say that their works had been infringed, however these LLMs include a lot of other data from public domain sources
one could consider these public domain sources and our experience of the world to be synonymous (and if you don’t i’d love to hear the distinction), in which case there’s some kind of a line that you seem to be drawing, and again i’d love to hear where you think that line is
is it just ratio? there’s precedent to that for sure: current law has fair use rules which stipulate things like “amount and substantiality”. in that case the question becomes one of defining the ratio. certainly the ratio of content that the author is referring to vs the content not trained by the author is minuscule
I agree with what you’re saying, and a model that is only trained on public domain would be fine. I think the very obvious line is that it’s a computer program. There seems to be a want for computers to be human but they aren’t. They don’t consume media for their own enjoyment, they are forced to do it so someone can sell the output as a product. You can’t compare the public domain to life.
Except it’s not a collection of stories, it’s an amalgamation - and at a very granular level at that. For instance, take the beginning of a sentence from the middle of first book, then switch to a sentence in the 3-rd, then finish with another part of the original sentence. Change some words here and there, add one for good measure (based on some sentence in the 7-th book). Then fix the grammar. All the while, keeping track that there’s some continuity between the sentences you’re stringing together.
That counts as “new” for me. And a lot of stuff humans do isn’t more original.
The maybe bigger argument against free-reign training is that you’re attributing personal rights to a language model. Also even people aren’t completely free to derive things from memory (legally) which is why clean-room-design is a thing.
That is not even close to correct. LLMs are little more than massively complex webs of statistics. Here’s a basic primer:
https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/
I’ve coded LLMs, I was just simplifying it because at its base level it’s not that different. It’s just much more convoluted as I said. They’re essentially selling someone else’s work as their own, it’s just run through a computer program first.
it’s nothing like that at all… if someone bought a book and produced a big table of words and the likelihood that the next word would be followed by another word, that’s what we’re talking about: it’s abstract statistics
actually, that’s not even what we’re talking about… we then take that word table and then combine it with hundreds of thousands of other tables until the original is so far from the original as to be completely untraceable back to the original work
If it were trained on a single book, the output would be the book. That’s the base level without all the convolution and that’s what we should be looking at. Do you also think someone should be able to train a model on your appearance and use it to sell images and videos, even though it’s technically not your likeness?
I can see your argument it’s just your metaphor wasn’t very strong and I think it just made things a bit confusing
Google web integrity is very much different than what I’m proposing. “Nobody” was more in relation to regulating this.
I hold the opposite opinion in that creatives (I’d almost say individuals only, no companies) own all rights to their work and can impose any limitations they’d like on (edit: commercial) use. Current copyright law doesn’t extend quite that far though.
A creative work is not a reproduceable quantifiable product. No two are exactly alike until they’re mass produced.
Your analogy works more with a person rather than a pen, in that why is it ok when a person reads something and uses it as inspiration and not a computer? This comes back around to my argument about transformative works. An AI cannot add anything new, only guess based on historical knowledge. One of the best traits of the human race is our ability to be creative and bring completely new ideas.
Edit: added in a commercial use specifier after it was pointed out that the rules over individuals would be too restrictive.
I think that point’s worth discussing by itself - leaving aside the AI - as you wrote it quite general.
I came up with some examples:
Taking your statement at face value - the answers should be: no (I can’t decorate), yes (it’s a valid restriction), and no (I can’t use it to illustrate my argument). But maybe you didn’t mean it quite that strict? What do you think on each example and why?
Fair points. I think the restrictions in most part would have to be in place for commercial use primarily.
So under your examples
Yes, you should. As there’s no commercial usage you’re not profiting off of their work, you’re simply using your copy of it to decorate a personal space
If we restrict the copyright protections to only apply to commercial use then this becomes a non-issue. The copyright extends to reproduction (or performance in the case of music) of the work in any kind, but does not extend to complete control over personal usage.
Personal interpretation is fine. If you start using that argument in some kind of publication or “performance”, then you end up with fair use being called into question. Quoting, with appropriate attribution is fine, but say you print a chapter of the book, then a chapter of critique. Where is that line drawn? Right now it’s ambiguous at best, downright invisible at most times.
I appreciate the well thought out response. I hold sting views on copyright of an individuals creative work as a musician and developer, and believe that they should have control over how their products are used to make money. These views probably are a little too restrictive for the general public, and probably won’t ever garner a huge amount of support.
I dropped the ball on making sure to specify use as in commercial use, I’ll put an edit at the bottom of the op to clarify it too
All of the examples you listed have nothing to do with how OpenAI was created and set up. It was trained on copyrighted work, how is that remotely comparable to purchasing a pen?
Would a more apt comparison be a band posting royalties to all of their influences?
i think that’s a pretty good analogy that i haven’t heard before!
You made two arguments for why they shouldn’t be able to train on the work for free and then said that they can with the third?
Did openai pay for the material? If not, then it’s illegal.
Additionally, copywrite and trademarks and patents are about reproduction, not use.
If you bought a pen that was patented, then made a copy of the pen and sold it as yours, that’s illegal. This is the analogy of what openai is going with books.
Plagiarism and reproduction of text is the part that is illegal. If you take the “ai” part out, what openai is doing is blatantly illegal.
Just now, I tried to get Llama-2 (I’m not using OpenAI’s stuff cause they’re not open) to reproduce the first few paragraphs of Harry Potter and the philosophers’ stone, and it didn’t work at all. It created something vaguely resembling it, but with lots of made-up stuff that doesn’t make much sense. I certainly can’t use it to read the book or pirate it.
Openai:
That doesn’t mean the copyrighted material isn’t in there. It also doesn’t mean that the unrestricted model can’t.
Edit: I didn’t get it to tell me that it does have the verbatim text in its data.
Here we go, I can get chat gpt to give me sentence by sentence:
Most publically available/hosted (self hosted models are an exception to this) have an absolute laundry list of extra parameters and checks that are done on every query to limit the model as much as possible to tailor the outputs.
This wasn’t even hard… I got it spitting out random verbatim bits of Harry Potter. It won’t do the whole thing, and some of it is garbage, but this is pretty clear copyright violations.
Maybe it’s trained not to repeat JK Rowling’s horseshit verbatim. I’d probably put that in my algorithm. “No matter how many times a celebrity is quoted in these articles, do not take them seriously. Especially JK Rowling. But especially especially Kanye West.”
It’s not repeating its training data verbatim because it can’t do that. It doesn’t have the training data stored away inside itself. If it did the big news wouldn’t be AI, it would be the insanely magical compression algorithm that’s been discovered that allows many terabytes of data to be compressed down into just a few gigabytes.
Do you remember quotes in english ascii /s
Tokens are even denser than ascii. simmlar to word “chunking” My guess is it’s like lossy video compression but for text, [Attacked] with [lazers] by [deatheaters] apon [margret];[has flowery language]; word [margret] [comes first] (Theoretical example has 7 “tokens”)
It may have actually impressioned a really good copy of that book as it’s lilely read it lots of times.
If it’s lossy enough then it’s just a high-level conceptual memory, and that’s not copyrightable.
It varries based on how much time its been given with the media.
You are reading my comment right now. In my comment, I am letting you know that Sidehill Gougers come in both clockwise and counterclockwise breeds.
Oh no! You just learned that fact for free! I didn’t give you permission to learn from my comments, even though I deliberately published it here for you to read. I demand that you either pay me or wipe that ill-gotten knowledge from your mind.
Don’t you dare tell anyone else about Sidehill Gougers. That’s illegal.
Keep on that strawman my goodman.
But it was so funny :shrug: /s
Good job demonstrating you don’t understand the underlying point.
Computer manufacturers aren’t making AI software. If someone uses an HP copier to make illegal copies of a book and then distributes those pages to other people for free, the person that used the copier is breaking the law, not the company that made the copier.
A pen manufacturer isn’t repurposing other peoples’ work to make their pens.
A computer manufacturer has to license the intellectual property that they use to make their computers.
They didn’t pay the writers though, that’s the whole point
True - but I don’t think the goal here is to establish that AI companies must purchase 1 copy of each book they use. Rather, the point seems to be that they should need separate, special permission for AI training.
100% this! there are separate licenses for personal listening, public performance, use in another work (movie and TV)… there will likely be a license added for AI training to which some authors will opt into, some will opt out of… it’ll likely start very expensive, nobody will pay, someone will offer up
old works that aren’t selling well for bargain basement prices, make a killing, then others will see the success and slowly prices will follow and eventually prices will sit at a happy medium where AI companies can tolerate and copyright holders aren’t feeling screwed… well, i mean, they’ll be being screwed but their publishers will be making bank
that’s my totally out of thin air prediction anyway
I believe this is where it’ll inevitably go. However I’m not sure it’ll be just AI, rather hopefully more protections around individual creative work and how that can be used by corporations for internal or external data collection.
This really does depend on privacy laws as well and probably data collection, retention and usage too.
There is no law that requires them to be paid for this.
That probably depends a lot on definitions of terms of legalese, but there should be a law explicitly for this in every civilised country.
With that mindset, only the powerful will have access to these models.
Places like Reddit, Google, Facebook, etc, places that can rope you into giving away rights to your data with TOS stipulations.
Locking down everything available on the Internet by piling more bullshit onto already draconian copyright rules isn’t the answer and it surprises the shit out of me how quickly fellow artists, writers, and creatives piled onto the side with Disney, the RIAA, and other former enemies the second they started perceiving ML as a threat to their livelihood.
I do believe restrictions should be looked into when it comes to large organizations and industries replacing creators with ML, but attacking open ML models directly is going to result in the common folk losing access to the tools and corporations continuing to work exactly as they are right now by paying for access to locked-down ML based on content from companies who trade in huge amounts of data.
Not to mention it’s going to give the giants who have been leveraging their copyright powers against just about everyone on the internet more power to do just that. That’s the last thing we need.
What’s the basis for this? Why can a human read a thing and base their knowledge on it, but not a machine?
Because a human understands and transforms the work. The machine runs statistical analysis and regurgitates a mix of what it was given. There’s no understanding or transformation, it’s just what is statistically the 3rd most correct word that comes next. Humans add to the work, LLMs don’t.
Machines do not learn. LLMs do not “know” anything. They make guesses based on their inputs. The reason they appear to be so right is the scale of data they’re trained on.
This is going to become a crazy copyright battle that will likely lead to the entirety of copyright law being rewritten.
I don’t know if I agree with everything you wrote but I think the argument about llms basically transforming the text is important.
Converting written text into numbers doesn’t fundamentally change the text. It’s still the authors original work, just translated into a vector format. Reproduction of that vector format is still reproduction without citation.
But it’s not just converting them into a different format. It’s not even storing that information at all. It can’t actually reproduce anything from the dataset unless it is really small or completely overfitted, neither of which apply to GPT with how massive it is.
Each neuron, which represents a word or a phrase, is a set of weights. One source makes a neuron go up by 0.000001% and then another source makes it go down by 0.000001%. And then you repeat that millions and millions of times. The model has absolutely zero knowledge of any specific source in its training data, it only knows how often different words and phrases occur next to each other. Or for images it only knows that certain pixels are weighted to be certain colors. Etc.
This is a misunderstanding on your part. While some neurons are trained this way, word2vec and doc2vec are not these mechanisms. The llms are extensions of these models and while there are certainly some aspects of what you are describing, there is a transcription into vector formats.
This is the power of vectorization of language (among other things). The one to one mapping between vectors and words / sentences to documents and so forth allows models to describe the distance between words or phrases using euclidian geometry.
I was trying to make it as simple as possible. The format is irrelevant. The model is still storing nothing but weights at the end of the day. Storing the relationships between words and sentences is not the same thing as storing works in a different format which is what your original comment implied.
I’m sorry you failed to grasp how it works in this context.
You made me really interested in this concept so I asked GPT-4 what the furthest word away from the word “vectorization” would be.
Interesting game! If we’re aiming for a word that’s conceptually, contextually, and semantically distant from “vectorization,” I’d pick “marshmallow.” While “vectorization” pertains to complex computational processes and mathematics, “marshmallow” is a soft, sweet confectionery. They’re quite far apart in terms of their typical contexts and meanings.
It honestly never ceases to surprise me. I’m gonna play around with some more. I do really like the idea that it’s essentially a word calculator.
Try asking it how the vectorization of king and queen are related.
At some level, isn’t what a human brain does also effectively some form of very very complicated mathematical algorithm, just based not on computer modeling but on the behavior of the physical systems (the neurons in the brain interacting in various ways) involved under the physical laws the universe presents? We don’t yet know everything about how the brain works, but we do at least know that it is a physical object that does something with the information given as inputs (senses). Given that we don’t know for sure how exactly things like understanding and learning work in humans, can we really be absolutely sure what these machines do doesn’t qualify?
To be clear, I’m not really trying to argue that what we have is a true AI or anything, or that what these models do isn’t just some very convoluted statistics, I’ve just had a nagging feeling in the back of my head ever since chatGPT and such started getting popular along the lines of “can we really be sure that this isn’t (a very simple form of) what our brains, or at least a part of it, actually do, and we just can’t see it that way because that’s not how it internally “feels” like?” Or, assuming it is not, if someone made a machine that really did exhibit knowledge and creativity, using the same mechanism as humans or one similar, how would we recognize it, and in what way would it look different from what we have (assuming it’s not a sci-fi style artificial general intelligence that’s essentially just a person, and instead some hypothetical dumb machine that nevertheless possesses genuine creativity or knowledge.) It feels somewhat strange to declare with certainty that a machine that mimics the symptoms of understanding (in the way that they can talk at least somewhat humanlike, and explain subjects in a manner that sometimes appears thought out. It can also be dead wrong of course but then again, so can humans), definitely does not possess anything close to actual understanding, when we don’t even know entirely what understanding physically entails in the first place.
It’s also the scale of their context, not just the data. More (good) data and lots of (good) varied data is obviously better, but the perceived cleverness isn’t owed to data alone.
I do hope copyright law gets rewritten. It is dated and hasn’t kept up with society or technology at all.
I think this is very unlikely. All of law is precedent.
Google uses copyrighted works for many things that are “algorithmic” but not AI and people aren’t shitting themselves over it.
Why would AI be different? So long as copyright isn’t infringed at least.
That machine is a commercial product. Quite unlike a human being, in essence, purpose and function. So I do not think the comparison is valid here unless it were perhaps a sentient artificial being, free to act of its own accord. But that is not what we’re talking about here. We must not be carried away by our imaginations, these language models are (often proprietary and for profit) products.
I don’t see how that’s relevant. A company can pay someone to read copyrighted work, learn from it, and then perform a task for the benefit of the company related to the learning.
But how did that person acquire the copyrighted work? Was the copyrighted material paid for?
That’s the crux of the issue, Open AI isn’t paying for the copyrighted work they are “reading”, are they?
What does paying for anything have to do with what we’re talking about here. They’re ingesting freely available content, that anyone with a web browser could read
Bullshit. If I learn engineering from a textbook, or a website, and then go on to design a cool new widget that makes millions, the copyright holder of the textbook or website should get zero dollars from me.
It should be no different for an AI.
Agreed. Royalties are a capitalist invention
While I agree, corporations shouldn’t make bucks on knowledge(sorta) they basically eavesdropped and violated the privacy of millions of people for.
AI solutions are made from people’s ideas, and should be freely accessible by the people by definition. It not being sustainable as a business model is also a feature in this case, since there’d be no intrinsic incentive to steal data and violate privacy.
Yes, but what about you going into teaching engineering, and writing a text book for it that is awfully close to the ones you have used? Current AI is at a stage where it just “remixes” content it gobbled in, and not (yet) advanced enough to actually learn and derive from it.
Last time I looked, textbooks were fucking expensive. You might be able to borrow one from the library, of course. But most people who study something pay up front for the information they’re studying on
Every time I see this argument it reminds me of how little people understand how copyright works.
The crux is fair compensation. The rights holder has to agree to the usage, with clear terms and conditions for their creative works, in exchange for a monetary sum (single or reoccurring) and/or a service of similar or equal value with a designated party. That’s why AI continues to be in hot water. Just because you can suck up the data does not mean the data is public domain. Nor does it mean the license used between interested parties transfers to an AI company during collection. If AI companies want to monetize their services, they’re going to have to provide fair compensation for the non-public domain works used.
Human experience considers context, experience, and relation to previous works
‘AI’ has the words verbatim in it’s database and will occasionally spit them out verbatim
It doesn’t. The original data is nowhere in its dataset. Words are nowhere in its dataset. It stores how often certain tokens (numbers computationally equivalent to language fragments; not even words, but just a few letters or punctuation, often chunks of words) are found together in sentences written by humans, and uses that to generate human-sounding sentences. The sentences it returns are thereby a massaged average of what it predicts a human would say in that situation.
If you say “It was the best of times,” and it returns “it was the worst of times.”, it’s not because “it was the best of times, it was the worst of times.” is literally in its dataset, it’s because after converting what you said to tokens, its dataset shows that the latter almost always follows the former. From the AI’s perspective, it’s like you said the token string (03)(153)(3181)(359)(939)(3)(10)(108), and it found that the most common response to that by far is (03)(153)(3181)(359)(61013)(12)(10)(108).
Impressioning and memorization, it memorised the impression (“sensation”) of what it’s like to have the text in the buffer: “It was the best of times,” and “instinctively” outputs it’s impression “it was the worst of times.” Knowing each letter it added was the most “correct” rewarding.
Sorry, wrong reply
I disagree. I think that there should be zero regulation of the datasets as long as the produced content is noticeably derivative, in the same way that humans can produce derivative works using other tools.
Good in theory, Problem is if your bot is given too mutch exposure to a specific piece of media and when the “creativity” value that adds random noise (and for some setups forces it to improvise) is too low, you get whatever impression the content made on the AI, like an imperfect photocopy (non expert, explained “memorization”). Too high and you get random noise.
Then it’s a cheap copy, not noticeably derivative, and whoever is hosting the trained bot should probably take it down.
Then the bot is trash. Legal and non-infringing, but trash.
There is a happy medium where SD, MJ, and many other text-to-image generators currently exist. You can prompt in such a way (or exploit other vulnerabilities) to create “imperfect photocopies,” but you can also create cheap, infringing works with any number of digital and physical tools.
LLM are not human, the process to train LLM is not human-like, LLM don’t have human needs or desires, or rights for that matter.
comparing it to humans has been a flawed analogy since day 1.
Llm no desires = no derivative works? Let llm handle your comments they will make more sense
I think any LLM should be required to be free to use. They can pay for extra bells and whistles like document upload but the core model must be free. They’re free to make their billions, but it shouldn’t be on a model built by scraping all the information of humanity for free.
I think this is an even better solution than making them scrap it or pay everyone some token amount.
I understand the sentiment (and agree on moral grounds) but I hink this would put us at an extreme disadvantage in the development of this technology compared to competing nations. Unless you can get all countries to agree and somehow enforce this I think it dramatically hinders our ability to push forward in this space.
They pay for it, simple.
Think about a code that an expert Samsung developer wrote and understanding and executing that flawlessly took 20 years of his/her experience. That person is the only one skilled enough to write it but an LLM model stole it and suggesting it every dev around the world.
That’s a good thing if the dev gets paid to teach the model and then we pay to subscribe to it. Right now it’s breaking the economy. Organisations and startups are abusing the knowledge and laying off skilled occupation.
Nope, you’re looking at it wrong. The Dev got paid to write that code and for all of their 20 years experience. The code was freely given away after that. Nobody loses when knowledge is shared, humanity wins. It gets hairy when you have businesses whose model relies on giving some content away for free and locking some behind a pay wall. Obviously using all of that to train a model without paying anything implies that they never had a subscription, but if they did have one and gave the model access? What’s the difference between that and paying someone to read all those articles? What’s the difference between training a model and paying an employee while training them to expertise? We’re acting like these models are some kind of machine that chops up text and regurgitates it, but that could describe your average college freshman just as well. We’re fast approaching the point where the distinction is meaningless. We can’t treat model training any different from teaching a student.
It should be available for everyone to learn, agreed. But intellectual property and copyright still means something. Artists don’t post anything online for others to steal. They want to share their work and humans look at those to learn and take inspiration.
Obviously I’m talking at a philosophical level and everyone is allowed to have their opinion on it, but I strongly believe that they should also have to follow etiquettes and only use open source and
I disagree. However, I believe the models should be open sourced by law .
Open sourcing the models does absolutely nothing. The fact of the matter is that the people who create these models aren’t able to quantifiably show how they work, because those levels have been abstracted so far into code that there’s no way to understand them.
I understand that.
What I am trying to express is that the models should not be hoarded by large corporations. Because they used the open informatuon of the internet to train, they should be available to everyone. Sort of like a library.
On a side note, models can definitely be open sourced, there are several already.
You sound like an old man who’s scared of changing times.
Or a creative who hates to see the entire soul of the human race boiled down to a computer doing a whole lot of math.
AI isn’t going to put office workers out of a job, not just yet, but it’s sure going to end the careers of a whole lot of artists who won’t get entry level opportunities anymore because an AI is able to do 90% of the job and all they need is someone to sort the outputs.
Yeah! Let’s burn fair use to the ground! Technology is scary! Destroy it all!
I don’t think AI is criticising or parodying that content. Also ChatGPT is a glorified chatbot that can just make it’s answers seem human, it’s not some world saving technology.