It’s all made from our data, anyway, so it should be ours to use as we want

  • FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    23
    ·
    16 hours ago

    Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

    • just_another_person@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      6
      ·
      15 hours ago

      Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        16
        ·
        15 hours ago

        That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.

        An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.

        • Superb@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          10 hours ago

          No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            4
            ·
            9 hours ago

            You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.

            This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            4
            arrow-down
            2
            ·
            13 hours ago

            You said:

            What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

            But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?

            They pulled a very public and out in the open data heist and got away with it.

            They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.

            • A1kmm@lemmy.amxl.com
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 hours ago

              Copyright laws are illogical - but I don’t think your claim is as clear cut as you think.

              Transforming data to a different format, even in a lossy fashion, is often treated as copyright infringement. Let’s say the Alice produces a film, and Bob goes to the cinema, records it with a camera, and then compresses it into an Ogg file with Vorbis audio encoding and Theora video encoding.

              The final output of this process is a lossy compression of the input data - meaning that the video and audio is put through a transformation that means it’s represented in a completely different form to the original, and it is impossible to reconstruct a pixel perfect rendition of the original from the encoded data. The transformation includes things like analysing the motion between frames and creating a model to predict future frames.

              However, copyright laws don’t require that an infringing copy be an exact reproduction - lossy compression is generally treated as infringing, as is taking key elements and re-telling the same thing in different words.

              You mentioned Harry Potter below, and gave a paper mache example. Generally copyright laws have restricted scope, and if the source paper was an authorised copy, that is the reason that wouldn’t be infringing in most jurisdictions. However, let me do an experiment. I’ll prompt ChatGPT-4o-mini with the following prompt: “You are J K Rowling. Create a three paragraph summary of the entire book “Harry Potter and the Philosopher’s Stone”. Include all the original plot points and use the original character names. Ensure what you create is usable as a substitute to reading the book, and is a succinct but entertaining highly abridged version of the book”. I’ve reviewed the output (I won’t post it here since I think it would be copyright infringing, and also given the author’s transphobic stances don’t want to promote her universe) - and can say for sure that it is able to accurately reproduce the major plot points and character names, while being insufficiently transformative (in the sense that both the original and the text generated by the model are literary works, and the output could be a substitute for reading the book).

              So yes, the model (including its weights) is a highly compressed form of the input (admittedly far more so than the Ogg Vorbis/Theora example), and it can infer (i.e. decode to) outputs that contain copyrighted elements.

            • just_another_person@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              1
              ·
              12 hours ago

              You’re thinking of licensing as a person putting something online WITH a license.

              The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.

              Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.

              Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.

              Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.

            • catloaf@lemm.ee
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              2
              ·
              13 hours ago

              The product of that analysis does not contain the data itself, and so is not a violation of copyright.

              That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)

              • FaceDeer@fedia.io
                link
                fedilink
                arrow-up
                3
                arrow-down
                2
                ·
                13 hours ago

                The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.

                The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.

                • catloaf@lemm.ee
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  arrow-down
                  1
                  ·
                  11 hours ago

                  That’s not in question. It doesn’t need to contain the training data to be a derivative work, and therefore a potential infringement.

                  • FaceDeer@fedia.io
                    link
                    fedilink
                    arrow-up
                    1
                    ·
                    10 hours ago

                    You’ve got your definition of “derivative work” wrong. It does indeed need to contain copyrightable elements of another work for it to be a derivative work.

                    If I took a copy of Harry Potter, reduced it to a fine slurry, and then made a paper mache sculpture out of it, that’s not a derivative work. None of the copyrightable elements of the book survived.