• CapuccinoCoretto@lemmy.world
    link
    fedilink
    arrow-up
    51
    ·
    4 days ago

    One thing I want to see is poisoned wells. When you detect scrapers, don’t stop them, feed them pseudo content designed to COST them. Make their training data poisonous and damaging. Make it cost them to purge it, and difficult and expensive to identify it.

    • Agent641@lemmy.world
      link
      fedilink
      arrow-up
      11
      ·
      3 days ago

      We need to host the data version of asbestos. Very appealing and useful, a miracle material in fact, and you don’t realise until 30 years later and well after it’s too late that it’s causing an incurable disease in your lungs.

      Get that poisonous data so deep in the databases of these AIs that it festers and spawns billions of tumors.

      I wish I was smart enough to devise a practical way to weaponise data like this.

    • TheOctonaut@piefed.zip
      link
      fedilink
      English
      arrow-up
      10
      ·
      4 days ago

      Unless a significant portion of the internet does this, and we’re talking hundreds of millions of pages, the only cost here is to you.

      LLMs are statistics. They don’t “remember” their training. They just know what statistically speaking the next words should be. But sure, be the web dev version of þorn guy.

      • nlgranger@lemmy.world
        link
        fedilink
        arrow-up
        1
        ·
        2 days ago

        That is not entirely true in theory. It is possible to engineer content to have a disproportionate impact on the model performance. But we are talking state of the art research and its a moving target since the models evolve quite fast.

      • algernon@lemmy.ml
        link
        fedilink
        arrow-up
        4
        ·
        3 days ago

        Unless a significant portion of the internet does this, and we’re talking hundreds of millions of pages, the only cost here is to you.

        Fun twist: no! There’s a very neat trick you can do when you serve the crawlers poison: you can hide an identifier in the URLs you serve them, and you can then identify that id when they come back riding on the back of remote controlled chromes. By serving them garbage, you can overload their queue with poisoned ones, which helps you block crawlers that you wouldn’t otherwise be able to block.

        Generating and serving garbage is incredibly cheap (cheaper than serving a file from a filesystem on SSD, in most cases), and once you have requests landing on poisoned URLs, you can firewall them off for a day or so, and reduce your costs even more.

        We may not be able to poison the models, but we can poison their crawling queues. I have a year’s worth of data to support that. They still haven’t caught on.

        • TheOctonaut@piefed.zip
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          3 days ago

          They still haven’t caught on

          I admire the optimism to see it this way and not “it’s still not worth it to them to bother blacklisting the domain”

          • algernon@lemmy.ml
            link
            fedilink
            arrow-up
            2
            ·
            3 days ago

            I wonder too, why they didn’t, because they’re happily crawling domains that never had anything but junk on them. To me, that suggests they have no idea they’re trapped. Not at crawling time at least.

      • ATPA9@feddit.org
        link
        fedilink
        arrow-up
        7
        ·
        4 days ago

        Remember the glue on pizza? Sometimes it takes just one stupid post somewhere to poison an llm

        • TheOctonaut@piefed.zip
          link
          fedilink
          English
          arrow-up
          4
          ·
          4 days ago

          Glue on pizza was a result of an early version of an agent tool - built in search. It wasn’t an output of the LLM model (yes I know, ATM machine) itself. It was an LLM using a tool to find a search result from a site considered reputable (yes, I know) and presenting it to the user as fact - an instructions problem, not a statistical one.

        • TheOctonaut@piefed.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 days ago

          I don’t think you understand the scale of the amount of data that has been fed into these models. Already fed in, as in the models are already created, the baseline already established, the dataset responsible for the output they want already retained.

          Any attempt to “poison” them is attempting to add one, ten, a thousand, a million confounding data points against every webpage 1993-2026, every book ever digitised, every social media post made public, every transcript of every video on YouTube, every code comment made public, every post on this federated platform.

          For news articles alone, that’s about 20 billion non-poisoned articles. Do you know what the difference between a million poisoned pages and 20 billion is? 20 billion.

          The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

          • algernon@lemmy.ml
            link
            fedilink
            arrow-up
            1
            ·
            3 days ago

            The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

            I have an automatically generated infinite maze. It produces roughly a million unique pages each day. It used to produce ~60 million pages / day, but a few months ago I decided to firewall some of the crawlers off instead of serving them garbage.

            And I run niche sites. A site with more lucrative traffic than mine (eg, Codeberg, who uses the same software I do) likely generates a lot more garbage.

            There was also a paper, commissioned by Anthropic, I believe, that concluded that only 250 malicious pages they fail to remove from the training set is enough to poison even the largest model. Now, I do not trust anything Anthropic says. But even if we’d need a billion pages to poison a model… I alone served that much in the past year.

            • TheOctonaut@piefed.zip
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              3 days ago

              As you’ve said elsewhere, you’ve created a crawler trap, not a way to poison a model. You’re wasting… some resources I guess? Both theirs and your own. Fascinating to think that you’ve served a billion http requests to no benefit to anyone and you believe this is you winning somehow.

              • algernon@lemmy.ml
                link
                fedilink
                arrow-up
                1
                ·
                3 days ago

                Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That’s a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?

            • TheOctonaut@piefed.zip
              link
              fedilink
              English
              arrow-up
              1
              ·
              3 days ago

              Ok, suppose that I’ve made it to my 40s without realising that time is in linear motion.

              Explain to me what relevance that has to LLMs?

    • hansolo@lemmy.today
      link
      fedilink
      arrow-up
      8
      ·
      4 days ago

      I really want a tutorial on how to do this. I think it’s a great way to practice self-agrandizement by making myself the pretend king of a pretend country.