Using model-generated content in training causes irreversible defects, a team of researchers says. “The tails of the original content distribution disappears,” writes co-author Ross Anderson from the University of Cambridge in a blog post. “Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions.”

Here’s is the study: http://web.archive.org/web/20230614184632/https://arxiv.org/abs/2305.17493

  • FaceDeer@kbin.social
    link
    fedilink
    arrow-up
    9
    ·
    1 year ago

    So the only Reddit data that’s really valuable to AI companies is the stuff that’s already been archived, and the stuff that will be gated behind their paid API is less useful now that bots are becoming prevalent?

    Good move, Reddit.