Ok, so I have some code to crawl a posting of a community and compare two servers for comments missing. It looks bad today. Both of these servers are version 0.18.0 and have been upgraded for several days.

missing 0 unequal 0 11 on https://lemmy.ml/ vs. 11 on https://sh.itjust.works/
missing 35 unequal 1 48 on https://lemmy.ml/ vs. 14 on https://sh.itjust.works/
missing 4 unequal 0 9 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 6 unequal 0 9 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 6 unequal 0 12 on https://lemmy.ml/ vs. 6 on https://sh.itjust.works/
missing 3 unequal 0 8 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 3 unequal 0 6 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 22 unequal 0 42 on https://lemmy.ml/ vs. 20 on https://sh.itjust.works/
missing 5 unequal 0 15 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 8 unequal 2 17 on https://lemmy.ml/ vs. 9 on https://sh.itjust.works/
missing 3 unequal 0 3 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 0 unequal 0 10 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 11 unequal 0 24 on https://lemmy.ml/ vs. 13 on https://sh.itjust.works/
missing 1 unequal 0 2 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 13 unequal 0 37 on https://lemmy.ml/ vs. 24 on https://sh.itjust.works/
missing 3 unequal 0 7 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 0 unequal 0 10 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 60 unequal 2 186 on https://lemmy.ml/ vs. 126 on https://sh.itjust.works/
missing 10 unequal 2 51 on https://lemmy.ml/ vs. 41 on https://sh.itjust.works/
missing 16 unequal 0 51 on https://lemmy.ml/ vs. 36 on https://sh.itjust.works/
missing 31 unequal 3 128 on https://lemmy.ml/ vs. 97 on https://sh.itjust.works/
missing 0 unequal 0 4 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 2 unequal 0 5 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 15 unequal 1 67 on https://lemmy.ml/ vs. 52 on https://sh.itjust.works/
missing 4 unequal 0 53 on https://lemmy.ml/ vs. 49 on https://sh.itjust.works/
missing 0 unequal 0 5 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 1 unequal 0 19 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 0 unequal 0 2 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 0 unequal 0 22 on https://lemmy.ml/ vs. 22 on https://sh.itjust.works/
missing 0 unequal 0 16 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 0 unequal 0 7 on https://lemmy.ml/ vs. 7 on https://sh.itjust.works/
missing 3 unequal 0 27 on https://lemmy.ml/ vs. 24 on https://sh.itjust.works/
missing 2 unequal 0 32 on https://lemmy.ml/ vs. 30 on https://sh.itjust.works/
missing 3 unequal 0 21 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 3 unequal 1 16 on https://lemmy.ml/ vs. 13 on https://sh.itjust.works/
missing 3 unequal 1 47 on https://lemmy.ml/ vs. 44 on https://sh.itjust.works/
missing 1 unequal 0 24 on https://lemmy.ml/ vs. 23 on https://sh.itjust.works/

The number of comments is based on loading comments, not the counts at the top of the posting.

  • RoundSparrow@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    have you looked for example at lemmy.world (very large but performant) and another relatively small instance?

    Yes, and I run my own instance that is on a high performance server and it has no users other than myself for testing Lemmy.

    Lemmy.ml is having the most problems, but really all of the big servers are dropping comment and post replication. The biggest problem is that server operators have no way to know this is happening other than to read raw Linux error logs. The post ID numbers are unique to each server, so it is not easy to identify which link is for the same postings.

    • DaEagle@lemmy.ml
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      I think I saw one of your earlier posts and I really appreciate you chasing this down and raising awareness. As a relatively savvy user this is definitely something I’ve noticed and I share your concern that it will slowly erode user’s trust in the concept of federation.

      Technically, can you trace where the comments are dropped? does the target receive the response but fails to process it, or does it break somewhere at the network layer? if so, is there no receiver “ack” built into the protocol? sorry for asking a bunch of questions and feel free to ignore (I’m an engineer but I don’t know much about the federation protocol…)

      • RoundSparrow@lemmy.mlOP
        link
        fedilink
        English
        arrow-up
        6
        ·
        1 year ago

        Technically, can you trace where the comments are dropped?

        There are multiple timing and resource issues with the way content is sent. Every single vote and comment has a lot of overhead, and the Lemmy servers are causing each other to slow down with the overhead of it all. There are even very tight security timings that have been hit causing rejection. And there system has no automated way to repair missing content, it just tires to keep up with each new posting, comment, vote.

        • maegul (he/they)@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          How much of this is just the nature of activity pub and federation over it?

          Putting aside whether lemmy is doing a good or bad job, it seems like an issue at the protocol level? For instance, if lemmy were to implement some additional procedures as you hint at, would they work for federation outside of lemmy or would they even cause problems or bugs?

          I’m pretty sure I’ve seen things get dropped in similar ways on other micro-blog platforms. It could be that the community/group based structure just surfaces the issues more because you can compare whole communities instead of individual reply threads in microblogging. Maybe activity pub isn’t appropriate for federated group activity?

          Have you looked at similar issues with kbin?

          Also, thanks for this work. High level end-to-end testing like this is probably invaluable!!

          • RoundSparrow@lemmy.mlOP
            link
            fedilink
            English
            arrow-up
            3
            ·
            1 year ago

            I haven’t had time to look at kbin.

            The Lemmy servers are logging errors, but the messages need organization and are rather difficult to access by server operators. Growing pains.

          • haganbmj@lemmy.ml
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Just to pose these in a similar thread, I have a few questions as a casual observer, some of which I’m unclear if they’re handled at the protocol or Lemmy level.

            • As I understand it servers subscribe to other servers and everything is then push based?
            • I assume ordering is not a guarantee. So there’s probably no concept of offset tracking on subscriptions or replaying a time range?
            • If ordering is not a requirement how do likes/comments handle out of order receipt? Everything seems to have a local ID, so can content get pre-liked before the root message arrives? Unclear if ID generation is based on any identifiers you’d have to work with or not - or whether remote content retains its origin IDs?
            • Lemmy at least appears to have some retry mechanism, but I’m unclear the behavior on that - seems annoying with 1000+ subscribing servers.
            • I seem to recall reading ActivityPub has some pattern for batching, but reading the spec again I’m not seeing it. Is that a thing?
            • maegul (he/they)@lemmy.ml
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              Sorry I don’t know the answers to these questions. Also, I don’t think the OP will get a notification for your content (?) unless you reply to them directly, just in case you want to ask them.

        • maegul (he/they)@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Could you do a little bit more analysis to determine the typical failure mode?

          For instance, my first guess would be that information is more likely to get lost when a comment/vote needs to go through multiple servers.

          That is, a thread is on a community on server A, with people commenting from servers A, B and C. AFAIU, all data synchronisation goes through server A, as that’s where the community lives. So a comment from server B, even it’s in response to a comment/post from server C needs to go through server A and then to server C.

          If these multi-server trips get dropped the most, you’d expect views of the thread to be most inconsistent between servers B and C and that the most dropped content to be from servers B or C when viewed from the other of these two.

          • RoundSparrow@lemmy.mlOP
            link
            fedilink
            English
            arrow-up
            3
            ·
            1 year ago

            The failures have more to do with server performance related to the quantity of new comments and activities than anything. There are periods of time that it fails worse than others and even when web browsers visiting the servers show errors.