Ok, so I have some code to crawl a posting of a community and compare two servers for comments missing. It looks bad today. Both of these servers are version 0.18.0 and have been upgraded for several days.
missing 0 unequal 0 11 on https://lemmy.ml/ vs. 11 on https://sh.itjust.works/
missing 35 unequal 1 48 on https://lemmy.ml/ vs. 14 on https://sh.itjust.works/
missing 4 unequal 0 9 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 6 unequal 0 9 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 6 unequal 0 12 on https://lemmy.ml/ vs. 6 on https://sh.itjust.works/
missing 3 unequal 0 8 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 3 unequal 0 6 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 22 unequal 0 42 on https://lemmy.ml/ vs. 20 on https://sh.itjust.works/
missing 5 unequal 0 15 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 8 unequal 2 17 on https://lemmy.ml/ vs. 9 on https://sh.itjust.works/
missing 3 unequal 0 3 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 0 unequal 0 10 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 11 unequal 0 24 on https://lemmy.ml/ vs. 13 on https://sh.itjust.works/
missing 1 unequal 0 2 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 13 unequal 0 37 on https://lemmy.ml/ vs. 24 on https://sh.itjust.works/
missing 3 unequal 0 7 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 0 unequal 0 10 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 60 unequal 2 186 on https://lemmy.ml/ vs. 126 on https://sh.itjust.works/
missing 10 unequal 2 51 on https://lemmy.ml/ vs. 41 on https://sh.itjust.works/
missing 16 unequal 0 51 on https://lemmy.ml/ vs. 36 on https://sh.itjust.works/
missing 31 unequal 3 128 on https://lemmy.ml/ vs. 97 on https://sh.itjust.works/
missing 0 unequal 0 4 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 2 unequal 0 5 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 15 unequal 1 67 on https://lemmy.ml/ vs. 52 on https://sh.itjust.works/
missing 4 unequal 0 53 on https://lemmy.ml/ vs. 49 on https://sh.itjust.works/
missing 0 unequal 0 5 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 1 unequal 0 19 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 0 unequal 0 2 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 0 unequal 0 22 on https://lemmy.ml/ vs. 22 on https://sh.itjust.works/
missing 0 unequal 0 16 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 0 unequal 0 7 on https://lemmy.ml/ vs. 7 on https://sh.itjust.works/
missing 3 unequal 0 27 on https://lemmy.ml/ vs. 24 on https://sh.itjust.works/
missing 2 unequal 0 32 on https://lemmy.ml/ vs. 30 on https://sh.itjust.works/
missing 3 unequal 0 21 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 3 unequal 1 16 on https://lemmy.ml/ vs. 13 on https://sh.itjust.works/
missing 3 unequal 1 47 on https://lemmy.ml/ vs. 44 on https://sh.itjust.works/
missing 1 unequal 0 24 on https://lemmy.ml/ vs. 23 on https://sh.itjust.works/
The number of comments is based on loading comments, not the counts at the top of the posting.
Yes, and I run my own instance that is on a high performance server and it has no users other than myself for testing Lemmy.
Lemmy.ml is having the most problems, but really all of the big servers are dropping comment and post replication. The biggest problem is that server operators have no way to know this is happening other than to read raw Linux error logs. The post ID numbers are unique to each server, so it is not easy to identify which link is for the same postings.
I think I saw one of your earlier posts and I really appreciate you chasing this down and raising awareness. As a relatively savvy user this is definitely something I’ve noticed and I share your concern that it will slowly erode user’s trust in the concept of federation.
Technically, can you trace where the comments are dropped? does the target receive the response but fails to process it, or does it break somewhere at the network layer? if so, is there no receiver “ack” built into the protocol? sorry for asking a bunch of questions and feel free to ignore (I’m an engineer but I don’t know much about the federation protocol…)
There are multiple timing and resource issues with the way content is sent. Every single vote and comment has a lot of overhead, and the Lemmy servers are causing each other to slow down with the overhead of it all. There are even very tight security timings that have been hit causing rejection. And there system has no automated way to repair missing content, it just tires to keep up with each new posting, comment, vote.
How much of this is just the nature of activity pub and federation over it?
Putting aside whether lemmy is doing a good or bad job, it seems like an issue at the protocol level? For instance, if lemmy were to implement some additional procedures as you hint at, would they work for federation outside of lemmy or would they even cause problems or bugs?
I’m pretty sure I’ve seen things get dropped in similar ways on other micro-blog platforms. It could be that the community/group based structure just surfaces the issues more because you can compare whole communities instead of individual reply threads in microblogging. Maybe activity pub isn’t appropriate for federated group activity?
Have you looked at similar issues with kbin?
Also, thanks for this work. High level end-to-end testing like this is probably invaluable!!
I haven’t had time to look at kbin.
The Lemmy servers are logging errors, but the messages need organization and are rather difficult to access by server operators. Growing pains.
Just to pose these in a similar thread, I have a few questions as a casual observer, some of which I’m unclear if they’re handled at the protocol or Lemmy level.
Sorry I don’t know the answers to these questions. Also, I don’t think the OP will get a notification for your content (?) unless you reply to them directly, just in case you want to ask them.
Could you do a little bit more analysis to determine the typical failure mode?
For instance, my first guess would be that information is more likely to get lost when a comment/vote needs to go through multiple servers.
That is, a thread is on a community on server A, with people commenting from servers A, B and C. AFAIU, all data synchronisation goes through server A, as that’s where the community lives. So a comment from server B, even it’s in response to a comment/post from server C needs to go through server A and then to server C.
If these multi-server trips get dropped the most, you’d expect views of the thread to be most inconsistent between servers B and C and that the most dropped content to be from servers B or C when viewed from the other of these two.
The failures have more to do with server performance related to the quantity of new comments and activities than anything. There are periods of time that it fails worse than others and even when web browsers visiting the servers show errors.