I’m interested in how much visual blockage a photograph can carry before it stops feeling layered and starts feeling cluttered.
Here the foreground heads are dark and heavy, but they also place the viewer inside the crowd rather than outside the scene.
The black and white edit flattens the museum space a little, which may help connect the painted figures, the guide, and the visitors.
Would you crop or lift the foreground, or does the weight at the bottom make the image work?

It looks to me like the subject is the person facing the camera as well as the art behind him. In that case and considering your intent to have the viewer stay among the audience, I think it looks good, although the painting could have had a little more headroom, or perhaps at a less steep angle.
Thanks, that matches what I was testing: the guide and the painting as a shared subject, with the audience acting as a frame. I agree about the painting needing a little more headroom; the top edge feels tight.