Perhaps the software OP is using has a second layer of generation (with a different network) that focuses on details like eyes. It might not even know the input prompt (and if it does then it might not have the training background to reward keeping things pixelated).
“Aint webassy we doms?”