DALL-E has some limits:-
Prompt: - "I want to have a panoramic photo of a clean-shaven, elderly man, taken from the side. He's holding a hat behind his back."
They all have these limits, it's inherent to how the models work. They just take each word and try to generalise what all the images they've seen flagged with that word have in common.
"Holding" probably almost always means that there's hands visible, and they'll usually be in some fairly normal places. The models generally struggle with things like hands though because it doesn't have any idea of anatomy. The hand just needs to be in the picture, it doesn't need to be connected to anything.
"Behind" isn't inherently connected to "hat" or "back" because the model doesn't know what sentences are. It's just trying to match words.
If you want a good result from a simple prompt, you have to keep it pretty basic and let the model do more or less what it wants. If you want strong control over the actual composition of the images, you have to start using extra tools to define areas of the canvas and what they should contain or at least prompt the model by feeding it a starter image instead of pure noise. A lot of the basic online services don't offer that sort of control, you probably need to start running your own instance.
This is what people mean when they say "it takes a lot of effort to write a good prompt to get the result you want". It's not as hard as learning to be an artist but it's far from trivial either, because the model is just a tool and one with no innate knowledge of pretty much anything that makes a good artist. It's relying on you to feed it the right prompts in a way that suits the inherent limitations it has.
I'm currently messing around with the Hunyuan video generator, and I'm having a hell of a time getting it to spit out anything but blurry messes. Partly because my rig really isn't quite powerful enough to be doing video, but also because I just don't know what I'm doing with it yet. This is the trade off - models that are easy to use offer little control, and ones that offer significant control also spit out garbage a lot of the time.