Photo by YuriArcursPeopleimages On Envato Elements

Image generation Artificial Intelligence (AI) has advanced by leaps and bounds, creating photorealistic landscapes and unique individuals from a simple text description. However, if you’ve tried asking your favorite image generator to create a group of people, you’ve probably encountered strange results: hands with six fingers, merging limbs, or inconsistent and low-quality faces.

Why does this happen? If AI is so good at generating a single person, what confuses it when there’s more than one? 🤔 The answer lies in how these models have been trained and the inherent complexity of human interaction.

🧠 The Training Problem: Individuality vs. Context

Image generation AI models, like DALL-E or Midjourney, are trained on vast amounts of data, including billions of images and their corresponding captions.

  1. Focus on the Main Subject

A large part of the training set consists of images where there is a clear main subject (a person, an object, a landscape). This makes the AI expert at recreating a face and anatomy in isolation, as these are the most common and distinct patterns in its data.

2. Confusion When Multiplying Elements

When asked to generate a group, the AI must duplicate its understanding of “person” multiple times and, most difficult, must do so in a coherent spatial context. The model struggles with:

  • Anatomical Consistency: Maintaining the correct anatomy (five fingers, two arms) for each individual simultaneously, without the elements “blending” or corrupting each other.
  • Positioning and Occlusion: Deciding how the bodies should overlap or interact. Who is in front? Who is behind? This understanding of depth and social context is much harder to learn than the individual elements.

📐 Social Geometry and Data Biases

Another crucial factor is that scenes with groups of people involve a social geometry and spatial relationship that the AI doesn’t always manage to capture.

  1. Difficult Interactions to Model

Unlike inanimate objects (like three apples on a table), the interaction between humans is dynamic and complex. The AI often fails at small details that a human notices immediately:

  • Hands and Limbs: These are notoriously difficult because they have many joints and can overlap or be in unusual positions when interacting with others. Hands are one of AI’s best-known “weak points,” and this problem is amplified when multiple bodies are involved.
  • Gaze and Expressions: It’s complex for the AI to ensure that a group’s gazes point in a coherent direction or that their facial expressions reflect a logical group interaction (for example, everyone looking and smiling at an invisible photographer).

2. Dataset Bias

Finally, biases in the training data also play a role. If images of groups of people in the training set lack diversity or are dominated by certain types of poses or environments, the AI will have trouble generalizing and creating a natural-looking, diverse group in a new scene. The result is often a repetition of the most common patterns, generating individuals who look like slightly modified versions of the same mold.

In short, generating groups isn’t just about “generating one person and then copying them.” It requires a deep geometric and contextual understanding that goes beyond individual visual patterns, and that’s where AI still needs to improve to achieve perfect group interactions.

If you’re interested in learning more about how AI is trained and the challenges it faces, check out this video about why image generation doesn’t recognize faces (the Spanish video link provided in the original blog post). The video explains that the consistency problem, even in individual faces, is due to the fact that AI works with probabilities and not with a fixed understanding.

Contac Us