Back Side of the Moon - Pink Floyd x Sailor Moon

Copyright infringement by way of ControlNet, IP-Adapter, and refined taste

The Idea

Pink Floyd’s “Back Catalogue” poster has a great composition:

1990s Sailor Moon has a great aesthetic:

What if I combined the two?

The Tools

The Process

I wanted to recreate the Pink Floyd poster with Sailor Moon characters in a retro 1990s anime style. I achieved this in a controllable way using Stable Diffusion.

There are three problems which had to be solved:

  1. How to match the composition of the Pink Floyd poster
  2. How to replace the real women with Sailor Moon characters
  3. How to apply a 1990s Sailor Moon anime aesthetic

1. Composition Matching

The final image should have five characters sitting along an edge with their poses identical to the reference image. To achieve this I used ControlNets.

One Controlnet (Canny) was used to control the edges of the generated characters and environment, and the other (Depth) was used to control the depth of generated elements (i.e. foreground, midground, background).

To begin, I preprocessed the source image to produce two control images which were later used to guide the ControlNets:

The control image produced by the Canny Edge preprocessor effectively highlighted the edges of the input image:

The control image produced by the MiDaS Depth Map preprocessor created a reasonably accurate depth map of the input image:

These two control images were fed into their associated ControlNets:

This validated that generated images would have a similar composition to the source image.

2. Character Replacement

Sailor Moon is widely known in popular culture and is thus well-represented in the base SDXL model. This meant I did not need to train a model to understand what is meant by “Sailor Moon characters” (although I could have if I wanted to).

Instead, detailed prompting got close enough:

six beautiful sailor moon characters wearing frilly dainty dresses sitting side by side at the edge of a japanese wooden onsen hot pool outdoors with their legs in the water, sailor moon, sailor neptune, sailor uranus, sailor pluto, sailor jupiter, sailor mars, sailor venus, long hair, short hair, textured hair, pigtail hair, ponytail hair, buns in hair

Although not perfect, those are clearly Sailor-Moon-like characters in the desired composition.

3. Style Matching

Next I needed to apply a retro anime aesthetic to the generated images. It was simple to obtain reference screenshots from retro anime, so the goal was to use these reference screenshots to guide the generated imagery. To achieve this I used an Image Prompt Adapter (IP-Adapter).

IP-Adapter is a technique which utilizes input images as “visual prompts” for image generation. In addition to steering generated imagery towards the desired aesthetic, this approach made the characters more Sailor-Moon-like.

The choice of reference imagery had a big effect on the generated imagery. Using screenshots from different eras of animation achieved different aesthetics:

I could now generate candidate images.

Image Generation

As I was generating thousands of candidate images, I used dynamic prompts to add variety to the generated images. With each generated image, only one of the prompt options was chosen.

I varied the clothing worn by the characters, the environment, and the lighting:

wearing frilly dainty dresses|wearing modest bathing suits

japanese wooden onsen hot pool outdoors|high school swimming pool indoors|japanese garden pond outdoors with sky above|wooden dock at the oceanfront

golden hour lighting|sunset lighting|sunrise lighting

I left this running for a few days and went digging for my favorite image:

I had a winner, but a new problem appeared (common with Stable Diffusion) - the faces and the hands of the characters were deformed:

Next, I fixed that.

Bonus: Face and Hand Fixing

I was happy with most of the generated image, but wanted to replace only the faces and arms. This was achieved with the use of inpainting (i.e. generating new imagery in selected areas of an existing image).

To automatically determine the location of faces and fingers for masking, I used Bounding Box Detectors (BBOX). BBOX automatically determines the location of faces and fingers, masks out those areas, and generates new imagery in only the masked locations:

The Result

Putting it all together, I had the final poster: