Japanese Citypop Landscapes via Style Transfer

Creating imagery in genres that do not exist

The Idea

I was scrolling through my Instagram feed one morning and was presented with a series of Japanese citypop images:

Beautiful images, but they are all of urban environments. What would landscape images look like in this style?

The Tools

Stable Diffusion (sdxl_1.0, ComfyUI)
LoRA training (Kohya’s GUI)

The Process

My goal was to take an established painting style and create imagery that is not typically found in that style.

My process was:

Figure out a prompting style which recreates selected reference images
Train a Stable Diffusion LoRA on the target style
Modify the content of the prompt to generate unusual imagery in the trained style

1. Prompt Design

I started by getting as close as possible to the target style using the base SDXL model. The base model was trained (at great expense) on a wide variety of styles and techniques, so taking advantage of this was the obvious place to start. This approach had the benefit of easing the demands on LoRA training, which were only needed to “finish the job”.

After some experimentation, I found a prompting style which worked reasonably well:

an acrylic painting on canvas of a small silhouetted private jet flying over a large city from high altitude at sunset, buildings lit up, moody, dramatic, metropolis, masterpiece, flat application, minimal brushwork, flat color blocks, sharp clean lines, vibrant pastels, retro 80s aesthetic, acrylic look, no brush marks, graphic quality, stylized architecture, exaggerated shadows, pop art influence, art deco elements, clear composition, minimalistic, minimalism, simple, japanese citypop style

With the prompt figured out, I moved on to figuring out what sampler and CFG value worked best.

I created a grid to investigate:

Based purely on aesthetics, I went with DPM++ 2S Ancestral as the sampler with a CFG value of 7.5. The choice of an ancestral sampler (where noise is injected at each step, causing the image to change from one step to the next) is unusual, but worked fine for this project.

2. Style Training

The success of LoRA training is dependent upon the quality of the data which is used for training. Images must be of high-resolution, representative of what is being trained on, and captions must be highly detailed and reflective of the prompting style which is planned for use with the fine-tuned model.

The style I was interested in was primarily the work of Japanese artist Hiroshi Nagai. It was simple to gather a collection of high-resolution reference images of his work:

I then created detailed captions for each image in the following format (referring to the target style as ‘japanese citypop style’):

an acrylic painting of a curved pool with still water, ladder entering the pool, pool chairs and a table surrounding the pool, spherical lamp in foreground with rose bush, green bushes with yellow flowers with tall tropical trees in the midground, ocean in the background with breaking waves, gradient blue sky, motel to the left with a fence in front of it, pastel colors, calm, serene, luxury, japanese citypop style

With the training data prepared, I used Kohya’s GUI to generate the LoRA. After considerable experimentation, the following settings worked best:

With 40 training images, 20 repeats, a batch size of 1, and 15 epochs (saving a checkpoint at each epoch) I trained for a total of 12,000 steps (with a checkpoint saved every 800 steps). This gave a wide cross-section of checkpoints with varying degrees of training. It is not necessarily the case that more training produces a better result, overtraining is always a risk.

Once I had the 15 epochs, I determined which one worked best. To do this I generated a grid of test images using the prompting style developed earlier:

an acrylic painting on canvas of an alaskan winter wilderness scene, frozen river in a forest with snow covered evergreens, denali in background, bright sunny day, cold, masterpiece, flat application, minimal brushwork, Flat color blocks, sharp clean lines, vibrant pastels, retro 80s aesthetic, acrylic look, no brush marks, graphic quality, stylized architecture, exaggerated shadows, pop art influence, art deco elements, clear composition, japanese citypop style

Overtraining artifacts were first visible at epoch 7. Epoch 4 produced the best results.

The Result

With a trained LoRA, tuned parameters, and custom prompt style I could now reliably generate images in the target style. I generated different landscape environments by changing only the content portion of the prompt:

an alaskan winter wilderness scene, frozen river in a forest with snow covered evergreens, denali in background, bright sunny day, cold

a vast empty desert landscape, steep cliffs, mesa, bright sunny day, hot

a thick jungle in africa, hot

BONUS: Validation

It turns out that Hiroshi Nagai did in fact paint landscapes, so I can verify my work:

My results seem to be a reasonable approximation!