Attempts at sprite generation in ComfyUI, Pt. 1

I'm doing a lot of exploration of ComfyUI as a means of generating sprites for 2D fighting games. Let's go through some of what I've found. 

As a starting point for learning, I began with this video tutorial series by Pixaroma. Specifically, I watched episode 1 on the basics of image generation, episode 10 on Flux GGUF (an image generation model), episode 14 on ControlNet OpenPose (which can identify and replicate poses), and episode 28 on creating consistent characters with Flux and online LoRAs.


Here's the basics of how AI image generation works in ComfyUI. The most important node is the KSampler. The KSampler requires the following inputs: an AI model, a positive prompt which describes what you want to see in the image, a negative prompt which describes what you don't want to see in the image, and a latent image. 


Starting from the latent image (which is either supplied by the user or left empty), the KSampler carries out a process called "denoising," in which it makes small adjustments to the latent image. In order to figure out what to do during the denoising process, the KSampler uses the prompts and the model. You can tweak various parameters of the KSampler, including the denoise strength (a parameter that ranges from 0 to 1 that determines how much the KSampler should modify the latent image), the number of steps in the denoising process, and the classifier-free guidance (CFG), which modifies how closely the process adheres to the prompt.

It is possible to generate images purely using text prompts and unmodified models. However, if you want to have a more precise level of control over your results, you can use additional nodes that modify the text prompts and model. For this project, I used a feature called ControlNet OpenPose.


OpenPose is a system designed for detecting human poses. When given an image, OpenPose can identify any humans in the image, as well as the positions and angles of their limbs. OpenPose can also optionally identify the locations of fingers or certain facial features. OpenPose encodes its identification details into an image file, which can then be used with compatible models in order to replicate that pose using different characters.

An example of an OpenPose output image file (left). This pose data has been used to generate images of actors Henry Cavill (center) and Summer Glau (right). From https://openposes.com/

OpenPose data can be applied to image generation prompts in ComfyUI using something called ControlNet. ControlNet is a "helper" model that can modify prompts as they're being sent into other models. To use ControlNet, you must have a version of the model that is compatible with your image generation model. In this case, I was using a model called Flux.1-dev, so I had to find a ControlNet Model that was compatible with Flux.


For this project, I initially attempted to use this sprite sheet depicting Makoto from Street Fighter III: Third Strike performing her "Fukiage" special move as my pose reference image.


However, I found that OpenPose was having a lot of trouble identifying any poses from this image. This is probably for a few reasons: the sprites are pixel art (which I imagine is relatively low-resolution compared to the images OpenPose is expecting), Makoto is not entirely proportioned like an actual human, her clothing somewhat obscures her body shape, and there's some instances of smear frames.
For my next attempt, I prepared a sprite sheet of Da Lan from the fighting game Dong Dong Never Die doing his standing heavy kick. 


All of the sprites in Dong Dong Never Die are digitized images of actual people, so I imagined that OpenPose would have an easier time with them. As expected, this worked quite a bit better: OpenPose was mostly able to identify the poses accurately, although it did struggle with some of the poses in the middle. This might be because the sprites are still relatively low-resolution, making it hard to pick out details. 


Just for completeness's sake, here's all of the models I used for this workflow. All of these models can be found on Hugging Face, a platform for sharing machine learning models and datasets.


Note that my main generation model is called "Flux.1-dev-Q3_K_S.gguf," as opposed to all of the other models which have the file type ".safetensors." A GGUF is essentially a compressed version of a machine-learning model that is faster to run and load. The complete Flux.1-dev model is a 23.8 GB file, while this GGUF is only 5.23 GB. In order to use the GGUF model, I apparently also needed to have two CLIP models. From what I understand, the CLIP models help the image generation model understand text and images respectively. 

For my prompts, I used long, prose-style descriptions of the character design and art style I wanted, as well as a description of the final image layout.


I got some fairly disappointing results from this method:


The AI was fairly good at maintaining character design consistency in between poses (of course, this design consistency was not maintained between generations). However, the AI was rather poor at actually replicating the character's poses precisely. It seemed especially reluctant to have the character face away from the camera during the spin at the start of the kick animation. Because of the lack of precise adherence to the provided poses, the character's limbs jump around randomly in between frames, which I imagine would make these sprites look very jerky in motion.

I think that I could've gotten much better results if I'd played around with some of the settings associated with ControlNet and the KSampler. I believe I didn't modify the default ControlNet settings at all. The default settings notably set the end_percent parameter at 0.2, which essentially means that ControlNet stops applying 20% of the way into the generation. This is probably greatly to blame for the lack of adherence to the provided pose data.


Also, each generation took a fairly long time: about 10 minutes. From later testing with other image generation models, this seems to be an issue exclusive to Flux: other models ran much more quickly, and produced better results as well. I'm not sure whether it's a problem inherent to Flux, or if it just doesn't play nicely with my hardware for some reason.

The reason why I stuck with Flux even though it was running poorly on my end is because, somehow, Pixaroma's videos had given me the idea that OpenPose was only compatible with Flux. That's not true: it can be used with any image generation model, as long as you have a compatible ControlNet model. In future testing, I tried models other than Flux, and did a lot more structured investigation into all of the node parameters and what they affect. I think I'll save that for future posts, though.

Comments