I'm doing a lot of exploration of ComfyUI as a means of generating sprites for 2D fighting games. Let's go through some of what I've found.
As a starting point for learning, I began with this
video tutorial series by Pixaroma. Specifically, I watched episode 1 on the basics of image generation, episode 10 on
Flux GGUF (an image generation model), episode 14 on ControlNet OpenPose (which can
identify and replicate poses), and episode 28 on creating consistent characters
with Flux and online LoRAs.
Here's the basics of how AI image generation works in ComfyUI. The most important node is the KSampler. The KSampler requires the following inputs: an AI model, a positive prompt which describes what you want to see in the image, a negative prompt which describes what you don't want to see in the image, and a latent image.
Starting from the latent image (which is either supplied by the user or left empty), the KSampler carries out a process called "denoising," in which it makes small adjustments to the latent image. In order to figure out what to do during the denoising process, the KSampler uses the prompts and the model. You can tweak various parameters of the KSampler, including the denoise strength (a parameter that ranges from 0 to 1 that determines how much the KSampler should modify the latent image), the number of steps in the denoising process, and the classifier-free guidance (CFG), which modifies how closely the process adheres to the prompt.
It is possible to generate images purely using text prompts and unmodified models. However, if you want to have a more precise level of control over your results, you can use additional nodes that modify the text prompts and model. For this project, I used a feature called ControlNet OpenPose.
An example of an OpenPose output image file (left). This pose data has been used to generate images of actors Henry Cavill (center) and Summer Glau (right). From https://openposes.com/ |
All of the sprites in Dong Dong Never Die are digitized images of actual people, so I imagined that OpenPose would have an easier time with them. As expected, this worked quite a bit better: OpenPose was mostly able to identify the poses accurately, although it did struggle with some of the poses in the middle. This might be because the sprites are still relatively low-resolution, making it hard to pick out details.
Note that my main generation model is called "Flux.1-dev-Q3_K_S.gguf," as opposed to all of the other models which have the file type ".safetensors." A GGUF is essentially a compressed version of a machine-learning model that is faster to run and load. The complete Flux.1-dev model is a 23.8 GB file, while this GGUF is only 5.23 GB. In order to use the GGUF model, I apparently also needed to have two CLIP models. From what I understand, the CLIP models help the image generation model understand text and images respectively.
I got some fairly disappointing results from this method:
The AI was fairly good at maintaining character design consistency in between poses (of course, this design consistency was not maintained between generations). However, the AI was rather poor at actually replicating the character's poses precisely. It seemed especially reluctant to have the character face away from the camera during the spin at the start of the kick animation. Because of the lack of precise adherence to the provided poses, the character's limbs jump around randomly in between frames, which I imagine would make these sprites look very jerky in motion.
The reason why I stuck with Flux even though it was running poorly on my end is because, somehow, Pixaroma's videos had given me the idea that OpenPose was only compatible with Flux. That's not true: it can be used with any image generation model, as long as you have a compatible ControlNet model. In future testing, I tried models other than Flux, and did a lot more structured investigation into all of the node parameters and what they affect. I think I'll save that for future posts, though.
Comments
Post a Comment