It's been a while since I've made a post here. Long story short, I got overwhelmed by my workload and lost momentum. Also, I find the process of working with AI to be kind of unpleasant. I find the work interesting and I see the value in it, but I find that the lack of predictable results can be a bit frustrating: if something goes wrong, it's hard to trace it back to an identifiable cause.
The plan at the time was to continue investigating OpenPose in combination with a character LoRA. If OpenPose can replicate a pose accurately from a reference image, and if a LoRA can replicate a character design accurately and consistently, then in theory, I'd be able to generate an entire image set for a character.
In my previous work, I added the VTuber Korone to the game Dong Dong Never Die, creating the sprites myself using a 3D model (made by someone else) and translating the 3D renders to 2D.
I decided that I would attempt to replicate this sprite, both as a fun callback to my previous work, and because this is a recognizable character that I figured somebody has already made a LoRA of.
I took a picture of myself approximating this pose:
Here's a screenshot of my workflow:
Like before, my generations were taking a long time. Also, the image quality in general was kind of low. There was another issue: the LoRA appeared to be trained on multiple different outfits (the author commentary on Civitai implied it was kind of thrown together).
While this could perhaps be fixed if I were to use a LoRA trained on a specific outfit, I still wanted to test if it was possible to maintain that consistency. So, after a few more tries, I decided to add in a latent image. Up to this point, I'd just been using an empty latent.
The latent image is essentially the "starting point" for the image generation process: the KSampler starts from the latent and carries out a number of denoising steps in order to get a final image. The amount by which the image deviates from the latent can be controlled by the "denoise" parameter in the KSampler: setting it to 0 will not modify the latent at all, setting it to 1 will completely diverge from the latent, and anywhere in the middle will retain some qualities of the latent, but not replicate it perfectly.
I tried a bunch more generations, tweaking a bunch of parameters along the way, including the KSampler's denoise, and the strength, start percent, and end percent in ControlNet. The results were not fantastic:
I was never really able to find a tuning of the parameters that hit all the marks I wanted. If I made the ControlNet parameters stronger, then it would increase the pose accuracy, but mess up the anatomy. If I increased the denoise amount, the outfit would deviate more from the reference image, but if I reduced the denoise amount, then the pose would not be replicated as accurately. Basically, what I got from this is that if I really want both consistent results, I have to handle that at the LoRA creation stage.
At some point, I decided to try out different base models, as I was unhappy with the results I was getting from Flux. I switched to Stable Diffusion 1.5 and immediately started getting much better outcomes: not only was the image quality much higher, they took way less time to generate as well: pretty consistently less than 30 seconds.
Additionally, there seems to be a lot more LoRAs out there for Stable Diffusion, and the quantity and quality of documentation and tutorials seems to be higher as well.
I was immediately able to get some pretty decent-looking results by using the same pose image with a Stable Diffusion model and no character LoRA, just a text description of the image:
The generated images are quite close to the original pose, and the anatomy looks pretty solid overall. Some of the face details are a little questionable, but I think it's fair to call this a success.
I decided that the next target would be to see if I could generate some consistent character images from scratch. This would allow me to get further experience with the capabilities of AI image generation, and if successful, would give me some materials that I could use to train a LoRA.
Specifically, I wanted to see if I could get the AI to create a reasonable image of a Black man, since I figured that would be a blind spot of most AI models, and I don't want my game to be filled with nothing but generically attractive light-skinned women. I especially anticipated that this would be difficult with the models I was using, which were more anime-styled. I started with the following positive prompt:
Masterpiece, best quality, 4k, anime character sheet, 1 boy, reference sheet, black male, 24 years old, slim build, full body, simple background, front view, side view, back view, multiple views, shirt, black fingerless gloves, standing, low-cut t-shirt, black shirt, yellow jacket tied around waist, amber eyes, black hair, yellow highlights in hair, mid-length dreadlocks, ponytail, long baggy pants, black pants, barefoot
And this negative prompt:
blurry, low quality, verybadimagenegative_v1.3, worst quality, bad-hands-5, EasyNegative, bad anatomy, depth of field
My negative prompt contains a few embeddings. These are tiny helper models that go into their own folder in ComfyUI.
You can invoke embeddings by including their name in either your positive or negative prompt. Some embeddings, such as "bad-hands-5," are trained on things that are generally undesirable (such as bad hands), and including those in the negative prompt will help stop those things from appearing in your image.
Anyway, here's what I got:
Clearly, the AI abjectly failed to produce an image of a Black person. In addition to not looking like dreads, the hair is inconsistent across images and kind of nonsensical in some of them. The sash position is also inconsistent.
I tried adjusting the tags, but kept getting pretty poor results. By using the tag "african male" (which seems to be how images of Black people are generally tagged for SD images) I was able to get a character with darker skin, but it still doesn't look much like what I want.
I went searching on Civitai for LoRAs focused on replicating Black hairstyles. I was pretty disappointed by the selection. I found one that seemed to work okay, but it seemed to be designed for realistic images, so changed the visual style of the entire rest of the image considerably:
After many rounds of tweaking my prompt and various parameters, I was able to get some generations that looked stylistically passable.
No matter what I tried, though, there would always be some amount of inconsistency in the outfit. Basically, I could paint certain colors into the outfit, but it was a total crapshoot where they showed up in each pose. If I just kept the prompt very plain and had it generate a monochromatic outfit, then I could avoid the inconsistency issue, but obviously that greatly restricts what I can do with the character designs. The AI especially seemed to struggle with the concept of a sash: for fighting games especially, it helps to have some kind of clear divider between the character's upper and lower body for readability purposes, and I really struggled with getting that to manifest with this method.
Anyway, my main takeaway from all of this was A. if I want really good results, I probably have to train my own LoRA, and B. I probably have to create the training images for the LoRA manually if I want to have the level of control over the final product that I'm looking for.
Comments
Post a Comment