For the past week and a half or so, I've been focusing on exploring the process LoRA training. My goal is to produce a LoRA capable of consistently producing full-body images of a given person and their outfit, in order to produce frames for animation.
I started with this article: Generate Photos of Yourself by Training a LoRA for Stable Diffusion Privately on AWS | by Paulo Carvalho | Medium, but got distracted about halfway through by this video about the training process: LORA Training - for HYPER Realistic Results. This video was my main resource for figuring out the training process.
The video provides a detailed walkthrough of all the steps of the training process. It gives advice on what technology to use, ideal conditions for capturing training photos, how to process the training photos, tagging, and what settings to use in the LoRA trainer itself.
I decided I would follow along with the video and, as an initial test, create a LoRA of myself. I'll now talk through all the steps of the process as described by the video, and how I approached them.
First, the technology. I don't have access to a proper photography camera, so I just used a phone camera. I think this is mostly fine, though: the training images don't need to be particularly high-resolution because the images are resized to a set resolution anyway at the start of the training process. What is important is the level of detail: the video recommends shooting raw (uncompressed) pictures. I missed this on my first watch through because I'm not experienced with photography and I was kind of just assuming that a lot of the recommendations wouldn't be possible with my level of technology or experience and could somewhat safely be ignored.
Ideally, you want a diverse photoset of images, where the parts you want the LoRA to pick up on are consistent between images, and the other parts are not consistent. You also want some full-body shots, some closeups of the face at different angles, and some shots from the waist/shoulders up. The video also recommends taking steps to reduce shadows and reflections on the face, but implies that it's more of a personal preference thing rather than something that's required in order to get consistent results. The video provides some examples of what an ideal training set looks like:
However, I decided that I wouldn't put too much effort into optimizing my image set right from the start. I wanted to see what kinds of results I would get from putting in the bare minimum effort, in order to set a baseline to improve upon.
With some assistance, I took an image set of myself in an indoor office building setting, as I figured that would provide some nice, consistent lighting.
We took the pictures at various locations, with me turning completely around and varying my pose slightly at each location. Having all the pictures be taken at the same location probably isn't ideal, because the LoRA will pick up on the location, but I thought it would be fine for an initial test.
All of the pictures were taken on an iPhone, and were saved in Apple's HEIC format, a compressed file format similar to a JPEG. In order for the photos to be usable, I had to convert them to another format. I couldn't figure out an easy way to do that on my machine, so I used an online tool: Convert Heic to JPEG for free | Made by JPEGmini. This appeared to alter the colors slightly, but I didn't notice any new compression artifacts, and I believe the service claimed to be lossless.
The primary tool I used for LoRA training is kohya_ss. It has a browser-based GUI that can be launched from the command line. From what I can tell, it seems like you need a network connection for it to function, but all of the actual training is done on your machine, using your GPU.
First, though, I needed to add tags to the images. In order to ensure the best possible results, you must include text files alongside your training images, which describe what is in each of your images.
I did some searching around to see if there was an explanation of what the tags actually did and how they affected the output, but didn't find too much concrete information. From what I gather, the tags describe what is in your image, and the training process takes everything in the image that isn't described by the tags and associates that with your LoRA. For best results, then, it seems like you want to tag everything in the background and provide a general description of the image, but you don't have to describe your subject in detail.
kohya_ss comes with several autotagger models. The video suggested using the WD14 tagger, then using a program called BDTM (Booru Dataset Tag Manager) in order to add the tags.
Then, it was time to actually train the LoRA. I got a little stuck at this stage, because you have to have a very specific file structure and naming convention in order for it to work, which was not properly communicated in the video. Your file path for the training image folder can't have any space characters, and you need to include a number at the start of the folder name, which tells the model how many "repetitions" it should do per image (not 100% sure what that means, still).
I experimented with different values for the epoch, resolution, and network rank/alpha. However, I found that at the higher settings, the training process took an exorbitant amount of time: I didn't set a timer or note down my parameters, but I would come back hours later and see that only around 400 of 1600 steps had been completed or whatever. Eventually, I decided to set the resolution relatively low (384*512) and keep the network rank and dimension at base, and just be okay with getting mediocre results. With those settings, it took ~30 minutes to do a 10-epoch training with 41 images and 10 repetitions per image.
Now that my LoRA was ready, I could test it out in ComfyUI. The LoRAs were trained using Stable Diffusion 1.5, so that was the model I used for the generation as well. Somewhat predictably, the results weren't too great:
So, I suppose that's good news. If I could somehow fix the weird garbling issues with the face and proportions by having a better training set, then there could be some potential here.
I decided to have another go at the training process, this time with a better set of training images, and putting more effort into properly cropping and tagging the images. I enlisted the help of my mom, and we took a bunch of images in a variety of locations. This time, instead of having the subject rotate in order to capture different angles, I rotated the angle of the camera itself, in the hopes of having a greater variety of backgrounds. I also took a few close-up shots of the face.
The images were taken late in the afternoon in areas with uniform shadows (such as those cast by buildings), in order to ensure plenty of light without harsh shadows or reflections on the subject.
From these, I selected a few and cropped them, including some full-body images, some of the face, and some from the waist/shoulders up. I generated the tags automatically, but put much more effort into manually curating the tags this time.
This time, my total image count was smaller, so I used a much higher repeat number to compensate.
I trained a LoRA using this image set. I don't recall my exact settings: I believe I had the resolution at 512x512, which made the training take over an hour at 5 epochs. I then went and generated some images in ComfyUI. The results were mildly horrifying:
The face is kind of uncanny when you look at it too closely, the pose is a little funky-looking, and there's a weird artifact near the right hand, but other than that it looks mostly fine.
However, training at higher resolution takes way, way more time. This could be remediated with better technology: in particular, a more powerful GPU (or cloud computing) that could run the training with higher resolution. Guidelines suggest using a GPU with at least 12GB of VRAM, and all I have is 6 GB.
But if I'm being honest, I don't particularly want to do any of that, because it's becoming clear to me that I find the process of training AI to be categorically miserable. The main problem is that it doesn't feel like I'm learning a skill. It's pure trial and error, a total black box, and it doesn't feel like there's any opportunities to leverage my own intelligence. It's also very unclear how much progress I've actually made towards my goal at any given time, or whether my goal is even possible to realize.
With, say, 3D modeling and animation, there's a lot of steps to that process to learn and get better at, and at each of those steps it's usually pretty easy to evaluate how close I am to my vision, because it's a well-documented system that's actually designed for the task, and any changes I make in the process have an immediate, consistent, and noticeable effect on the end product. Also, with 3D modeling, the model is the thing: that is the asset that will be used going forward in order to produce more things for the user to experience.
With AI image generation, though, results are always off from my vision in some way, but I have no way of knowing what steps could get me there. Also, the image is not the thing, the workflow/models are. If there's a specific change that I want to see reflected in an image, then I have to go all the way back to modifying the workflow, which is frustrating because it's a black box. And even if one image turns out looking fine, there's no guarantee that the others will: the workflow has to do what I want consistently in order to be viable for content production. And given how new the technology is and how underexplored it seems to be by people who are actually trying to produce interesting art, there's almost certainly going to be some surprise blind spots where the AI is just totally incapable of parsing certain types of requests.
While I don't doubt that somebody could get this workflow to a point where it's usable, I don't think that somebody has to be me. While I do think that an AI-based sprite generation approach might reduce my total work time, that time would be spent doing something I don't enjoy very much and comes with a whole host of ethical ramifications.
I might give this LoRA training thing another shot, but I think it might be time to move on. If I do continue to use generative AI tools in the future, it'll probably be to produce things that I actually have the skills to create and modify myself, not for entirely novel applications where I'm relying heavily on AI as a core part of the workflow and don't really have a fallback option. I always want to have final say in whatever ends up getting produced.
At this point, I think I would much rather be exploring avenues for manually producing 3D assets for fighting games, with an emphasis on finding a unique visual style that stands out among the rest of the field while allowing the overall workload to stay low. I already have an idea lined up for something that might look cool, stand out, and be feasible for me to produce. I'll begin exploration into that direction, and may update this blog with my findings.
Comments
Post a Comment