LoRA Training investigation

For the past week and a half or so, I've been focusing on exploring the process LoRA training. My goal is to produce a LoRA capable of consistently producing full-body images of a given person and their outfit, in order to produce frames for animation.

I started with this article: Generate Photos of Yourself by Training a LoRA for Stable Diffusion Privately on AWS | by Paulo Carvalho | Medium, but got distracted about halfway through by this video about the training process: LORA Training - for HYPER Realistic Results. This video was my main resource for figuring out the training process.

The video provides a detailed walkthrough of all the steps of the training process. It gives advice on what technology to use, ideal conditions for capturing training photos, how to process the training photos, tagging, and what settings to use in the LoRA trainer itself.

I decided I would follow along with the video and, as an initial test, create a LoRA of myself. I'll now talk through all the steps of the process as described by the video, and how I approached them.

First, the technology. I don't have access to a proper photography camera, so I just used a phone camera. I think this is mostly fine, though: the training images don't need to be particularly high-resolution because the images are resized to a set resolution anyway at the start of the training process. What is important is the level of detail: the video recommends shooting raw (uncompressed) pictures. I missed this on my first watch through because I'm not experienced with photography and I was kind of just assuming that a lot of the recommendations wouldn't be possible with my level of technology or experience and could somewhat safely be ignored.

Ideally, you want a diverse photoset of images, where the parts you want the LoRA to pick up on are consistent between images, and the other parts are not consistent. You also want some full-body shots, some closeups of the face at different angles, and some shots from the waist/shoulders up. The video also recommends taking steps to reduce shadows and reflections on the face, but implies that it's more of a personal preference thing rather than something that's required in order to get consistent results. The video provides some examples of what an ideal training set looks like:

However, I decided that I wouldn't put too much effort into optimizing my image set right from the start. I wanted to see what kinds of results I would get from putting in the bare minimum effort, in order to set a baseline to improve upon.

With some assistance, I took an image set of myself in an indoor office building setting, as I figured that would provide some nice, consistent lighting.

We took the pictures at various locations, with me turning completely around and varying my pose slightly at each location. Having all the pictures be taken at the same location probably isn't ideal, because the LoRA will pick up on the location, but I thought it would be fine for an initial test.

All of the pictures were taken on an iPhone, and were saved in Apple's HEIC format, a compressed file format similar to a JPEG. In order for the photos to be usable, I had to convert them to another format. I couldn't figure out an easy way to do that on my machine, so I used an online tool: Convert Heic to JPEG for free | Made by JPEGmini. This appeared to alter the colors slightly, but I didn't notice any new compression artifacts, and I believe the service claimed to be lossless.

The primary tool I used for LoRA training is kohya_ss. It has a browser-based GUI that can be launched from the command line. From what I can tell, it seems like you need a network connection for it to function, but all of the actual training is done on your machine, using your GPU.

First, though, I needed to add tags to the images. In order to ensure the best possible results, you must include text files alongside your training images, which describe what is in each of your images.

I did some searching around to see if there was an explanation of what the tags actually did and how they affected the output, but didn't find too much concrete information. From what I gather, the tags describe what is in your image, and the training process takes everything in the image that isn't described by the tags and associates that with your LoRA. For best results, then, it seems like you want to tag everything in the background and provide a general description of the image, but you don't have to describe your subject in detail.

kohya_ss comes with several autotagger models. The video suggested using the WD14 tagger, then using a program called BDTM (Booru Dataset Tag Manager) in order to add the tags.

I used the tag manager to delete automatic tags that I thought were inaccurate, and also added a "trigger word" to the start of the tag list for all images. For the "trigger word," you want to select something unique that isn't commonly used as a tag for other images, such as a proper name. When generating an image, the LoRA will only be used if the trigger word is included in the prompt, so you want to pick a word that won't inherently affect the generation and isn't likely to be included accidentally in a prompt.

Then, it was time to actually train the LoRA. I got a little stuck at this stage, because you have to have a very specific file structure and naming convention in order for it to work, which was not properly communicated in the video. Your file path for the training image folder can't have any space characters, and you need to include a number at the start of the folder name, which tells the model how many "repetitions" it should do per image (not 100% sure what that means, still).

There's a bunch of parameters and optional settings to adjust here, but the video recommended leaving most of them alone for now. From what I can tell, only a few of them really matter. The epoch determines the number of training epochs in the training process.

An epoch is basically an intermediate model: after an epoch completes, the result from that epoch is used to assist in the training of the next one. The number of training steps in each epoch is equal to the number of images in your dataset times the repeat count as specified in your training image folder name. Having the same number of steps spread across a different amount of epochs produces different results (for example, 100 steps per epoch over 10 epochs will give you something different from 1000 steps in one epoch, though I'm not sure exactly how). The total number of steps in the entire training process can be limited by the "max train steps" parameter, if you so choose.

Other important parameters include the max resolution, which determines the resolution your training images will be cropped to, and the network rank and alpha, which make the model larger and take longer to train in exchange for... some kind of benefit, I assume.

This page provides more detailed explanations on what every single one of the parameters do, which I found helpful: LoRA training parameters · bmaltais/kohya_ss Wiki.

With regards to max resolution: I don't think it has to match the aspect ratio of your training images. The video tutorial used 768x768 max resolution, but the training images were all larger than that and not all of them were square. Still, I imagine that it helps to have your max resolution match the resolution of your training images exactly, but I didn't bother cropping my images so as to set that up.

I experimented with different values for the epoch, resolution, and network rank/alpha. However, I found that at the higher settings, the training process took an exorbitant amount of time: I didn't set a timer or note down my parameters, but I would come back hours later and see that only around 400 of 1600 steps had been completed or whatever. Eventually, I decided to set the resolution relatively low (384*512) and keep the network rank and dimension at base, and just be okay with getting mediocre results. With those settings, it took ~30 minutes to do a 10-epoch training with 41 images and 10 repetitions per image.

Now that my LoRA was ready, I could test it out in ComfyUI. The LoRAs were trained using Stable Diffusion 1.5, so that was the model I used for the generation as well. Somewhat predictably, the results weren't too great:

You can certainly recognize the connection between the training images and the LoRA results, but everything just looks a little... off. The proportions are strange at points, and the face especially is pretty messed up.

When I tried combining the LoRA with OpenPose, it was at least able to recognize the bits of the image that were human and map them onto the pose.

So, I suppose that's good news. If I could somehow fix the weird garbling issues with the face and proportions by having a better training set, then there could be some potential here.

I decided to have another go at the training process, this time with a better set of training images, and putting more effort into properly cropping and tagging the images. I enlisted the help of my mom, and we took a bunch of images in a variety of locations. This time, instead of having the subject rotate in order to capture different angles, I rotated the angle of the camera itself, in the hopes of having a greater variety of backgrounds. I also took a few close-up shots of the face.

The images were taken late in the afternoon in areas with uniform shadows (such as those cast by buildings), in order to ensure plenty of light without harsh shadows or reflections on the subject.

From these, I selected a few and cropped them, including some full-body images, some of the face, and some from the waist/shoulders up. I generated the tags automatically, but put much more effort into manually curating the tags this time.

This time, my total image count was smaller, so I used a much higher repeat number to compensate.

I trained a LoRA using this image set. I don't recall my exact settings: I believe I had the resolution at 512x512, which made the training take over an hour at 5 epochs. I then went and generated some images in ComfyUI. The results were mildly horrifying:

The proportions and face were incredibly uncanny. It also struggled with accurately replicating the outfit, even though I think it was pretty distinctive and should've been difficult to mess up. The following images show attempts at generation with different levels of guidance and ControlNet strength in ComfyUI:

It's not exactly clear to me why the results are so poor. One thing that I think can definitely be ruled out is simply not doing enough epochs: kohya_ss outputs the intermediate epochs during the training process, and there isn't much visible improvement from epoch to epoch past epoch 2 or 3. In fact, the earlier epochs are arguably better than the final result: here's an image generated with epoch 3 out of 5.

The face is kind of uncanny when you look at it too closely, the pose is a little funky-looking, and there's a weird artifact near the right hand, but other than that it looks mostly fine.

I've heard that "overtraining" is a thing that can happen with LoRAs: if the LoRA starts heading in the wrong direction on an earlier epoch and picks up some errors, those will be propagated forward to future epochs. I think a possible avenue for improvement would probably be to increase the number of steps per epoch, and see if that helps the AI get the model right earlier in the series of epochs.

I think my tags and croppings were fine too, so the other main area for improvement I have in mind is image quality. Taking raw photos would probably help the AI pick out the finer details considerably. The resolution was probably too low, as well: if I resize the full-body shots so that the biggest dimension is 512 px (which is how I'm assuming the resizing is done if the input images are not square), the faces are pretty low-detail:

However, training at higher resolution takes way, way more time. This could be remediated with better technology: in particular, a more powerful GPU (or cloud computing) that could run the training with higher resolution. Guidelines suggest using a GPU with at least 12GB of VRAM, and all I have is 6 GB.

But if I'm being honest, I don't particularly want to do any of that, because it's becoming clear to me that I find the process of training AI to be categorically miserable. The main problem is that it doesn't feel like I'm learning a skill. It's pure trial and error, a total black box, and it doesn't feel like there's any opportunities to leverage my own intelligence. It's also very unclear how much progress I've actually made towards my goal at any given time, or whether my goal is even possible to realize.

With, say, 3D modeling and animation, there's a lot of steps to that process to learn and get better at, and at each of those steps it's usually pretty easy to evaluate how close I am to my vision, because it's a well-documented system that's actually designed for the task, and any changes I make in the process have an immediate, consistent, and noticeable effect on the end product. Also, with 3D modeling, the model is the thing: that is the asset that will be used going forward in order to produce more things for the user to experience.

With AI image generation, though, results are always off from my vision in some way, but I have no way of knowing what steps could get me there. Also, the image is not the thing, the workflow/models are. If there's a specific change that I want to see reflected in an image, then I have to go all the way back to modifying the workflow, which is frustrating because it's a black box. And even if one image turns out looking fine, there's no guarantee that the others will: the workflow has to do what I want consistently in order to be viable for content production. And given how new the technology is and how underexplored it seems to be by people who are actually trying to produce interesting art, there's almost certainly going to be some surprise blind spots where the AI is just totally incapable of parsing certain types of requests.

While I don't doubt that somebody could get this workflow to a point where it's usable, I don't think that somebody has to be me. While I do think that an AI-based sprite generation approach might reduce my total work time, that time would be spent doing something I don't enjoy very much and comes with a whole host of ethical ramifications.

I might give this LoRA training thing another shot, but I think it might be time to move on. If I do continue to use generative AI tools in the future, it'll probably be to produce things that I actually have the skills to create and modify myself, not for entirely novel applications where I'm relying heavily on AI as a core part of the workflow and don't really have a fallback option. I always want to have final say in whatever ends up getting produced.

At this point, I think I would much rather be exploring avenues for manually producing 3D assets for fighting games, with an emphasis on finding a unique visual style that stands out among the rest of the field while allowing the overall workload to stay low. I already have an idea lined up for something that might look cool, stand out, and be feasible for me to produce. I'll begin exploration into that direction, and may update this blog with my findings.

(Mis)adventures in Game Design

Search This Blog

LoRA Training investigation

Comments

Post a Comment