The job search continues.
By the way, did I ever mention that I’ve never passed a single interview in my life? I suppose that’s something that would improve over time as a person experiences more interviews, but well, I don’t even get interviews. For reference, I have gotten 2 in-person interviews, and 1 phone interview throughout the entire year of 2023.
I’ve had people review my resume, but the general feedback has been good, with some minor changes to some wording, so I suppose the fundamental issue is my experience and skills. I simply don’t have the breadth or depth of experience that people are looking for.
I must say though, for every month that goes by, my self confidence takes a hit. I’m honestly not sure if I’ll ever get a job in my lifetime.
On the topic of job searching though, I’ve heard time and time again that networking is far more effective then simply applying for jobs on job boards. And to be honest, it makes sense. Connections are a powerful thing. There’s a big difference between some unknown person you’ve never seen the face of, and someone you’ve interacted with and hold goodwill towards.
Well, not that I’d know, since I can count the total number of people I actually know on one hand. I can barely hold a conversation on a normal day. I could point out a number of shortcomings and reasons for my incompetence, but no one wants to hear all that doom and gloom, so anywho, moving on.
Stable Diffusion XL (SDXL) and LoRA1
Recently, I’ve successfully trained a LoRA models for Stable Diffusion XL, or more specifically, a LoRA model trained on a finetune of SDXL 1.0.
Since AI has been a pretty hot topic these days, most people have probably heard of Stable Diffusion at least, alongside other famous AI models like ChatGPT, Midjourney, and so on.
So, as you know, Stable Diffusion is a text-to-image AI model that generates images based on a text prompt. SDXL offers a number of improvements over it’s predecessors 1.4 and 1.5. From the ordinary end user’s point of view without delving into the technical details, the main differences were a larger native image resolution (1024 by 1024 compared to 512 by 512), and better depictions of text and hands in generated images. Well, hands are debatable depending on who you ask, I suppose.
Anyway, I was browsing LoRA models recently and found there was a distinct lack of SDXL finetunes and LoRA models. I’m not quite sure why that is, perhaps the cost of training was too high or maybe SD 1.4 and 1.5 were good enough for most people’s purposes? Or maybe the hype has died down a bit? I’m not sure, but personally I’ve been using a finetune of SD 1.5 until recently only because there’s a finetune with a specific art style that I particularly like, but I figured it was time to see what SDXL models had to offer… But there weren’t many options.
LoRA Model Training
So I thought, why not try and make a LoRA model that can create the style I want? And it’d probably be a good way to see for myself what training a LoRA model is like. And the good thing was that I already had an idea of a style I liked, and knew where to gather the training data, so I took a crack at it.
I obtained 180 training images, and about 500 or so regularization images for training the LoRA model.
Training images were fairly self explanatory. They represented the style that I wanted to train the LoRA model to imitate, while the regularization images were images that were in the same category as the training images, but had a different style.
From my understanding, regularization images are used to reduce overfitting to the training data. This is probably best explained with an example:
As an example, if there’s an AI model used to classify images of objects into categories based on the pixel data of the images (e.g. RGB values of each pixel, and their positions), there may come a point in the training where it overfits the training data, which is where it has “learnt” features of the training data that are irrelevant to the classification of the image, but has incorporated these irrelevant features into the way it categorizes images (e.g. background). An extreme example of overfitting would be when an image is only correctly categorized when every single pixel is the same as the training image that represents that category. But obviously, that kind of classification model is useless because that’s just checking whether two images are the same, not classification of the objects in the images.
A more realistic example of overfitting would be something like a dog only being categorised correctly when there’s a dog with a park in the background, but not when a dog is inside a house. This is when the model has overfit the training data and considers the background (feature) of the park to be important in determining whether the subject is a dog or not.
So in my case, my LoRA training images were anime images with my desired style, while my regularization images were other anime images with different styles.
As for the actual training process, I used kohya_ss2 and followed the following YouTube video tutorial to prepare the datasets and train the images: https://www.youtube.com/watch?v=y2J7EZUk_a0
Review and Findings
The process itself wasn’t too complicated. Although I did have to deviate from the tutorial when setting the parameters for training because the training used far more than the 16GB of VRAM that my GPU had, and was giving me an estimated training time of over 300 hours, haha! Based on my observations, I believe the training wanted to use about 40-50 GB VRAM total, but since my GPU didn’t have enough, it used the system RAM as well which was much slower than VRAM and pushed the training time up significantly.
After tweaking some settings to reduce memory usage so that it fit into 16 GB VRAM, I managed to push the time down to 40 hours, and then by changing my settings to further reduce memory usage, I was able to increase my batch size to 3 to further reduce the training time to 10 hours. Now that was a much more acceptable training time.
Surprisingly, my final LoRA model was actually quite good and usable! I was surprised at the results of my first attempt. Well, technically my 3rd attempt, since I encountered some technical issues with my PC deciding to shutdown and restart overnight twice while training, but I digress.
One thing I will note is that while the tutorial suggests comparing the results of LoRA after each epoch to choose the best one, I personally found that earlier epochs often have issues with accurately depicting human bodies while LoRAs models produced in later epochs do not. I suspect that earlier epochs are simply undertrained, and would suggest using the LoRA from the final epoch, even if the results look over trained (in my case, it looked like it leaned into the style a bit too much).
The reason for this is because when using LoRA in something like AUTOMATIC1111 Stable Diffusion web UI3, you can simply set the weight of the LoRA model lower to compensate and reduce its effect on the image.
From my personal experience, it is generally better to reduce the weight of a LoRA model when generating images than to increase the weights of a LoRA model. Increasing weights excessively often causes strange and unusual deformities on top of potentially conflicting with and overwriting effects of other LoRA models you’ve applied. This is less common when reducing weights.
Final Comments
I’m happy with the final LoRA model produced, and it was interesting to go through the process for myself. Obviously, a lot of the more technical and theoretical details fly right over my head, but I’ve done my best to convey my understanding of the concepts throughout this post.
It’s quite nice being able to create something that you can personally use for yourself. I think that is the best motivation for starting any sort of project. I suppose that’s why people like employees that are invested in their work, huh?
Gah, I feel the negativity creeping back in. Anyway, I should stop getting distracted and actually learn C# and dotnet soon. But there’s a nagging voice in the back of my mind that’s asking me whether it’ll even make a difference….