BigASP 🐍 V1

bigasp-v1

Text to Image Community ModelFree for Premium Users LLMs.txt

bigASP 🐍

A photorealistic SDXL model that works with danbooru style tags! This experimental model was finetuned from base SDXL on almost 1.5 MILLION high quality photos for 30 million training samples. Every photo was tagged using JoyTag, a state of the art tagging model that extends the danbooru2021 dataset to photographic content. This imbues bigASP 🐍 with all the expressive prompting power of an anime model, while generating high quality photographic results.

This is my first foray into base SDXL models. I'm excited to see how the community uses this model and to learn its strengths and weaknesses. Please share your gens and feedback!

Features

Danbooru style tag prompting: e.g. photo (medium), 1girl, spread legs, small breasts, nipples, pussy, pantyhose, sheer legwear, couch
Aesthetic/quality score: e.g. score_8_up, score_7, you name it, you got it.
Diversity: Want something other than generic white women in your photo gens? Get ready for male, female, trans, dark skin, light skin, big breasts, small breasts, and everything in between. Paint the world in colors.
Aspect ratio bucketing: Widescreen, square, portrait, bigASP 🐍 is ready to take it all.
High quality training data: Most of the training data consists of high quality, professional grade photos with resolutions well beyond SDXL's native resolution, all downloaded in their original quality with no additional compression. bigASP 🐍 won't miss a single pixel.
Large prompt support: Trained with support for up to 225 tokens in the prompt. It is BIG asp, after all.

Prompting

Building on the shoulders of giants like Pony XL, bigASP 🐍 uses a custom aesthetic/quality model to rate the training dataset and assign scores from 1 to 9. However, you shouldn't need a long "score_9, score_8, score_7" etc prompt. A simple "score_8_up" should suffice in your positive prompt, and a low score or two in your negative like "score_1, score_2".

Score tags should always come first, but all other tags can come in any order (tags were shuffled during training). Underscores or spaces, doesn't matter: e.g. "small breasts" or "small_breasts" both work.

"photo (medium)" should likely be included in every prompt, as all of the training data was photographic. NOTE: You will might need to use "photo $medium$" instead if your UI uses parentheses to weight words (Comfy, Auto1111, etc)! I made the same mistake...

Here is a list of all tags the model saw in its training data, with how many images have that tag:

https://gist.github.com/fpgaminer/8c1b488aa81f9713efdb8f9245e8a0e8

Or with aliases added: https://gist.github.com/fpgaminer/e835328c7883b1929419273f6e73c3aa

This includes most standard danbooru tags, along with things like "reddit", "onlyfans", "fansly", some subreddits (e.g. asstastic), etc. Explore and see what bigASP 🐍 can do!

Example prompts:

score_7_up, photo (medium), 1girl, spread legs, small breasts, puffy nipples, cumshot, shocked

score_7_up, photo \(medium\), 1boy, male focus, penis, muscular, testicles, spread legs

score_7_up, photo (medium), long hair, standing, thighhighs, reddit, 1girl, r/asstastic, kitchen, dark skin

score_8_up, photo \(medium\), black shirt, pussy, long hair, spread legs, thighhighs, miniskirt, outside

Settings

I recommend most of the DPM samplers, though Euler A seems to work as well, and a kerras or exponential schedule. Normal schedule will result in garbage outputs. DPM 3M++ SDE is my go-to. These are just my recommendations, so please explore and see what works best for you! (AYS schedule also works well, and I've seen PAG mentioned as being helpful.)

Supported resolutions (with training image count for reference):

832 x 1216: 372164 images
1216 x 832: 289428 images
832 x 1152: 174153 images
1152 x 896: 64223 images
896 x 1152: 56959 images
768 x 1344: 30512 images
1024 x 1024: 26565 images
896 x 1088: 16070 images
704 x 1408: 15656 images
1344 x 768: 15149 images
768 x 1280: 13564 images
704 x 1344: 11833 images
960 x 1088: 10300 images
1152 x 832: 7051 images
960 x 1024: 6601 images
1088 x 896: 5067 images
1024 x 960: 4842 images
1088 x 960: 4454 images
1280 x 768: 2719 images
1344 x 704: 1819 images
1472 x 704: 1334 images
1408 x 704: 1293 images
1600 x 640: 840 images
1728 x 576: 743 images
1536 x 640: 326 images
1664 x 576: 51 images

Limitations

No offset noise: Sorry, maybe next version.
No anime: This is a photo model through and through. Perhaps in a future version anime will be included to expand the model's concepts.
Faces/hands/feet: You know the deal. I can't work miracles.
Undertrained: In my usage so far I've found bigASP to be at least slightly undertrained. I think another 10M or more training samples could get the model to where I really want it.
VAE issues: SDXL's VAE sucks. We all know it. But nowhere is it more apparent than in photos. And because bigASP generates such high levels of detail, it really exposes the weaknesses of the VAE. Not much I can do about that.
An obsession with stone backgrounds?: Dunno why, but a lot of my gens end up with a stone background if I don't prompt for something else. Your guess is as good as mine.
Scoring model: I trained the quality/scoring model from scratch. I find that it works a lot better than the old aesthetic model used by SD and SDXL. It gets images into a general trend of "blurry, lowres reddit selfie garbage" at score_1 up to professional photoshoot at score_9. But it could still benefit from more work so it's less biased towards that photoshoot look.

Training Details

Details on how the big SDXL finetunes were trained is scarce to say the least. So, I'm sharing all my details here to help the community.

bigASP was trained on about 1,440,000 photos, all with resolutions larger than their respective aspect ratio bucket. Each image is about 1MB on disk, making the database about 1TB per million images.

Every image goes through: the quality model to rate it from 0 to 9; JoyTag to tag it; OWLv2 with the prompt "a watermark" to detect watermarks in the images. I found OWLv2 to perform better than even a finetuned vision model, and it has the added benefit of providing bounding boxes for the watermarks. Accuracy is about 92%. While it wasn't done for this version, it's possible in the future that the bounding boxes could be used to do "loss masking" during training, which basically hides the watermarks from SD. For now, if a watermark is detect, a "watermark" tag is included in the training prompt.

Images with a score of 0 are dropped entirely. I did a lot of work specifically training the scoring model to put certain images down in this score bracket. You'd be surprised at how much junk comes through in datasets, and even a hint of them and really throw off training. Thumbnails, video preview images, ads, etc.

bigASP uses the same aspect ratios buckets that SDXL's paper defines. All images are bucketed into the bucket they best fit in while not being smaller than any dimension of that bucket when scaled down. So after scaling, images get randomly cropped. The original resolution and crop data is recorded alongside the VAE encoded image on disk for conditioning SDXL, and finally the latent is gzipped. I found gzip to provide a nice 30% space savings. This reduces the training dataset down to about 100GB per million images.

Training was done using a custom training script based off the diffusers library. I used a custom training script so that I could fully understand all the inner mechanics and implement any tweaks I wanted. Plus I had my training scripts from SD1.5 training, so it wasn't a huge leap. The downside is that a lot of time had to be spent debugging subtle issues that cropped up after several bugged runs. Those are all expensive mistakes. But, for me, mistakes are the cost of learning.

I think the training prompts are really important to the performance of the final model in actual usage. The custom Dataset class is responsible for doing a lot of heavy lifting when it comes to generating the training prompts. People prompt with everything from short prompts to long prompts, to prompts with all kinds of commas, underscores, typos, etc.

I pulled a large sample of AI images that included prompts to analyze the statistics of typical user prompts. The distribution of prompts followed a mostly normal distribution, with a mean of 32 tags and a std of 19.8. So my Dataset class reflects this. For every training sample, it picks a random integer in this distribution to determine how many tags it should use for this training sample. It shuffles the tags on the image and then truncates them to that number.

This means that during training the model sees everything from just "1girl" to a huge 224 token prompt! And thus, hopefully, learns to fill in the details for the user.

Certain tags, like watermark, are given priority and always included if present, so the model learns those tags strongly.

The tag alias list from danbooru is used to randomly mutate tags to synonyms so that bigASP understands all the different ways people might refer to a concept. Hopefully.

And, of course, the score tags. Just like Pony XL, bigASP encodes the score of a training sample as a range of tags of the form "score_X" and "score_X_up". However, to avoid the issues Pony XL ran into, only a random number of score tags are included in the training prompt. That way the model doesn't require "score_8, score_7, score_6," etc in the prompt to work correctly. It's already used to just a single, or a couple score tags being present.

10% of the time the prompt is dropped completely, being set to an empty string. UCG, you know the deal. N.B.!!! I noticed in Stability's training scripts, and even HuggingFace's scripts, that instead of setting the prompt to an empty string, they set it to "zero" in the embedded space. This is different from how SD1.5 was trained. And it's different from how most of the SD front-ends do inference on SD. My theory is that it can actually be a big problem if SDXL is trained with "zero" dropping instead of empty prompt dropping. That means that during inference, if you use an empty prompt, you're telling the model to move away not from the "average image", but away from only images that happened to have no caption during training. That doesn't sound right. So for bigASP I opt to train with empty prompt dropping.

Additionally, Stability's training scripts include dropping of SDXL's other conditionings: original_size, crop, and target_size. I didn't see this behavior present in kohyaa's scripts, so I didn't use it. I'm not entirely sure what benefit it would provide.

I made sure that during training, the model gets a variety of batched prompt lengths. What I mean is, the prompts themselves for each training sample are certainly different lengths, but they all have to be padded to the longest example in a batch. So it's important to ensure that the model still sees a variety of lengths even after batching, otherwise it might overfit and only work on a specific range of prompt lengths. A quick Python Notebook to scan the training batches helped to verify a good distribution: 25% of batches were 225 tokens, 66% were 150, and 9% were 75 tokens. Though in future runs I might try to balance this more.

The rest of the training process is fairly standard. I found min-snr loss to work best in my experiments. Pure fp16 training did not work for me, so I had to resort to mixed precision with the model in fp32. Since the latents are already encoded, the VAE doesn't need to be loaded, saving precious memory. For generating sample images during training, I use a separate machine which grabs the saved checkpoints and generates the sample images. Again, that saves memory and compute on the training machine.

The final run uses an effective batch size of 2048, no EMA, no offset noise, PyTorch's AMP with just float16 (not bfloat16), 1e-4 learning rate, AdamW, min-snr loss, 0.1 weight decay, cosine annealing with linear warmup for 100,000 training samples, 10% UCG rate, text encoder 1 training is enabled, text encoded 2 is kept frozen, min_snr_gamma=5, PyTorch GradScaler with an initial scaling of 65k, 0.9 beta1, 0.999 beta2, 1e-8 eps. Everything is initialized from SDXL 1.0.

A validation dataset of 2048 images is used. Validation is performed every 50,000 samples to ensure that the model is not overfitting and to help guide hyperparameter selection. To help compare runs with different loss functions, validation is always performed with the basic loss function, even if training is using e.g. min-snr. And a checkpoint is saved every 500,000 samples. I find that it's really only helpful to look at sample images every million steps, so that process is run on every other checkpoint.

A stable training loss is also logged (I use Wandb to monitor my runs). Stable training loss is calculated at the same time as validation loss (one after the other). It's basically like a validation pass, except instead of using the validation dataset, it uses the first 2048 images from the training dataset, and uses a fixed seed. This provides a, well, stable training loss. SD's training loss is incredibly noisy, so this metric provides a much better gauge of how training loss is progressing.

The batch size I use is quite large compared to the few values I've seen online for finetuning runs. But it's informed by my experience with training other models. Large batch size wins in the long run, but is worse in the short run, so its efficacy can be challenging to measure on small scale benchmarks. Hopefully it was a win here. Full runs on SDXL are far too expensive for much experimentation here. But one immediate benefit of a large batch size is that iteration speed is faster, since optimization and gradient sync happens less frequently.

Training was done on an 8xH100 sxm5 machine rented in the cloud. On this machine, iteration speed is about 70 images/s. That means the whole run took about 5 solid days of computing. A staggering number for a hobbyist like me. Please send hugs. I hurt.

Training being done in the cloud was a big motivator for the use of precomputed latents. Takes me about an hour to get the data over to the machine to begin training. Theoretically the code could be set up to start training immediately, as the training data is streamed in for the first pass. It takes even the 8xH100 four hours to work through a million images, so data can be streamed faster than it's training. That way the machine isn't sitting idle burning money.

One disadvantage of precomputed latents is, of course, the lack of regularization from varying the latents between epochs. The model still sees a very large variety of prompts between epochs, but it won't see different crops of images or variations in VAE sampling. In future runs what I might do is have my local GPUs re-encoding the latents constantly and streaming those updated latents to the cloud machine. That way the latents change every few epochs. I didn't detect any overfitting on this run, so it might not be a big deal either way.