Training Klein 9B/4B LoRAs with Musubi-Tuner on AMD Strix Halo

This guide walks you through setting up the tools and workflow for training a custom LoRA (Low-Rank Adaptation) using musubi-tuner on AMD Strix Halo APUs. It covers both Flux.2 Klein 4B and 9B models. For a similar guide on Z-Image training, see Training Z-Image LoRAs. For other models, small adjustments will be needed and will be addressed in future guides.

The goal of this guide is to help you set up the tools and environment for LoRA training. The actual training process requires experimentation to find settings that work best for your specific use case. These are the steps I followed to train a LoRA for Klein 9B and 4B based on my photographic style, with examples from my own training where applicable.

Hardware Requirements

This guide assumes a Strix Halo with 128GB of RAM for the default path. Refer to the Out of VRAM section if you run into problems.

Prerequisites

Make sure you have the following installed:

  • uv - A fast Python package installer and manager
  • git - Version control system

You can install uv and git on most Linux distributions:

# Arch Linux
sudo pacman -S uv git

# Ubuntu/Debian
sudo apt install uv git

Installation

Download the musubi wrapper script and copy it where you want. This is a small script I created to simplify the procedure. The script is easy to read, so if you are curious, have a look at what it does!

Now cd to the directory where the script is located. From there, you will need to:

# Make the script executable
chmod +x musubi.sh

# Install musubi-tuner
./musubi.sh setup

By default this will install musubi-tuner in your home directory. You can override the install directory:

export MUSUBI_TUNER_INSTALL_DIR="/musubi/installation/path"

The script defaults to downloading dependencies for Strix Halo. This can also be overridden:

# You can check all available architectures here: https://rocm.nightlies.amd.com/v2-staging/
# The example below is for Strix Point
export GFX_NAME="gfx1150"

All overrides must be performed before running the setup step.

Downloading Models

We need to download the Flux 2 Klein 4B or 9B model and its VAE (Autoencoder). The Klein repo’s VAE is not compatible with musubi-tuner, so we need the VAE from the main Flux 2 dev repository.

Note that we download the base model instead of the distilled 4-step model. This is intentional. The created LoRAs will work with the distilled model as well.

Klein 4B

  1. Go to https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B and download flux-2-klein-base-4b.safetensors + all .safetensors files from text_encoder
  2. Go to https://huggingface.co/black-forest-labs/FLUX.2-dev and download ae.safetensors
  3. Place all files in appropriate directories

Klein 9B

  1. Go to https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9B and download flux-2-klein-base-9b.safetensors + all .safetensors files from text_encoder
  2. Go to https://huggingface.co/black-forest-labs/FLUX.2-dev and download ae.safetensors
  3. Place all files in appropriate directories

After downloading, note the paths to your model files. You’ll need them in the next steps.

Note: If you already have a VAE and/or text encoder (either sharded or single file) that you are using with the distilled version of the model (for example in ComfyUI), you can use those files for the following steps instead of downloading them again.

Project Creation

We will now create the LoRA project. Once again, we will rely on the script to create the initial directory structure and a standard musubi-tuner configuration for Klein 4B/9B

# Set the model version (e.g., klein-base-4b or klein-base-9b)
export MODEL_VERSION="klein-base-4b"

# Set the path of the diffusion model (flux-2-klein-base-4b.safetensors or flux-2-klein-base-9b.safetensors)
export DIT_MODEL="/path/to/dit/flux-2-klein-base-4b.safetensors"

# Set the path to the first shard of the text_encoder (e.g., model-00001-of-00002.safetensors or model-00001-of-00004.safetensors) 
# or to the merged model file (e.g., qwen_3_4b.safetensors or qwen_3_8b.safetensors)
export TEXT_ENCODER="/path/to/te/model-00001-of-00002.safetensors"

# Set the path to the VAE (e.g., ae.safetensors)
export VAE_MODEL="/path/to/vae/ae.safetensors"

# Set the project name. A folder with this name will be created
export PROJECT_NAME="my-lora"

# Create the project
./musubi.sh create

Dataset Preparation

A good dataset is crucial for training a useful LoRA. The following are technical guidelines and rules of thumb to get you started. However, experimentation is key - different datasets and styles may require different approaches.

Adding Images

Place your training images in the dataset directory of your project. Each image should have:

  • File format: PNG, JPG, or WEBP
  • Resolution: Aim for 1024x1024. Using a single resolution for all images can reduce the number of batches, which is good
  • Aspect ratio: Square images work best but other ratios will work too

As a rule of thumb, anywhere from 20-200 images will work. The quality of the images is more important than the number.

Adding Captions

For each image, create a corresponding text file with the same name but .txt extension:

dataset/
├── image1.jpg
├── image1.txt          # Caption for image1.jpg
├── image2.png
├── image2.txt          # Caption for image2.png
└── ...

Caption rules of thumb:

  • For styles: describing the scene but not the style works. I had good results with just empty captions.
  • For characters: a trigger word + a short description seems to work well.

Editing the Dataset Config

In the project directory, you will find a dataset.toml file. While usable as-is, here is the explanation of some of its parameters:

  • resolution: Target resolution for training
  • batch_size: How many images to process at once (reduce if you run out of VRAM)
  • enable_bucket: Allows different aspect ratios (keeps more detail)
  • num_repeats: How many times to cycle through the dataset per epoch (higher = more training)

Creating Reference Prompts

Another notable file is called reference_prompts.txt. Reference prompts are used to generate sample images during training, so you can see how the LoRA is progressing.

Example with a single prompt:

A close-up portrait by mikkoph of a very young woman with fair skin and striking blue eyes, looking directly at the camera with a soft, serene expression. Her blonde hair is styled in an elegant updo, adorned with numerous small white flowers, possibly daisies, nestled throughout the curls. She wears a floral-patterned blouse with black, white, and gold flowers, a pearl earring in her right ear, and has a manicure with white nail polish. Her hands are gently cupped around her face, with her fingers lightly touching her cheeks. The background is a deep, dark blue, creating a dramatic contrast that highlights her features and the delicate details of her look. --w 1024 --h 1024 --d 42 --s 40

Each line is a separate prompt that will be sampled during training. Adding as many as you need, but keep in mind that this will make the training session longer.

Training Configuration

The following explains the most relevant parameters from the training.toml file in your project directory:

  • network_dim: Dimension of the LoRA (8-16 is usually suitable for simple styles, 32-64 for more complex concepts or characters)
  • learning_rate: 1e-4 is a good starting point (adjust if the loss doesn’t decrease)
  • max_train_epochs: How many training cycles (10-50, depending on dataset size)
  • save_every_n_epochs: How often to save checkpoints
  • save_state: Saves the training state with each checkpoint. This allows stopping and resuming training. It consumes more disk space and VRAM

Important: All the settings mentioned above are starting defaults. There is no one-size-fits-all configuration. You will need to experiment with these values to find what works best for your specific dataset and goals. The values provided are based on my experience, but your results may vary significantly.

Running Training

# Cache latents and prompts. This speeds up training considerably
./musubi.sh cache

# Run the training
./musubi.sh train

Note: You are likely to see many warnings when running this command. They are harmless and can be ignored.

Resuming Training

If you have stopped the training or feel that the LoRA is undertrained even after finishing, you can resume training if you set save_state to true (which is the default) in training.toml. To resume, only run:

./musubi.sh train

This will automatically find the latest saved state and restart training from there.

Monitoring Training

During training, you’ll see:

  1. Loss values - Should decrease over time. If it stays flat or increases, your learning rate may be too high. However, do not rely too much on this value.
  2. Sample images - Generated every 2 epochs (or whatever you set) showing how the LoRA is learning
  3. Checkpoint files - Saved to the output directory of the project every 2 epochs (or whatever you set)

What to Look For

EpochWhat to Check
1-5Loss should start decreasing
5-10Sample images should show the style emerging
10-20Check for overfitting (samples look too much like training images)
20+If loss is still decreasing, consider more epochs

Here are some examples from my training:

Epoch 0Epoch 6Epoch 12Epoch 20Epoch 30
Epoch 0 (Start) - Training loss and sample images at the beginning of training
Epoch 0 (Start) - Training loss and sample images at the beginning of training
Epoch 6 - Early learning phase showing initial style emergence
Epoch 6 - Early learning phase showing initial style emergence
Epoch 12 - Mid-training with clearer style development
Epoch 12 - Mid-training with clearer style development
Epoch 20 - Best checkpoint with optimal style quality
Epoch 20 - Best checkpoint with optimal style quality
Epoch 30 - Overtrained checkpoint with degraded quality
Epoch 30 - Overtrained checkpoint with degraded quality
StartEarly learningMid-trainingBest checkpointOvertrained

The last checkpoint is clearly overtrained and quality degraded significantly. I used checkpoint 20 instead.

However, I still tested all most promising checkpoints with the distilled model using ComfyUI. That’s the only way to be sure.

Using Your Trained LoRA

After training completes, you’ll have checkpoint files in the output directory of your project:

output/
├── my-lora-000002.safetensors    # Epoch 2 checkpoint
├── my-lora-000004.safetensors    # Epoch 4 checkpoint
├── my-lora-000006.safetensors    # Epoch 6 checkpoint
└── ...

The final checkpoint won’t have a sequence number.

In ComfyUI

  1. Place the .safetensors file in your ComfyUI models/loras/ directory
  2. Add a Load LoRA node to your workflow
  3. Connect it to your Flux model nodes
  4. Adjust the LoRA strength. Start with 1.0, but don’t be afraid to push it significantly higher or lower.

In other tools

Most tools that support Stable Diffusion LoRAs will work with Flux LoRAs. Look for a “Load LoRA” or similar node/module.

Results

Here are some example images generated using my LoRA. They are generated with the same prompt and seed using Klein 4B.

before after
before after
before after
before after
before after

Merging Checkpoints with EMA

The musubi.sh script includes an ema command that performs Exponential Moving Average (EMA) merging of LoRA checkpoints. This post-training technique combines multiple checkpoints into a single, often improved checkpoint.

EMA works by applying a weighted average to checkpoint parameters, where a beta value (default: 0.95) controls the weighting between earlier and later checkpoints. This can be useful when:

  • You have multiple promising checkpoints but are unsure which is optimal
  • Training showed consistent improvement across epochs
  • You want to avoid testing each checkpoint individually

Usage:

./musubi.sh ema output/my-lora-000016.safetensors output/my-lora-000018.safetensors output/my-lora-000020.safetensors --output_file my-lora-ema.safetensors

You can also adjust the beta parameter (e.g., --beta 0.97) to give more or less weight to earlier checkpoints. Note that EMA works best when checkpoints are from similar training phases and show consistent improvement.

Troubleshooting

Out of VRAM

If you get “out of memory” errors:

  • Reduce batch_size in dataset.toml
  • Try optimizer_type = "adamw8bit" in training.toml
  • Reduce resolution (try 768x768)
  • Reduce max_data_loader_n_workers (try 1)

Samples Look Bad

  • Train for more epochs (style may need time to emerge)
  • Check your dataset quality (images should be clear, captions should be good)
  • Try a different network_dim (higher for complex styles)

Conclusion

You now have a complete workflow for setting up and training custom LoRAs with musubi-tuner on AMD GPUs. Start with a small dataset (50-100 images) and experiment with different settings to find what works best for your use case.

For more information, check out:

Happy training!