Trainable Quantum Kernel — KTA Optimization

What You'll Learn:

How adding learnable parameters to a quantum feature map lets the kernel adapt to a specific classification task
Why kernel-target alignment (KTA) is a differentiable proxy for classification performance
How PennyLane's autograd computes gradients through a full kernel matrix
How to compare trained vs random kernels and verify the optimization worked

Level: Advanced | Time: 30 minutes | Qubits: 4 | Framework: PennyLane

Prerequisites

Quantum Kernel — kernel trick, feature maps, kernel matrices
Fidelity Kernel — inversion test, SWAP test, kernel properties
Bell State — entanglement, measurement

The Idea

Standard quantum kernels use a fixed feature map to encode classical data into quantum states. The kernel value K(x, y) = |<phi(x)|phi(y)>|^2 depends entirely on the choice of encoding circuit. If the encoding is poorly suited to the data, the kernel will not separate the classes well — no matter how powerful the downstream classifier.

Trainable quantum kernels solve this by introducing learnable parameters theta into the feature map:

K_theta(x, y) = |<phi_theta(x)|phi_theta(y)>|^2

Instead of hoping the fixed encoding works, we optimize theta to maximize how well the kernel matrix aligns with the classification labels. This is analogous to how neural networks learn feature representations — except here the "features" are quantum states.

The key insight: you do not need to train a full variational classifier. By optimizing the kernel itself, you can use proven classical methods (SVM, Gaussian processes) on top, getting the best of both worlds.

How It Works

Kernel-Target Alignment (KTA)

KTA measures how well a kernel matrix K matches the ideal kernel for a binary classification task. The ideal kernel is yy^T, where y is the label vector (+1/-1):

Same-class pairs: y_i * y_j = +1 (kernel should be high)
Different-class pairs: y_i * y_j = -1 (kernel should be low)

KTA is the cosine similarity between K and yy^T in Frobenius space — a single number that summarizes kernel quality.

Training Loop

Initialize random parameters theta_0 ~ Uniform(0, 2pi)
Build kernel matrix K_theta by evaluating K_theta(x_i, x_j) for all training pairs
Compute KTA(K_theta, y) — how well does this kernel separate the classes?
Differentiate through the quantum circuit: PennyLane's autograd computes dKTA/dtheta
Update theta via gradient descent: theta <- theta + eta * dKTA/dtheta
Repeat until convergence

Feature Map Architecture

Each repetition applies three layers:

CODE
Layer (per rep):
     ┌─────────┐┌────────┐┌────────┐
q_0: ┤ RY(x₀π) ├┤ RZ(θ₀) ├┤ RY(θ₁) ├──■──────
     ├─────────┤├────────┤├────────┤┌─┴─┐
q_1: ┤ RY(x₁π) ├┤ RZ(θ₂) ├┤ RY(θ₃) ├┤ X ├──■──
     ├─────────┤├────────┤├────────┤└───┘┌─┴─┐
q_2: ┤ RY(x₀π) ├┤ RZ(θ₄) ├┤ RY(θ₅) ├────┤ X ├─
     ├─────────┤├────────┤├────────┤    └───┘
q_3: ┤ RY(x₁π) ├┤ RZ(θ₆) ├┤ RY(θ₇) ├──────────
     └─────────┘└────────┘└────────┘

Repeat for n_reps layers, then apply adjoint for the second data point.

Data layer: RY(x_i * pi) — angle encoding maps features to Bloch sphere rotations
Trainable layer: RZ(theta) RY(theta) — two free rotations per qubit that the optimizer tunes
Entangling layer: CNOT chain — creates correlations between encoded features

The kernel is computed via the inversion test: apply phi_theta(x) then phi_theta^dagger(y), and measure the probability of the all-zeros state.

The Math

KTA Formula

KTA(K, y) = <K, yy^T>_F / (||K||_F * ||yy^T||_F)

where:

<A, B>_F = sum_ij A_ij * B_ij = tr(A^T B) (Frobenius inner product)
||A||_F = sqrt(sum_ij A_ij^2) (Frobenius norm)

KTA Gradient

The gradient dKTA/dtheta flows through:

CODE
dKTA/dtheta = d/dtheta [ sum_ij K_ij(theta) * y_i * y_j ] / (||K||_F * ||yy^T||_F)
            + normalization correction terms

Each kernel element K_ij(theta) = |<0|U_theta^dagger(x_j) U_theta(x_i)|0>|^2 is differentiable via PennyLane's parameter-shift rule:

dK_ij/dtheta_k = [K_ij(theta_k + pi/2) - K_ij(theta_k - pi/2)] / 2

Parameter Count

For n_qubits qubits and n_reps repetitions:

n_params = n_reps * n_qubits * 2

With defaults (4 qubits, 2 reps): 16 trainable parameters.

Expected Output

Metric	Random Kernel	Trained Kernel
KTA	~0.05 - 0.30	~0.40 - 0.80
Improvement	—	+0.2 to +0.5
Diagonal K(x,x)	1.0	1.0
Kernel symmetry	K = K^T	K = K^T

Exact values depend on the random seed and number of iterations. More iterations and more samples yield higher trained KTA.

Running the Circuit

PYTHON
from circuit import run_circuit, verify_trainable_kernel

# Train the kernel and compare with random baseline
result = run_circuit(n_samples=12, n_iterations=15)
print(f"Random KTA:  {result['random_kta']:.4f}")
print(f"Trained KTA: {result['trained_kta']:.4f}")
print(f"Improvement: {result['improvement']:.4f}")

# Run verification suite
v = verify_trainable_kernel()
for check in v["checks"]:
    status = "PASS" if check["passed"] else "FAIL"
    print(f"[{status}] {check['name']}: {check['detail']}")

Try It Yourself

More iterations: Increase n_iterations=50 and watch KTA climb. Plot result['kta_history'] to see the convergence curve.
Harder data: Move the class centers closer (edit CLASS_0_CENTER and CLASS_1_CENTER in circuit.py). Does the trained kernel still improve? How many iterations does it need?
Fewer qubits: Try n_qubits=2. With fewer parameters, can the kernel still adapt? Compare final KTA with the 4-qubit version.
More reps: Increase n_reps=4 (more circuit depth). Does deeper = better, or does the optimizer struggle with more parameters?
Learning rate sweep: Try learning_rate values from 0.01 to 0.5. What happens when it is too high? Too low?

What's Next

Quantum Kernel SVM — Use the trained kernel for classification with support vector machines
Fidelity Kernel — Compare with the fixed SWAP test kernel (no training)
Projected Quantum Kernel — Classical post-processing of quantum measurements as an alternative

Applications

Domain	Use case
Task-specific kernels	Optimize the quantum embedding for a particular dataset
Few-shot learning	KTA training works with small datasets where neural networks overfit
Transfer learning	Pre-train kernel parameters on related tasks, fine-tune on target
Kernel ensembles	Combine multiple trained kernels for robust classification
Quantum advantage studies	Compare trainable quantum kernels vs classical RBF/polynomial kernels

References

Hubregtsen, T. et al. (2022). "Training quantum embedding kernels on near-term quantum devices." Physical Review A 106, 042431. DOI: 10.1103/PhysRevA.106.042431
Glick, J.R. et al. (2024). "Covariant quantum kernels for data with group structure." Nature Physics 20, 1027-1036. DOI: 10.1038/s41567-023-02288-w
Cristianini, N. et al. (2001). "On Kernel-Target Alignment." NeurIPS 14. Proceedings
Schuld, M. & Killoran, N. (2019). "Quantum Machine Learning in Feature Hilbert Spaces." Physical Review Letters 122, 040504. DOI: 10.1103/PhysRevLett.122.040504