Trainable Quantum Kernel — KTA Optimization
What You'll Learn:
- How adding learnable parameters to a quantum feature map lets the kernel adapt to a specific classification task
- Why kernel-target alignment (KTA) is a differentiable proxy for classification performance
- How PennyLane's autograd computes gradients through a full kernel matrix
- How to compare trained vs random kernels and verify the optimization worked
Level: Advanced | Time: 30 minutes | Qubits: 4 | Framework: PennyLane
Prerequisites
- Quantum Kernel — kernel trick, feature maps, kernel matrices
- Fidelity Kernel — inversion test, SWAP test, kernel properties
- Bell State — entanglement, measurement
The Idea
Standard quantum kernels use a fixed feature map to encode classical data into quantum states. The kernel value K(x, y) = |<phi(x)|phi(y)>|^2 depends entirely on the choice of encoding circuit. If the encoding is poorly suited to the data, the kernel will not separate the classes well — no matter how powerful the downstream classifier.
Trainable quantum kernels solve this by introducing learnable parameters theta into the feature map:
K_theta(x, y) = |<phi_theta(x)|phi_theta(y)>|^2
Instead of hoping the fixed encoding works, we optimize theta to maximize how well the kernel matrix aligns with the classification labels. This is analogous to how neural networks learn feature representations — except here the "features" are quantum states.
The key insight: you do not need to train a full variational classifier. By optimizing the kernel itself, you can use proven classical methods (SVM, Gaussian processes) on top, getting the best of both worlds.
How It Works
Kernel-Target Alignment (KTA)
KTA measures how well a kernel matrix K matches the ideal kernel for a binary classification task. The ideal kernel is yy^T, where y is the label vector (+1/-1):
- Same-class pairs: y_i * y_j = +1 (kernel should be high)
- Different-class pairs: y_i * y_j = -1 (kernel should be low)
KTA is the cosine similarity between K and yy^T in Frobenius space — a single number that summarizes kernel quality.
Training Loop
- Initialize random parameters theta_0 ~ Uniform(0, 2pi)
- Build kernel matrix K_theta by evaluating K_theta(x_i, x_j) for all training pairs
- Compute KTA(K_theta, y) — how well does this kernel separate the classes?
- Differentiate through the quantum circuit: PennyLane's autograd computes dKTA/dtheta
- Update theta via gradient descent: theta <- theta + eta * dKTA/dtheta
- Repeat until convergence
Feature Map Architecture
Each repetition applies three layers:
CODELayer (per rep): ┌─────────┐┌────────┐┌────────┐ q_0: ┤ RY(x₀π) ├┤ RZ(θ₀) ├┤ RY(θ₁) ├──■────── ├─────────┤├────────┤├────────┤┌─┴─┐ q_1: ┤ RY(x₁π) ├┤ RZ(θ₂) ├┤ RY(θ₃) ├┤ X ├──■── ├─────────┤├────────┤├────────┤└───┘┌─┴─┐ q_2: ┤ RY(x₀π) ├┤ RZ(θ₄) ├┤ RY(θ₅) ├────┤ X ├─ ├─────────┤├────────┤├────────┤ └───┘ q_3: ┤ RY(x₁π) ├┤ RZ(θ₆) ├┤ RY(θ₇) ├────────── └─────────┘└────────┘└────────┘ Repeat for n_reps layers, then apply adjoint for the second data point.
- Data layer: RY(x_i * pi) — angle encoding maps features to Bloch sphere rotations
- Trainable layer: RZ(theta) RY(theta) — two free rotations per qubit that the optimizer tunes
- Entangling layer: CNOT chain — creates correlations between encoded features
The kernel is computed via the inversion test: apply phi_theta(x) then phi_theta^dagger(y), and measure the probability of the all-zeros state.
The Math
KTA Formula
KTA(K, y) = <K, yy^T>_F / (||K||_F * ||yy^T||_F)
where:
<A, B>_F = sum_ij A_ij * B_ij = tr(A^T B)(Frobenius inner product)||A||_F = sqrt(sum_ij A_ij^2)(Frobenius norm)
KTA Gradient
The gradient dKTA/dtheta flows through:
CODEdKTA/dtheta = d/dtheta [ sum_ij K_ij(theta) * y_i * y_j ] / (||K||_F * ||yy^T||_F) + normalization correction terms
Each kernel element K_ij(theta) = |<0|U_theta^dagger(x_j) U_theta(x_i)|0>|^2 is differentiable via PennyLane's parameter-shift rule:
dK_ij/dtheta_k = [K_ij(theta_k + pi/2) - K_ij(theta_k - pi/2)] / 2
Parameter Count
For n_qubits qubits and n_reps repetitions:
n_params = n_reps * n_qubits * 2
With defaults (4 qubits, 2 reps): 16 trainable parameters.
Expected Output
| Metric | Random Kernel | Trained Kernel |
|---|---|---|
| KTA | ~0.05 - 0.30 | ~0.40 - 0.80 |
| Improvement | — | +0.2 to +0.5 |
| Diagonal K(x,x) | 1.0 | 1.0 |
| Kernel symmetry | K = K^T | K = K^T |
Exact values depend on the random seed and number of iterations. More iterations and more samples yield higher trained KTA.
Running the Circuit
PYTHONfrom circuit import run_circuit, verify_trainable_kernel # Train the kernel and compare with random baseline result = run_circuit(n_samples=12, n_iterations=15) print(f"Random KTA: {result['random_kta']:.4f}") print(f"Trained KTA: {result['trained_kta']:.4f}") print(f"Improvement: {result['improvement']:.4f}") # Run verification suite v = verify_trainable_kernel() for check in v["checks"]: status = "PASS" if check["passed"] else "FAIL" print(f"[{status}] {check['name']}: {check['detail']}")
Try It Yourself
-
More iterations: Increase
n_iterations=50and watch KTA climb. Plotresult['kta_history']to see the convergence curve. -
Harder data: Move the class centers closer (edit
CLASS_0_CENTERandCLASS_1_CENTERincircuit.py). Does the trained kernel still improve? How many iterations does it need? -
Fewer qubits: Try
n_qubits=2. With fewer parameters, can the kernel still adapt? Compare final KTA with the 4-qubit version. -
More reps: Increase
n_reps=4(more circuit depth). Does deeper = better, or does the optimizer struggle with more parameters? -
Learning rate sweep: Try
learning_ratevalues from 0.01 to 0.5. What happens when it is too high? Too low?
What's Next
- Quantum Kernel SVM — Use the trained kernel for classification with support vector machines
- Fidelity Kernel — Compare with the fixed SWAP test kernel (no training)
- Projected Quantum Kernel — Classical post-processing of quantum measurements as an alternative
Applications
| Domain | Use case |
|---|---|
| Task-specific kernels | Optimize the quantum embedding for a particular dataset |
| Few-shot learning | KTA training works with small datasets where neural networks overfit |
| Transfer learning | Pre-train kernel parameters on related tasks, fine-tune on target |
| Kernel ensembles | Combine multiple trained kernels for robust classification |
| Quantum advantage studies | Compare trainable quantum kernels vs classical RBF/polynomial kernels |
References
- Hubregtsen, T. et al. (2022). "Training quantum embedding kernels on near-term quantum devices." Physical Review A 106, 042431. DOI: 10.1103/PhysRevA.106.042431
- Glick, J.R. et al. (2024). "Covariant quantum kernels for data with group structure." Nature Physics 20, 1027-1036. DOI: 10.1038/s41567-023-02288-w
- Cristianini, N. et al. (2001). "On Kernel-Target Alignment." NeurIPS 14. Proceedings
- Schuld, M. & Killoran, N. (2019). "Quantum Machine Learning in Feature Hilbert Spaces." Physical Review Letters 122, 040504. DOI: 10.1103/PhysRevLett.122.040504