Alpha Values

Learned \(\alpha\) vs. Similarity

Alpha plots

The scatter plot above compares the optimized merge weights \(\alpha\) learned via gradient descent with the original similarity scores \(S_{lm}\) used in the MKA heuristic. While the MKA method assumes a direct relationship \(\alpha = S_{lm}\), our results show that the true optimal \(\alpha\) values learned from data are noticeably more variable, ranging from approximately \(0.50\) to \(0.71\), rather than clustering tightly around the heuristic value of approximately \(0.62\).

The moderate Pearson correlation of \(0.560\) indicates that similarity does influence the optimal merge weight; layers that appear more similar generally prefer a higher \(\alpha\). However, the relationship is far from linear or deterministic. Many learned \(\alpha\) values deviate substantially from what the heuristic would predict, which suggests that factors beyond the similarity score also shape the optimal fusion ratio.

This supports a key conclusion of Experiment 1. Although the MKA heuristic performs surprisingly well in terms of final accuracy, it is an oversimplification of the underlying relationship that drives optimal layer fusion. The merge weight depends not only on similarity but also on nuanced interactions between the specific layers being merged, task specific activation patterns, and the local loss landscape. As a result, \(\alpha\) cannot be reliably predicted from similarity alone, even though similarity remains directionally meaningful.

In short, similarity is a useful but incomplete predictor of \(\alpha\). Learned \(\alpha\) values capture additional structure that the heuristic cannot. This explains why optimized \(\alpha\) values differ substantially despite producing similar final accuracy.

These findings highlight an important insight for future layer merging research. To fully exploit merge weight optimization, we need a richer model of \(\alpha\), one that goes beyond similarity and accounts for deeper functional and task dependent properties of the layers.

Average MMLU accuracy for different \(\alpha\) optimization strategies

The table below reports average MMLU accuracy after merging 13 layers using different strategies for choosing the merge weight \(\alpha\).

Model / Method	Average MMLU Accuracy
Base (Heuristic)	0.64710
Bayesian Optimization	0.64888
Gradient Descent	0.64753