Weight Exchange and Weight Space Geometry
Research notes for the “Neural Societies” series. Compiled 2026-04-05.
The central question: what does it mean for neural nets to “exchange parameters with each other”? The geometry and topology of weight space determine whether this is trivial, difficult, or impossible.
1. Mode Connectivity
The Discovery
Two independent groups in 2018 overturned the conventional picture of isolated local minima in neural network loss landscapes.
Garipov, Izmailov, Podoprikhin, Vetrov, Wilson. “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs.” NeurIPS 2018. (arXiv:1802.10026)
Key finding: optima found by independent SGD runs are connected by simple curves of near-constant loss and accuracy. These are not arbitrary paths – they can be found as low-degree parametric curves.
Draxler, Veschgini, Salmhofer, Hamprecht. “Essentially No Barriers in Neural Network Energy Landscape.” ICML 2018. (arXiv:1803.00885)
Key finding: the paths between minima are “essentially flat” in both training and test loss. Minima are best understood not as isolated valleys but as points on a single connected manifold of low loss.
Mathematical Formulation
Garipov et al. parameterize a path phi_theta(t) connecting two trained weight vectors w_1 and w_2, then minimize the expected loss along the path:
L(theta) = E_{t ~ U[0,1]}[ Loss(phi_theta(t)) ]
Two parameterizations:
Quadratic Bezier curve:
phi(t) = (1-t)^2 * w_1 + 2t(1-t) * theta + t^2 * w_2
where theta is a trainable control point with the same dimensionality as the weight vectors. The entire curve lies in a 2D affine subspace of weight space.
Polygonal chain: a piecewise linear path with one or more “bends.” Surprisingly, a single-bend polychain (two line segments) suffices for many architectures.
What This Means for Breeding Neural Nets
If two independently trained nets live on the same connected low-loss manifold, then in principle you can interpolate between them and get something functional. The path is not a straight line (linear interpolation typically fails), but a slightly curved path works. This is the geometric precondition for meaningful parameter exchange.
However, mode connectivity alone does not mean linear interpolation works. The curves found by Garipov et al. require optimization to discover. The question of when straight-line paths suffice is the subject of linear mode connectivity (Section 2).
Fast Geometric Ensembling (FGE)
Practical payoff: by collecting weight vectors along a mode-connecting curve (using cyclical learning rates to traverse it), one can build ensembles in the time it takes to train a single model. The different points along the curve correspond to genuinely different functions despite having similar loss.
2. Linear Mode Connectivity and Permutation Symmetry
The Barrier Problem
Given two trained models w_1 and w_2, define the linear interpolation barrier as:
barrier(w_1, w_2) = max_{0 <= alpha <= 1} [ Loss(alpha * w_1 + (1-alpha) * w_2) ] - (1/2)(Loss(w_1) + Loss(w_2))
If the barrier is zero or near-zero, the models are “linearly mode connected” (LMC). Naive linear interpolation between independently trained models almost always has a large barrier. The loss spikes in the middle.
The Permutation Symmetry Insight
The reason linear interpolation fails is not that the models found genuinely different solutions. It is that they represent the same solution with neurons in different orders.
A feedforward network with hidden layers of widths n_1, …, n_{L-1} is invariant under the permutation group:
G = S_{n_1} x S_{n_2} x ... x S_{n_{L-1}}
Any permutation of neurons within a hidden layer, with corresponding rearrangement of incoming and outgoing weights, gives a functionally identical network. For a network with d-1 hidden layers of width n, there are (n!)^{d-1} equivalent weight configurations.
Linear interpolation between two equivalent-but-permuted configurations creates a Frankenstein: neuron 3 in model A is averaged with neuron 7 in model B, producing nonsense.
The Entezari Conjecture
Entezari, Sedghi, Saukber, Neyshabur. “The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks.” 2021. (arXiv:2110.06296)
Conjecture: after accounting for permutation symmetries, most SGD solutions lie in a single basin. That is, for any two independently trained models w_1 and w_2, there exists a permutation pi in G such that:
barrier(pi(w_1), w_2) approx 0
This would mean the loss landscape has essentially one basin (modulo permutations), radically simplifying the picture.
Git Re-Basin
Ainsworth, Hayase, Srinivasa. “Git Re-Basin: Merging Models Modulo Permutation Symmetries.” ICLR 2023. (arXiv:2209.04836)
Three algorithms to find the permutation that aligns two models:
-
Weight matching. Solve a bipartite linear assignment problem: maximize the inner product between corresponding weight rows of the two models. Uses block-coordinate descent, iteratively solving layer-by-layer assignments until convergence.
-
Activation matching. Run both models on a batch of data, match neurons by their activation patterns using the Hungarian algorithm on correlation matrices.
-
Straight-through estimator (STE). Parameterize permutations continuously (via soft assignments), minimize the midpoint interpolation loss directly. Uses straight-through estimator for gradients through the discrete projection step.
Results: achieved zero-barrier LMC for ResNets on CIFAR-10. First demonstration that independently trained models can be merged by simple averaging after permutation alignment. Key finding: model width matters – sufficiently wide models are required for LMC to hold.
Frankle’s Stability Analysis
Frankle, Dziugaite, Roy, Carbin. “Linear Mode Connectivity and the Lottery Ticket Hypothesis.” ICML 2020. (arXiv:1912.05671)
Discovered that neural networks become “stable” to SGD noise early in training – after a critical point, different random seeds converge to the same linearly connected basin. For ResNet-20 on CIFAR-10 this happens after only 3% of training; for ResNet-50 on ImageNet, after 20%.
This suggests a phase transition: early training is a chaotic exploration phase where the network finds its basin, followed by a deterministic refinement phase within a single basin. The timing of this transition connects to lottery ticket rewinding – subnetworks only work when rewound to post-stability checkpoints.
Implications for Neural Societies
The permutation alignment problem is the fundamental obstacle to “breeding” neural networks. Two networks trained independently will have their neurons in different orders, making direct parameter exchange destructive. But once you solve the alignment problem (Git Re-Basin, activation matching, etc.), averaging becomes viable.
The width requirement is interesting: wider networks are easier to align. This parallels the biological observation that genetic recombination works better with redundancy.
3. Model Soups and Weight Averaging
Model Soups
Wortsman, Ilharco, Gadre, …, Schmidt. “Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time.” ICML 2022. (arXiv:2203.05482)
Key insight: when multiple models are fine-tuned from the same pretrained initialization with different hyperparameters, they tend to remain in the same loss basin. This means you can simply average their weights and get a model that is better than any individual.
Two recipes:
- Uniform soup: average all fine-tuned models. Simple but can include bad models that hurt the average.
- Greedy soup: sort models by validation accuracy, then greedily add each model to the soup only if it improves validation performance. This can never be worse than the single best model.
Results: ViT-G model soup achieved 90.94% top-1 on ImageNet, a new state-of-the-art at the time. Improvements also on OOD robustness and zero-shot transfer.
The theoretical grounding connects to loss landscape flatness: weight averaging works when the models live in a flat region of the loss landscape, and a shared pretrained initialization ensures they do.
Stochastic Weight Averaging (SWA)
Izmailov, Podoprikhin, Garipov, Vetrov, Wilson. “Averaging Weights Leads to Wider Optima and Better Generalization.” UAI 2018. (arXiv:1803.05407)
SWA averages weight vectors collected along an SGD trajectory (with cyclical or constant learning rate). The averaged model finds a wider, flatter minimum than SGD alone.
Geometric intuition: in high-dimensional weight space, most of the volume of a flat region is concentrated near its boundary. SGD, being stochastic, settles near the boundary. SWA averages multiple boundary points, moving the solution toward the center of the flat region, where it generalizes better.
Task Arithmetic
Ilharco, Ribeiro, Wortsman, …, Farhadi. “Editing Models with Task Arithmetic.” ICLR 2023. (arXiv:2212.04089)
A step beyond averaging: define a “task vector” as:
tau_task = w_finetuned - w_pretrained
These task vectors can be added, subtracted, and scaled:
- Addition: w_pretrained + tau_A + tau_B gives multi-task capabilities.
- Negation: w_pretrained - tau_A removes task A capabilities.
- Scaling: w_pretrained + lambda * tau_A controls the strength.
This works because fine-tuned models from a shared initialization exhibit LMC, so task vectors are approximately linear directions in weight space. Weight space near a pretrained model behaves roughly like a vector space, at least locally.
Connection to Neural Societies
Model soups are a primitive form of “social learning” – a population of models trained independently, then combined. The critical requirement is a shared starting point (pretrained initialization), which ensures all models stay in the same basin. Without this, averaging destroys information.
This is the simplest case of parameter exchange: all agents start from the same point and are averaged at the end. The question for neural societies is whether you can do something more dynamic – ongoing exchange during training, between agents that may have diverged further.
4. Federated Learning as Implicit Society
FedAvg
McMahan, Moore, Ramage, Hampson, Arcas. “Communication-Efficient Learning of Deep Networks from Decentralized Data.” AISTATS 2017. (arXiv:1602.05629)
FedAvg is literally a society of neural networks that learn privately and share publicly:
- Server broadcasts global model to K clients.
- Each client trains locally on its own data for E epochs.
- Clients send updated weights back to the server.
- Server averages the weights (weighted by dataset size).
- Repeat.
Achieves 10-100x communication reduction over synchronized SGD. Robust to non-IID data distributions across clients.
The Permutation Problem in Federated Learning
Wang, Yurochkin, Sun, Papailiopoulos, Khazaeni. “Federated Learning with Matched Averaging.” ICLR 2020. (arXiv:2002.06440)
FedMA recognizes that naive element-wise averaging in FedAvg ignores permutation invariance. Different clients, training on different data from different initializations, will learn neurons in different orders. FedMA aligns neurons across clients before averaging, using Bayesian nonparametric matching (for FC layers) and feature-similarity matching (for Conv layers).
PFNM (Probabilistic Federated Neural Matching): Uses Bayesian nonparametric methods (Beta process priors) to simultaneously align and merge client models, adapting the global model size to the population.
Convergence Issues
FedAvg with fixed learning rate converges to a neighborhood of the optimum, not the optimum itself. The gap is Omega(eta * (E - 1)), where eta is the learning rate and E is local epochs. Learning rate decay is necessary for exact convergence.
FedProx (Li et al., 2020) adds a proximal term penalizing deviation from the global model, stabilizing convergence under heterogeneous data. Improves absolute accuracy by ~22% in highly heterogeneous settings.
SCAFFOLD (Karimireddy et al., 2020) uses variance reduction to correct local update directions, achieving convergence guarantees independent of data heterogeneity.
The Society Analogy
Federated learning is the closest existing paradigm to a “neural society”:
- Individual learning: each client trains on its own data (private experience).
- Social sharing: periodic averaging aggregates individual knowledge (public discourse).
- Non-IID data: different clients see different slices of reality (subjective perspectives).
- Communication constraints: sharing is expensive (bounded rationality).
- Drift: without periodic synchronization, clients diverge (cultural drift).
The key tension is between local adaptation (specialization to each client’s data) and global coherence (maintaining a shared model that works for everyone). FedAvg resolves this crudely by averaging; more sophisticated approaches (FedMA, SCAFFOLD) try to be smarter about what gets shared and how.
5. Loss Landscape Topology
Visualization: Li et al.
Li, Xu, Taylor, Studer, Goldstein. “Visualizing the Loss Landscape of Neural Nets.” NeurIPS 2018. (arXiv:1712.09913)
Introduced “filter normalization” for meaningful 2D visualizations of loss landscapes. Key findings:
- Shallow networks have smooth, convex-looking landscapes.
- Deep networks without skip connections have highly chaotic, non-convex landscapes.
- Skip connections (ResNets) dramatically smooth the landscape, even for very deep networks.
- Wider networks have smoother landscapes than narrow ones.
- Smaller batch sizes produce wider minima.
Saddle Points Dominate in High Dimensions
Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” NeurIPS 2014. (arXiv:1406.2572)
The crucial insight from random matrix theory and statistical physics: in high dimensions, critical points are overwhelmingly saddle points, not local minima.
At a critical point, the Hessian has a distribution of eigenvalues. In high dimensions:
- Critical points far from the global minimum are exponentially likely to have many negative eigenvalues (saddle points with many escape directions).
- The fraction of negative eigenvalues decreases as you approach the global minimum.
- True local minima (all eigenvalues positive) concentrate near the global minimum value.
This is why SGD works: it almost never gets stuck in local minima, because in million-dimensional space, true local minima are exponentially rare above the global minimum.
The Spin Glass Analogy
Choromanska, Henaff, Mathieu, Arous, LeCun. “The Loss Surfaces of Multilayer Networks.” AISTATS 2015. (arXiv:1412.0233)
Under simplifying assumptions (variable independence, redundancy, uniformity), the loss function of a neural network is analogous to the Hamiltonian of a spherical p-spin glass. This connection yields:
- Critical values form a layered band structure above the global minimum.
- The number of local minima outside this band diminishes exponentially with network size.
- Below the “energy barrier,” all critical points are local minima of near-optimal quality.
- Above the energy barrier, critical points are high-index saddle points.
- Finding the exact global minimum becomes harder with network size, but is also irrelevant (it typically overfits).
Limitations: the independence assumption is unrealistic (many paths share inputs). But the qualitative picture – a band of good local minima near the global minimum, separated from a wilderness of saddle points – matches empirical observations.
Sewall Wright’s Adaptive Landscape vs. Neural Loss Landscapes
Wright (1932) introduced the metaphor of a “fitness landscape” for evolutionary biology: allele frequencies define coordinates, fitness defines height, evolution is hill-climbing on this surface. The metaphor has been influential but controversial for 90 years.
Key parallels with neural loss landscapes:
- Both are high-dimensional surfaces over parameter spaces.
- Both involve populations navigating these surfaces.
- Both care about the connectivity structure (can you get from one peak to another?).
- Both involve noise-driven exploration (genetic drift / SGD noise).
Key differences:
| Feature | Wright’s Landscape | Neural Loss Landscape |
|---|---|---|
| Dimensionality | ~10^3 to 10^4 (genes) | ~10^6 to 10^12 (parameters) |
| Objective | Maximize fitness | Minimize loss |
| High-dim effect | Debated; may reduce stable peaks (Fisher) | Eliminates isolated local minima (Dauphin) |
| Connectivity | Unknown; neutral ridges may exist | Established – mode connectivity |
| Landscape is… | Fixed (by environment) | Defined by data (changes with distribution) |
| Symmetry | Minimal (genes have distinct roles) | Massive (permutation symmetry) |
| Navigation | Blind (mutation + selection) | Gradient-informed (backprop) |
The most profound difference: high dimensionality is a curse in biology but a blessing in optimization. In low dimensions, local optima are isolated traps. In millions of dimensions, there are so many escape directions that true local minima become vanishingly rare. The very problem Wright worried about (getting stuck on suboptimal peaks) essentially dissolves in the neural case.
However, both landscapes share the difficulty of visualization. Wright estimated that the space of gene combinations would number ~10^1000. From the start, the “landscape” was always a low-dimensional metaphor for a high-dimensional reality. The same caveat applies to those beautiful 2D loss landscape plots – they are projections, and projections can be misleading.
Morse Theory and Loss Landscape Connectedness
Akhtiamov et al. “Connectedness of Loss Landscapes via the Lens of Morse Theory.” ALT 2023.
Morse theory provides tools for understanding the topology of sublevel sets { w : L(w) <= c } as the threshold c varies. Critical points of the loss function (where grad L = 0) change the topology: minima add connected components, saddle points merge them. If all saddle points between two minima have loss below threshold c, then the sublevel set at c is connected – the two minima are mode-connected at loss level c.
This gives a rigorous topological framework for mode connectivity: it’s about the Morse-theoretic structure of the loss function, specifically whether index-1 saddle points (with one negative Hessian eigenvalue) exist at sufficiently low loss to connect the basins.
6. Population-Based Training (PBT)
Jaderberg, Dalibard, Osindero, Czarnecki, Donahue, Razavi, Vinyals, Green, Dunning, Simonyan, Fernando, Kavukcuoglu. “Population Based Training of Neural Networks.” 2017. (arXiv:1711.09846)
The Algorithm
PBT is literally a society of neural networks with parameter exchange:
- Initialize a population of N models with random hyperparameters.
- Train all models in parallel.
- Periodically, each model evaluates its fitness (validation performance).
- Exploit: if a model is underperforming, it copies the weights and hyperparameters of a better-performing model.
- Explore: the copied hyperparameters are perturbed (multiplied by 0.8 or 1.2 for continuous values, shifted to adjacent value for discrete ones).
- Continue training from the new weights with the new hyperparameters.
- Repeat.
This is a hybridization of gradient-based training (each model uses backprop) and evolutionary dynamics (the population shares information through weight copying and hyperparameter mutation).
Key Properties
- Adaptive schedules: PBT discovers schedules of hyperparameters, not just fixed values. The learning rate, regularization strength, etc. evolve over training.
- No computational overhead: uses the same total compute as training N independent models, but directs resources toward promising configurations.
- Asynchronous: models need not synchronize; exploitation and exploration happen whenever a model completes an evaluation.
Applications
- Reinforcement learning: stabilized training on DeepMind Lab, Atari, StarCraft II. State-of-the-art results.
- GANs: improved Inception Score from 6.45 to 6.9 on CIFAR-10. GANs are notoriously sensitive to hyperparameters, making PBT especially valuable.
- Machine translation: automatically found hyperparameter schedules matching hand-tuned baselines.
PBT as a Neural Society
PBT has the clearest “society” interpretation of any existing method:
- Individuals: each model in the population.
- Private learning: each model trains by gradient descent on its own trajectory.
- Social comparison: fitness evaluation relative to the population.
- Knowledge transfer: copying weights from a better model (imitation/apprenticeship).
- Innovation: hyperparameter mutation (trying new strategies).
- Selection pressure: underperformers adopt the strategies of the successful.
The “exploit” step is the most radical form of parameter exchange: wholesale copying of another model’s weights. This works because models in the population share the same architecture and (usually) similar training histories, so they remain in approximately the same basin.
7. Crossover in Weight Space
The Competing Conventions Problem
The fundamental difficulty with genetic crossover in neural networks, identified in the neuroevolution literature:
Two networks can compute identical functions with neurons in different orders. If you cross them by swapping corresponding weight indices, the offspring inherits neuron 3 from parent A and neuron 3 from parent B – but these may represent completely different features. The child is a random scramble.
Example: parents [A, B, C] and [C, B, A] represent the same function (just neurons permuted). Crossover produces [A, B, A] or [C, B, C] – losing 1/3 of the information.
This is the same permutation symmetry problem that plagues model merging and federated averaging, rediscovered independently in the neuroevolution community.
NEAT’s Solution: Historical Markers
Stanley, Miikkulainen. “Evolving Neural Networks through Augmenting Topologies.” Evolutionary Computation, 2002.
NEAT avoids the competing conventions problem by tracking the evolutionary history of every gene (connection) via global innovation numbers. Crossover aligns genes by their historical origin, not their position. Two genes with the same innovation number necessarily represent the same structural element (though possibly with different weights).
This is essentially a manual solution to the alignment problem: instead of trying to align neurons after the fact, NEAT maintains alignment by construction throughout the evolutionary process.
Safe Crossover via Neuron Alignment
Uriot, Izzo. “Safe Crossover of Neural Networks Through Neuron Alignment.” GECCO 2020. (arXiv:2003.10306)
A post-hoc approach: before crossover, align the neurons of the two parents by computing correlations between their activation patterns. Two methods:
- Pairwise Correlation (PwC): match neurons by how well their outputs correlate on a dataset.
- Canonical Correlation Analysis (CCA): match neurons using CCA, which can capture linear combinations.
After alignment, standard arithmetic crossover (averaging corresponding weights) works effectively. The method is computationally fast and transmits information from parents to offspring.
Connection to Git Re-Basin: this is essentially the same idea (permutation alignment before merging), discovered independently in the evolutionary computation community.
Stitching for Neuroevolution
Guijt, Thierens, Alderliesten, Bosman. “Stitching for Neuroevolution: Recombining Deep Neural Networks without Breaking Them.” GECCO 2024. (arXiv:2403.14224)
A more ambitious approach: instead of permuting neurons, insert trainable “stitching layers” (small linear or 1x1 conv layers) at crossover points. These layers learn the transformation between parent representations. This avoids the competing conventions problem entirely because the stitching layer adapts to whatever alignment exists.
Steps:
- Find compatible layer pairs between parents (matching tensor shapes).
- Ensure the resulting computational graph is acyclic (branch-and-bound).
- Construct a “supernetwork” with switches at each crossover point.
- Train stitching layers to minimize representation mismatch.
Results: recombined networks can achieve novel performance/cost tradeoffs, sometimes dominating individual parents. Works across different architectures, not just identical ones.
Evolution Strategies as an Alternative
Salimans, Ho, Chen, Siddharth, Sutskever. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” 2017. (arXiv:1703.03864)
OpenAI showed that evolution strategies (ES) – black-box optimization via random perturbation of weights – can match RL performance on Atari and MuJoCo. ES treats the network as a black box: perturb the million-dimensional weight vector by Gaussian noise, evaluate the reward, update in the direction of successful perturbations.
ES avoids the crossover problem entirely by working only with mutation. But it does use a population: each worker evaluates a perturbed version of the shared weight vector. The communication is remarkably efficient – only scalar rewards need to be shared, because workers share random seeds and can reconstruct perturbations locally.
This is a “society” with a very different structure: no crossover, no weight copying, just collective gradient estimation through correlated perturbations.
8. Theoretical Frameworks
Weight Space Symmetries: The Full Picture
Hecht-Nielsen. “On the Algebraic Structure of Feedforward Network Weight Spaces.” 1990.
Pioneer work establishing that the set of weight transformations leaving the input-output function invariant forms a group. For an MLP with hidden layers of widths n_1, …, n_{L-1}:
The symmetry group contains at minimum:
- Permutation symmetries: S_{n_1} x S_{n_2} x … x S_{n_{L-1}} (reordering neurons within each layer).
- Sign-flip symmetries: Z_2^{n_k} for odd-symmetric activations like tanh (flipping the sign of a neuron’s input and output weights).
The full discrete symmetry group for a tanh network is a wreath product:
G = (Z_2 wr S_{n_1}) x (Z_2 wr S_{n_2}) x ... x (Z_2 wr S_{n_{L-1}})
where Z_2 wr S_n = Z_2^n rtimes S_n is the hyperoctahedral group (signed permutations).
For ReLU networks, the sign-flip symmetry is replaced by positive scaling: R_{>0}^{n_k} (rescaling a neuron’s output by lambda and its incoming weights by 1/lambda). This is a continuous symmetry, making the quotient space more complex.
Simsek, Ged, Jacot, Spadaro, Hongler, Gerstner, Brea. “Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances.” ICML 2021.
Key results:
- Permutation symmetries create “symmetry-induced” critical points (saddle points and plateaus connecting equivalent minima).
- Adding one extra neuron per layer suffices to connect all discrete permutation-equivalent minima into a single manifold.
- In the mildly overparameterized regime, symmetry-induced critical points outnumber the points on the global minima manifold; the ratio reverses in the vastly overparameterized regime.
Recent Taxonomy (2025)
“Symmetry in Neural Network Parameter Spaces.” 2025. (arXiv:2506.13018)
A comprehensive survey defining a hierarchy of symmetry types:
- Functional symmetry: f(g . theta, x) = f(theta, x) for all g in G, theta, x. The network computes the same function.
- Loss symmetry: L(g . theta) = L(theta). Weaker – the loss is preserved but the function may differ on individual inputs.
- Distribution symmetry: E_{x
D}[L(g . theta, x)] = E_{xD}[L(theta, x)]. Even weaker – only the expected loss over the data distribution is preserved. - Data-dependent symmetry: symmetry holds only on specific data subsets.
Architecture-specific symmetry groups:
- Two-layer linear networks: GL_h(R) (full general linear group), acting as g . (W_2, W_1) = (W_2 g^{-1}, g W_1).
- ReLU networks: R_{>0}^h (positive scaling group).
- Tanh networks: Z_2^h (sign-flip group).
- Transformers: (GL_{d_k}(R))^h from attention heads, plus S_h from head permutation, plus (GL_{d_v}(R))^h from output projections.
The fundamental challenge: the full symmetry group G_{Theta,L} is typically impossible to characterize. The known groups (permutations, sign flips, scaling) are subgroups. The quotient space Theta/G has not been fully characterized topologically for any nontrivial architecture. This is an open problem.
Weight Space as a Manifold
The parameter space Theta = R^d is trivially a manifold. The interesting question is the structure of the quotient Theta/G, which is typically an orbifold (a manifold with singularities at points with non-trivial stabilizers).
The singularities occur at weight configurations where two or more neurons are identical – the stabilizer subgroup is larger than generic, making the quotient space singular. These are precisely the “permutation points” identified by Brea and Simsek (2019): points where neuron weight vectors collide, creating flat directions in the loss landscape.
Open question: What is the topology (fundamental group, homology) of the quotient space Theta/G? Is it simply connected? Does it have non-trivial cycles? The answers would determine whether there are topological obstructions to parameter exchange.
Information Geometry
Amari. “Natural Gradient Works Efficiently in Learning.” Neural Computation, 1998.
The parameter space of a neural network is a Riemannian manifold with metric given by the Fisher information matrix:
g_{ij}(theta) = E[ (d log p(x|theta)/d theta_i) * (d log p(x|theta)/d theta_j) ]
The natural gradient (the steepest descent direction on this manifold) is:
theta_new = theta - eta * F^{-1} * grad L
where F is the Fisher information matrix.
Amari, Karakida. “Fisher Information and Natural Gradient Learning of Random Deep Networks.” AISTATS 2019.
For deep networks, the Fisher matrix has a unit-wise block-diagonal structure (up to small off-diagonal terms). This means the information geometry decomposes approximately layer by layer.
Relevance to neural societies: if you want to merge models in a geometrically principled way, you should not simply average their parameters (which assumes Euclidean geometry). You should interpolate along geodesics of the Fisher information manifold. SLERP (spherical linear interpolation) is a crude approximation to this. A proper merging procedure would account for the curvature of the statistical manifold.
Permutation Saddles and Valley Structure
Brea, Simsek, Ganguli. “Weight-Space Symmetry in Deep Networks Gives Rise to Permutation Saddles, Connected by Equal-Loss Valleys Across the Loss Landscape.” 2019. (arXiv:1907.02911)
Concrete geometric results:
- Between any two permutation-equivalent minima, there exist smooth paths through “permutation points” where two neurons’ weight vectors collide and interchange.
- At each permutation point, the Hessian has at least n_{k+1} zero eigenvalues (where n_{k+1} is the width of the next layer).
- These zero eigenvalues correspond to flat directions – the permutation saddles extend into flat valleys of dimension n_{k+1}.
- Higher-order permutation points (where K neurons simultaneously collide) increase the total critical point count multiplicatively.
The picture: the weight space of a neural network is carved up by a web of flat valleys connecting equivalent minima, with permutation saddles acting as low-dimensional corridors between them. The topology is far richer than “a bunch of isolated basins.”
9. The Model Merging Ecosystem (2023-2026)
The theoretical insights above have spawned a thriving practical ecosystem, especially in the LLM community.
Key Methods
- TIES-Merging: resolves sign conflicts in task vectors before averaging. Eliminates redundant parameters and resolves sign disagreements.
- DARE (Drop And REscale): randomly drops 90% of delta parameters and rescales the rest. Reduces interference during merging.
- SLERP: spherical linear interpolation between weight vectors. Preserves directional structure that naive linear averaging distorts. Treats weights as vectors on a hypersphere and interpolates along the geodesic.
- Task Arithmetic: addition and subtraction of task vectors (see Section 3).
Weight Disentanglement
Recent finding (2024): large-scale instruction tuning “disentangles” model weights, meaning different directions in weight space correspond to functional changes in disjoint regions of the input space. This makes merging easier because the task vectors don’t interfere.
MergeKit and Community Practice
The open-source MergeKit library has enabled thousands of community-created merged LLMs, many ranking at or near the top of the Open LLM Leaderboard. Model merging has become a form of combinatorial innovation – practitioners combine specialist models (coding, math, conversation) without any training, achieving capabilities beyond any single parent.
This is, in effect, a “neural society” emerging organically in the open-source community: a population of models that exchange parameters (via merging recipes), compete (on leaderboards), and evolve (through iterated merge-and-evaluate cycles).
10. Open Questions and Mathematical Directions
For the Neural Societies Concept
-
Dynamic alignment during training. Git Re-Basin works post-hoc. Can you maintain alignment continuously during co-training, so that models in a population can exchange parameters at any time? This would require tracking the permutation mapping as it evolves.
-
Partial exchange. PBT copies entire weight vectors. Model soups average everything. Is there a principled way to exchange parts of weight spaces – e.g., sharing early-layer features while keeping late-layer specializations?
-
Communication topology. In federated learning, there is a star topology (all clients talk to one server). What happens with peer-to-peer exchange? Ring topologies? Does the communication graph structure affect convergence the way social network structure affects cultural evolution?
-
Competitive dynamics. All current approaches are cooperative (shared objective). What happens when models have conflicting objectives but must still exchange information? This is the game-theoretic frontier.
-
Diversity maintenance. In PBT, exploitation drives the population toward homogeneity. How do you maintain diversity? Evolutionary biology uses speciation, geographic isolation, frequency-dependent selection. What are the neural network equivalents?
Mathematical Concepts Worth Exploring
-
Fiber bundles over the quotient space. The projection from Theta to Theta/G has the structure of a principal G-bundle. The topology of this bundle (characteristic classes, holonomy) determines what happens when you try to “parallel transport” a model along a path in the quotient space. Model averaging is attempting to find a section of this bundle.
-
Optimal transport for model alignment. Instead of solving the linear assignment problem (as in Git Re-Basin), you could formulate alignment as an optimal transport problem, finding the minimum-cost mapping between the “neuron distributions” of two models. This connects to Wasserstein geometry.
-
Persistent homology of loss landscapes. Using topological data analysis to characterize the topological features (connected components, loops, voids) of sublevel sets of the loss function as the threshold varies. Recent work has begun using merge trees and persistence diagrams but the theory is underdeveloped.
-
The Riemannian geometry of weight averaging. Simple averaging is a Euclidean midpoint. The Riemannian midpoint (Frechet mean) on the Fisher information manifold would be more principled. How much does this matter in practice?
-
Symmetry breaking during training. Early in training, all neurons are approximately equivalent (high symmetry). As training progresses, neurons specialize and the effective symmetry group reduces. This is spontaneous symmetry breaking. What determines the pattern of symmetry breaking? Does it connect to the phase transition Frankle observed?
-
Population dynamics on loss landscapes. A population of models moving on a shared loss landscape under selection pressure, with occasional parameter exchange. This is a stochastic process on a high-dimensional manifold. Are there results from statistical mechanics or evolutionary dynamics that apply?
Key References Summary
| Paper | Year | Key Contribution |
|---|---|---|
| Hecht-Nielsen, “Algebraic Structure of Weight Spaces” | 1990 | Symmetry group of feedforward networks |
| Stanley & Miikkulainen, NEAT | 2002 | Historical markers solve competing conventions |
| Dauphin et al., “Saddle Point Problem” | 2014 | Saddle points dominate in high dimensions |
| Choromanska et al., “Loss Surfaces of Multilayer Networks” | 2015 | Spin glass analogy for loss landscapes |
| McMahan et al., FedAvg | 2017 | Federated averaging of distributed models |
| Jaderberg et al., PBT | 2017 | Population-based training with exploit/explore |
| Salimans et al., Evolution Strategies | 2017 | ES as scalable alternative to RL |
| Amari, “Natural Gradient” | 1998 | Fisher information Riemannian metric |
| Garipov et al., Mode Connectivity | 2018 | Low-loss curves between optima |
| Draxler et al., “Essentially No Barriers” | 2018 | Flat paths between minima |
| Li et al., Loss Landscape Visualization | 2018 | Filter normalization, skip connection effects |
| Izmailov et al., SWA | 2018 | Weight averaging finds wider optima |
| Brea & Simsek, Permutation Saddles | 2019 | Geometric structure of symmetry-induced critical points |
| Wang et al., FedMA | 2020 | Permutation-aware federated averaging |
| Frankle et al., LMC + Lottery Ticket | 2020 | Stability phase transition early in training |
| Uriot & Izzo, Safe Crossover | 2020 | Neuron alignment for evolutionary crossover |
| Entezari et al., Single Basin Conjecture | 2021 | After permutation alignment, one basin |
| Simsek et al., “Geometry of Loss Landscape” | 2021 | Symmetry-induced critical subspaces |
| Wortsman et al., Model Soups | 2022 | Averaging fine-tuned models improves accuracy |
| Ainsworth et al., Git Re-Basin | 2023 | Three algorithms for permutation alignment |
| Ilharco et al., Task Arithmetic | 2023 | Linear operations on task vectors |
| Guijt et al., Stitching for Neuroevolution | 2024 | Stitching layers for architecture-crossing recombination |
| “Symmetry in Neural Network Parameter Spaces” (survey) | 2025 | Taxonomy of weight space symmetries |