Weight Exchange and Weight Space Geometry

Research notes for the “Neural Societies” series. Compiled 2026-04-05.

The central question: what does it mean for neural nets to “exchange parameters with each other”? The geometry and topology of weight space determine whether this is trivial, difficult, or impossible.


1. Mode Connectivity

The Discovery

Two independent groups in 2018 overturned the conventional picture of isolated local minima in neural network loss landscapes.

Garipov, Izmailov, Podoprikhin, Vetrov, Wilson. “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs.” NeurIPS 2018. (arXiv:1802.10026)

Key finding: optima found by independent SGD runs are connected by simple curves of near-constant loss and accuracy. These are not arbitrary paths – they can be found as low-degree parametric curves.

Draxler, Veschgini, Salmhofer, Hamprecht. “Essentially No Barriers in Neural Network Energy Landscape.” ICML 2018. (arXiv:1803.00885)

Key finding: the paths between minima are “essentially flat” in both training and test loss. Minima are best understood not as isolated valleys but as points on a single connected manifold of low loss.

Mathematical Formulation

Garipov et al. parameterize a path phi_theta(t) connecting two trained weight vectors w_1 and w_2, then minimize the expected loss along the path:

L(theta) = E_{t ~ U[0,1]}[ Loss(phi_theta(t)) ]

Two parameterizations:

Quadratic Bezier curve:

phi(t) = (1-t)^2 * w_1 + 2t(1-t) * theta + t^2 * w_2

where theta is a trainable control point with the same dimensionality as the weight vectors. The entire curve lies in a 2D affine subspace of weight space.

Polygonal chain: a piecewise linear path with one or more “bends.” Surprisingly, a single-bend polychain (two line segments) suffices for many architectures.

What This Means for Breeding Neural Nets

If two independently trained nets live on the same connected low-loss manifold, then in principle you can interpolate between them and get something functional. The path is not a straight line (linear interpolation typically fails), but a slightly curved path works. This is the geometric precondition for meaningful parameter exchange.

However, mode connectivity alone does not mean linear interpolation works. The curves found by Garipov et al. require optimization to discover. The question of when straight-line paths suffice is the subject of linear mode connectivity (Section 2).

Fast Geometric Ensembling (FGE)

Practical payoff: by collecting weight vectors along a mode-connecting curve (using cyclical learning rates to traverse it), one can build ensembles in the time it takes to train a single model. The different points along the curve correspond to genuinely different functions despite having similar loss.


2. Linear Mode Connectivity and Permutation Symmetry

The Barrier Problem

Given two trained models w_1 and w_2, define the linear interpolation barrier as:

barrier(w_1, w_2) = max_{0 <= alpha <= 1} [ Loss(alpha * w_1 + (1-alpha) * w_2) ] - (1/2)(Loss(w_1) + Loss(w_2))

If the barrier is zero or near-zero, the models are “linearly mode connected” (LMC). Naive linear interpolation between independently trained models almost always has a large barrier. The loss spikes in the middle.

The Permutation Symmetry Insight

The reason linear interpolation fails is not that the models found genuinely different solutions. It is that they represent the same solution with neurons in different orders.

A feedforward network with hidden layers of widths n_1, …, n_{L-1} is invariant under the permutation group:

G = S_{n_1} x S_{n_2} x ... x S_{n_{L-1}}

Any permutation of neurons within a hidden layer, with corresponding rearrangement of incoming and outgoing weights, gives a functionally identical network. For a network with d-1 hidden layers of width n, there are (n!)^{d-1} equivalent weight configurations.

Linear interpolation between two equivalent-but-permuted configurations creates a Frankenstein: neuron 3 in model A is averaged with neuron 7 in model B, producing nonsense.

The Entezari Conjecture

Entezari, Sedghi, Saukber, Neyshabur. “The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks.” 2021. (arXiv:2110.06296)

Conjecture: after accounting for permutation symmetries, most SGD solutions lie in a single basin. That is, for any two independently trained models w_1 and w_2, there exists a permutation pi in G such that:

barrier(pi(w_1), w_2) approx 0

This would mean the loss landscape has essentially one basin (modulo permutations), radically simplifying the picture.

Git Re-Basin

Ainsworth, Hayase, Srinivasa. “Git Re-Basin: Merging Models Modulo Permutation Symmetries.” ICLR 2023. (arXiv:2209.04836)

Three algorithms to find the permutation that aligns two models:

  1. Weight matching. Solve a bipartite linear assignment problem: maximize the inner product between corresponding weight rows of the two models. Uses block-coordinate descent, iteratively solving layer-by-layer assignments until convergence.

  2. Activation matching. Run both models on a batch of data, match neurons by their activation patterns using the Hungarian algorithm on correlation matrices.

  3. Straight-through estimator (STE). Parameterize permutations continuously (via soft assignments), minimize the midpoint interpolation loss directly. Uses straight-through estimator for gradients through the discrete projection step.

Results: achieved zero-barrier LMC for ResNets on CIFAR-10. First demonstration that independently trained models can be merged by simple averaging after permutation alignment. Key finding: model width matters – sufficiently wide models are required for LMC to hold.

Frankle’s Stability Analysis

Frankle, Dziugaite, Roy, Carbin. “Linear Mode Connectivity and the Lottery Ticket Hypothesis.” ICML 2020. (arXiv:1912.05671)

Discovered that neural networks become “stable” to SGD noise early in training – after a critical point, different random seeds converge to the same linearly connected basin. For ResNet-20 on CIFAR-10 this happens after only 3% of training; for ResNet-50 on ImageNet, after 20%.

This suggests a phase transition: early training is a chaotic exploration phase where the network finds its basin, followed by a deterministic refinement phase within a single basin. The timing of this transition connects to lottery ticket rewinding – subnetworks only work when rewound to post-stability checkpoints.

Implications for Neural Societies

The permutation alignment problem is the fundamental obstacle to “breeding” neural networks. Two networks trained independently will have their neurons in different orders, making direct parameter exchange destructive. But once you solve the alignment problem (Git Re-Basin, activation matching, etc.), averaging becomes viable.

The width requirement is interesting: wider networks are easier to align. This parallels the biological observation that genetic recombination works better with redundancy.


3. Model Soups and Weight Averaging

Model Soups

Wortsman, Ilharco, Gadre, …, Schmidt. “Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time.” ICML 2022. (arXiv:2203.05482)

Key insight: when multiple models are fine-tuned from the same pretrained initialization with different hyperparameters, they tend to remain in the same loss basin. This means you can simply average their weights and get a model that is better than any individual.

Two recipes:

Results: ViT-G model soup achieved 90.94% top-1 on ImageNet, a new state-of-the-art at the time. Improvements also on OOD robustness and zero-shot transfer.

The theoretical grounding connects to loss landscape flatness: weight averaging works when the models live in a flat region of the loss landscape, and a shared pretrained initialization ensures they do.

Stochastic Weight Averaging (SWA)

Izmailov, Podoprikhin, Garipov, Vetrov, Wilson. “Averaging Weights Leads to Wider Optima and Better Generalization.” UAI 2018. (arXiv:1803.05407)

SWA averages weight vectors collected along an SGD trajectory (with cyclical or constant learning rate). The averaged model finds a wider, flatter minimum than SGD alone.

Geometric intuition: in high-dimensional weight space, most of the volume of a flat region is concentrated near its boundary. SGD, being stochastic, settles near the boundary. SWA averages multiple boundary points, moving the solution toward the center of the flat region, where it generalizes better.

Task Arithmetic

Ilharco, Ribeiro, Wortsman, …, Farhadi. “Editing Models with Task Arithmetic.” ICLR 2023. (arXiv:2212.04089)

A step beyond averaging: define a “task vector” as:

tau_task = w_finetuned - w_pretrained

These task vectors can be added, subtracted, and scaled:

This works because fine-tuned models from a shared initialization exhibit LMC, so task vectors are approximately linear directions in weight space. Weight space near a pretrained model behaves roughly like a vector space, at least locally.

Connection to Neural Societies

Model soups are a primitive form of “social learning” – a population of models trained independently, then combined. The critical requirement is a shared starting point (pretrained initialization), which ensures all models stay in the same basin. Without this, averaging destroys information.

This is the simplest case of parameter exchange: all agents start from the same point and are averaged at the end. The question for neural societies is whether you can do something more dynamic – ongoing exchange during training, between agents that may have diverged further.


4. Federated Learning as Implicit Society

FedAvg

McMahan, Moore, Ramage, Hampson, Arcas. “Communication-Efficient Learning of Deep Networks from Decentralized Data.” AISTATS 2017. (arXiv:1602.05629)

FedAvg is literally a society of neural networks that learn privately and share publicly:

  1. Server broadcasts global model to K clients.
  2. Each client trains locally on its own data for E epochs.
  3. Clients send updated weights back to the server.
  4. Server averages the weights (weighted by dataset size).
  5. Repeat.

Achieves 10-100x communication reduction over synchronized SGD. Robust to non-IID data distributions across clients.

The Permutation Problem in Federated Learning

Wang, Yurochkin, Sun, Papailiopoulos, Khazaeni. “Federated Learning with Matched Averaging.” ICLR 2020. (arXiv:2002.06440)

FedMA recognizes that naive element-wise averaging in FedAvg ignores permutation invariance. Different clients, training on different data from different initializations, will learn neurons in different orders. FedMA aligns neurons across clients before averaging, using Bayesian nonparametric matching (for FC layers) and feature-similarity matching (for Conv layers).

PFNM (Probabilistic Federated Neural Matching): Uses Bayesian nonparametric methods (Beta process priors) to simultaneously align and merge client models, adapting the global model size to the population.

Convergence Issues

FedAvg with fixed learning rate converges to a neighborhood of the optimum, not the optimum itself. The gap is Omega(eta * (E - 1)), where eta is the learning rate and E is local epochs. Learning rate decay is necessary for exact convergence.

FedProx (Li et al., 2020) adds a proximal term penalizing deviation from the global model, stabilizing convergence under heterogeneous data. Improves absolute accuracy by ~22% in highly heterogeneous settings.

SCAFFOLD (Karimireddy et al., 2020) uses variance reduction to correct local update directions, achieving convergence guarantees independent of data heterogeneity.

The Society Analogy

Federated learning is the closest existing paradigm to a “neural society”:

The key tension is between local adaptation (specialization to each client’s data) and global coherence (maintaining a shared model that works for everyone). FedAvg resolves this crudely by averaging; more sophisticated approaches (FedMA, SCAFFOLD) try to be smarter about what gets shared and how.


5. Loss Landscape Topology

Visualization: Li et al.

Li, Xu, Taylor, Studer, Goldstein. “Visualizing the Loss Landscape of Neural Nets.” NeurIPS 2018. (arXiv:1712.09913)

Introduced “filter normalization” for meaningful 2D visualizations of loss landscapes. Key findings:

Saddle Points Dominate in High Dimensions

Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” NeurIPS 2014. (arXiv:1406.2572)

The crucial insight from random matrix theory and statistical physics: in high dimensions, critical points are overwhelmingly saddle points, not local minima.

At a critical point, the Hessian has a distribution of eigenvalues. In high dimensions:

This is why SGD works: it almost never gets stuck in local minima, because in million-dimensional space, true local minima are exponentially rare above the global minimum.

The Spin Glass Analogy

Choromanska, Henaff, Mathieu, Arous, LeCun. “The Loss Surfaces of Multilayer Networks.” AISTATS 2015. (arXiv:1412.0233)

Under simplifying assumptions (variable independence, redundancy, uniformity), the loss function of a neural network is analogous to the Hamiltonian of a spherical p-spin glass. This connection yields:

Limitations: the independence assumption is unrealistic (many paths share inputs). But the qualitative picture – a band of good local minima near the global minimum, separated from a wilderness of saddle points – matches empirical observations.

Sewall Wright’s Adaptive Landscape vs. Neural Loss Landscapes

Wright (1932) introduced the metaphor of a “fitness landscape” for evolutionary biology: allele frequencies define coordinates, fitness defines height, evolution is hill-climbing on this surface. The metaphor has been influential but controversial for 90 years.

Key parallels with neural loss landscapes:

Key differences:

Feature Wright’s Landscape Neural Loss Landscape
Dimensionality ~10^3 to 10^4 (genes) ~10^6 to 10^12 (parameters)
Objective Maximize fitness Minimize loss
High-dim effect Debated; may reduce stable peaks (Fisher) Eliminates isolated local minima (Dauphin)
Connectivity Unknown; neutral ridges may exist Established – mode connectivity
Landscape is… Fixed (by environment) Defined by data (changes with distribution)
Symmetry Minimal (genes have distinct roles) Massive (permutation symmetry)
Navigation Blind (mutation + selection) Gradient-informed (backprop)

The most profound difference: high dimensionality is a curse in biology but a blessing in optimization. In low dimensions, local optima are isolated traps. In millions of dimensions, there are so many escape directions that true local minima become vanishingly rare. The very problem Wright worried about (getting stuck on suboptimal peaks) essentially dissolves in the neural case.

However, both landscapes share the difficulty of visualization. Wright estimated that the space of gene combinations would number ~10^1000. From the start, the “landscape” was always a low-dimensional metaphor for a high-dimensional reality. The same caveat applies to those beautiful 2D loss landscape plots – they are projections, and projections can be misleading.

Morse Theory and Loss Landscape Connectedness

Akhtiamov et al. “Connectedness of Loss Landscapes via the Lens of Morse Theory.” ALT 2023.

Morse theory provides tools for understanding the topology of sublevel sets { w : L(w) <= c } as the threshold c varies. Critical points of the loss function (where grad L = 0) change the topology: minima add connected components, saddle points merge them. If all saddle points between two minima have loss below threshold c, then the sublevel set at c is connected – the two minima are mode-connected at loss level c.

This gives a rigorous topological framework for mode connectivity: it’s about the Morse-theoretic structure of the loss function, specifically whether index-1 saddle points (with one negative Hessian eigenvalue) exist at sufficiently low loss to connect the basins.


6. Population-Based Training (PBT)

Jaderberg, Dalibard, Osindero, Czarnecki, Donahue, Razavi, Vinyals, Green, Dunning, Simonyan, Fernando, Kavukcuoglu. “Population Based Training of Neural Networks.” 2017. (arXiv:1711.09846)

The Algorithm

PBT is literally a society of neural networks with parameter exchange:

  1. Initialize a population of N models with random hyperparameters.
  2. Train all models in parallel.
  3. Periodically, each model evaluates its fitness (validation performance).
  4. Exploit: if a model is underperforming, it copies the weights and hyperparameters of a better-performing model.
  5. Explore: the copied hyperparameters are perturbed (multiplied by 0.8 or 1.2 for continuous values, shifted to adjacent value for discrete ones).
  6. Continue training from the new weights with the new hyperparameters.
  7. Repeat.

This is a hybridization of gradient-based training (each model uses backprop) and evolutionary dynamics (the population shares information through weight copying and hyperparameter mutation).

Key Properties

Applications

PBT as a Neural Society

PBT has the clearest “society” interpretation of any existing method:

The “exploit” step is the most radical form of parameter exchange: wholesale copying of another model’s weights. This works because models in the population share the same architecture and (usually) similar training histories, so they remain in approximately the same basin.


7. Crossover in Weight Space

The Competing Conventions Problem

The fundamental difficulty with genetic crossover in neural networks, identified in the neuroevolution literature:

Two networks can compute identical functions with neurons in different orders. If you cross them by swapping corresponding weight indices, the offspring inherits neuron 3 from parent A and neuron 3 from parent B – but these may represent completely different features. The child is a random scramble.

Example: parents [A, B, C] and [C, B, A] represent the same function (just neurons permuted). Crossover produces [A, B, A] or [C, B, C] – losing 1/3 of the information.

This is the same permutation symmetry problem that plagues model merging and federated averaging, rediscovered independently in the neuroevolution community.

NEAT’s Solution: Historical Markers

Stanley, Miikkulainen. “Evolving Neural Networks through Augmenting Topologies.” Evolutionary Computation, 2002.

NEAT avoids the competing conventions problem by tracking the evolutionary history of every gene (connection) via global innovation numbers. Crossover aligns genes by their historical origin, not their position. Two genes with the same innovation number necessarily represent the same structural element (though possibly with different weights).

This is essentially a manual solution to the alignment problem: instead of trying to align neurons after the fact, NEAT maintains alignment by construction throughout the evolutionary process.

Safe Crossover via Neuron Alignment

Uriot, Izzo. “Safe Crossover of Neural Networks Through Neuron Alignment.” GECCO 2020. (arXiv:2003.10306)

A post-hoc approach: before crossover, align the neurons of the two parents by computing correlations between their activation patterns. Two methods:

After alignment, standard arithmetic crossover (averaging corresponding weights) works effectively. The method is computationally fast and transmits information from parents to offspring.

Connection to Git Re-Basin: this is essentially the same idea (permutation alignment before merging), discovered independently in the evolutionary computation community.

Stitching for Neuroevolution

Guijt, Thierens, Alderliesten, Bosman. “Stitching for Neuroevolution: Recombining Deep Neural Networks without Breaking Them.” GECCO 2024. (arXiv:2403.14224)

A more ambitious approach: instead of permuting neurons, insert trainable “stitching layers” (small linear or 1x1 conv layers) at crossover points. These layers learn the transformation between parent representations. This avoids the competing conventions problem entirely because the stitching layer adapts to whatever alignment exists.

Steps:

  1. Find compatible layer pairs between parents (matching tensor shapes).
  2. Ensure the resulting computational graph is acyclic (branch-and-bound).
  3. Construct a “supernetwork” with switches at each crossover point.
  4. Train stitching layers to minimize representation mismatch.

Results: recombined networks can achieve novel performance/cost tradeoffs, sometimes dominating individual parents. Works across different architectures, not just identical ones.

Evolution Strategies as an Alternative

Salimans, Ho, Chen, Siddharth, Sutskever. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” 2017. (arXiv:1703.03864)

OpenAI showed that evolution strategies (ES) – black-box optimization via random perturbation of weights – can match RL performance on Atari and MuJoCo. ES treats the network as a black box: perturb the million-dimensional weight vector by Gaussian noise, evaluate the reward, update in the direction of successful perturbations.

ES avoids the crossover problem entirely by working only with mutation. But it does use a population: each worker evaluates a perturbed version of the shared weight vector. The communication is remarkably efficient – only scalar rewards need to be shared, because workers share random seeds and can reconstruct perturbations locally.

This is a “society” with a very different structure: no crossover, no weight copying, just collective gradient estimation through correlated perturbations.


8. Theoretical Frameworks

Weight Space Symmetries: The Full Picture

Hecht-Nielsen. “On the Algebraic Structure of Feedforward Network Weight Spaces.” 1990.

Pioneer work establishing that the set of weight transformations leaving the input-output function invariant forms a group. For an MLP with hidden layers of widths n_1, …, n_{L-1}:

The symmetry group contains at minimum:

The full discrete symmetry group for a tanh network is a wreath product:

G = (Z_2 wr S_{n_1}) x (Z_2 wr S_{n_2}) x ... x (Z_2 wr S_{n_{L-1}})

where Z_2 wr S_n = Z_2^n rtimes S_n is the hyperoctahedral group (signed permutations).

For ReLU networks, the sign-flip symmetry is replaced by positive scaling: R_{>0}^{n_k} (rescaling a neuron’s output by lambda and its incoming weights by 1/lambda). This is a continuous symmetry, making the quotient space more complex.

Simsek, Ged, Jacot, Spadaro, Hongler, Gerstner, Brea. “Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances.” ICML 2021.

Key results:

Recent Taxonomy (2025)

“Symmetry in Neural Network Parameter Spaces.” 2025. (arXiv:2506.13018)

A comprehensive survey defining a hierarchy of symmetry types:

Architecture-specific symmetry groups:

The fundamental challenge: the full symmetry group G_{Theta,L} is typically impossible to characterize. The known groups (permutations, sign flips, scaling) are subgroups. The quotient space Theta/G has not been fully characterized topologically for any nontrivial architecture. This is an open problem.

Weight Space as a Manifold

The parameter space Theta = R^d is trivially a manifold. The interesting question is the structure of the quotient Theta/G, which is typically an orbifold (a manifold with singularities at points with non-trivial stabilizers).

The singularities occur at weight configurations where two or more neurons are identical – the stabilizer subgroup is larger than generic, making the quotient space singular. These are precisely the “permutation points” identified by Brea and Simsek (2019): points where neuron weight vectors collide, creating flat directions in the loss landscape.

Open question: What is the topology (fundamental group, homology) of the quotient space Theta/G? Is it simply connected? Does it have non-trivial cycles? The answers would determine whether there are topological obstructions to parameter exchange.

Information Geometry

Amari. “Natural Gradient Works Efficiently in Learning.” Neural Computation, 1998.

The parameter space of a neural network is a Riemannian manifold with metric given by the Fisher information matrix:

g_{ij}(theta) = E[ (d log p(x|theta)/d theta_i) * (d log p(x|theta)/d theta_j) ]

The natural gradient (the steepest descent direction on this manifold) is:

theta_new = theta - eta * F^{-1} * grad L

where F is the Fisher information matrix.

Amari, Karakida. “Fisher Information and Natural Gradient Learning of Random Deep Networks.” AISTATS 2019.

For deep networks, the Fisher matrix has a unit-wise block-diagonal structure (up to small off-diagonal terms). This means the information geometry decomposes approximately layer by layer.

Relevance to neural societies: if you want to merge models in a geometrically principled way, you should not simply average their parameters (which assumes Euclidean geometry). You should interpolate along geodesics of the Fisher information manifold. SLERP (spherical linear interpolation) is a crude approximation to this. A proper merging procedure would account for the curvature of the statistical manifold.

Permutation Saddles and Valley Structure

Brea, Simsek, Ganguli. “Weight-Space Symmetry in Deep Networks Gives Rise to Permutation Saddles, Connected by Equal-Loss Valleys Across the Loss Landscape.” 2019. (arXiv:1907.02911)

Concrete geometric results:

The picture: the weight space of a neural network is carved up by a web of flat valleys connecting equivalent minima, with permutation saddles acting as low-dimensional corridors between them. The topology is far richer than “a bunch of isolated basins.”


9. The Model Merging Ecosystem (2023-2026)

The theoretical insights above have spawned a thriving practical ecosystem, especially in the LLM community.

Key Methods

Weight Disentanglement

Recent finding (2024): large-scale instruction tuning “disentangles” model weights, meaning different directions in weight space correspond to functional changes in disjoint regions of the input space. This makes merging easier because the task vectors don’t interfere.

MergeKit and Community Practice

The open-source MergeKit library has enabled thousands of community-created merged LLMs, many ranking at or near the top of the Open LLM Leaderboard. Model merging has become a form of combinatorial innovation – practitioners combine specialist models (coding, math, conversation) without any training, achieving capabilities beyond any single parent.

This is, in effect, a “neural society” emerging organically in the open-source community: a population of models that exchange parameters (via merging recipes), compete (on leaderboards), and evolve (through iterated merge-and-evaluate cycles).


10. Open Questions and Mathematical Directions

For the Neural Societies Concept

  1. Dynamic alignment during training. Git Re-Basin works post-hoc. Can you maintain alignment continuously during co-training, so that models in a population can exchange parameters at any time? This would require tracking the permutation mapping as it evolves.

  2. Partial exchange. PBT copies entire weight vectors. Model soups average everything. Is there a principled way to exchange parts of weight spaces – e.g., sharing early-layer features while keeping late-layer specializations?

  3. Communication topology. In federated learning, there is a star topology (all clients talk to one server). What happens with peer-to-peer exchange? Ring topologies? Does the communication graph structure affect convergence the way social network structure affects cultural evolution?

  4. Competitive dynamics. All current approaches are cooperative (shared objective). What happens when models have conflicting objectives but must still exchange information? This is the game-theoretic frontier.

  5. Diversity maintenance. In PBT, exploitation drives the population toward homogeneity. How do you maintain diversity? Evolutionary biology uses speciation, geographic isolation, frequency-dependent selection. What are the neural network equivalents?

Mathematical Concepts Worth Exploring


Key References Summary

Paper Year Key Contribution
Hecht-Nielsen, “Algebraic Structure of Weight Spaces” 1990 Symmetry group of feedforward networks
Stanley & Miikkulainen, NEAT 2002 Historical markers solve competing conventions
Dauphin et al., “Saddle Point Problem” 2014 Saddle points dominate in high dimensions
Choromanska et al., “Loss Surfaces of Multilayer Networks” 2015 Spin glass analogy for loss landscapes
McMahan et al., FedAvg 2017 Federated averaging of distributed models
Jaderberg et al., PBT 2017 Population-based training with exploit/explore
Salimans et al., Evolution Strategies 2017 ES as scalable alternative to RL
Amari, “Natural Gradient” 1998 Fisher information Riemannian metric
Garipov et al., Mode Connectivity 2018 Low-loss curves between optima
Draxler et al., “Essentially No Barriers” 2018 Flat paths between minima
Li et al., Loss Landscape Visualization 2018 Filter normalization, skip connection effects
Izmailov et al., SWA 2018 Weight averaging finds wider optima
Brea & Simsek, Permutation Saddles 2019 Geometric structure of symmetry-induced critical points
Wang et al., FedMA 2020 Permutation-aware federated averaging
Frankle et al., LMC + Lottery Ticket 2020 Stability phase transition early in training
Uriot & Izzo, Safe Crossover 2020 Neuron alignment for evolutionary crossover
Entezari et al., Single Basin Conjecture 2021 After permutation alignment, one basin
Simsek et al., “Geometry of Loss Landscape” 2021 Symmetry-induced critical subspaces
Wortsman et al., Model Soups 2022 Averaging fine-tuned models improves accuracy
Ainsworth et al., Git Re-Basin 2023 Three algorithms for permutation alignment
Ilharco et al., Task Arithmetic 2023 Linear operations on task vectors
Guijt et al., Stitching for Neuroevolution 2024 Stitching layers for architecture-crossing recombination
“Symmetry in Neural Network Parameter Spaces” (survey) 2025 Taxonomy of weight space symmetries