Notes

meta-learning redux

1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04


cardinal and ordinal utilities

A common mistake is to take one's formalism as metaphysics. This is especially true in domains tangentially related to the study of human behavior: "just because you can be described by a coherence theorem does not mean you are a coherence theorem."

I note that the difference between cardinal and ordinal utilities is not as deep as it may seem. Cardinalists use a utility function $u: X \to \mathbb{R}$ to describe preferences, while ordinalists restrain themselves to only defining an ordering over $X.$

Under natural conditions, orderings over $X$ can be described as utility functions over $X.$ If the ordering is complete, transitive, continuous1, and admits an order-dense subset2, there exists a continuous function $u$ such that $u(x) \leq u(y)$ if and only if $x \leq y.$

As an example: if $X$ is any convex subset of $\mathbb{R}^n$ and $\leq$ is continuous, then this holds and there exists a corresponding continuous utility function.

1

Give $X$ topological structure. If the upper and lower contour sets of an ordering $\leq$ are closed in $X$ for every $x \in X,$ then $\leq$ is continuous.

2

There exists $Z \subseteq X$ such that for every pair $x,y \in X$ such that $x \leq y,$ there exists $z \in Z$ such that $x \leq z \leq y.$


thoughts on VeLO

[paper] [code]

[1] hypernetwork architecture. there's a tensor-level LSTM that takes in bulk loss statistics and outputs a weighting vector $c_{hyper}$ used to interpolate between a bank of pre-trained per-parameter MLPs that then gets passed the per-parameter loss statistics to compute parameters $d,m$ such that $\Delta p = 10^{-3} \cdot d \cdot \exp(10^{-3} \cdot c_{lr}) ||p||_2$ (where $c_{lr}$ is also produced by the tensor-level LSTM). not immediately obvious to me where these pre-trained MLPs come from

[2] meta-training distribution is surprisingly diverse. image classification / image generation / language modeling, across MLP / CNN / vision transformer / autoencoder / VAE / transformer architectures, with different initializations. + task augmentation! (weight reparameterization / gradient renormalization / etc.). 4k TPU-months of compute. these tasks all dealt with relatively small models: VeLO generalization to 100M param LLM was not good (compared to a tuned Adafactor baseline), but "within distribution" VeLO seems to outperform tuned Adam / Shampoo / other classical optimizers.

[3] ES as a gradient update method. meta-objective was to minimize end-of-training loss (vs. average loss per-step), and the classical method was used to estimate gradients: perturb the parameters with sampled Gaussian noise and use the estimator $$\frac{1}{2m\sigma} \sum_{i=1}^m (L(\theta + \sigma \epsilon_i) - L(\theta - \sigma \epsilon_i))\epsilon_i,$$ where $\epsilon_i$ is drawn from a normal distribution. I wonder how much there's to be gained in iterating on this approach? the ES literature hasn't quite stagnated

[4] Adam is really quite good? VeLO outperforms by (eyeballing) ~2-5x with intense resource input, and even then isn't pareto compared to tuned Adam OOD. (VeLO is also much more expensive, & cannot generalize to RL or GNNs (as far as they tested))

[5] convinced me that truly there's a lot of performance one could eke out of meta-training a learned optimizer, provided one doesn't care about resource constraints. many cool directions to extend this to


morley rank as dimension

[notes]

Let $\mathcal{L}$ be a first order language and let $T$ be a complete $\mathcal{L}$ theory. Given

  • $M \models T$ a model of $T$,
  • $x$ a variable context,
  • $\phi \in \mathcal{L}_x(M)$ a formula with parameters from $M$ and free variables ranging in $x,$ and
  • $\alpha$ an ordinal

we define the Morley rank $\text{RM}(\phi)$ of a formula $\phi$ by transfinite recursion on the condition $\text{RM}(\phi) \geq \alpha.$ The recursion satisfies:

  • $\text{RM}(\phi) \geq 0$ if and only if $M \models (\exists x)\phi$
  • $\text{RM}(\phi) \geq \alpha + 1$ if and only if there is some $N \succeq M$ (an elementary extension) and a sequence of formulas $(\psi_i)_{i \in \omega} \subseteq \mathcal{L}_x(N)^\omega$ such that for each $i,$ $N \models \psi_i \to \phi$ and $\text{RM}(\psi_i) \geq \alpha$ both hold and for $i \neq j$ we have that $N \models \psi_i \to \neg \psi_j.$
  • $\text{RM}(\phi) \geq \lambda$ for a limit ordinal $\lambda$ if and only if $\text{RM}(\phi) \geq \alpha$ for all $\alpha < \lambda.$

We define $\text{RM}(\phi) := \alpha$ if $\text{RM}(\phi) \geq \alpha$ but $\text{RM}(\phi) \not\geq \alpha + 1.$ Every formula in a totally transcendental theory has ordinal-valued Morley rank. Inconsistent formulas are given Morley rank of $-1.$

When taking $T = \text{ACF}_p$ (the theory of algebraically closed fields for characteristic $p$), the Morley rank and Krull dimension agree (on constructible sets). Examples:

  • Take finite $X \subset K^n.$ This has dimension 0, and can be specified by some formula $\phi.$ One can verify that $\text{RM}(\psi) = 0$ when the solution set of $\psi$ in $M$ is finite, so $\text{RM}(\phi) = 0.$
  • The affine line $\mathbb{A}^1(K) = K$ has dimension 1, and $\text{RM}(K) \geq 1.$ However, it cannot be greater than 2 because every definable subset of $K$ is either finite or cofinite, and cofinite sets cannot be disjoint. (Similar arguments hold for $\mathbb{A}^k.$)
  • . . .

on ferrante

A few points I'm not sure Ferrante tried to make but I understood regardless:

[1] "Volitional strength" does not increase with age so much as "volitional complexity" does.

[2] Reminds me of my mom's stories about las colonias. From the basics (community units are families led by patriarchs) to particulars (the Lina / Melina friendship, fancy car as status symbol, loan-shark as resident, the names).

[3] The "respect for the interiority of self" that Lina possesses --- is this an authorial artifact? Faithful developmental description? Does it look like "respect" from the inside as well as the outside?

[4] Characters' relational generators flummox me.

[5] First fifty pages were the most intense reading experience of my life?