meta-learning redux
1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04
1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04
A common mistake is to take one's formalism as metaphysics. This is especially true in domains tangentially related to the study of human behavior: "just because you can be described by a coherence theorem does not mean you are a coherence theorem."
I note that the difference between cardinal and ordinal utilities is not as deep as it may seem. Cardinalists use a utility function $u: X \to \mathbb{R}$ to describe preferences, while ordinalists restrain themselves to only defining an ordering over $X.$
Under natural conditions, orderings over $X$ can be described as utility functions over $X.$ If the ordering is complete, transitive, continuous1, and admits an order-dense subset2, there exists a continuous function $u$ such that $u(x) \leq u(y)$ if and only if $x \leq y.$
As an example: if $X$ is any convex subset of $\mathbb{R}^n$ and $\leq$ is continuous, then this holds and there exists a corresponding continuous utility function.
Give $X$ topological structure. If the upper and lower contour sets of an ordering $\leq$ are closed in $X$ for every $x \in X,$ then $\leq$ is continuous.
There exists $Z \subseteq X$ such that for every pair $x,y \in X$ such that $x \leq y,$ there exists $z \in Z$ such that $x \leq z \leq y.$
[1] hypernetwork architecture. there's a tensor-level LSTM that takes in bulk loss statistics and outputs a weighting vector $c_{hyper}$ used to interpolate between a bank of pre-trained per-parameter MLPs that then gets passed the per-parameter loss statistics to compute parameters $d,m$ such that $\Delta p = 10^{-3} \cdot d \cdot \exp(10^{-3} \cdot c_{lr}) ||p||_2$ (where $c_{lr}$ is also produced by the tensor-level LSTM). not immediately obvious to me where these pre-trained MLPs come from
[2] meta-training distribution is surprisingly diverse. image classification / image generation / language modeling, across MLP / CNN / vision transformer / autoencoder / VAE / transformer architectures, with different initializations. + task augmentation! (weight reparameterization / gradient renormalization / etc.). 4k TPU-months of compute. these tasks all dealt with relatively small models: VeLO generalization to 100M param LLM was not good (compared to a tuned Adafactor baseline), but "within distribution" VeLO seems to outperform tuned Adam / Shampoo / other classical optimizers.
[3] ES as a gradient update method. meta-objective was to minimize end-of-training loss (vs. average loss per-step), and the classical method was used to estimate gradients: perturb the parameters with sampled Gaussian noise and use the estimator $$\frac{1}{2m\sigma} \sum_{i=1}^m (L(\theta + \sigma \epsilon_i) - L(\theta - \sigma \epsilon_i))\epsilon_i,$$ where $\epsilon_i$ is drawn from a normal distribution. I wonder how much there's to be gained in iterating on this approach? the ES literature hasn't quite stagnated
[4] Adam is really quite good? VeLO outperforms by (eyeballing) ~2-5x with intense resource input, and even then isn't pareto compared to tuned Adam OOD. (VeLO is also much more expensive, & cannot generalize to RL or GNNs (as far as they tested))
[5] convinced me that truly there's a lot of performance one could eke out of meta-training a learned optimizer, provided one doesn't care about resource constraints. many cool directions to extend this to
[notes]
Let $\mathcal{L}$ be a first order language and let $T$ be a complete $\mathcal{L}$ theory. Given
we define the Morley rank $\text{RM}(\phi)$ of a formula $\phi$ by transfinite recursion on the condition $\text{RM}(\phi) \geq \alpha.$ The recursion satisfies:
We define $\text{RM}(\phi) := \alpha$ if $\text{RM}(\phi) \geq \alpha$ but $\text{RM}(\phi) \not\geq \alpha + 1.$ Every formula in a totally transcendental theory has ordinal-valued Morley rank. Inconsistent formulas are given Morley rank of $-1.$
When taking $T = \text{ACF}_p$ (the theory of algebraically closed fields for characteristic $p$), the Morley rank and Krull dimension agree (on constructible sets). Examples:
A few points I'm not sure Ferrante tried to make but I understood regardless:
[1] "Volitional strength" does not increase with age so much as "volitional complexity" does.
[2] Reminds me of my mom's stories about las colonias. From the basics (community units are families led by patriarchs) to particulars (the Lina / Melina friendship, fancy car as status symbol, loan-shark as resident, the names).
[3] The "respect for the interiority of self" that Lina possesses --- is this an authorial artifact? Faithful developmental description? Does it look like "respect" from the inside as well as the outside?
[4] Characters' relational generators flummox me.
[5] First fifty pages were the most intense reading experience of my life?