An early glimpse at A2

Steve
Mar 5
5 min read

This post is part of a series on Architecture A2.

I've been working on designing NAM's new default architecture, "A2". In this blog post, I wanted to give stakeholders some insight into what I've seen so far in Stage 4 ("Optimize A2", from the original blog post linked above). The work isn't done yet--but I do want to share some information and how it's influencing the way I'm thinking. If you're a builder who's looking to prioritize your work, then this may be useful.

Test models, "set 2"

First off, if you're looking to adopt A2, I've got a second round of models for you to test and give me feedback on. Please see this previous post for the basic process; you'll want to repeat it and, as before, please share your results back to me so that I can be sure that A2 is on the right track.

As before, you can download set-2 from Google Drive.

The observations

Here are some general observations that you'll see reflected in set-2:

LeakyReLU is a nice activation

A1 uses the hyperbolic tangent ("Tanh") function as an element-wise activation. It happens to be a surprisingly-nice activation function--models that use it are typically quite accurate. When I was designing A1, I compared it to some other typical activations, and was pretty happy in terms of basic sample-wise mean-square error (cf. "ESR" and "null tests"). It's also a very good activation in terms of frequency-space metrics, with (pseudo*-)aliasing artifacts being rather well-managed. One can see this by swapping the Tanh out for something else and seeing A1's accuracy drop.

What this doesn't take into account is that Tanh is quite expensive to compute for very small neural networks. When I designed A1 (standard), it was a "very, very small" network in my mind. Usually, the activation takes a negligible amount of compute. For a model as small as A1-standard, that's not really true anymore.

At even smaller sizes, the activation can easily take the majority of compute time. The compute required to calculate the activation can't be ignored. When I started looking at how the CPU-accuracy trade-off looked, I found that I was able to use larger models (in terms of parameter count) with lighter activations to get the same CPU usage. What's surprising is that, at this same CPU usage, accuracy can be improved.

In NeuralAmpModelerCore v0.4.0, we introduced a lot of new activations. Out of them, the leaky ReLU activation fares very well in CPU usage, enabling me to increase model size (and accuracy) while decreasing CPU.

The result has held over several experiments so far, which is making me suspect that this will be a key ingredient in A2 regardless of what other choices are made.

Bottlenecks don't seem very helpful

I introduced a "bottleneck" parameter in NAM's WaveNet layers, where the number of channels might shrink (or grow) before the activation. My rough intuition was that if some heavy activation (like Tanh) were not so important in quantity but in merely being present at all, then we might reduce how many of them needed to be computed--allowing the network to grow where it mattered more for performance--not all parts of the network are probably equally-valuable.

NAM used this core insight in the past--which is one reason why A1 isn't actually a WaveNet.**

What I've found in some limited experiments is that slimming bottlenecks appear to harm accuracy far more than they're worth, but that expanding bottlenecks at least don't hurt. For this reason, I'm growing skeptical that bottlenecks will appear in the final model.

The "layer 1x1" is the most helpful "1x1".

There are these convolution layers that the WaveNet paper called "1x1"'s [a bit of a misnomer, but consistent with their use in computer vision on 2D inputs like images]. These are convolutions with kernel size 1--just a basic linear layer that's applied to each sample.

When I coded up the WaveNet originally, I assumed that these would go on the pathway that exits to the next layer (I'm calling it the "layer 1x1"), but they could also conceivably exist on the "skip-output" to the WaveNet head (a "head 1x1"). In fact, the figure in van den Oord et al., 2016 is (maybe?) a little ambiguous about what exactly they meant for it to do:

From van den Oord et al., 2016 — *From* *van den Oord et al., 2016*

For A2, I decided to double-check this and see whether my choice was actually the best one. What I'm more convinced of so far is that the "layer 1x1" (that heads upwards to the residual summing op in the figure) is more helpful than a "head 1x1" out to the "skip[-out] connections". Having both helps accuracy, but I haven't carefully checked it when CPU is figured in, so the jury is still out for "both". However, "none" and "head only" appear less promising.

FiLMs aren't very useful in most places

I introduced the ability for A1's "conditioning input" (i.e. the input audio signal, for A1) to not just enter into each WaveNet layer via a mixing projection that sums with the layer-to-layer information, but for it to steer affine operations (scale & shift) at various points. The literature calls this "feature-wise linear modulation" (FiLM) and it's shown up in some papers in this (neural effect modeling) field e.g. Steinmetz et al., 2021. I've seen it help myself in some other experiments, so I figured it was worth it to look more closely for A2.

As can be expected, most of this didn't really do much of anything*** in early tests--the FiLM right after the activation looks a little useful--but so does the head 1x1 module--and they do similar things, so it makes sense that they might benefit the model similarly. If there were a single FiLM that's most likely to show up in A2, it's the post-activation FiLM. However, it remains to be seen whether it's useful enough.

The conditioning module isn't obviously useful

I thought this one was a really clever idea--take part of the original network and move it "up" into a position where its outputs are directly connected as input to every layer in the main model--either as the "condition" input (previously just the input audio) or via the FiLMs above.

Unfortunately for my idea, it's just not clear that this helps at all. If that's the case, then leaving it out simplifies things quite a bit. Expect me to come down that way in the event of a tie (or even slight evidence in its favor--simplicity is valuable).

Conclusions

Like I emphasized at the top, these aren't definitive findings--things may still change. That said, these are early signs, and my intuition tells me, based on how frequently the evidence points a given way, that they're ok to share. Please take them with a grain of salt.

Next steps

What's next is that once I hear back on set-2 above I should have enough to go by that I'll be ready to proceed with listening tests (stage 5).

Stay tuned!

Footnotes

*There's an oft-repeated claim that neural modeling can fall short due to "the aliasing artifacts stemming from nonlinear activation functions in neural networks." (Sato et al., 2025). I have a subtle mathematical objection to this framing that deserves a careful explanation (and review), but the short is that while aliasing-like errors can be observed in neural predictions, their source is distinct from aliasing proper. This seems to have consequences for how the errors are mitigated. back

**It's a stack of WaveNet-like modules (with further differences within). back

***I'm sharing negative results! back