Ilya's Papers to Carmack: RNN Regularization

theahura

Nov 6, 2024

Read →

2 Comments

schizoid intellectual

Jul 20, 2025

Great series of posts!

>This paper discovered that applying dropout everywhere except the hidden state dependency worked really well.

I think this is the most relevant sentence on this post and yet, it is not very clear. Do you mean that only connections from one layer to another are dropped and connection in the same layer are left intact?

Reply (1)

theahura

Jul 20, 2025

Correct. Connections in the same layer are left intact.

This makes some amount of intuitive sense. Dropout in the hidden state within a single layer inevitably results in signal decay over long context lengths. If you have a single break in the chain across a single layer, that's it; there's no way for later parts of the sequence to get gradients from the earlier parts before the chain. That's not true for breaks across layers

12 Grams of Carbon

Ilya's Papers to Carmack: RNN Regularization