>This paper discovered that applying dropout everywhere except the hidden state dependency worked really well.
I think this is the most relevant sentence on this post and yet, it is not very clear. Do you mean that only connections from one layer to another are dropped and connection in the same layer are left intact?
Correct. Connections in the same layer are left intact.
This makes some amount of intuitive sense. Dropout in the hidden state within a single layer inevitably results in signal decay over long context lengths. If you have a single break in the chain across a single layer, that's it; there's no way for later parts of the sequence to get gradients from the earlier parts before the chain. That's not true for breaks across layers
Great series of posts!
>This paper discovered that applying dropout everywhere except the hidden state dependency worked really well.
I think this is the most relevant sentence on this post and yet, it is not very clear. Do you mean that only connections from one layer to another are dropped and connection in the same layer are left intact?
Correct. Connections in the same layer are left intact.
This makes some amount of intuitive sense. Dropout in the hidden state within a single layer inevitably results in signal decay over long context lengths. If you have a single break in the chain across a single layer, that's it; there's no way for later parts of the sequence to get gradients from the earlier parts before the chain. That's not true for breaks across layers