HomeOVERVIEWHow Smooth Is Attention? - Apple Machine Learning Research

How Smooth Is Attention? – Apple Machine Learning Research

Self-attention and masked self-attention are at the heart of Transformers’ outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties — which are key when it comes to analyzing robustness and expressive power — is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length n in any compact set, the Lipschitz constant of self-attention is bounded by sqrt(n) up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length n is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of n. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
Figure 1: Regularity of the attention layer as a function of sequence length for different architectures.

Latest articles

Newbury BS cuts resi, expat, landlord rates by up to 30bps  – Mortgage Strategy

Newbury Building Society has cut fixed-rate offers by up to 30 basis points...

Rate and Term Refinances Are Up a Whopping 300% from a Year Ago

What a difference a year makes.While the mortgage industry has been purchase loan-heavy for...

Goldman Sachs loses profit after hits from GreenSky, real estate

Second-quarter profit fell 58% to $1.22 billion, or $3.08 a share, due to steep...

Why Do AIs Lie?

Zeroth Principles can clarify many issues in the ML/AI domain. As discussed in a...

More like this

Data Structures: Fundamental Building Blocks in Computer Science

Within this tutorial, we explore the differences between the many data structures in computer...

From Mundane to Marvelous: Optimizing Business Tasks with Microsoft and Copilot

Hey there! Last week, I was buried under a mountain of emails, spreadsheets, and...

6 Reasons Customers Hate Your Bot & How to Fix Them

It’s no exaggeration to say that chatbots are a mainstream technology. A whopping 88%...