Deep Learning Diagrams

Despite being mathematical systems, deep learning models are not systematically approached. Proper analysis requires various abstractions. Mathematically, we are interested in how functions feed into each other, how they are parallelized, and how simple linear operations can be rearranged. Practically, we are interested in the resource costs given some mathematical goal, allowing us to find the optimal execution strategy on parallelized GPU hardware. Typical approaches struggle to consider these different lenses.

Category theory's tools for studying abstractions allows us to relate these approaches. Furthermore, category theory allows us to develop a rigorous diagrammatic language which reflects these abstractions. Our methods have successfully expressed a variety of models in all their detail, exactly describing the constituent functions and associated parallelized and linearity properties. Additionally, In FlashAttention on a Napkin, we used diagrams to quickly derive optimized execution strategies and performance models. In contrast, typical methods take years of laborious research to derive these methods.

Works

This research opens many branches for further work at the intersection of category theory and the practical aspects of deep learning design, including optimizing resource usage. So far, we have published FlashAttention on a Napkin in addition to Vincent Abbott's previous works.

  1. V. Abbott and G. Zardini, “FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness,” Transactions on Machine Learning Research (In Press), 2025.
    @article{abbott24-tmlr,
      title = {FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness},
      author = {Abbott, Vincent and Zardini, Gioele},
      year = {2025},
      volume = {},
      number = {},
      journal = {Transactions on Machine Learning Research (In Press)},
      url = {https://zardini.mit.edu/diagrams/}
    }
    

    Abstract: Optimizing deep learning algorithms currently requires slow, manual derivation, potentially leaving much performance untapped. Methods like FlashAttention have achieved a \times 6 performance improvement over native PyTorch by avoiding unnecessary data transfers, but required three iterations over three years. Automated compiled methods have consistently lagged behind. GPUs are limited by both transfers to processors and available compute, with transfer bandwidth having improved at a far slower pace. Already, transfer bandwidth accounts for 46% of GPU energy costs. This indicates the future of energy and capital-efficient algorithms relies on improved consideration of transfer costs (IO-awareness) and a systematic method for deriving optimized algorithms. In this paper, we present a diagrammatic approach to deep learning models which, with simple relabelings, derive optimal implementations and performance models that consider low-level memory. Diagrams generalize down the GPU hierarchy, providing a universal performance model for comparing hardware and quantization choices. Diagrams generate pseudocode, which reveals the application of hardware-specific features such as coalesced memory access, tensor core operations, and overlapped computation. We present attention algorithms for Ampere, which fits 13 warps per SM (FlashAttention fits 8), and for Hopper, which has improved overlapping and may achieve 1.32 PFLOPs.

Future work will encompass:

  • Formalizing the category theory further, developing a symbolic framework which captures diagrams and the graphical "moves" for deriving optimizations.
  • Developing an automated framework which uses a categorical data structure for converting between standard PyTorch implementations, to the symbolic framework, and then to optimized code and accurate performance models.
  • Creating a graphical dashboard which incorporates these tools, allowing deep learning engineers to work directly with diagrams and be automatically notified of various resource usage considerations.
  • Integrating our work on resource analysis of deep learning algorithms to categorical co-design, allowing us to optimize the full deep-learning stack, including hardware.

The aim, then, is to use category theory to create an indispensible tool for innovating efficient deep learning models.

Sample Diagrams of Complete Architectures

Diagram of the original transformer architecture from Attention Is All You Need
Diagram of the original transformer architecture from Attention Is All You Need (Jun 2017).
Mixtral-8x7B (Dec 2023) is an open-source model which beat the original release of ChatGPT (Nov 2022). The attention block captures five years of innovation, while the feed-forward layer is completely changed from a fully-connected layer to a Mixture-of-Experts which has a vast amount of data, only some of which is incorporated with each pass.
Diagram of DeepSeek-V3 (Dec 2024), an open-source model which caught up to the latest models from OpenAI, Google, and others. Note the innovations in the attention block, and the wide mixture-of-experts feed-forward layer used.