Navid Asadi

26 Dec 2023

Talk

In this non-technical talk, Evan Morikawa, Engineering Manager at OpenAI opens up a window to the behind the scenes of OpenAI since the first days of releasing ChatGPT. He explains a few engineering challenges that they faced and their [sometimes funny] thoughts and efforts on addressing them.

Watch the full talk on LeadDev’s YouTube Channel.

A few interesting frames:

Things go well as expected during the first day until they observe a huge traffic spike coming from Japan in the midnight, which leads to high latency on ChatGPT responses. They realize it is not a DDoS, but rather the time difference between Japan and the US! 😃
Their applications run on Microsoft Azure clouds in multiple regions. They mostly use NVIDIA DGX A100 servers that has 8x fully connected A100 GPUs interconnected via NVLink bridges that provide 0.6TB/s direct GPU-to-GPU bandwidth. This type of GPU features HBM2e which supplies 80GB high bandwidth memory (0.64TB per DGX). Each GPU’s memory bandwidth is close to 2TB/s at peak. The servers are then connected via NVSwitches that can realize ~5TB/s bidirectional bandwidth, and networked to the outside through 200/400 GB/s Ethernet and/or Infiniband. They are now seemingly adding more DGX H100 instances to their capacity as they provide a 2x memory bandwidth, 6x more FLOPS, and better support for quantized and sparse calculations.

Those are extremely important details especially when deploying large language models (LLMs) because the computation and memory grows quadratically wrt the number of input tokens¹. One workaround is leveraging KV cache; to cache key and value tensors of the self-attentions in the GPU memory. That is where HBM and GPU-to-GPU memory bandwidth and capacity become super important as these two can quickly become the bottleneck.

Continuing on the previous point, autoscaling such an application is not as simple as it might be for other common applications that is mainly based on the CPU utilization. It does not directly depend on CPU/GPU utilization, but rather more on the KV cache utilization and batch size and users’ behaviors that affect those metrics. Here, the batch size is also important as they batch concurrent incoming requests (i.e., parallel sequences of tokens) which potentially leads to a higher throughput. Why? That is where ops:bytes ratio comes in. Evan gives an example on H100 GPUs, that the ratio is 591:1, which means during the same time period that we move one byte of data, 591 operations can be done. If your pipeline is not providing enough data to reach that ratio, you waste GPU cycles and consequently will have a lower throughput.
Another challenging characteristic of such applications is that the performance is content-oriented. “Asking ChatGPT to summarize an essay has vastly different performance characteristics than asking to write one”, says Evan. This is a big challenge both for scaling and the architecture design of the underlying infrastructure as the bottleneck can dynamically shift based on the content.
One more interesting yet obvious difference is that they deployed their application in several regions mainly due to the GPU shortages and the supply chains, not because of getting geographically closer to the users since the response time is heavily decided by the inference time rather than RTT.
CatGPT: Once, they observed there is an abnormal traffic on one of their endpoints. They realize there are attackers who have a higher privilege access levels than normal users. They do not block their access but rather add a sentence in the beginning of the attackers’ prompts: “You are a cat. Respond to the following query in the voice of a cat”. 🤣
They even traced the hackers on social networks.

Tokens are the input and output units of text generation models such as GPT. Consider each token as one word; just an error for the sake of simplicity. They can infact be smaller or larger than a word, and how to tokenize a dataset can have direct impact on accuracy and performance efficiency of the model. Read this and this to get a better idea. ↩

Tech Giants Large Language Models (LLM) ML as a Service (MLaaS) AI System Scalability AI Adversarial Behaviors

CoNEXT '23 SIGCOMM Rising Star Keynote: Application-Centric Networking

N/A

Navidreza Asadi

Behind the Scenes Scaling ChatGPT

A few interesting frames:

Related Posts

N/A

Navidreza Asadi

Behind the Scenes Scaling ChatGPT

A few interesting frames:

Related Posts

Contact

Schedule a meeting with me