How many concurrent conversations can you sustain before you hit OpenAI's rate limits?

Headshot of
                Vikram Oberoi

Vikram Oberoi

·

August 11, 2023

vikramoberoi.com

· 4 min read

How many concurrent conversations can you sustain before you hit OpenAI's rate limits?

A cartoon featuring a woman admiring a robot with the caption “Sure, he didn’t pass the Turing test. But what was so great about being human anyway?”
This bot can unfortunately sustain at most one concurrent conversation. Credit: katiebcartoons.com

Over at Baxter we’ve been working with a client to build an LLM-powered chat bot that we’re deploying into Fortune 100 enterprises.

Employees at these enormous companies will chat with this bot and, uh, unfortunately I cannot share much more than that.

I recently had to ballpark how many concurrent chats we could support before we’d hit our OpenAI rate limit. I came up with a basic model and am sharing it here in case it comes in handy for others.

Here’s the kind of bot for which this ballpark estimate makes sense:

This seems like it holds for gobs of products and features I see in the market.

Here are assumptions the model makes:

Deriving a ballpark estimate

OpenAI’s rate limits are comprised of two values:

  1. TPM: tokens per minute.
  2. RPM: requests per minute.

GPT-4’s base values for these are 40,000 TPM, 200 RPM. You are rate limited the moment you hit one of these.

This means that if 200 people log in and send a message, your service falls over immediately. Everyone will send one message, and then their next message will fail.

What we want is: what is the maximum # of users can sustain a conversation without getting rate limited?

That number, whatever it is, is far less than 200.

To get a rough estimate, you want to grab these numbers for your bot:

I recommend computing these from real chat sessions your users have done:

Between OpenAI’s response times, a user pausing/thinking, and then sending a message, you’re likely to get somewhere between 1-3 requests per minute per chat.

Meanwhile, because the number of tokens you send to OpenAI every time a user chats grows cumulatively, Chat TPM is your likely bottleneck and it might even surprise you how big it gets.

I recommend capping the length of these conversations at what you believe is a reasonable duration. You want to remove outliers – for example, if a user had a conversation on day 1 and then came back on day 2 to send another message.

(Unless that’s the norm for your app, in which case you might not want to use this model.)

Once you have these numbers, the # of users who can sustain a conversation concurrently is the lesser of:

That’s it!

The longer your user’s conversations last, the fewer of them you’ll be able to sustain concurrently. Again, this is because of the cumulative growth in the tokens you send over the course of a conversation: whenever you send a ChatCompletion request, you’re sending over the entire conversation every time.