Over at Baxter we've been working with a client to build an LLM-powered chat bot that we're deploying into Fortune 100 enterprises.
Employees at these enormous companies will chat with this bot and, uh, unfortunately I cannot share much more than that.
I recently had to ballpark how many concurrent chats we could support before we'd hit our OpenAI rate limit. I came up with a basic model and am sharing it here in case it comes in handy for others.
Here's the kind of bot for which this ballpark estimate makes sense:
- It uses OpenAI's ChatCompletion APIs.
- The user talks, the bot talks, the user talks, the bot talks, and so on...
- Every time a user sends a chat, that chat is sent to a ChatCompletion endpoint along with the entire message stack: all
usermessages up to and including the user's message.
This seems like it holds for gobs of products and features I see in the market.
Here are assumptions the model makes:
- The concurrent chats are all fresh, new conversations as opposed to a long conversation being re-entered.
- They all start at the same time.
- We're looking for the number of these conversations that can run concurrently...
- ... without any of them running into one of OpenAI's rate limits.
Deriving a ballpark estimate
OpenAI's rate limits are comprised of two values:
- TPM: tokens per minute.
- RPM: requests per minute.
GPT-4’s base values for these are 40,000 TPM, 200 RPM. You are rate limited the moment you hit one of these.
This means that if 200 people log in and send a message, your service falls over immediately. Everyone will send one message, and then their next message will fail.
What we want is: what is the maximum # of users can sustain a conversation without getting rate limited?
That number, whatever it is, is far less than 200.
To get a rough estimate, you want to grab these numbers for your bot:
- The average Chat RPM: The average # of requests a regular chat session makes per minute.
- The average Chat TPM: The average # of tokens a regular chat session sends per minute.
I recommend computing these from real chat sessions your users have done:
- Chat RPM maps to the number of
userrole messages sent per minute in a session.
- Chat TPM takes more work. Whenever a user sends a message, you have to sum up all the tokens of all messages up to and including the user's message. Add up each of these token counts, divide them by the # of minutes the chat lasted.
- Use tiktoken to count tokens.
Between OpenAI's response times, a user pausing/thinking, and then sending a message, you're likely to get somewhere between 1-3 requests per minute per chat.
Meanwhile, because the number of tokens you send to OpenAI every time a user chats grows cumulatively, Chat TPM is your likely bottleneck and it might even surprise you how big it gets.
I recommend capping the length of these conversations at what you believe is a reasonable duration. You want to remove outliers – for example, if a user had a conversation on day 1 and then came back on day 2 to send another message.
(Unless that's the norm for your app, in which case you might not want to use this model.)
Once you have these numbers, the # of users who can sustain a conversation concurrently is the lesser of:
- OpenAI RPM Rate Limit / the average Chat RPM
- OpenAI TPM Rate Limit / the average Chat TPM
The longer your user's conversations last, the fewer of them you'll be able to sustain concurrently. Again, this is because of the cumulative growth in the tokens you send over the course of a conversation: whenever you send a ChatCompletion request, you're sending over the entire conversation every time.
Identified a major issue? Have a question?
Holler over at Twitter, or uh, I mean X: https://twitter.com/voberoi.