AI deployment strategies: balancing efficiency and environmental impact

AIsustainabilitygreendeploymentperformance

07 November 2024

Ori Pekelman

Chief Strategy Officer

This post is also available in German and in French.

Video transcript

So, the actual title of the talk is "LLM Deployment Strategies: Balancing Efficiency and Environmental Impact." I know—nobody really wants to hear "the environmental talk." But don’t worry, this is going to be a feel-good talk. Here’s a kitten! It only costs about 30 grams of CO₂ emissions to create. And if I’m getting this right, the CO₂ emissions from all LLM applications on Earth currently amount to less than 0.0274% of global emissions. That’s really not much.

In fact, we should give ourselves a pat on the back. The CO₂ emissions from LLMs are probably less than 0.03 megatons per day—barely 11 megatons a year. As an industry, we actually started caring. We got better. I can’t see you all, but show of hands—how many of you know if your company has a climate pledge or is carbon neutral? A few of you, maybe?

The thing is, while we’ve gotten better, we still have a long way to go. We’ve started burning coal again like there’s no tomorrow, which ironically leads to there being no tomorrow. Every ton of CO₂ emitted stays in the atmosphere—it’s just thermodynamics; you can’t argue with science.

This isn’t about being a tech optimist or a progressive. Science doesn’t work like that. CO₂ doesn’t just go away. Right now, LLM emissions are about ten times what bombs emit. And bombs are, well, not great—some people disagree, but we’ll leave it at that. And it doesn’t really matter if you’re sitting next to a hydroelectric plant or not. Most of the emissions are embedded in the processes we use.

So, let’s take a look at the scale here. About two million H100s were sold this year, along with a few Grace Hopper supercomputers, each machine consuming around 700 watts. With roughly 56,000 watts per machine and a power usage effectiveness (PUE) factor, we’re looking at about 12 megatons added to emissions this year alone. And that’s on top of the 11 megatons I just mentioned.

We’re more than doubling our emissions annually, which means, if we keep up this trajectory, we’re looking at around half a gigaton of emissions in the next five years. For context, the world emits about 40 gigatons a year right now. This trend is not sustainable, and it’s clear on the graphs.

Commercial Break! This talk is sponsored by Reality and Wily. So, I actually licensed a cartoon for this presentation—it cost me 40 bucks. And yes, I also pay for a couple of LLM subscriptions, like ChatGPT and Claude from Anthropic, costing around 40 bucks a month. It’s all-you-can-eat, so I could technically ask them to generate a similar cartoon for free, but it wouldn’t be quite the same.

But hey, I did the math, and the CO₂ compensation we’d need for this 12-megaton footprint would cost about 0.7 billion dollars—a number that sounds like “chump change” for some in this industry. The largest direct air capture facility on Earth, located in Orca, Iceland, only manages 4,000 tons of CO₂ per year. Compare that to 12 megatons, and, well, you get the picture.

All right, can I stop yelling about CO₂ for a moment and dive into the technical part? Almost, I promise! Being the scientific, technical person I am, I asked a few LLMs how much CO₂ they generated to answer the question of how much CO₂ they generate. And, of course, they didn’t give me any answers. Reinforcement learning through marketing feedback—RLFM—is what I call it. It’s not about alignment; it’s more about sidestepping. They’re basically trained to avoid talking about their carbon footprint.

Just try it—ask Gemini or another LLM how much CO₂ they emit. They’ll send you to a nice blog post, assuring you that no CO₂ is emitted, at least according to them. But let’s get technical. You need a graph in this kind of talk, right? So, I set up a basic graph with "Predictability to Risk" on one axis and "Legacy to Innovation" on the other. This graph can represent LLM deployment strategies quite effectively.

In terms of machine learning model deployment, we’re talking about everything from simple SQL queries to training our own foundational models. And, for the most part, these applications are about information retrieval: you have some data, you want to ask questions, and you want answers. The beauty of this kind of graph is that you can switch the axis labels, and the dots on it would still make sense.

We can label the x-axis as "Measured in milliseconds" versus "Measured in months" or “Working software” versus “Hardware working really hard.” Or, to put it more bluntly, “I care” versus “I don’t care” about global warming. One of the most interesting dimensions we could label is money. Running SQL is cheap; we’ve been doing it for years. Running your own GPUs, however, costs a fortune.

So, let’s talk about the trade-offs here. As software engineers, you know that everything comes down to trade-offs—optimizing one quality of the system often means sacrificing another. Take training LLMs, for example. It’s a complex process that can fail spectacularly, leaving us with an overfitted model that generalizes poorly. The bad news is, in that case, we don’t get any compression or true learning—what we end up with is essentially a database.

And sure, you could use an overfitted LLM as a kind of database. Say you’ve fine-tuned it to know your customer’s first name. Is it efficient for that? Maybe—for the developer who doesn’t need to request a new datastore from DevOps or a new column in the database. But it adds latency. Every query might take months of extra processing time because it’s not running through an L1 or L2 cache. It could even retrieve incorrect data.

We’ve all been there, though. Back in the day, we used Oracle for everything. Now, we might put data in a Docker layer file and hope for the best. If your only tool is a neural network, then yes, maybe your PyTorch file ends up being your database. It happens, even in production.

When issues arise, we lean on prompt engineering. We try to filter out things the model shouldn’t show, hoping it’ll work fine even if we update the model or switch vendors. This is just a new kind of technical debt—our industry’s latest invention. And let’s be real, there’s a tragic irony to this approach: if you expose an LLM to user input without validation, you’re essentially setting yourself up for disaster.

And it’s not just our input. The whole world’s data becomes the user input for LLMs, meaning we have limited control. Retrieval-Augmented Generation (RAG) doesn’t change that reality much. It’s just another layer on top, another “friendly” abstraction. So, when you’re using embeddings from a big model, remember: it’s not a straightforward representation of the latent layers. These embeddings carry tons of semantic information that can often be used effectively within your vector store.

Using a vector store effectively could save you a significant amount of resources. Many tasks could be accomplished with simple distance queries on data you already have in your vector database. This approach is much cheaper, but here’s the thing: the money will run out eventually. What we’re currently doing—calling massive models hosted on top-of-the-line hardware like Grace Hoppers—won’t be financially viable in the next five years. The science doesn’t support that we’ll suddenly make these processes affordable.

So, what are the green parts of this equation? They represent things we know how to do without GPUs, tasks that can run on CPUs. CPU utilization is well-understood, predictable, and stable. But GPU utilization? That’s still a mystery to most of us.

Your job is to figure out these trade-offs. Yes, it’s more code to write, and yes, it might not be as glamorous as using the latest, cutting-edge features. But what you get in return is a more stable, economical system. Every time you run an uncached query rather than hitting an LLM, you’re saving resources. Think of it this way: every time you rely on an LLM for a query, a kitten metaphorically “dies.” Use the tools you already have—like PostgreSQL and PG Vector—to save the kittens. This isn’t about blaming some mysterious “them”; this is about us, and our decisions.

Thank you all, I’m Ori. Take care!

AI deployment strategies: balancing efficiency and environmental impact

Video transcript

Useful links

Your greatest work
is just on the horizon

AI deployment strategies: balancing efficiency and environmental impact

Video transcript

Useful links

Your greatest work.css-2vew0q{display:inline-block;background:rgb(250, 65, 255);background:linear-gradient(90deg, #806bff 0%, #ed49f0 100%);-webkit-background-clip:text;-webkit-background-clip:text;background-clip:text;-webkit-text-fill-color:transparent;}is just on the horizon

Your greatest work
is just on the horizon