What’s the problem?
- Infrastructure designed to handle workloads prior to agents, and LLMs is becoming insufficient to handle these workloads. Engineering is running into failure modes more often.
- Right now, companies are hiring full time FTEs whose sole job is to scale up, scale down, and manually tune infrastructure to match the demands imposed by these new types of workloads.
- Large amounts of capital are being deployed to secure GPUs, and to keep them “hot” to guarantee their ability to handle traffic.
- The user experience of these products is becoming degraded, often describing qualities like high latency, flakiness, and random unavailability.
What’s causing the problem, why now?
- The shape (the size, type, and length) of these workloads now differ based on the length of the task, tokens required, and individual customer requirements - which is unknown to those who are capacity planning.
- LLM-driven software engineering is rapidly changing the performance, characteristics and resources needed to achieve the same task - often weekly, or in some higher velocity companies - daily.
- Highly compartmentalized pieces in the product that are chained together in these new AI use cases - produce inabilities to properly measure and capacity plan. In one case, a Series B company still struggles to approximate how many individual sessions a single container can handle.
- Infrastructure Autoscalers, like those found in Kubernetes, are becoming increasingly inadequate to match demand to infrastructure needs. Modern AI infrastructure spans different clusters, managed GPU providers, and third-party APIs.
What if there was something that took in your workload, listened to your infrastructure, and told you when it was suitable to do the work?
Arklow
- Arklow will solve this problem by building a new class of infrastructure: admission control.
- In doing so, Arklow will protect your products from poor performance, save heavily on your “running hot” overspend, and reduce these hard-to-model failure modes.
- We learn from your already available metrics on Datadog, Prometheus, and other metric aggregators, how your system responds to work today - and learn when you need things to scale on a per-workload, and per-customer basis. We then can instruct each of your infrastructure components (whether self-hosted, or externally) to scale, or to hold traffic until you’re ready.
- Arklow operates in two ways: sitting in front of your workloads, or allowing engineering teams to instrument it into their product. In both cases, we ensure that resources are available for when you need them, without the constant guessing.
Background
2024 - Replit
I joined Replit to work on anti-abuse engineering. In short this was finding engineering solutions to reducing malicious actors abusing the platform through excessive resource usage, or fraud. Surprising things occur when you’re tasked with cleaning up/banning thousands of users all at once; it puts strain on our infrastructure.
To account for this, a large part of the engineering effort was building from first principles - a queue system to decide when work should get done. Similarly, it needed to be easy for other developers on the team to add new types of work, to be routed by the queue. We spent most of our time fighting against how good the system was at suffocating us.
At this time, our constraint was causing outages with our database.
2025 - Bland
At Bland, the same problem emerged, but in a new landscape. Our customers wanted to perform thousands of calls per day, apiece. We wanted to satisfy all the load, at all times - causing our GPU spending to be considerably higher. Customers cared that their calls went out, and were durably made - a few minutes here or there didn’t matter. Some customers cared about calls going out super fast. But when they did, they expected quality - not latency, or unavailability.
The same demand was asked of me here again, to build something that performs the critical job of deciding when work can be done.
In this case, our constraint was GPU cost, spending, and availability. If we had an easy way to spread out work, our GPU needs were smoother.