April 9, 2024

A Rising Player Surfing the GPU Wave: How Inference.ai plans to Become the “New AWS of the AI Era”

“The world we live in is built on NVIDIA GPUs,” NVIDIA CEO Jensen Huang said in a recent media interview.

Indeed, this is no exaggeration, as generative AI is sweeping the globe, led by OpenAI’s ChatGPT. In the wake of this success, AI apps are now multiplying at an unprecedented pace, fueling a surge in demand for GPUs. As a result of this boom, GPUs are becoming an increasingly scarce resource, turning them into a hot commodity. Fletcher Previn, Cisco’s Chief Information Officer, recently shared that NVIDIA chips are transported in armored trucks, underscoring just how valuable GPUs now are.

We sat down with John Yue,  the founder and CEO of Inference.ai, to pick his brain for insights into how to navigate the fast-emerging GPU shortage:

 “I must say, this is a savage game. Whoever has the GPUs wins. It doesn’t even matter what the price is or how good the user interface is. In 90% of cases, the person with the GPU resources is the one who wins the business.” John goes on to explain, “If you go to a public cloud provider for GPUs, you often have to wait one to two quarters, and you might not even be able to get one. Plus you’ll have to pay a lot of money upfront, which is a big problem for startups and companies that want to get into AI.”

What is more, these shifts are creating a new status quo for the industry. As John puts it, “In the AI era, Infrastructure as a Service (IaaS) is going to change dramatically because the days of unlimited resources are gone. This is something that the industry needs to address. As for us, we have a lot of computing power.” 

When you visit Inference.ai’s website, you will find a range of various NVIDIA GPU chips. Scrolling further down, you’ll also see products from AMD and Intel. During a time when not even money can guarantee GPU access, Inference.ai’s à la carte menu of GPU chips (featuring NVIDIA, AMD, and Intel) can offer users the best seats at a very exclusive table.

Inference.ai boasts the largest and most diverse fleet of GPUs in the cloud

Inference.ai cannot predict the future – nor did it anticipate today’s GPU shortages years in advance; nonetheless, it is definitely a pioneer in the field. When John and his co-founder started the company, they chose the “distributed infrastructure” model, giving them access to several CPU and GPU resources at an early stage.This is now allowing them to reap rewards from the AI boom.

John is a serial entrepreneur from Canada whose past ventures have revolved around hardware and IaaS. “A few years ago, my co-founders and I decided that if we wanted to succeed in this business, we had to keep up with major trends,” he said. “The most important trend we identified at that time was that data was proliferating at an unprecedented rate.  In the future, there will be more and more data, with the demand for storage and computing power only increasing and never decreasing.

Current research confirms John’s viewpoint. Market intelligence firm IDC reports that, as the Internet of Things and cloud computing have become mainstream, the quantity and variety of data has grown exponentially. The IDC now projects that by 2025, the global datasphere will balloon to an astonishing 163 zettabytes (or 163 trillion gigabytes), which is a tenfold increase from the 16.1ZB of data generated in 2016.

John anticipated this trend early on. “Our initial thinking was very straightforward: the amount of data grows rapidly; therefore, so does the need for storage. However, data storage technology has not advanced significantly in several years. The user experience was subpar, and prices were extremely high.”

John resolved to tackle these pain points and, together with his partners, built relationships with more than 20,000 storage data centers, offering providers new ways to monetize their unused cloud storage resources. As a result of these efforts, users now have the option to store their data across 80 different servers provided by Inference.ai. This not only reduces costs by as much as 90% compared to traditional public cloud storage but also significantly decreases the risk of data loss and hacking.

Through this confluence of vision and enterprise, John was ahead of everyone else when it came to securing idle CPU and GPU resources. However, he couldn’t have foreseen how, in just a few short years, generative AI would revolutionize the industry. As competition for GPUs intensified, Inference.ai was propelled to the forefront of this emerging industry.

The rise of the “GPU as a Service” market

GPUs are the engines driving the current explosion of AI applications. Inevitably, this surge in demand has resulted in severe computing power shortages, in turn creating the perfect conditions for GPU as a Service (GPUaaS) to prosper. Inference.ai is in the best possible strategic position at this critical moment. Fortune smiles when opportunity meets preparation and Inference.ai has secured a financial infusion amidst a fundraising environment that has become challenging for many other young firms. In 2023, Inference.ai successfully secured a US$4-million seed round investment, proudly led by Cherubic Ventures, Maple VC, and Fusion Fund.

So, what does Inference.ai actually do? Like so many powerful ideas, the formula is quite simple: provide a matchmaking service between data centers with GPU resources and users who need GPU computing power by offering to rent out these resources at competitive rates. Simply put, Inference.ai is the “Airbnb of GPUs.” It helps companies quickly find the most reasonably priced GPUs with specifications that are ideal for the task at hand. The service unlocks an otherwise inaccessible trove of scarce GPU resources to alleviate the urgent GPU computing power shortage.

Inference.ai unveiled their billboard on 101N in San Francisco

For example, users can rent NVIDIA H100 chips for as little as US$1.99 per hour, or instead access any of a wide variety of different GPU chips, based on the specific needs of their training model.

Such a service has obvious benefits for organizations in need of GPU power in terms of optimizing compute resources, time, and cost. Inference.ai enables users to access a wide range of GPUs for model training without being tied to a limited menu of GPU specifications and simultaneously eliminates the need to endure the long wait times for in-demand GPUs to be made available by public cloud providers.

In this era of high demand for GPUs, renting from Inference.ai allows users not only to develop AI products faster but also to achieve their goals at a lower cost.Users can save up to 82% on their overall costs by renting GPUs from Inference.ai compared to engaging the services of big companies like Google, AWS, or Microsoft. 

3 pain points in the GPU market: uneven resource allocation, unrelenting shortages, and asymmetric information

John has identified three major pain points in the current market. First, amid a severe shortage of GPUs, existing resources are unevenly allocated. Public cloud providers such as Microsoft, Google, AWS, and other large enterprises with AI demands have already acquired the majority of GPUs. As a result, smaller companies find it difficult to obtain the necessary resources, even if they have the financial means, while startups are left with almost nothing. It is estimated that less than 6% of NVIDIA’s latest GPUs are made available to startups.

In addition, even large enterprises looking to train their AI models and develop products often have to wait one or two quarters to access GPU resources from public cloud providers. They also now need to secure millions of dollars upfront to have any chance of acquiring computing power. This represents a major obstacle for established companies wanting to invest in AI, and renders such ambitions simply a pipe dream for cash-strapped startups.

The second major pain point is that the shortage of GPU computing power is unlikely to be resolved in the near future. 

Although the global GPU chip shortage has eased slightly in recent months, it is not an issue that can be resolved in the short term. Alleviating such a shortage not only means that GPU chip makers need to ramp up their production capacity to full utilization; it also requires an increase in the capacity of supply chain components, such as TSMC’s CoWoS advanced packaging technologies and ASML’s EUV lithography exposure equipment. Bottlenecks and backlogs for key elements of this complex supply chain imply that there will be no quick fixes.

As demand continues to surge, the supply side clearly cannot keep pace. Fearing obsolescence, entire industries are seeking to join this AI gold rush. But the process takes time. In simple terms, developing AI solutions can be divided into two stages: “model training” and “model inferencing.” Although most companies are still in the model training stage, John believes that the focus will shift to model inferencing within the next 12 months. During this second phase, more companies will be using trained AI models to make predictions and to generate content or new products. While existing AI models will only need to be trained when they are updated, AI inferencing is a continuous process. As inferencing becomes “business as usual” for huge swathes of the global economy, the demand for GPU power will only increase.

“In this scenario, one approach is to increase production, and the other is to better allocate existing resources. The latter is what we’re doing,” says John. “We believe that shortages will exist in the future as GPU specifications continue to evolve, and all we can do is make sure that all of our resources are efficiently allocated to the people who need them.

Source: DALL.E

Despite the market boom, there is still a considerable asymmetry between supply and demand in terms of information, which is the third major pain point.

GPUs and CPUs are very different beasts. If you were to compare them to athletes, CPUs would be the all-rounder who excels at every sport but isn’t necessarily the best at any single one.

GPUs, on the other hand, are the “specialists,” designed to efficiently handle specific operations. Employing accelerated computing is akin to hiring a team of experts, with each expert specializing in a different task. “A lot of people don’t understand this; in fact, most people still don’t understand much about GPUs. ChatGPT has been popular for less than two years, so it’s impossible for all the techies to become GPU experts overnight.” John explains that currently, if organizations want to know how many GPUs are needed to train a model, they typically have to go to the NVIDIA or AMD websites, read the specs, and then attempt to guess how much computing power is required.

In addition, there’s the mysterious aspect of machine learning that makes a trained AI model in many ways a black box, where the trainer has no sure way of completely understanding the algorithm’s decision-making process. This uncertainty also extends to the training process itself, which further increases the difficulty and complexity for companies attempting to estimate and procure computing power equal to the task.

Besides the information asymmetry in products, there are also considerable discrepancies in information across distribution channels. John provides an example in which one of Inference.ai’s customers was looking for NVIDIA’s L40S chips: “At the time, it was NVIDIA’s newest product, and it was only available to university labs. Not even businesses had access to it, and our customer couldn’t find it anywhere. So they came to us.”

John and his team quickly found a supplier in Estonia that was able to meet the customer’s needs. “Without us, they may have never found these chips in time. Even in a GPU shortage, it doesn’t mean that the information gap in this industry has been solved,” John explains. “What we’re doing is helping to better allocate resources on the supply side to fulfill customer needs on the demand side.”

To empower their customers to make more informed choices in this complex landscape, Inference.ai has developed “ChatGPU,” an AI model specially designed to recommend GPU models. Potential users can consult ChatGPU via the official Inference.ai website for free and build a better understanding based on their needs before moving forward with one of Inference.ai’s services. 

Inference.ai has created “ChatGPU,” an AI model tailored to suggest suitable GPU models.

ChatGPU can answer various questions regarding model training and will even recommend suitable chips, given inputs such as the client’s budget, time, and model requirements. If the client then wishes to move forward, Inference.ai’s professional team will assist with benchmark testing to identify how much computing power is needed and then help them match with and rent the most suitable chips.

With the advent of accelerated computing, the “public cloud providers of the AI era” will rise.

John believes that with the AI wave, power will be redistributed among the Tech titans, with the greatest resources no longer belonging to the public cloud providers of the Internet era, but instead gathering under new leaders, such as NVIDIA and AMD. As this trend unfolds, these “public cloud providers of the AI era” will rise to the occasion. John states, “I’m very confident that we will become the AWS of the new era.”

As Jensen Huang has said, the era of “general-purpose computing” has ended, and the era of “accelerated computing” has officially arrived. The world and the techno-industrial ecosystem are about to take on a whole new look. In the AI era, GPU chips are likely to become increasingly specialized, while the challenge of sourcing computing power will also become greater. Whoever can allocate these resources most efficiently will become big winners in this new era.

This article has been contributed to AsiaTechDaily.

Recent Articles

See All