From supercomputer to cloud

Written by ENCCS HPC Specialist, Daniel Medeiros.

You are a startup. A small or medium company. Maybe a sole entrepreneur with a big dream. You deployed your workload into the available EuroHPC systems. It could be an AI-related workload, or even a more classical use case that deals with molecular or material dynamics. You got the successful results that you are looking for. Have you ever thought how you can transfer your workflows from a supercomputer to cloud services?

EuroHPC JU machines and most supercomputers have some limitations and cannot execute certain workloads. Using the AI open calls one is not able to execute user-facing services, for example to provide inference to users. You want to take the next step and bring your product to market.

The image shows the interior of the LUMI data center.
A long hallway with rows of large, white, rectangular cabinets with LUMI written on them.
LUMI supercomputer in Kajaani, Finland. Many industrial and research projects have used LUMI.

What now?

Some of the EuroHPC JU host institutions can provide you a degree of commercial services under a separate paid agreement. For example, CSC in Finland, or LuxProvide in Luxembourg. However, in many cases, you will need to consider to migrate from HPC to cloud service providers (CSPs). There are a myriad of cloud service providers that you can choose from. There are so many offerings, that it is hard to keep track what you really need. We at ENCCS compiled a quick list of several relevant European providers here.

However, this blog post will focus on the major considerations that you should perform when migrating from HPC to Cloud. We will discuss these considerations under different lens: 

  1. performance and offerings
  2. availability and reliability
  3. security and compliance, and 
  4. costs. 

While there are potentially other factors to discuss when migrating your workload, the ones above tend to be the most deciding.

We written this small guide with questions that will aid you in the decision-making process and to help you figure out your own needs and decide a relevant provider for you. Keep in mind that the migration from HPC to Cloud can be extremely complex depending on how you have designed you workload or your services.

1. Performance and Offerings

1.1 How is your workload running?

In traditional HPC systems, applications often run in baremetal with the help of a scheduler (e.g., SLURM, or PBS). This is done to extract the maximum performance from the system by having the lowest amount of layers between the application and the hardware. 

If you are running your application in baremetal and you want to keep the effort to a minimum, you should likely stick towards computing instances/virtual machines. Keep in mind that in this option, you are likely to have to build your own operating system image so the application can be deployed among several instances. Still, some degree of configuration might be required: in MPI jobs, for example, one is expected to configure the machines and keep track of all the IPs (or domains, if one is using nameservers) so the ranks can be distributed among the several machines.

If you have been using containers to run your application (e.g., in supercomputers such as LUMI), you can also opt for services such as Managed Container, Kubernetes or OpenShift in which you do not need to worry about the OS-layer but rather only about the application itself. Additionally, some CSPs go one step further and also provide SLURM as if you were in an actual HPC environment. However, despite the decrease in complexity, the downside of using managed services is that there is often an embedded cost in addition to the price of the instances one would use.

1.2 Public, private or hybrid?

Public cloud means that the resources will be shared between among several users, and private that the resources are entirely yours. The latter is, of course, more expensive as you have an exclusive infrastructure that needs to be whole provisioned regardless of usage.

A mid-term approach of this is what has been called “hybrid cloud“, with one use case being the storage of your data on-premise and a connection to the resources on the public cloud, so your sensitive data can stay on-premise to meet needs or to comply with regulatory requirements. This requires previous on-premise infrastructure in which, for small/medium companies, one is likely to have.

1.3 What are your throughput requirements?

This can be the easiest requirement to figure out, or maybe the worst one. Let’s try to break it in several pieces.

  1. For starters, is your workload running on CPUs or GPUs? Both? Are their brands relevant? There would be additional effort to migrate the application if one has originally written the code in, say, CUDA and one provisions AMD GPUs. The same works for CPUs, despite all being the same architecture, Intel CPUs might have some specific SIMD intrinsics or even performance counters that are not available in its AMD counterpart. If you are using brand-agnostic libraries, such as Tensorflow or Pytorch (Python), OpenCL or HIP (C/C++), you probably don’t need to worry about brands in most cases.
  2. Do you need the high-end GPUs on the market or can you get around with mid-end ones? This is important because one might perform AI training in very high-end systems (like the ones available in EuroHPC JU systems) but inference might not need as much resources. In the same line of thought, it does not matter to pay for a GPU with very high memory when your code/model is not making full use of it.
The image shows the interior of the National Supercomputer Centre in Linköping, Sweden. A row of large, silver/golden, rectangular GPU cabinets with a textured surface, arranged in a long, narrow hallway.
NVIDIA A100 GPUs at the National Supercomputer Centre (NSC) in Linköping.
  1. Are different computer architectures relevant to you? For example, some codes can potentially benefit from the ARM architecture available in some cloud providers (e.g., Graviton4), which can also be more cost-efficient as they are more energy efficient and thus the reduction of this cost is given to the end-user. A similar argument applies similarly to item 1 if you write parts of your code in highly-optimized machine code for x86 processors and end up running in a RISC-V ones (or ARM, or PowerPC).
  2. Any other types of accelerators? Some supercomputers are making use of more not so conventional accelerators such as FPGAs and Quantum modules. While not fully common in European supercomputers, some might also even use Vector Processing Units too. These types of accelerators are harder to find replacement in cloud providers, leaving you the choice of either choosing among the largest ones or to rewrite your code. This is also true if you want to use specific processors such as AWS’ Trainium/Inferentia, or Google’s Tensor Processing Unit.
  3. How much RAM do you need? In newer models of GPUs and also some frameworks (e.g. DALI), one doesn’t need to transfer the data directly from host memory to device but can do from directly to disk, or can work with other arrangements such as Unified Memory. Therefore, you can optimise the amount of RAM you use in order to save costs and avoid provisioning useless resources.

1.4 Which type of storage do you need?

Containers provide only temporary storage, but many providers also offer traditional POSIX filesystems.
Some providers extend this to parallel filesystems like Lustre or OrangeFS, mimicking supercomputers.
Additionally, they supply block storage and object storage, such as S3‑compatible or Ceph.

Direct access to object storage is uncommon in HPC machines, but it handles large amounts of data cheaply and scalably. If you haven’t used it yet and deal with tons of data, consider this service.

1.5 How network-dependent is your workload?

Tight‑coupled workloads, such as MPI‑centric applications, suffer when ranks are federated across sites. Such workloads require ranks to remain near each other, ensuring fast communication.

Some providers offer similar solutions as current supercomputing systems, including InfiniBand, high-speed Ethernet or custom solutions (e.g., Elastic Fabric Adapter). Of course, these sort of high-performance solutions often come with a large price tag.

Loosely‑coupled workloads communicate rarely, so they are less impacted by slower networks.
Many of these applications embed load balancers that distribute batches across nodes.
These batches rarely talk to one another. When you move to a federated system, tail latency can spike, severely raising response times.

Another important aspect of network is the amount of (external) IP addresses that you will potentially need. Several providers doesn’t charge for IPv6 addresses but they do for IPv4 ones.

1.6 Are you using any vendor-specific tool?

Certain vendors design and publish tools that can be used anywhere but are usually more effective in their own environments. A common example is Amazon S3 Standard for object storage.
S3 works best when paired with AWS SDKs and AWS services, though other vendors like Ceph and MinIO can also use it. However, you may lose certain features if you choose a different vendor, so consider this before deciding.

Vendor lock-in can potentially mean that you might have trouble to find the best deals to host your service when migrating to a production.

Your workload may rely on proprietary software that once ran on the supercomputer, even if you owned the license. Examples include visualization tools, rendering, and simulation packages such as CFD or electromagnetic solvers. Before moving to the cloud, verify and renegotiate license terms to avoid unexpected fees.

2 Availability, Portability and Reliability

2.1 What is your desired service-level agreement?

If your HPC workload (or service) will be user-facing or needs to run for a very long time, you need to consider the availability of resources of data centers (e.g., Tier-4 ones). 99.99% means a few minutes per year of downtime, while a SLA of 99.9% translates to a few hours per year. This should not be a major issue in case you have several nodes provisioned or your workload has embedded some fault-tolerance mechanism, with checkpointing being the most common among them.

Furthermore, the desired SLA can be also analyzed by the lens of which service you are planning to run. If your service has usage bursts for very shorts periods (e.g. a simulation that lasts a few hours), a larger SLA might be less valuable than a service that has a medium usage over long periods.

2.2 Are you considering multicloud or can you stay within a single ecosystem?

While offerings between different providers, especially the major ones, are roughly the same, ensuring the interoperability between them might not necessarily be as easy. A potential use case of multicloud, aside ensuring that your service remains up in case of disasters, or because you want to have better reliability, is having deals with different minor local providers that operate only within a single region/country, and therefore you want to have different points of access to your service among several countries.

Please notice that the upcoming Data Act, that ensures/eases the portability of the data between different providers, will get in force from September 2025, hence many changes might be seen in the scope of multicloud.

2.3 Do you need a provider that is in several countries (potentially continents)?

There are providers that have datacenter contained only in a certain country; this may simplify regulatory frameworks in certain cases but has the downside of major issues in case that country has problems due to force majoure (i.e., energy outage, war).

2.4 Would your application benefit of replication among several data centers?

Applications that are designed as microservices or load balancers, and in many cases that are not user-facing (e.g., have tolerance in tail latency) can likely to have different instances among several data centers without major impacts.

2.5 Conservative vs Aggressive innovation in providers

Some providers (regardless of size) tend to push the latest hardware and solutions as soon as possible, this can be an issue if you are in the more conservative spectrum and uses old versions of certain stack (libraries, compilers) in your workload. While previous stack doesn’t suddenly disappear, you might find yourself losing time to reconfigure certain options for compatibility or because certain features were deprecated.

3 Security and Compliance

3.1 How sensitive is your data?

Some providers capitalize over the local regulations of secrecy while offering cloud services; a notable examle being Luxembourg in which separate laws coexist with the European GDPR, having high regulatory oversight by requiring regular audits, background checks on personnel and segregation of duties for their datacenters. Your can therefore favour a cloud provider by location if you are, for example, dealing with highly-sensitive data (e.g., financial, medical) in your model and wants to minimize the risks of data leakage.

An additional point is that several cloud providers (especially non-EU ones) talk about European data/cloud sovereignty, with different solutions. A common approach uses external key management. All data remains encrypted, so a foreign government cannot read it. However, providers can still hand over the encrypted data. Other approaches are partnering with other European companies by providing the hardware/software stack separated from the main cloud, or even on-premise deployment (also with the software maintained by, for example, Google).

All of this ends up in the fact that one must weight-in as there is a (potentially large) embedded additional cost in these protections, and therefore one must ask whether the GDPR framework is enough in terms of protections your data might need, or if you need more than that.

3.2 Do you care about open source?

Opensource tools have the advantage of everyone being able to inspect and audit the existing code, while also contributing to avoid potential bugs or to create innovation. Several providers rely on well‑known default solutions like Kubernetes. Others build their own Kubernetes flavor on closed source or deploy proprietary solutions. This should be a concern only if you use one existing tool (e.g. APIs) in which the provider has its own version or do not use it at all, and therefore can break compatibility with your existing workload.

3.3 Do you need certifications (e.g. ISO) in the data center?

In certain cases, due to the sensititivy of your application or per the requirements of your customers, you might require certain certifications for the datacenter. This may come, of course, as an additional cost that might not be justified if you do not have the specific need for that.

3.4 Are trainings and certifications relevant to your application?

Some cloud providers, especially major ones, offer training and certifications targeted to personnel for their own services and additional benefits (including discounts in their services) if the organisation employs certified people. This essentially consists in a vendor lock-in, but can be relevant if you are looking for an easier compatibility between an existing service in that provider (say, a customer business that already uses certain provider) and your workload.

4 Costs

4.1 What is the margin of unexpected expenses do you have?

If you are hosting an user-facing service to your HPC application, you might want to consider that there are potentially peaks of demand during certain times of the day or the week. In these cases, to keep within the desired deadline, the system might consider scaling. A no-control in the amount of new resources can easily become a large snowball of unexpected expenses in the pay-as-you-go system. Several providers offers tools that monitor resources and control the costs and cut off the instance(s) when it is off the budget.

An alternative is to pay for the whole period up-front and ensure that the budget for the service stays within the forecast and do not generate downtimes with the services (unless under extreme demand, for user-facing ones).

4.2 Can you afford to use spot instances?

Spot instances (also called preemptive VMs in some providers) make use of unreserved resources on the cloud service providers for a cheaper price, in many times at half the cost. A drawback of this approach is that the resource offers no guaranteed availability. Providers can turn it off at any time, although some may forecast how long the instance will stay online.

In this case, the workload should support checkpointing to a non-spot instance that hosts a storage service; if it is a user-facing service (e.g., inference), one might prefer to use spot instances as complimentary to non-spot ones.

4.3 Do you care about politics?

If you are not constrained by regulations, politics may guide your choice of a cloud provider. There is currently a major trend of favoring European providers over international (e.g. north american) ones; however, this choice can potentially affect your cost as well since the resources (human, equipments) in Europe tends to be more expensive than elsewhere while the scale is lower.

4.4 Have you considered the total cost of ownership of a HPC machine?

Several academic papers point out that running HPC workloads in the cloud might not be economically feasible depending on the machines and prices, but it can be feasible if you optimize the resources that you are intending to use. Buying a HPC machine to use on-premise is a great effort and requires not only the technical expertise but also other costs such as the machine itself (upfront), and the maintenance costs including energy, faulty equipment, and the likely full depreciation over the following 5-10 years. CSPs embed all of this as hidden costs in the fees you pay, and they dilute those costs across many users.

Hosting a HPC system, even if small, is, of course, a challenge to most startups and SMEs. However, if your cloud bill is too large, calculating whether an on-premise system is worth it starts to be a must.

5 Conclusions

Migrating from HPC to cloud involves many decisions. No single approach fits every organization. Some choices should not rest solely with you—whether you’re a researcher, CEO, or CTO—but with specialists who understand the technical and business implications.

By involving the right experts early, you can prevent unnecessary expenses that do not align with your goals. This collaborative approach ensures that the cloud strategy matches your unique workload requirements, budget constraints, and regulatory environment.

We hope that you have enjoyed this read!

Categories:

Tags: