Laura Henn August 15, 2023
How many engineers does it take to complete a cloud migration?
It’s not a riddle. It’s a conundrum. Your actual number may vary, but if you’re like most organizations, it’s probably in the ballpark of “too many.”
Resourcing and complexity are two of the most common reasons why cloud migrations fall short of expectations, and why many organizations find themselves locked into a particular cloud vendor that no longer best serves their needs. But there’s good news – AI can automate most, if not all, of the mundane tasks that can slow down your migration and distract your engineering talent from innovation and other high-value work. You just need the right tools.
Here at Nuvalence, we’ve been experimenting with LLMs for exactly this use case. We’ve learned a lot, but one of our biggest takeaways? Prompts will make or break an AI-powered cloud migration.
In this post, we’ll share the approach we’ve taken to prompt engineering, and the four biggest lessons we have learned so far.
Setting the Stage
We chose to start our exploration in AI-powered cloud migrations with Infrastructure as Code (IaC) generation. We set out to understand how far LLMs could get us in translating IaC to deploy an application from one cloud to another.
More specifically, how well could LLMs:
- Identify equivalent services between clouds (e.g., identify Compute Engine as the Google Cloud equivalent of AWS EC2 instances)?
- Configure different services to work together (e.g., configure Compute Engine instances to access Cloud SQL databases)?
- Generate clear, functioning, maintainable Terraform code?
To zero in on these areas, we built a CLI tool for cloud engineers to leverage LLMs when translating IaC from one cloud to another, using chat-completion models from OpenAI throughout the experiment. Here’s a high-level diagram of the tool’s user flow.
In this post, we focus on how carefully-crafted prompts vastly improved the ‘translate’ stage. Stay tuned for future posts in this series, where we’ll explore the rest of this flow and how we landed on a human-in-the-loop design pattern for the tool.
Our Test Case
For our initial test case, we migrated an off-the-shelf, highly-available WordPress installation, deployed to AWS via a CloudFormation template, to Google Cloud, deployed via Terraform. We felt that this application had the right mix of infrastructure components – an auto scaling EC2 group, an RDS database instance, an application load balancer, and a security group – to give us an initial sense of how far LLMs could take us in a migration.
Good Prompts Were The Key To Success
Our initial results weren’t very promising. During our first attempts at the translation stage, when we used very basic prompts, the generated Terraform hardly reflected the original architecture. For example, the generated Terraform often consisted of a standalone Compute Engine instance with a SQL database installed on it along with WordPress–rather than having a Compute Engine managed instance group connecting to a Cloud SQL Database, as we expected. However, when we iterated on our prompt, we saw a vast improvement in the Terraform that was returned from the LLMs.
By the end of our experiment, we were able to generate working Terraform code that accurately reflected the original AWS architecture with up to 90% accuracy.
Dramatic results like these didn’t come easily. There was a lot of trial-and-error and lessons learned along the way. Here are the four biggest ones.
1. Iterate, iterate, iterate
Generative AI feels like magic – both in its incredible capabilities and the mystery in how it actually works. While there’s an abundance of general guidance on model selection and prompt-engineering now cropping up all over the internet, the only way to truly discover what works best for your specific use case is to test out many different prompts and iterate. We found several different approaches to help in this area.
The ChatGPT playground is a great starting point, as it allows users to tweak settings quickly and intuitively. For example, using the playground, we observed that GPT-4 models were able to generate much higher-quality code than GPT-3.5 models for our initial prompt (“translate this CloudFormation template into Terraform for Google Cloud”), whereas in later prompts (“fix this Terraform error we’re experiencing”) it didn’t make a meaningful difference. The playground helped us to quickly land on models and settings that had the right balance of performance and cost for our use case.
While the playground was very helpful as we started our project, we did eventually land on a more systematic approach to drill down into subtle differences among prompts. This seemed necessary because even with a temperature of zero (corresponding to the least amount of randomness), the output of these models differed between identical runs. In order to see past the inherent, non-deterministic nature of these models, we wrote scripts to run prompts repetitively and compare outputs. For example, by contrasting trends in the Terraform resources created by running prompt A five times with those created by running prompt B five times, we were able to select the wording for our prompts with higher confidence. We’ll show an example of this in our third lesson.
2. Leverage system messages
OpenAI offers the option to set system messages when using their chat completion models. System messages define the role that the AI assistant plays in the conversation; it defaults to “You are a helpful assistant,” but can be customized to fit your use case. We changed ours to “You are a cloud engineer with expert knowledge of AWS, Google Cloud, CloudFormation, and Terraform.” To our surprise, this made a significant difference in the Terraform output.
The Terraform output was more specific to the application we were deploying when using the custom system message. Often, resources were named more descriptively (for example, the database instance would be named ‘wordpress’ or ‘wordpress-db’ when using the custom system message, instead of the generic ‘database-instance’ when using the default system message). We also saw other evidence of the Terraform being more specific to the application when using the customized system message. The metadata startup script for the virtual machines was more likely to install WordPress and relevant packages, and the load balancer’s health check was more often configured to use a WordPress-specific path.
Clearly, changing the system message had a large impact on the quality of Terraform code that the model generated. Setting aside the astounding fact that ‘telling’ the model to act like someone who knows more about your problem allows it to solve your problem better, some of these differences are puzzling. Nowhere in our customized system message did we tell our model that it knew a lot about WordPress, yet that was the specific aspect of the Terraform that it improved the most. This unpredictable behavior reinforced the importance of systematically testing out prompts.
3. Chunk your problems
Breaking up complex problems into smaller problems (or ‘chunks’) allows the model to perform better. This is not a novel finding – for example, the Google Brain research team has explored this deeply and published a paper on it – but we were excited to see how much of a difference it made for us. When we prompted the model to describe the architecture of the CloudFormation template before generating Terraform to deploy it in Google Cloud, it produced Terraform that was a more accurate translation of the original architecture. Here’s a brief summary of the differences we noticed based on this simple change in the prompt.
We can see from these differences that the chunked prompt more reliably produced a better translation of the original CloudFormation template, which also happened to be a better and more scalable solution compared to the output often produced by the non-chunked prompt. For example, a managed instance group of Google Cloud Compute Engine instances better matches the original Autoscaling group of EC2 instances, and is a much more scalable solution compared to a single Compute Engine instance. Also, while the original architecture in AWS didn’t create a VPC and subnets, it did reference them in quite a few of the resources. The chunked prompt was more likely to recognize this dependency and include the VPC and subnets in the Google Cloud deployment.
In contrast to our previous lesson, the approach of ‘chunking’ prompts makes more intuitive sense to us humans – we also find it easier to solve big problems by breaking them up into smaller steps. In the context of generative AI, chunking may also offer opportunities to add humans to the loop to guide the models. For example, we could have had the model first describe the architecture and list the equivalent services between AWS and Google Cloud (e.g., EC2 in AWS corresponds to Compute Engine in Google Cloud), then give the human user an opportunity to accept or correct these assumptions before the model generated the Terraform. (Using this 6-category taxonomy of AI, this would be a cooperative model.) In this case, the model was able to correctly identify equivalent services with a high degree of accuracy, so this was unnecessary, but it is something to consider if you find yourself chunking prompts. If you’re interested in learning more about humans-in-the-loop and how we incorporated them into our experiments, stay tuned for a future post in our series.
4. Be specific
If you, like us at Nuvalence, have embraced the rise of ChatGPT and have been using it in addition to (or instead of) a search engine, you’ve likely noticed a difference in how you leverage these tools. With searching, you generalize your question to find relevant resources, whereas with ChatGPT and other LLMs, you try to be as specific as possible so you can get a precise answer. Integrating LLMs into processes like cloud migrations is no different – you should strive to create detailed prompts.
Consider this excerpt from our prompt, where we included detailed requirements for well-written Terraform:
Adhere to the requirements below: - Include the GCP provider in the generated code - Leverage Terraform variables as necessary, including reasonable default values - Avoid using object-type Terraform variables - Compute Engine VM instances must have secure boot enabled
This collection of stylistic preferences and security requirements reliably generated Terraform code that was easy to understand and would not be blocked by organizational security policies when deploying to our environment.
The power of large language models like GPT-4 is undeniable, but harnessing that power requires careful crafting of prompts. In the context of cloud migrations, this technology can drastically change the game.
Consider a relatively simple test case like ours, which would typically cost cloud engineers hours, if not days, to map out equivalent architectures and generate boilerplate Terraform code (and this is assuming they already have familiarity with both clouds). Yet, with the aid of LLMs guided by thoughtful prompts, this once time-consuming task is now accomplished in a matter of minutes, freeing up subject-matter experts to focus on higher-value work. Taking into account the scale and complexity of cloud migrations we’ve encountered, the impact of LLMs with well-crafted prompts is transformative.
Now, we want to challenge you: what prompts would you use? Prompt engineering is not just a one-size-fits-all approach: it demands creativity, experimentation, and continuous iteration. We hope you can take our learnings and apply them to your own unique use case!
This is the third post in our Supercharging Cloud Migrations with GenAI series. Interested in learning more? Check out our previous insights: Tame Your Cloud Migration Anxiety with Generative AI and Exclusive Preview: An AI-Powered Cloud Migration Platform.