Maximizing AI Performance with Intel Arc A770 GPU on Windows
Synopsis
This article introduces the Intel Arc A770 GPU as a competitive option for intensive AI tasks, especially for those working within the Windows ecosystem. Traditionally, NVIDIA GPUs and CUDA have dominated this space, but Intel’s latest offering provides a robust alternative. This article adds new information to help users work more easily with the Arc A770 GPU natively on Windows, bypassing the need for the Windows Subsystem for Linux (WSL).
Through practical steps and detailed insights, we explore how to set up and optimize the Arc A770 GPU for various AI models, including Llama2, Llama3, and Phi3. The article also includes performance metrics and memory usage statistics, providing a comprehensive overview of the GPU’s capabilities. Whether you are a developer or researcher, this post will equip you with the knowledge to leverage Intel’s GPU for your AI projects efficiently and effectively.
Introduction
Intel recently provided me with the opportunity to test their Arc A770 GPU for AI tasks. While detailed specifications can be found here, one feature immediately stood out: 16GB of RAM. This is 4GB more than its natural competitor, the NVIDIA RTX 3060, making it a compelling option for AI computations at a similar price point.
At Plain Concepts, where we predominantly work with Microsoft technologies, I decided to explore the GPU’s capabilities on a Windows platform. Given my usual work with PyTorch, I began by utilizing the Intel Extension for PyTorch to see if it could run models like Llama2, Llama3, and Phi3, and to evaluate its performance.
Initially, I considered using the Windows Subsystem for Linux (WSL) based on suggestions from various blog posts and videos that indicated native Windows support might not be fully ready. However, I chose to first experiment with a native Windows setup, and after a few tweaks and adjustments, I was pleased to discover that everything worked seamlessly!
In this article, I will share my experiences and the steps I took to run Llama2, Llama3, and Phi3 models on the Intel Arc A770 GPU natively in Windows. I will also present performance metrics, including execution time and memory usage for each model. The goal is to provide a comprehensive overview of how the Intel Arc A770 GPU can be effectively used for intensive AI tasks on Windows.
Setup on Windows
Intel provides a comprehensive guide for installing the Python extension for the Arc GPU.
However, setting up the Arc A770 GPU on Windows required some initial adjustments and troubleshooting. Here’s a brief summary of those adjustments. For detailed instructions, refer to the samples repository.
- Since oneAPI requires setting up several environment variables from the CMD, I recommend installing the Pscx extension for PowerShell, which allows you to easily call CMD scripts.
- When working on Windows with mamba, the PATH environment variable can become excessively long, causing issues when setting oneAPI environment variables. To avoid this problem, I included a setup_vars.ps1 script that sets the necessary environment variables for oneAPI while circumventing this issue.
- The Phi3 sample requires installing the prerelease version of the ipex-llm library, which implements optimizations for all kernel operations of Phi3. After installing this library, you must reinstall the transformers library.
Using the Intel extension for Pytorch
As stated in its GitHub repository, “Intel® Extension for PyTorch extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware”. Specifically, it “provides easy GPU acceleration for Intel discrete GPUs through the PyTorch xpu device”. This means that, by using this extension, you can leverage the Intel Arc A770 GPU for AI tasks without relying on CUDA/NVIDIA, and that you can get an even greater performance boost when using one of the optimized models.
Luckly, the extension follows the same API as PyTorch, so in general there is just a few changes to make in the code to get it running on the Intel GPU. Here is a brief summary of the changes needed:
- Check for the GPU
Add intel extension for pytorch, and check if the GPU is correctly detected.
This change is not strictly needed, but it is a good practice to check if the GPU is correctly detected before running the model.
- Move the model to the GPU
Once the model is loaded, move it to the GPU.
- Move inputs to the GPU
Finally, when using the model, ensure the input data is also on the GPU.
Other changes for performance measurement
In order to measure the performance accurately, I also added some extra code to retrieve the total inference time and max memory allocation. It mainly consists on a warm-up of each model before actually doing de inference, plus some extra code to wait model to run and print the results in a human-readable way. Check the samples repository for more information and to replicate the results in your own machine.
Llama2
Llama2 is the second iteration of the popular and open source Llama LLM model by Meta. After the preparing the environment, and making the changes stated in the previous section to the Llama2 official samples, I was able to run the Llama2 model on the Intel Arc A770 GPU, for plain inference as well as for instruction tasks.
Running Llama2 7B on Intel Arc A770 GPU
The Llama2 7B model takes approximately 14GB of memory using float16 precision. As the GPU has 16GB available, we can run it without any issues. Below you can see the results of the inference sample, using a maximum of 128 tokens in the output.
Running Llama2 7B Chat on Intel Arc A770 GPU
Similarly, the Llama2 7B chat results were impressive, with the model generating human-like responses in a conversational tone. The chat sample ran smoothly on the Intel Arc A770 GPU, showcasing its capabilities for chat applications. In this case, the sample runs with 512 tokens in the output to further stress the hardware.
Llama3
Llama3 is the latest iteration of the Llama LLM model by Meta, released a couple of months ago. Luckly the Intel team hurried to include the model optimization in the extension, so it was possible to leverage the full power of the Intel Arc A770 GPU. The process was quite similar to the one used for Llama2, using the same environment and official samples.
Running Llama3 8B on Intel Arc A770 GPU
The Llama3 8B model takes approximately a little more than 15GB of memory using float16 precision. As the GPU has 16GB available, we can run it without any issues. Below you can see the results of the inference sample, using a maximum of 64 tokens in the output.
Running Llama3 8B Instruct on Intel Arc A770 GPU
Following the Llama2 samples, I also tested for the chat capabilities of the Llama3 8B model, increasing the output tokens to 256.
Phi3
Phi3 is the latest model from Microsoft, released the 24th of April, designed for instruction tasks. It is a smaller model than Llama2 and Llama3 (3.8B parameters the smallest version), but it is still quite powerful. It is trained for instruction tasks, providing detailed and informative responses.
While Phi3 optimizations for Intel hardware are not yet included in the Intel extension for Pytorch, we can use a third party library, ipex-llm, to optimize the model. In this case, as the Phi3 is quite new, to get the optimization I had to install the prerelease version, that implements the optimizations for all kernel operations of Phi3. Note that ipex-llm is not a formal Intel library, but a community-driven one, so it is not officially supported by Intel.
Once the model is optimized, the rest of the code modifications are the same as for Llama2 and Llama3, so I was able to run the Phi3 model on the Intel Arc A770 GPU without any issues.
Running Phi3 4K Instruct on Intel Arc A770 GPU
The 4K model takes around 2.5GB of memory using 4bit precision. As it has much less parameters than Llama models, it is much faster to run. Below you can see the results of the inference sample, using a maximum of 512 tokens in the output.
Performance Comparison
To offer a thorough evaluation of the Intel Arc A770 GPU’s performance, I conducted a comparative analysis of execution time and memory usage for each model on both the Intel Arc A770 GPU and the NVIDIA RTX3080 TI.
The metrics were obtained using identical code samples and environment settings for both GPUs, ensuring a fair and accurate comparison. To understand better the results, it is important to note that I didn’t use quantization in the Llama models (dtype float16). As they take >12GB of memory, when using the NVIDIA GPU the system had to use around 2-3 GB of shared memory to compensate. On the other hand, the Phi3 test uses 4-bit quantization on both NVIDIA and Intel tests.
Intel ARC A770
Model | Output Tokens | Execution Time | Max Memory Used |
meta-llama/Llama-2-7b-hf | 128 | ~7.7s | ~12.8GB |
meta-llama/Llama-2-7b-chat-hf | 512 | ~22.1s | ~13.3GB |
meta-llama/Meta-Llama-3-8B | 64 | ~11.5s | ~15.1GB |
meta-llama/Meta-Llama-3-8B-Instruct | 256 | ~30.7s | ~15.2GB |
microsoft/Phi-3-mini-4k-instruct | 512 | ~5.9s | ~2.6GB |
NVIDIA RTX3080 TI
Model | Output Tokens | Execution Time | Max Memory Used |
meta-llama/Llama-2-7b-hf | 128 | ~15.5s | ~12.8GB |
meta-llama/Llama-2-7b-chat-hf | 512 | ~51.5s | ~13.3GB |
meta-llama/Meta-Llama-3-8B | 64 | ~16.9s | ~15.1GB |
meta-llama/Meta-Llama-3-8B-Instruct | 256 | ~66.7s | ~15.2GB |
microsoft/Phi-3-mini-4k-instruct | 512 | ~16.7s | ~2.6GB |
Performance Comparison Chart
The graph below illustrates the normalized execution time per token for each model on both the Intel Arc A770 and NVIDIA RTX3080 TI GPUs.
As illustrated, the Intel Arc A770 GPU performed exceptionally well across all models, demonstrating competitive execution times. Notably, the Intel Arc A770 GPU outperformed the NVIDIA RTX3080 TI by a factor of two or more in most cases.
Conclusion
The Intel Arc A770 GPU has proven to be a remarkable option for AI computation on a local Windows machine, offering an alternative to the CUDA/NVIDIA ecosystem. The GPU’s ability to efficiently run models like Llama2, Llama3, and Phi3 demonstrates its potential and robust performance capabilities. Despite initial setup challenges, the process was relatively straightforward, and the results were impressive.
In essence, the Intel Arc A770 GPU is a powerful tool for AI applications on Windows. With some initial setup and code adjustments, it handled inference, chat, and training tasks efficiently. This opens up new opportunities for developers and researchers who prefer or need to work within the Windows environment without relying on NVIDIA GPUs and CUDA. As Intel continues to enhance its GPU offerings and software support, the Arc A770 and future models are poised to become significant players in the AI community.
Useful links
The code samples used in this article can be found in the IntelArcA770 GitHub repository.
As well below are some resources that I find fundamental to dive deeper into the Intel hardware & libraries ecosystem for AI tasks.
- Intel Youtube Channel
- Tech.Decoded Library
- Christian Mills – Testing Intel’s Arc A770 GPU for Deep Learning
- SYCL Overview – The Khronos Group Inc
- Official documentation – PyTorch* Optimizations from Intel
- Intel AI Developer Tools and resources
References
- Sample code
- GitHub – intel/intel-extension-for-pytorch: A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
- Intel® Extension for PyTorch* — Intel Extension for PyTorch* 2.1.30+xpu documentation
- Llama 2 Inference with PyTorch on Intel® Arc™ A-Series GPUs
- ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3 at main · intel-analytics/ipex-llm · GitHub