Maximizing AI Performance with Intel Arc A770 GPU on Windows

Synopsis

This article introduces the Intel Arc A770 GPU as a competitive option for intensive AI tasks, especially for those working within the Windows ecosystem. Traditionally, NVIDIA GPUs and CUDA have dominated this space, but Intel’s latest offering provides a robust alternative. This article adds new information to help users work more easily with the Arc A770 GPU natively on Windows, bypassing the need for the Windows Subsystem for Linux (WSL).

Through practical steps and detailed insights, we explore how to set up and optimize the Arc A770 GPU for various AI models, including Llama2, Llama3, and Phi3. The article also includes performance metrics and memory usage statistics, providing a comprehensive overview of the GPU’s capabilities. Whether you are a developer or researcher, this post will equip you with the knowledge to leverage Intel’s GPU for your AI projects efficiently and effectively.

Introduction

Intel recently provided me with the opportunity to test their Arc A770 GPU for AI tasks. While detailed specifications can be found here, one feature immediately stood out: 16GB of RAM. This is 4GB more than its natural competitor, the NVIDIA RTX 3060, making it a compelling option for AI computations at a similar price point.

Intel Arc A770 GPU used for tests

At Plain Concepts, where we predominantly work with Microsoft technologies, I decided to explore the GPU’s capabilities on a Windows platform. Given my usual work with PyTorch, I began by utilizing the Intel Extension for PyTorch to see if it could run models like Llama2, Llama3, and Phi3, and to evaluate its performance.

Initially, I considered using the Windows Subsystem for Linux (WSL) based on suggestions from various blog posts and videos that indicated native Windows support might not be fully ready. However, I chose to first experiment with a native Windows setup, and after a few tweaks and adjustments, I was pleased to discover that everything worked seamlessly!

Intel Arc A770 GPU used for tests

In this article, I will share my experiences and the steps I took to run Llama2, Llama3, and Phi3 models on the Intel Arc A770 GPU natively in Windows. I will also present performance metrics, including execution time and memory usage for each model. The goal is to provide a comprehensive overview of how the Intel Arc A770 GPU can be effectively used for intensive AI tasks on Windows.

Setup on Windows

Intel provides a comprehensive guide for installing the Python extension for the Arc GPU.

Intel extension for Pytorch install guide

However, setting up the Arc A770 GPU on Windows required some initial adjustments and troubleshooting. Here’s a brief summary of those adjustments. For detailed instructions, refer to the samples repository.

Since oneAPI requires setting up several environment variables from the CMD, I recommend installing the Pscx extension for PowerShell, which allows you to easily call CMD scripts.
When working on Windows with mamba, the PATH environment variable can become excessively long, causing issues when setting oneAPI environment variables. To avoid this problem, I included a setup_vars.ps1 script that sets the necessary environment variables for oneAPI while circumventing this issue.
The Phi3 sample requires installing the prerelease version of the ipex-llm library, which implements optimizations for all kernel operations of Phi3. After installing this library, you must reinstall the transformers library.

Using the Intel extension for Pytorch

As stated in its GitHub repository, “Intel® Extension for PyTorch extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware”. Specifically, it “provides easy GPU acceleration for Intel discrete GPUs through the PyTorch xpu device”. This means that, by using this extension, you can leverage the Intel Arc A770 GPU for AI tasks without relying on CUDA/NVIDIA, and that you can get an even greater performance boost when using one of the optimized models.

Luckly, the extension follows the same API as PyTorch, so in general there is just a few changes to make in the code to get it running on the Intel GPU. Here is a brief summary of the changes needed:

Check for the GPU

Add intel extension for pytorch, and check if the GPU is correctly detected.

This change is not strictly needed, but it is a good practice to check if the GPU is correctly detected before running the model.

Move the model to the GPU

Once the model is loaded, move it to the GPU.

Move inputs to the GPU

Finally, when using the model, ensure the input data is also on the GPU.

Other changes for performance measurement

In order to measure the performance accurately, I also added some extra code to retrieve the total inference time and max memory allocation. It mainly consists on a warm-up of each model before actually doing de inference, plus some extra code to wait model to run and print the results in a human-readable way. Check the samples repository for more information and to replicate the results in your own machine.

Llama2

Llama2 is the second iteration of the popular and open source Llama LLM model by Meta. After the preparing the environment, and making the changes stated in the previous section to the Llama2 official samples, I was able to run the Llama2 model on the Intel Arc A770 GPU, for plain inference as well as for instruction tasks.

Running Llama2 7B on Intel Arc A770 GPU

The Llama2 7B model takes approximately 14GB of memory using float16 precision. As the GPU has 16GB available, we can run it without any issues. Below you can see the results of the inference sample, using a maximum of 128 tokens in the output.

Running Llama2 7B Chat on Intel Arc A770 GPU

Similarly, the Llama2 7B chat results were impressive, with the model generating human-like responses in a conversational tone. The chat sample ran smoothly on the Intel Arc A770 GPU, showcasing its capabilities for chat applications. In this case, the sample runs with 512 tokens in the output to further stress the hardware.

Llama3

Llama3 is the latest iteration of the Llama LLM model by Meta, released a couple of months ago. Luckly the Intel team hurried to include the model optimization in the extension, so it was possible to leverage the full power of the Intel Arc A770 GPU. The process was quite similar to the one used for Llama2, using the same environment and official samples.

Running Llama3 8B on Intel Arc A770 GPU

The Llama3 8B model takes approximately a little more than 15GB of memory using float16 precision. As the GPU has 16GB available, we can run it without any issues. Below you can see the results of the inference sample, using a maximum of 64 tokens in the output.

Running Llama3 8B Instruct on Intel Arc A770 GPU

Following the Llama2 samples, I also tested for the chat capabilities of the Llama3 8B model, increasing the output tokens to 256.

Phi3

Phi3 is the latest model from Microsoft, released the 24th of April, designed for instruction tasks. It is a smaller model than Llama2 and Llama3 (3.8B parameters the smallest version), but it is still quite powerful. It is trained for instruction tasks, providing detailed and informative responses.

While Phi3 optimizations for Intel hardware are not yet included in the Intel extension for Pytorch, we can use a third party library, ipex-llm, to optimize the model. In this case, as the Phi3 is quite new, to get the optimization I had to install the prerelease version, that implements the optimizations for all kernel operations of Phi3. Note that ipex-llm is not a formal Intel library, but a community-driven one, so it is not officially supported by Intel.

Once the model is optimized, the rest of the code modifications are the same as for Llama2 and Llama3, so I was able to run the Phi3 model on the Intel Arc A770 GPU without any issues.

Running Phi3 4K Instruct on Intel Arc A770 GPU

The 4K model takes around 2.5GB of memory using 4bit precision. As it has much less parameters than Llama models, it is much faster to run. Below you can see the results of the inference sample, using a maximum of 512 tokens in the output.

Performance Comparison

To offer a thorough evaluation of the Intel Arc A770 GPU’s performance, I conducted a comparative analysis of execution time and memory usage for each model on both the Intel Arc A770 GPU and the NVIDIA RTX3080 TI.

The metrics were obtained using identical code samples and environment settings for both GPUs, ensuring a fair and accurate comparison. To understand better the results, it is important to note that I didn’t use quantization in the Llama models (dtype float16). As they take >12GB of memory, when using the NVIDIA GPU the system had to use around 2-3 GB of shared memory to compensate. On the other hand, the Phi3 test uses 4-bit quantization on both NVIDIA and Intel tests.

Intel ARC A770

Model	Output Tokens	Execution Time	Max Memory Used
meta-llama/Llama-2-7b-hf	128	~7.7s	~12.8GB
meta-llama/Llama-2-7b-chat-hf	512	~22.1s	~13.3GB
meta-llama/Meta-Llama-3-8B	64	~11.5s	~15.1GB
meta-llama/Meta-Llama-3-8B-Instruct	256	~30.7s	~15.2GB
microsoft/Phi-3-mini-4k-instruct	512	~5.9s	~2.6GB

NVIDIA RTX3080 TI

Model	Output Tokens	Execution Time	Max Memory Used
meta-llama/Llama-2-7b-hf	128	~15.5s	~12.8GB
meta-llama/Llama-2-7b-chat-hf	512	~51.5s	~13.3GB
meta-llama/Meta-Llama-3-8B	64	~16.9s	~15.1GB
meta-llama/Meta-Llama-3-8B-Instruct	256	~66.7s	~15.2GB
microsoft/Phi-3-mini-4k-instruct	512	~16.7s	~2.6GB

Performance Comparison Chart

The graph below illustrates the normalized execution time per token for each model on both the Intel Arc A770 and NVIDIA RTX3080 TI GPUs.

*MARGIN OF ERROR: LESS THAN 0.1 SECONDS

As illustrated, the Intel Arc A770 GPU performed exceptionally well across all models, demonstrating competitive execution times. Notably, the Intel Arc A770 GPU outperformed the NVIDIA RTX3080 TI by a factor of two or more in most cases.

Conclusion

The Intel Arc A770 GPU has proven to be a remarkable option for AI computation on a local Windows machine, offering an alternative to the CUDA/NVIDIA ecosystem. The GPU’s ability to efficiently run models like Llama2, Llama3, and Phi3 demonstrates its potential and robust performance capabilities. Despite initial setup challenges, the process was relatively straightforward, and the results were impressive.

In essence, the Intel Arc A770 GPU is a powerful tool for AI applications on Windows. With some initial setup and code adjustments, it handled inference, chat, and training tasks efficiently. This opens up new opportunities for developers and researchers who prefer or need to work within the Windows environment without relying on NVIDIA GPUs and CUDA. As Intel continues to enhance its GPU offerings and software support, the Arc A770 and future models are poised to become significant players in the AI community.

Useful links

The code samples used in this article can be found in the IntelArcA770 GitHub repository.

As well below are some resources that I find fundamental to dive deeper into the Intel hardware & libraries ecosystem for AI tasks.

Cookie	Duration	Description
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
attributionCookie	session	No description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
cppro-ft	1 year	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	session	No description
cppro-ft-style	session	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	1 year	No description
i18n	10 years	No description available.
IE-jwt	62 years 6 months 9 days 9 hours	No description
IE-LANG_CODE	62 years 6 months 9 days 9 hours	No description
IE-set_country	62 years 6 months 9 days 9 hours	No description
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wmc	9 years 11 months 30 days 11 hours 59 minutes	No description

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	1 year	No description
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjSession_1776154	session	No description
_hjSessionUser_1776154	session	No description
_hjTLDTest	1 year	No description
_hjTLDTest	1 year	No description
_hjTLDTest	session	No description
_hjTLDTest	session	No description
_lfa_test_cookie_stored	past	No description

Cookie	Duration	Description
loglevel	never	No description available.
prism_90878714	1 month	No description
redirectFacebook	2 minutes	No description
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.