Maximizing LLMs performance with Intel CPUsPlain Concepts

Summary

Azure has announced the public preview of the D and E family virtual machines, powered by the new 5th Gen Intel® Xeon® processors with built-in Intel® AMX (Advanced Matrix Extensions). This advanced accelerator enhances deep learning inference performance directly on the CPU. Azure Instances that run on the 5^th Gen Intel Xeon support using the built in AMX Accelerator. These new virtual machines offer an excellent price-performance ratio for clients developing Multimodal RAG solutions that integrate Embeddings, Large Language Models, and Vision-Language Models.

This article explores how to set up and optimize our RAG pipelines for various models that will allow us to run open-source models (without needing external APIs) while maintaining good performance. Finally, we compare the performance of machines with GPUs and different CPU models.

Introduction

Retrieval-augmented generation (RAG) combines information retrieval and text generation to deliver highly relevant, context-aware responses. It enhances traditional generative models by incorporating external knowledge from a database or corpus, ensuring more accurate and grounded outputs.

Key components of a RAG pipeline include:

Embedding Model: Converts text or data into vector representations for efficient retrieval. Here, we use bge-large v1.5 for high-precision embeddings.
Generative Model (LLM): This model produces natural language responses based on retrieved information. In this case, we use Llama 3.2 for its strong text generation capabilities.
Vision-Language Model (VLM): Processes visual data and integrates it with textual inputs. We employ Phi-3.5 Vision to analyze images and generate detailed descriptions.

Figure 1. Sample Diagram of multimodal RAG pipeline components.

In this multimodal RAG pipeline, Phi-3.5 Vision enriches the system by enabling detailed image descriptions, which are integrated with retrieved text data. This setup enhances the pipeline’s ability to handle complex, multimodal queries, making it ideal for applications like multimedia analysis and decision support.

Optimization libraries

Intel provides various libraries designed to maximize performance in model execution, offering advanced optimizations for both CPUs and GPUs. Among these tools is IPEX-LLM, a PyTorch library designed to accelerate the inference and fine-tuning of large language models (LLMs) on Intel hardware, including CPUs, AI Accelerators like Gaudi and GPUs such as the Arc GPUs.

Built on top of the Intel Extension for PyTorch (IPEX), this library offers state-of-the-art optimizations and supports low-precision computations (FP8/FP4/INT4), enabling seamless integration with popular tools like llama.cpp, Ollama, HuggingFace Transformers, LangChain, LlamaIndex, and DeepSpeed.

The environment setup is straightforward if you only use an Intel CPU processor in your pipeline. No additional software installation is required (such as OneAPI for Intel GPUs), you only need to install ipex-llm from pip.

pip install –pre –upgrade ipex-llm[all] –extra-index-url https://download.pytorch.org/whl/cpu

Embeddings

In Natural Language Processing (NLP), embeddings are numerical representations of text that capture semantic meaning, enabling models to process language more effectively. They facilitate tasks like text classification and information retrieval by representing words or sentences as vectors in a continuous space.

The bge-large-en-v1.5 model is an advanced English embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI). Based on BERT architecture, it generates high-quality sentence embeddings, facilitating accurate similarity calculations and effective passage retrieval.

Intel has optimized the bge-large-en-v1.5 model with its IPEX-LLM library, enhancing performance on Intel CPUs and GPUs.

In this case, we can apply optimization to our sentence transformer using the ipex-llm API

These functions apply optimizations like merging KQV. In transformer models, this process refers to combining the separate linear transformations for queries (Q), keys (K), and values (V) into a single linear operation. This approach reduces computational overhead by performing one matrix multiplication instead of three, enhancing efficiency without compromising model performance.

You will find more details in IntelCPU: Intel CPU benchmarking on LLMs

Large Language Model

Large Language Models (LLMs) are pivotal in Retrieval-Augmented Generation (RAG) systems. In RAG, LLMs generate coherent and contextually relevant text by integrating user queries with retrieved external information. Llama 3.2-1B, developed by Meta, was officially released by Meta on September 25, 2024, and is a 1-billion-parameter model optimized for mobile hardware, enabling developers to create AI-powered applications for smartphones and other devices.

Loading optimized models with IPEX-llm is easy using the Automodel API of Transformers hugging face

The ipex-llm extension for transformers allows optimized models to be loaded directly. In this case, we are using symmetric INT4 optimization, but you may apply other low-bit optimizations (INT5, INT8, etc.) symmetrically and asymmetrically.

Vision Language Model

In multimodal Retrieval-Augmented Generation (RAG) systems, vision language models play a crucial role by integrating visual inputs into the retrieval and generation processes. This enables the system to understand and generate content that combines both text and images, enhancing tasks such as image captioning, visual question answering, and multimodal content creation.

Phi-3.5 Vision is a state-of-the-art, open multimodal model developed by Microsoft. It integrates both text and visual data, enabling advanced reasoning across these modalities. The model is built upon high-quality datasets, including synthetic data and filtered publicly available websites, emphasizing reasoning-dense information in both text and vision domains.

A notable feature of Phi-3.5 Vision is its support for a context length of up to 128,000 tokens, allowing it to handle extensive multimodal information efficiently.

We are using the same opt_parameters as Llama 3.2 to load our model.

Setting attn_implementation=”eager” configures the model to utilize a manual, straightforward implementation of attention, often referred to as “eager” mode. This mode is generally more compatible across various hardware and software environments, ensuring broader support and stability. When deploying models like Phi-3.5 Vision, which may not support certain optimized attention mechanisms, specifying this parameter can help prevent compatibility issues.

Performance comparison

In this section, we compare the inference performance of each core component in the multimodal RAG pipeline—embedding, large language model (LLM), and vision-language model (VLM)—across different hardware configurations. This evaluation provides insights into choosing a cost-effective infrastructure to execute the entire pipeline.

The Azure virtual machines used for the benchmark are:

All the machines are in the East US 2 region.

Embeddings

The graph compares inference times for generating embeddings using the BGE Large EN model across different hardware configurations, measured for 80 documents.

While the Tesla T4 GPU excels in handling embeddings, particularly as batch sizes increase, Azure’s D16s v6 with Intel Emerald Rapids CPUs offers a compelling balance of cost and performance. This means that for the BEG Large EN benchmark, Intel AMX could achieve slower performance but with almost 35% lower cost for instance.

Large Language Model

This graph compares inference times for generating 750 tokens in a single batch size.

The results show that Azure’s D16s v6 with Intel Emerald Rapids processors offers a balanced combination of cost and performance, coming very close to GPU-level efficiency for text generation.

Cost-performance comparison

We also evaluated the performance of this LLM model by measuring its tokens per second (t/s) rate. This metric quantifies the number of tokens the model can generate or process per second, clearly indicating its efficiency during inference.

To better understand the cost efficiency of the Llama 3.2 model across different hardware, we evaluated the cost per token. This metric calculates the cost in USD required to process one token per second, providing a clear picture of the economic implications of running the model on various hardware configurations. By comparing these values, we can identify which hardware offers the best balance between performance and cost.”

While the D16s v6 shows a significantly lower cost per token than the other options, it’s important to note that this analysis accounts for hardware efficiency and associated costs. This makes the D16s v6 the most cost-effective option, whereas the NC16as T4 v3 and D16s v4 have similar but higher costs compared to the D16s v6.

Vision Language model

Showcasing the capabilities of models like Phi3.5 Vision in creating detailed image captions, we present a sample of an image below.

Figure 2. Image for detailed caption generation

The prompt for the following output was: “Generate a detailed caption for the image”.

“In the image, a young child is the central figure, standing in front of a rustic stone wall. The child is dressed in a pink and white striped dress, adding a touch of innocence and youth to the scene. In the child’s hands, they hold a white teddy bear, which is adorned with a pink tutu, symbolizing the child’s love for stuffed animals. The teddy bear is positioned in such a way that it appears to be dancing, adding a playful element to the image. The background is simple yet charming, featuring a stone wall.”

This graph compares inference times for generating a detailed caption of a single image. (Max tokens 130)

Traditionally, GPUs have been the go-to choice for vision-related AI tasks due to their superior performance in parallel workloads. However, as shown in this comparison, the limited memory capacity of GPUs can often prevent certain models from running efficiently or at all. In the previous graph, the inference time for the Nvidia T4 was not included due to memory constraints, which prevented the model from running properly on this GPU. In contrast, the memory typically available in standard servers is significantly higher than the 16 GB provided by GPUs like the Tesla T4. This makes CPUs and memory-rich servers a practical solution for larger models. Additionally, the D16s v6 machine achieves an impressive inference time for generating the image description, showcasing its efficiency and reliability for these tasks.

Production environments

The tests conducted in the previous section demonstrate how it is possible to run models that are part of a multimodal pipeline using only the CPU. If we need this pipeline to be usable by many users concurrently, we could use LLM servers such as vLLM to achieve higher performance from our model.

The following link shows how to configure vLLM on Intel Extension for PyTorch using only the CPU as the inference device.

Conclusion

Intel’s 5th Gen Xeon Emerald Rapids CPUs, integrated with Azure’s latest virtual machines, are revolutionizing how Retrieval-Augmented Generation (RAG) pipelines are built and optimized. These processors, with advanced features like Intel Advanced Matrix Extensions (AMX), deliver robust performance for multimodal RAG tasks—such as embeddings, large language model (LLM) inference, and vision-language model (VLM) analysis—without the high costs typically associated with GPUs.

Emerald Rapids CPUs achieve inference speeds comparable to GPUs such as the Tesla T4, with minimal differences in performance across various RAG workloads. This makes Intel’s processors a cost-effective and high-performing option, enabling easier access to advanced AI technologies. Additionally, CPUs offer the advantage of simpler development and testing, as they do not require the installation of extra libraries or drivers, unlike GPUs. This is particularly useful in scenarios where local and production environments differ in hardware architecture.

The combination of Intel’s innovative CPU technology and Azure’s flexible infrastructure allows businesses to leverage open-source-friendly workflows, reduce dependency costs, and maintain control over their RAG implementations. With this approach, organizations can efficiently handle complex multimodal tasks and scale their pipelines to meet evolving demands.

Cookie	Duration	Description
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
attributionCookie	session	No description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
cppro-ft	1 year	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	session	No description
cppro-ft-style	session	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	1 year	No description
i18n	10 years	No description available.
IE-jwt	62 years 6 months 9 days 9 hours	No description
IE-LANG_CODE	62 years 6 months 9 days 9 hours	No description
IE-set_country	62 years 6 months 9 days 9 hours	No description
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wmc	9 years 11 months 30 days 11 hours 59 minutes	No description

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	1 year	No description
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjSession_1776154	session	No description
_hjSessionUser_1776154	session	No description
_hjTLDTest	1 year	No description
_hjTLDTest	1 year	No description
_hjTLDTest	session	No description
_hjTLDTest	session	No description
_lfa_test_cookie_stored	past	No description

Cookie	Duration	Description
loglevel	never	No description available.
prism_90878714	1 month	No description
redirectFacebook	2 minutes	No description
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Maximizing LLMs performance with Intel CPUs

Summary

Introduction

Performance comparison

Production environments

Conclusion

Summary

Introduction

Optimization libraries

Embeddings

Large Language Model

Vision Language Model

Performance comparison

Embeddings

Large Language Model

Cost-performance comparison

Vision Language model

Production environments

Conclusion

Useful links

References