
Maximizing LLMs performance with Intel CPUs
With AI processors like 5th Intel Xeon Emerald Rapids and Intel optimization libraries, Azure makes Multimodal RAG pipelines more cost-effective.
Summary
Azure has announced the public preview of the D and E family virtual machines, powered by the new 5th Gen Intel® Xeon® processors with built-in Intel® AMX (Advanced Matrix Extensions). This advanced accelerator enhances deep learning inference performance directly on the CPU. Azure Instances that run on the 5th Gen Intel Xeon support using the built in AMX Accelerator. These new virtual machines offer an excellent price-performance ratio for clients developing Multimodal RAG solutions that integrate Embeddings, Large Language Models, and Vision-Language Models.
This article explores how to set up and optimize our RAG pipelines for various models that will allow us to run open-source models (without needing external APIs) while maintaining good performance. Finally, we compare the performance of machines with GPUs and different CPU models.
Introduction
Retrieval-augmented generation (RAG) combines information retrieval and text generation to deliver highly relevant, context-aware responses. It enhances traditional generative models by incorporating external knowledge from a database or corpus, ensuring more accurate and grounded outputs.
Key components of a RAG pipeline include:
- Embedding Model: Converts text or data into vector representations for efficient retrieval. Here, we use bge-large v1.5 for high-precision embeddings.
- Generative Model (LLM): This model produces natural language responses based on retrieved information. In this case, we use Llama 3.2 for its strong text generation capabilities.
- Vision-Language Model (VLM): Processes visual data and integrates it with textual inputs. We employ Phi-3.5 Vision to analyze images and generate detailed descriptions.

Figure 1. Sample Diagram of multimodal RAG pipeline components.
In this multimodal RAG pipeline, Phi-3.5 Vision enriches the system by enabling detailed image descriptions, which are integrated with retrieved text data. This setup enhances the pipeline’s ability to handle complex, multimodal queries, making it ideal for applications like multimedia analysis and decision support.
Optimization libraries
Intel provides various libraries designed to maximize performance in model execution, offering advanced optimizations for both CPUs and GPUs. Among these tools is IPEX-LLM, a PyTorch library designed to accelerate the inference and fine-tuning of large language models (LLMs) on Intel hardware, including CPUs, AI Accelerators like Gaudi and GPUs such as the Arc GPUs.
Built on top of the Intel Extension for PyTorch (IPEX), this library offers state-of-the-art optimizations and supports low-precision computations (FP8/FP4/INT4), enabling seamless integration with popular tools like llama.cpp, Ollama, HuggingFace Transformers, LangChain, LlamaIndex, and DeepSpeed.
The environment setup is straightforward if you only use an Intel CPU processor in your pipeline. No additional software installation is required (such as OneAPI for Intel GPUs), you only need to install ipex-llm from pip.
pip install –pre –upgrade ipex-llm[all] –extra-index-url https://download.pytorch.org/whl/cpu |
Embeddings
In Natural Language Processing (NLP), embeddings are numerical representations of text that capture semantic meaning, enabling models to process language more effectively. They facilitate tasks like text classification and information retrieval by representing words or sentences as vectors in a continuous space.
The bge-large-en-v1.5 model is an advanced English embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI). Based on BERT architecture, it generates high-quality sentence embeddings, facilitating accurate similarity calculations and effective passage retrieval.
Intel has optimized the bge-large-en-v1.5 model with its IPEX-LLM library, enhancing performance on Intel CPUs and GPUs.
In this case, we can apply optimization to our sentence transformer using the ipex-llm API
These functions apply optimizations like merging KQV. In transformer models, this process refers to combining the separate linear transformations for queries (Q), keys (K), and values (V) into a single linear operation. This approach reduces computational overhead by performing one matrix multiplication instead of three, enhancing efficiency without compromising model performance.
You will find more details in IntelCPU: Intel CPU benchmarking on LLMs
Large Language Model
Large Language Models (LLMs) are pivotal in Retrieval-Augmented Generation (RAG) systems. In RAG, LLMs generate coherent and contextually relevant text by integrating user queries with retrieved external information. Llama 3.2-1B, developed by Meta, was officially released by Meta on September 25, 2024, and is a 1-billion-parameter model optimized for mobile hardware, enabling developers to create AI-powered applications for smartphones and other devices.
Loading optimized models with IPEX-llm is easy using the Automodel API of Transformers hugging face
The ipex-llm extension for transformers allows optimized models to be loaded directly. In this case, we are using symmetric INT4 optimization, but you may apply other low-bit optimizations (INT5, INT8, etc.) symmetrically and asymmetrically.
Vision Language Model
In multimodal Retrieval-Augmented Generation (RAG) systems, vision language models play a crucial role by integrating visual inputs into the retrieval and generation processes. This enables the system to understand and generate content that combines both text and images, enhancing tasks such as image captioning, visual question answering, and multimodal content creation.
Phi-3.5 Vision is a state-of-the-art, open multimodal model developed by Microsoft. It integrates both text and visual data, enabling advanced reasoning across these modalities. The model is built upon high-quality datasets, including synthetic data and filtered publicly available websites, emphasizing reasoning-dense information in both text and vision domains.
A notable feature of Phi-3.5 Vision is its support for a context length of up to 128,000 tokens, allowing it to handle extensive multimodal information efficiently.
We are using the same opt_parameters as Llama 3.2 to load our model.
Setting attn_implementation=”eager” configures the model to utilize a manual, straightforward implementation of attention, often referred to as “eager” mode. This mode is generally more compatible across various hardware and software environments, ensuring broader support and stability. When deploying models like Phi-3.5 Vision, which may not support certain optimized attention mechanisms, specifying this parameter can help prevent compatibility issues.
Performance comparison
In this section, we compare the inference performance of each core component in the multimodal RAG pipeline—embedding, large language model (LLM), and vision-language model (VLM)—across different hardware configurations. This evaluation provides insights into choosing a cost-effective infrastructure to execute the entire pipeline.
The Azure virtual machines used for the benchmark are:
All the machines are in the East US 2 region.
Embeddings
The graph compares inference times for generating embeddings using the BGE Large EN model across different hardware configurations, measured for 80 documents.
While the Tesla T4 GPU excels in handling embeddings, particularly as batch sizes increase, Azure’s D16s v6 with Intel Emerald Rapids CPUs offers a compelling balance of cost and performance. This means that for the BEG Large EN benchmark, Intel AMX could achieve slower performance but with almost 35% lower cost for instance.
Large Language Model
This graph compares inference times for generating 750 tokens in a single batch size.
The results show that Azure’s D16s v6 with Intel Emerald Rapids processors offers a balanced combination of cost and performance, coming very close to GPU-level efficiency for text generation.
Cost-performance comparison
We also evaluated the performance of this LLM model by measuring its tokens per second (t/s) rate. This metric quantifies the number of tokens the model can generate or process per second, clearly indicating its efficiency during inference.
To better understand the cost efficiency of the Llama 3.2 model across different hardware, we evaluated the cost per token. This metric calculates the cost in USD required to process one token per second, providing a clear picture of the economic implications of running the model on various hardware configurations. By comparing these values, we can identify which hardware offers the best balance between performance and cost.”
While the D16s v6 shows a significantly lower cost per token than the other options, it’s important to note that this analysis accounts for hardware efficiency and associated costs. This makes the D16s v6 the most cost-effective option, whereas the NC16as T4 v3 and D16s v4 have similar but higher costs compared to the D16s v6.
Vision Language model
Showcasing the capabilities of models like Phi3.5 Vision in creating detailed image captions, we present a sample of an image below.

Figure 2. Image for detailed caption generation
The prompt for the following output was: “Generate a detailed caption for the image”.
“In the image, a young child is the central figure, standing in front of a rustic stone wall. The child is dressed in a pink and white striped dress, adding a touch of innocence and youth to the scene. In the child’s hands, they hold a white teddy bear, which is adorned with a pink tutu, symbolizing the child’s love for stuffed animals. The teddy bear is positioned in such a way that it appears to be dancing, adding a playful element to the image. The background is simple yet charming, featuring a stone wall.”
This graph compares inference times for generating a detailed caption of a single image. (Max tokens 130)
Traditionally, GPUs have been the go-to choice for vision-related AI tasks due to their superior performance in parallel workloads. However, as shown in this comparison, the limited memory capacity of GPUs can often prevent certain models from running efficiently or at all. In the previous graph, the inference time for the Nvidia T4 was not included due to memory constraints, which prevented the model from running properly on this GPU. In contrast, the memory typically available in standard servers is significantly higher than the 16 GB provided by GPUs like the Tesla T4. This makes CPUs and memory-rich servers a practical solution for larger models. Additionally, the D16s v6 machine achieves an impressive inference time for generating the image description, showcasing its efficiency and reliability for these tasks.
Production environments
The tests conducted in the previous section demonstrate how it is possible to run models that are part of a multimodal pipeline using only the CPU. If we need this pipeline to be usable by many users concurrently, we could use LLM servers such as vLLM to achieve higher performance from our model.
The following link shows how to configure vLLM on Intel Extension for PyTorch using only the CPU as the inference device.
Conclusion
Intel’s 5th Gen Xeon Emerald Rapids CPUs, integrated with Azure’s latest virtual machines, are revolutionizing how Retrieval-Augmented Generation (RAG) pipelines are built and optimized. These processors, with advanced features like Intel Advanced Matrix Extensions (AMX), deliver robust performance for multimodal RAG tasks—such as embeddings, large language model (LLM) inference, and vision-language model (VLM) analysis—without the high costs typically associated with GPUs.
Emerald Rapids CPUs achieve inference speeds comparable to GPUs such as the Tesla T4, with minimal differences in performance across various RAG workloads. This makes Intel’s processors a cost-effective and high-performing option, enabling easier access to advanced AI technologies. Additionally, CPUs offer the advantage of simpler development and testing, as they do not require the installation of extra libraries or drivers, unlike GPUs. This is particularly useful in scenarios where local and production environments differ in hardware architecture.
The combination of Intel’s innovative CPU technology and Azure’s flexible infrastructure allows businesses to leverage open-source-friendly workflows, reduce dependency costs, and maintain control over their RAG implementations. With this approach, organizations can efficiently handle complex multimodal tasks and scale their pipelines to meet evolving demands.