Skip to main content
February 20, 2025

GraphRAG: A new revolution for creating graphics with LLM?

As Microsoft says, the biggest challenge and opportunity for LLMs is to extend their powerful problem-solving capabilities beyond the data they are trained on and achieve comparable results with data that the LLM has never seen.

This opens up new possibilities in data research, and one of the big breakthroughs is GraphRAG. Here’s what it is and how it works.  

What is GraphRAG

Retrieval-Augmented Generation (RAG) is a technique for searching information based on a user query and providing the results as a reference for generating an AI response.

This technique is an important part of most LLM-based tools and most RAG approaches use vector similarity as a search technique.

A baseline RAG typically integrates a vector database and an LLM, where the vector database stores and retrieves contextual information for user queries, and the LLM generates answers based on the retrieved context. While this approach works well in many cases, it presents difficulties with complex tasks such as multi-hop reasoning or answering queries that require connecting different pieces of information.  

The main challenge faced by a RAG is that it retrieves text based on semantic similarity and does not directly answer complex queries where specific details may not be explicitly mentioned in the dataset. This limitation makes it difficult to find the exact information needed, often requiring costly and impractical solutions such as manually creating batteries of frequently asked questions and answers.

To address these challenges, we found GraphRAG, developed by Microsoft, which uses LLM-generated knowledge graphs to provide substantial improvements in question-and-answer performance when performing complex information document analysis.

This research points out the power of rapid augmentation when performing discovery on private datasets. These private datasets are defined as data that LLM is not trained on and has never seen business documents or communications before. The graph created by GraphRAG is used in conjunction with Graph Machine Leaning to perform rapid augmentation at query time. This achieves a substantial improvement in answering the two classes of possible queries, demonstrating an intelligence or mastery that outperforms other approaches previously applied to private data sets.  

Application of RAG to private datasets

Microsoft Search has presented research using the Violent Incident Information from News Articles (VINA) dataset. This dataset was chosen because of its complexity and the presence of differing opinions and biased information.

They have used thousands of news articles from Russian and Ukrainian news sources from June 2023, translated into English, to create a private dataset on which they have performed their LLM-based retrieval. As the dataset is too large to fit in an LLM context window, a RAG approach is needed.

They start with an exploratory query to a reference RAG system and GraphRAG. The results are that both systems work well, so as a conclusion we can draw that, for a reference query, RAG is sufficient.

With a query that requires joining the dots, the base RAG does not answer this question and gives an error. In comparison, the GraphRAG method discovered an entity in the query. This allows the LLM to rely on the graph and generate a superior response containing provenance through links to the original supporting text. By using the knowledge graph generated by LLM, GraphRAG greatly improves the “retrieval” part of RAG by populating the context window with content of higher relevance, resulting in better answers and capturing the provenance of the evidence. 

Microsoft GraphRAG: How does it work?

As we said above, Project GraphRAG is Microsoft Research’s bet with which they have achieved the most advanced technique in the market to deeply understand text datasets by combining text extraction, network analysis, and LLM generation and summarization in a single end-to-end system.

Unlike a basic RAG that uses a vector database to retrieve semantically similar text, GraphRAG enhances the mechanism by incorporating knowledge graphs (KG). These graphs are data structures that store and link related or unrelated data based on their relationships.

A GraphRAG pipeline typically consists of two processes: indexing and querying. 

Indexing

This process includes four key steps:

  1. Segmentation of text units: the entire input corpus is divided into several text units, which can be paragraphs, sentences, or other logical units. By segmenting large documents into smaller fragments, we can extract and retain more detailed information about this input data.
  2. Entity, relationship, and assertion extraction: it uses LLM to identify and extract all entities, their relationships, and the key assertions expressed in the text of each unit.
  3. Hierarchical clustering: uses the Leiden technique to perform hierarchical clustering on the initial knowledge graphs. Thus, the entities in each group are assigned to different communities for further analysis.
  4. Community generation summary: generates summaries for each community (group of nodes within the graph connected to each other) and its members using a bottom-up approach. These include the main entities within the community, their relationships, and key assertions.  

Query

We encounter two different query workflows designed for different queries: 

  • Global search: to reason about holistic questions related to the whole corpus of data by taking advantage of community summaries. It is the most recommended when users ask questions about specific entities and includes the following phases:
    • User query and conversation history
    • Batches of community reports
    • Qualified Intermediate Responses (RIR)
    • Ranking and filtering
    • Final response
  • Local search: to reason about specific entities by disseminating their neighbors and associated concepts:
    • User query
    • Search for similar entities
    • Text unit to entity mapping
    • Entity-relationship extraction
    • Entity-variable mapping
    • Entity-community report mapping
    • Conversation history utilization
    • Response generation

Automatic adjustment of GraphRAG

GraphRAG uses LLM to create a comprehensive knowledge graph that details entities and their relationships from any collection of text documents. This graph allows you to take advantage of the semantic structure of the data and generate answers to complex queries that require a comprehensive understanding of the entire text. 

GraphRAG 1.0

Microsoft launched the preliminary version of GraphRAG in July 2024 and, since then and thanks to the incredible reception and collaboration of the community, they have been improving the service; which has culminated in the official release of GraphRAG 1.0.

The main improvements have to do with ergonomic refactorings and availability: 

  • Easier configuration for new projects: reduced friction in configuration by adding a command that generates a simplified initial configuration file with all the basic configuration required already set up.
  • New and expanded command interface: they have achieved better online documentation and a more complete CLI experience, offering a more streamlined experience.
    Consolidated API layer: still in the provisional phase, it is intended to be the main entry point for developers who wish to integrate GrapRAG functionality into their own applications without deep customization of the query class or pipeline.
  • Simplified data models: GraphRAG creates several output artifacts to store the indexed knowledge model. Fixes have been incorporated to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify data models 
  • Optimized Vector Stores: Inlays and their vector stores are some of the main drivers of model storage needs. With the new pipeline update, a default vector store has been created during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use.
  • Flatter and clearer code structure: the code base has been simplified to make it easier to maintain and more accessible to external users. By reducing workflows, there are fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. It has also reduced the in-memory footprint of the script, allowing users to index and analyze larger data sets with GraphRAG.
  • Incremental ingest: A new update command has been included in the CLI that calculates deltas between an existing index and newly added content and intelligently merges updates to minimize reindexing.
  • Availability and migration: GraphRAG 1.0 is now available on GitHub and published on PyPI, so it is recommended to migrate to this version, which offers an optimized experience that includes multiple improvements for both users and developers 

If you want to keep up to date with the latest news on AI and other technologies, subscribe to our newsletter!

Elena Canorea
Author
Elena Canorea
Communications Lead