GraphRAG: A new revolution for creating graphics with LLM?Plain Concepts

Introduction

As Microsoft says, the biggest challenge and opportunity for LLMs is to extend their powerful problem-solving capabilities beyond the data they are trained on and achieve comparable results with data that the LLM has never seen.

This opens up new possibilities in data research, and one of the big breakthroughs is GraphRAG. Here’s what it is and how it works.

What is GraphRAG

Retrieval-Augmented Generation (RAG) is a technique for searching information based on a user query and providing the results as a reference for generating an AI response.

This technique is an important part of most LLM-based tools and most RAG approaches use vector similarity as a search technique.

A baseline RAG typically integrates a vector database and an LLM, where the vector database stores and retrieves contextual information for user queries, and the LLM generates answers based on the retrieved context. While this approach works well in many cases, it presents difficulties with complex tasks such as multi-hop reasoning or answering queries that require connecting different pieces of information.

The main challenge faced by a RAG is that it retrieves text based on semantic similarity and does not directly answer complex queries where specific details may not be explicitly mentioned in the dataset. This limitation makes it difficult to find the exact information needed, often requiring costly and impractical solutions such as manually creating batteries of frequently asked questions and answers.

To address these challenges, we found GraphRAG, developed by Microsoft, which uses LLM-generated knowledge graphs to provide substantial improvements in question-and-answer performance when performing complex information document analysis.

This research points out the power of rapid augmentation when performing discovery on private datasets. These private datasets are defined as data that LLM is not trained on and has never seen business documents or communications before. The graph created by GraphRAG is used in conjunction with Graph Machine Leaning to perform rapid augmentation at query time. This achieves a substantial improvement in answering the two classes of possible queries, demonstrating an intelligence or mastery that outperforms other approaches previously applied to private data sets.

Application of RAG to private datasets

Microsoft Search has presented research using the Violent Incident Information from News Articles (VINA) dataset. This dataset was chosen because of its complexity and the presence of differing opinions and biased information.

They have used thousands of news articles from Russian and Ukrainian news sources from June 2023, translated into English, to create a private dataset on which they have performed their LLM-based retrieval. As the dataset is too large to fit in an LLM context window, a RAG approach is needed.

They start with an exploratory query to a reference RAG system and GraphRAG. The results are that both systems work well, so as a conclusion we can draw that, for a reference query, RAG is sufficient.

With a query that requires joining the dots, the base RAG does not answer this question and gives an error. In comparison, the GraphRAG method discovered an entity in the query. This allows the LLM to rely on the graph and generate a superior response containing provenance through links to the original supporting text. By using the knowledge graph generated by LLM, GraphRAG greatly improves the “retrieval” part of RAG by populating the context window with content of higher relevance, resulting in better answers and capturing the provenance of the evidence.

Microsoft GraphRAG: How does it work?

As we said above, Project GraphRAG is Microsoft Research’s bet with which they have achieved the most advanced technique in the market to deeply understand text datasets by combining text extraction, network analysis, and LLM generation and summarization in a single end-to-end system.

Unlike a basic RAG that uses a vector database to retrieve semantically similar text, GraphRAG enhances the mechanism by incorporating knowledge graphs (KG). These graphs are data structures that store and link related or unrelated data based on their relationships.

A GraphRAG pipeline typically consists of two processes: indexing and querying.

Indexing

This process includes four key steps:

Segmentation of text units: the entire input corpus is divided into several text units, which can be paragraphs, sentences, or other logical units. By segmenting large documents into smaller fragments, we can extract and retain more detailed information about this input data.
Entity, relationship, and assertion extraction: it uses LLM to identify and extract all entities, their relationships, and the key assertions expressed in the text of each unit.
Hierarchical clustering: uses the Leiden technique to perform hierarchical clustering on the initial knowledge graphs. Thus, the entities in each group are assigned to different communities for further analysis.
Community generation summary: generates summaries for each community (group of nodes within the graph connected to each other) and its members using a bottom-up approach. These include the main entities within the community, their relationships, and key assertions.

Query

We encounter two different query workflows designed for different queries:

Global search: to reason about holistic questions related to the whole corpus of data by taking advantage of community summaries. It is the most recommended when users ask questions about specific entities and includes the following phases:
- User query and conversation history
- Batches of community reports
- Qualified Intermediate Responses (RIR)
- Ranking and filtering
- Final response

Local search: to reason about specific entities by disseminating their neighbors and associated concepts:
- User query
- Search for similar entities
- Text unit to entity mapping
- Entity-relationship extraction
- Entity-variable mapping
- Entity-community report mapping
- Conversation history utilization
- Response generation

Automatic adjustment of GraphRAG

GraphRAG uses LLM to create a comprehensive knowledge graph that details entities and their relationships from any collection of text documents. This graph allows you to take advantage of the semantic structure of the data and generate answers to complex queries that require a comprehensive understanding of the entire text.

GraphRAG 1.0

Microsoft launched the preliminary version of GraphRAG in July 2024 and, since then and thanks to the incredible reception and collaboration of the community, they have been improving the service; which has culminated in the official release of GraphRAG 1.0.

The main improvements have to do with ergonomic refactorings and availability:

Easier configuration for new projects: reduced friction in configuration by adding a command that generates a simplified initial configuration file with all the basic configuration required already set up.
New and expanded command interface: they have achieved better online documentation and a more complete CLI experience, offering a more streamlined experience.
Consolidated API layer: still in the provisional phase, it is intended to be the main entry point for developers who wish to integrate GrapRAG functionality into their own applications without deep customization of the query class or pipeline.
Simplified data models: GraphRAG creates several output artifacts to store the indexed knowledge model. Fixes have been incorporated to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify data models.
Optimized Vector Stores: Inlays and their vector stores are some of the main drivers of model storage needs. With the new pipeline update, a default vector store has been created during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use.
Flatter and clearer code structure: the code base has been simplified to make it easier to maintain and more accessible to external users. By reducing workflows, there are fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. It has also reduced the in-memory footprint of the script, allowing users to index and analyze larger data sets with GraphRAG.
Incremental ingest: A new update command has been included in the CLI that calculates deltas between an existing index and newly added content and intelligently merges updates to minimize reindexing.
Availability and migration: GraphRAG 1.0 is now available on GitHub and published on PyPI, so it is recommended to migrate to this version, which offers an optimized experience that includes multiple improvements for both users and developers.

If you want to keep up to date with the latest news on AI and other technologies, subscribe to our newsletter!

Cookie	Duration	Description
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
attributionCookie	session	No description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
cppro-ft	1 year	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	session	No description
cppro-ft-style	session	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	1 year	No description
i18n	10 years	No description available.
IE-jwt	62 years 6 months 9 days 9 hours	No description
IE-LANG_CODE	62 years 6 months 9 days 9 hours	No description
IE-set_country	62 years 6 months 9 days 9 hours	No description
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wmc	9 years 11 months 30 days 11 hours 59 minutes	No description

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	1 year	No description
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjSession_1776154	session	No description
_hjSessionUser_1776154	session	No description
_hjTLDTest	1 year	No description
_hjTLDTest	1 year	No description
_hjTLDTest	session	No description
_hjTLDTest	session	No description
_lfa_test_cookie_stored	past	No description

Cookie	Duration	Description
loglevel	never	No description available.
prism_90878714	1 month	No description
redirectFacebook	2 minutes	No description
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.