Augmenting LLM with RAG

Retrieval-Augmented Generation (RAG) extends Large Language Model capabilities by connecting them to external knowledge bases. Current implementations show query latency ranging from 100ms to several seconds, with retrieval accuracy varying significantly based on embedding quality and context window limitations. Storage requirements can reach multiple terabytes for comprehensive knowledge bases, while maintaining index freshness presents ongoing operational challenges.

The fundamental challenge lies in balancing retrieval accuracy and response latency while managing computational resources and maintaining data relevance.

This page brings together solutions from recent research—including hybrid vector-semantic search approaches, streaming retrieval architectures, efficient embedding techniques, and context optimization methods. These and other approaches focus on building practical, production-ready RAG systems that deliver reliable and timely responses.

1. Graph-Based Contextual Prompt Generation for Code Completion in Language Models

MICROSOFT TECHNOLOGY LICENSING LLC, 2025

Improving large language models like GPT3 for software engineering tasks like code completion by providing context-specific prompts that better reflect user intent. The technique involves generating code directives by traversing a graph representation of the user's source code. The graph shows usage and definition relationships between code elements. When a user queries the model, the graph is searched to find relevant nodes. Directives containing source code snippets and relationship descriptions are generated from these nodes. This customized prompt is then sent to the model for response generation.

2. Microservice Architecture for Language Models with Specialized Functional Segmentation

EDUWORKS CORP, 2025

Microservice architecture for language models that provides performance comparable to competing trillion parameter models on some tasks while using significantly less computational resources. The microservice architecture involves breaking down the language model into smaller, specialized microservices like expansion, retrieval, and data producers. These microservices leverage techniques like recursive retrieval augmentation and API-centric data access to expand and enhance client inputs. This allows leveraging smaller, more specialized models instead of massive monolithic ones, reducing computational burden.

3. Microservice Architecture for Language Model Copilots with Input Expansion, Retrieval, and Core Output Generation Services

EDUWORKS CORP, 2025

Microservice architecture for language model copilots that provides cognitive functionality comparable to large language models like GPT-4 but with lower computational burden and fewer artifacts. The copilot is composed of multiple microservices, including an expansion service to augment client inputs, a retrieval service to fetch relevant documents, and a core service to generate outputs. The expansion and retrieval services expand and enrich the client input before passing it to the core service. This allows the core service to work with improved data for better performance. The microservice architecture enables leveraging specialized services for specific tasks instead of relying on a single large model.

4. Natural Language Query Processing System Utilizing Task Decomposition, Knowledge Graph Entity Retrieval, and Vector-Based Text Chunk Search

INTERNATIONAL BUSINESS MACHINES CORP, 2025

Generating more accurate and efficient natural language responses to user queries using AI techniques like knowledge graphs, vector searches, and large language models (LLMs). The method involves breaking down user queries into tasks, searching a knowledge graph for relevant entities, then searching a vector database for text chunks associated with those entities. This refined search scope improves the accuracy and efficiency of the vector search. The LLM is then used to process the retrieved text chunks and generate a response.

5. A Systematic Review of Retrieval-Augmented Generation for Enhancing Domain-Specific Knowledge in Large Language Models

murtiyoso murtiyoso, imam tahyudin, berlilana berlilana - Politeknik Ganesha, 2025

This literature review examines the use of Retrieval-Augmented Generation (RAG) in enhancing Large Language Models (LLM) for domain-specific knowledge. RAG integrates retrieval techniques with generative models to access external knowledge sources, addressing limitations LLMs handling specialized information. By leveraging data, improves accuracy and relevance generated content, making it particularly useful fields that require detailed up-to-date highlights effectiveness overcoming challenges such as data sparsity dynamic nature Furthermore, discusses potential enhance LLM performance, scalability, ability generate contextually accurate responses knowledge-intensive applications. Key future research directions implementation are also identified.

6. Enhancing Document-Level Question Answeringvia Multi-Hop Retrieval-Augmented Generationwith LLaMA 3

xinyue huang, ziqi lin, fang sun, 2025

This paper presents a novel Retrieval-AugmentedGeneration (RAG) framework tailored for complex questionanswering tasks, addressing challenges in multi-hop reasoningand contextual understanding across lengthy documents. Builtupon LLaMA 3, the integrates dense retrievalmodule with advanced context fusion and reasoningmechanisms, enabling more accurate coherent responsegeneration. A joint optimization strategy combining retrievallikelihood generation cross-entropy improves modelsrobustness adaptability. Experimental results show that theproposed system outperforms existing retrieval-augmented andgenerative baselines, confirming its effectiveness deliveringprecise, contextually grounded answers.

7. Multimedia Graph Codes for Fast and Semantic Retrieval-Augmented Generation

stefan wagenpfeil - Multidisciplinary Digital Publishing Institute, 2025

Retrieval-Augmented Generation (RAG) has become a central approach to enhance the factual consistency and domain specificity of large language models (LLMs) by incorporating external context at inference time. However, most existing RAG systems rely on dense vector-based similarity, which fails capture complex semantic structures, relational dependencies, multimodal content. In this paper, we introduce Graph Codesa matrix-based encoding Multimedia Feature Graphsas an alternative retrieval paradigm. Codes preserve topology explicitly entities their typed relationships from documents, enabling structure-aware interpretable retrieval. We evaluate our system in two domains: scene understanding (200 annotated image-question pairs) clinical question answering (150 real-world medical queries with 10,000 structured knowledge snippets). Results show that method outperforms baselines precision (+915%), reduces hallucination rates over 30%, yields higher expert-rated answer quality. Theoretically, work demonstrates symbolic similarity graphs provides more faithful alignment mechanism t... Read More

8. Conversational Agent for Medical Question-Answering Using RAG and LLM

la ode muhammad yudhy prayitno, annisa nurfadilah, septiyani bayu saudi - Ioinformatic, 2025

This study analyzes the application of RAG concept alongside an LLM in context PubMed QA data to augment question-answering capabilities medical context. For answering questions relevant private healthcare institutions, Mistral 7B model was utilized. To limit hallucinations, embedding used for document indexing, ensuring that answers based on provided information. The analysis conducted using five models, two which are specialized PubMedBERT-base and BioLORD-2023, as well three general GIST-large-Embedding-v0, blade-embed-kd, all-MiniLM-L6-v2. As results showed, models performed better than domain specific especially GIST-large-Embedding-v0 b1ade-embed-kd, underscores dominance general-purpose training datasets terms fundamental semantic retrieval, even domains. outcome this research demonstrates applying locally can safeguard privacy while still responding queries with appropriate precision, thus establishing a foundation dependable system.

9. Large Language Model Processing with Iterative Attention Focusing and Context-Optimized Retrieval Techniques

VIJAY MADISETTI, 2025

Enhancing attention span of large language models (LLMs) when processing long documents and improving Retrieval-Augmented Generation (RAG) through context-optimized retrieval techniques. The methods involve iterative attention focusing, context-aware document processing, intelligent information retrieval, adaptive response generation, and dynamic adjustment of LLM's attention focus. It also uses functional abstraction through hierarchical tokens to improve context management and semantic coherence.

10. Combining Retrieval-Augmented Generation and Fine-tuning of Large Language Models to Enhance Port Industry Question-Answering Systems

xiao hu, mideth abisado - Whioce Publishing Pte Ltd., 2025

In this research, we develop a new hybrid architecture that combines Retrieval-Augmented Generation (RAG) and LLMs (Large Language Models) in order to address the specific gaps domain question answering systems for maritime port industry. Our approach mitigates generic LLMs limitations concerning domain-specific queries through combination of knowledge retrieval industry adaptive modelling with implemented parameters. The overarching evaluation protocol designed investigating was both quantitative qualitative by using expert judgement which showed marked improvement justifiable gains across multi-dimensional stand-alone approaches regarding factual correctness, accuracy use terms, compliance relevant policies. system achieved 23% nDCG@5 scores alongside exceeding 90% terminology used context, maintaining sub-second response times under typical operational loads. experts consulted study were particularly impressed balance struck between precision contextual understanding complex scenarios. Such enables decision-makers critical environments greatly trust within their active contexts... Read More

11. Automated Cybersecurity System with Large Language Models and Retrieval-Augmented Generation for Suspicious Activity Report Generation

CITIBANK NA, 2025

Automated system for generating suspicious activity reports (SARs) in cybersecurity using large language models (LLMs) and retrieval-augmented generation (RAG). The system improves SAR generation accuracy and consistency by leveraging LLMs and RAG to incorporate more relevant data into SARs. The system obtains event data, extracts features, queries historical data with similar features, composes the events, prompts the LLM, and generates the SAR. This provides the LLM with more contextualized data to generate accurate SARs without needing extensive training on SARs themselves.

12. Language Model Output Enhancement via Structured Universal Language Integration

UNLIKELY ARTIFICIAL INTELLIGENCE LTD, 2025

Improving the output of large language models (LLMs) like GPT-3 by using a structured, machine-readable language like Universal Language (UL) to provide new context data for the LLM. This allows the LLM to generate more accurate and improved continuation text output in response to prompts. The UL representation allows more expressive and detailed meaning representation compared to natural language. The LLM uses UL output as prompts, and the UL processing system analyzes and improves the LLM's output. This provides more accurate and reliable LLM output.

13. IMPLEMENTING AND ASSESSING RETRIEVAL AUGMENTED GENERATION (RAG) FOR LLM-BASED DOCUMENTS QUERIES

peter kaczmarski, fernand vandamme - Routledge, 2025

In recent years, AI-related technology referred to as RAG (Retrieval Augmented Generation) (Lewis, 2020) gained a lot of attention. the RAG-approach, custom sources information are used seed knowledge obtained from LLM (Large Language Model), thus forming an approach which solves issue adapting cope with external information. Using RAGscenario, various processing use cases can be implemented, such AI-based document management, AI-enhanced web search, online service support, etc. This paper outlines main components RAG-workflow chunking and embedding input documents, well similarity -based user query processing. The is illustrated via Python implementation validate procedure simple example multi-topic document. Experimental results discussed showing feasibility this approach, illustrating need for further research enhancements, by RAPTOR concept (Sarthi, 2024).

14. SG-RAG MOT: SubGraph Retrieval Augmented Generation with Merging and Ordering Triplets for Knowledge Graph Multi-hop Question Answering

anwar saleh, gokhan tur, yucel saygin, 2025

Large Language Models (LLMs) often tend to hallucinate, especially on domain-specific tasks and that require reasoning. Previously, we introduced SubGraph Retrieval Augmented Generation (SG-RAG) as a novel GraphRAG method for multi-hop question answering. SG-RAG leverages Cypher queries search the given knowledge graph retrieve necessary subgraph answer question. The results from our previous work showed higher performance of compared traditional (RAG). In this work, further enhance by proposing an additional step called Merging Ordering Triplets (MOT). new MOT seeks decrease redundancy in retrieved triplets applying hierarchical merging subgraphs. Moreover, it provides ordering among using Breadth First Search (BFS) traversal algorithm. We conducted experiments MetaQA benchmark, which is proposed question-answering movies domain. Our show more accurate answers than Chain-of-Though Graph Chain-of-Though. also find out (up some point) highly overlapping subgraphs defining order helps LLM generate precise answers.

15. Retrieval-Augmented Generation System with Reconfigurable Ranker Sequence and Self-Rewarding Optimization Techniques

GOLDMAN SACHS & CO LLC, 2025

Optimizing retrieval-augmented generation (RAG) systems with a reconfigurable sequence of rankers in the retriever model to improve the quality of information chunks provided to the generative model. The rankers in the reconfigurable sequence are bi-encoders, cross-encoders, and an LLM-ranker. The rankers identify relevant chunks from documents for user queries. This allows more-relevant information chunks to be provided to the generative model, increasing output quality. The rankers can be reconfigured to optimize chunk selection. Additionally, self-rewarding optimization techniques are provided to train the entire RAG system using rewards based on generated responses.

16. Retrieval Augmented Generation System with Multi-Query Expansion and Vector Search Fusion

ELSEVIER INC, 2025

Retrieval augmented generation (RAG) system that improves search results by generating multiple natural language queries from a user's input, performing vector searches on both the user queries and the generated queries, and fusing the results to compile a more comprehensive and accurate search result. The system involves inputting user queries into a large language model to generate distinct associated queries, performing vector searches on all queries, compiling the results into a fused search, and summarizing the fused results. This bridges the gap between explicit and implicit search intent, leveraging the language model to expand queries and the vector search to fuse results.

17. Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain

simon knollmeyer, oguz caymazer, daniel grossmann - Multidisciplinary Digital Publishing Institute, 2025

Retrieval-Augmented Generation (RAG) systems have shown significant potential for domain-specific Question Answering (QA) tasks, although persistent challenges in retrieval precision and context selection continue to hinder their effectiveness. This study introduces Document Graph RAG (GraphRAG), a novel framework that bolsters robustness enhances answer generation by incorporating Knowledge Graphs (KGs) built upon documents intrinsic structure into the pipeline. Through application of Design Science Research methodology, we systematically design, implement, evaluate GraphRAG, leveraging graph-based document structuring keyword-based semantic linking mechanism improve quality. The evaluation, conducted on well-established datasets including SQuAD, HotpotQA, newly developed manufacturing dataset, demonstrates consistent performance gains over naive baseline across both metrics. results indicate GraphRAG improves Context Relevance metrics, with task-dependent optimizations chunk size, keyword density, top-k further enhancing performance. Notably, multi-hop questions benefit most fro... Read More

18. Integrating pre-trained LLMS with RAG for efficient content retrieval

tran trong kien, khau van bich - Lac Hong University, 2025

Large Language Models (LLMs) are highly effective at replicating human tasks and boosting productivity but face challenges in accurate data extraction due to prioritizing fluency over factual precision. Researchers addressing these limitations by combining LLMs with Retrieval-Augmented Generation (RAG) models. This approach utilizes chunking, searching, ranking algorithms streamline retrieval from unstructured text, improving LLMs precision processing. The findings provide key insights into optimizing chunking strategies set the stage for advancement broader application of RAG-enhanced systems.

19. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian

ermelinda oro, francesco granata, massimo ruffolo - Multidisciplinary Digital Publishing Institute, 2025

This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) question answering (QA) across languages, focusing on English Italian. We address significant research gap by providing empirical evidence model performance linguistic boundaries. evaluate 12 diverse IR datasets, including Italian SQuAD DICE, SciFact, ArguAna, NFCorpus. assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, Gemma-2b) QA tasks within retrieval-augmented generation (RAG) pipeline. them SQuAD, CovidQA, NarrativeQA cross-lingual scenarios. The results show multilingual perform more competitively than language-specific ones. embed-multilingual-v3.0 achieves top nDCG@10 scores 0.90 0.86 In evaluation, Mistral-Nemo demonstrates superior answer relevance (0.911.0) while maintaining strong groundedness (0.64-0.78). Our analysis reveals three key findings: (1) effectively bridge gaps between Italian, though consistency decreases in specialized domains, (2) size does not consistently predict performance, (3) all evaluated systems exhibit critic... Read More

20. AI-Based Search System with Intelligent Agents and Dynamic Knowledge Base Construction Using Retrieval Augmented Generation

ARTI ANALYTICS INC, 2025

An AI-powered search system that uses intelligent agents to find curated, directly useful information from disparate sources like websites, documents, and databases. The system uses AI techniques like Retrieval Augmented Generation (RAG) to analyze search queries and dynamically build a knowledge base of indexed information. It enables features like automated form filling, restricted access, and personalization. The system also learns and grows its knowledge base based on user confirmations and AI processing.

21. Next Sentence Prediction with BERT as a Dynamic Chunking Mechanism for Retrieval-Augmented Generation Systems

22. Context is Key: Aligning Large Language Models with Human Moral Judgments through Retrieval-Augmented Generation

23. Retrieval Augmented Generation: What Works and Lessons Learned

24. Enhancing Large Language Models for Specialized Domains: A Two-Stage Framework with Parameter-Sensitive LoRA Fine-Tuning and Chain-of-Thought RAG

25. Building an LLM Agent for Life Sciences Literature QA and Summarization

Get Full Report

Access our comprehensive collection of 108 documents related to this technology

Request PDF