In this blog, Metrum AI presents a RAG solution on Dell PowerEdge R760xa Server with Nvidia H100 Data Center Tensor Core GPUs.

| July 2024

The exponential growth of machine data from modern IT infrastructure, including logs from applications, servers, and network devices, presents a significant challenge for IT teams. While the large volume and complexity of the data makes it difficult for IT teams to manually analyze the logs to identify critical errors and anomalies in a timely manner, traditional log analysis methods are also labor-intensive and often fall short in providing timely, accurate, and correlated insights from different log streams, leading to potential system downtime, vulnerabilities, and inefficiencies.

Envision an enterprise workflow where the infrastructure automatically notifies IT teams of issues, removing the need for manual log analysis and streamlining error detection.

Using retrieval-augmented generation (RAG), infrastructures can be enabled to provide detailed insights into system and network issues, allowing IT Teams to diagnose and resolve them efficiently. Dell, in partnership with Nvidia and Metrum AI, is thrilled to unveil a cutting-edge GenAI-Based IT Log Analyzer Solution that leverages Generative AI and RAG to revolutionize the log analysis and incident management experience and process.

This solution enables the following key advancements in IT Log Analysis:

In this blog, Metrum AI walks through an enterprise-ready solution architecture and provides insights into the user interface along with example user queries.

| Solution Architecture

This solution leverages Dell PowerEdge R760xa server equipped with Nvidia H100 Data Center Tensor Core GPUs, and utilizes a suite of models, including bge-large-en-v1.5, a text embedding model, as well as the Llama 3 large language model. This solution is built on Haystack 2.0, an industry leading RAG framework, along with Nvidia NIM, offering optimized microservices at scale.

The image below illustrates the solution architecture in detail.

| Key Features

| Step-by-Step Demo Walkthrough

  1. The user ingests logs from different devices and types such as application, server and network through the user interface.
  2. The user selects the application and time frame for log analysis.
  3. The user converses with the application through the conversation console.
  4. The application retrieves and automatically provides identified errors, along with their details and source log references.
  5. The application automatically correlates errors across multiple sources and helps in the early diagnosis of errors.
  6. The application automatically allows users to generate an incident ticket (PDF) based on the analysis results.

| Example Scenarios

The following example scenarios simulate a user query relevant to the uploaded IT log files, along with the corresponding text response, reference list, and incident ticket generated by the application.

Can you summarize application errors from 7:00 to 17:00 as a markdown table?

Can you summarize application errors and corresponding IDRAC errors from the same time period?

Can you summarize errors and corresponding IDRAC errors from the same time period? Additionally, could you provide a detailed analysis of the app logs and IDRAC logs?

The image below details an example GenAI incident report.

| Summary

Dell PowerEdge R760xa server equipped with Nvidia H100 Data Center Tensor Core GPUs, offers enterprises industry-leading infrastructure to create custom RAG solutions using their proprietary data. In this blog, we showcased how enterprises deploying applied AI can take advantage of RAG capabilities in the context of an IT Log Analyzer, uncovering the following milestones:

| Additional Criteria for IT Decision Makers

| What is RAG, and why is it critical for enterprises?

RAG, which stands for Retrieval-Augmented Generation, is a method in natural language processing (NLP) that enhances the generation of responses or information by incorporating external knowledge retrieved from a large corpus or database. This approach combines the strengths of retrieval-based models and generative models to provide more accurate, informative, and contextually relevant outputs.

The key advantage of RAG is that it leverages a large amount of external knowledge dynamically, enabling the model to generate responses that are not just based on its training data but also on up-to-date and detailed information from the retrieval phase. This makes RAG particularly useful in applications where factual accuracy and detail are crucial, such as in customer support, academic research, and other domains requiring precise information. Ultimately, RAG provides enterprises with a robust tool for improving the accuracy, relevance, and efficiency of their information systems, leading to better customer service, cost savings, and competitive advantages.

| Resources

Dell product images: Dell.com

Copyright © 2024 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. Nvidia and combinations thereof are trademarks of Nvidia. All other product names are the trademarks of their respective owners.

***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.