In this blog, Metrum AI presents a RAG solution on Dell PowerEdge R760xa Server with Nvidia H100 Data Center Tensor Core GPUs.

The exponential growth of machine data from modern IT infrastructure, including logs from applications, servers, and network devices, presents a significant challenge for IT teams. While the large volume and complexity of the data makes it difficult for IT teams to manually analyze the logs to identify critical errors and anomalies in a timely manner, traditional log analysis methods are also labor-intensive and often fall short in providing timely, accurate, and correlated insights from different log streams, leading to potential system downtime, vulnerabilities, and inefficiencies.

Envision an enterprise workflow where the infrastructure automatically notifies IT teams of issues, removing the need for manual log analysis and streamlining error detection.

Using retrieval-augmented generation (RAG), infrastructures can be enabled to provide detailed insights into system and network issues, allowing IT Teams to diagnose and resolve them efficiently. Dell, in partnership with Nvidia and Metrum AI, is thrilled to unveil a cutting-edge GenAI-Based IT Log Analyzer Solution that leverages Generative AI and RAG to revolutionize the log analysis and incident management experience and process.

This solution enables the following key advancements in IT Log Analysis:

Efficiency and Accuracy: By enabling AI-assisted or fully automated log analysis, the solution significantly reduces the time and effort required to identify and diagnose issues, ensuring more accurate and timely insights.
Scalability: The solution can handle large volumes of log data from multiple sources, scaling seamlessly with the growing needs of IT infrastructures.
AI-based Correlation: By correlating errors and identifying patterns, the solution can help in early diagnosis thereby avoiding potential system downtime, security threats and vulnerabilities more effectively.
Enhanced User Experience: A user-friendly interface enables AI-assisted diagnosis of error and source log reference through chat conversation. The added capability to generate detailed RCA reports and incident tickets further enhances the user experience.

In this blog, Metrum AI walks through an enterprise-ready solution architecture and provides insights into the user interface along with example user queries.

| Solution Architecture

This solution leverages Dell PowerEdge R760xa server equipped with Nvidia H100 Data Center Tensor Core GPUs, and utilizes a suite of models, including bge-large-en-v1.5, a text embedding model, as well as the Llama 3 large language model. This solution is built on Haystack 2.0, an industry leading RAG framework, along with Nvidia NIM, offering optimized microservices at scale.

The image below illustrates the solution architecture in detail.

| Key Features

Log Ingestion Interface: This solution enables easy ingestion from multiple log sources including applications, servers, and networks.
Customizable Log Analysis: Users can seamlessly select target applications and specific time frames for analysis.
Chat Console: The solution enables users to converse with the application and query information relevant to uploaded files.
Error Identification and Retrieval: This solution automatically detects errors, provides detailed insights and references to the source logs, and enhances troubleshooting without manual intervention.
Intelligent Error Correlation: The application correlates errors across multiple sources, paving the way for early diagnosis and preventing potential downtime.
Automated Incident Ticket Generation: Following the conversation and analysis, the system auto-generates an incident ticket in a PDF format, capturing all relevant details for record-keeping and further action.

| Step-by-Step Demo Walkthrough

The user ingests logs from different devices and types such as application, server and network through the user interface.
The user selects the application and time frame for log analysis.
The user converses with the application through the conversation console.
The application retrieves and automatically provides identified errors, along with their details and source log references.
The application automatically correlates errors across multiple sources and helps in the early diagnosis of errors.
The application automatically allows users to generate an incident ticket (PDF) based on the analysis results.

| Example Scenarios

The following example scenarios simulate a user query relevant to the uploaded IT log files, along with the corresponding text response, reference list, and incident ticket generated by the application.

Can you summarize application errors from 7:00 to 17:00 as a markdown table?

Can you summarize application errors and corresponding IDRAC errors from the same time period?

Can you summarize errors and corresponding IDRAC errors from the same time period? Additionally, could you provide a detailed analysis of the app logs and IDRAC logs?

The image below details an example GenAI incident report.

| Summary

Dell PowerEdge R760xa server equipped with Nvidia H100 Data Center Tensor Core GPUs, offers enterprises industry-leading infrastructure to create custom RAG solutions using their proprietary data. In this blog, we showcased how enterprises deploying applied AI can take advantage of RAG capabilities in the context of an IT Log Analyzer, uncovering the following milestones:

Built end-to-end RAG on Dell PowerEdge R760xa server with Nvidia H100 Data Center Tensor Core GPUs, validated on Nvidia NIMs, and deployed with Haystack RAG Framework.
Enabled IT professionals to chat with network, application, and infrastructure logs leading to faster root cause analysis and auto generated reports.
Showcased live at Dell Technologies World ‘24.

| Additional Criteria for IT Decision Makers

| What is RAG, and why is it critical for enterprises?

RAG, which stands for Retrieval-Augmented Generation, is a method in natural language processing (NLP) that enhances the generation of responses or information by incorporating external knowledge retrieved from a large corpus or database. This approach combines the strengths of retrieval-based models and generative models to provide more accurate, informative, and contextually relevant outputs.

The key advantage of RAG is that it leverages a large amount of external knowledge dynamically, enabling the model to generate responses that are not just based on its training data but also on up-to-date and detailed information from the retrieval phase. This makes RAG particularly useful in applications where factual accuracy and detail are crucial, such as in customer support, academic research, and other domains requiring precise information. Ultimately, RAG provides enterprises with a robust tool for improving the accuracy, relevance, and efficiency of their information systems, leading to better customer service, cost savings, and competitive advantages.

| Resources

Dell product images: Dell.com

Copyright © 2024 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. Nvidia and combinations thereof are trademarks of Nvidia. All other product names are the trademarks of their respective owners.

***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.

Gen AI IT Log Analyzer | Enable IT Teams to Chat with Infrastructure on Dell PowerEdge™ R760xa Server with Nvidia H100 Data Center Tensor Core GPUs