Air-gapped observability at the edge: Monitoring distributed infrastructures with Grafana

About this session

Edge computing is revolutionizing industries by processing data closer to its source, but observability in these environments presents unique challenges — especially in air-gapped and radio-silent deployments where connectivity is limited or nonexistent. Traditional cloud-based monitoring solutions fall short in these scenarios, leaving organizations with operational blind spots, increased downtime risks, and higher maintenance costs.

In this session, Senior Site Reliability Engineer Ruslan Dautov explores how ZEDEDA integrates Grafana, Loki, and Prometheus to deliver real-time observability for large-scale distributed, air-gapped, and intermittently connected edge infrastructures. He’ll cover collecting and storing metrics locally with Prometheus, optimizing log aggregation with Loki in low-bandwidth environments, and visualizing distributed systems in Grafana — even when offline.

Using real-world examples, Ruslan will demonstrate how industries like manufacturing and energy leverage Grafana to monitor edge infrastructure under strict network constraints: factories preventing downtime with predictive analytics in air-gapped environments, energy providers optimizing remote assets with store-and-forward telemetry, and industrial facilities maintaining observability without constant cloud connectivity.

Speakers

Speaker 1 (00:03): Hello everyone. My name is Ruslan and today we’ll talk about the not put everything in the cloud. So we’ll talk about the air gap observability on the edge and QR-code for the questions. We’ll have Q&A session in the end. Don’t worry, we’ll put this QR-code in the end again. So my name is Ruslan Dautov I based in Berlin, Germany and working as a SRE engineer senior site relability engineer at ZEDEDA almost four years and I participating in on-call rotation and operation engineering. So let’s talk about the agenda. We’ll have a look on a problem statement, why the cloud-based monitoring doesn’t work and the cases of the edge observability. We’ll consider the several customer cases for the oil service company and the solar tracker and we’ll have a look on a solution and takeaways. So problem statement, we have dependency and we expecting that everything in data centers and you have stable connectivity but it’s not always available.

(01:19): So connectivity is a king. When the customer or the big enterprise is provisioning thousands of devices, you have the data waves of the workload which can come on a production system and can create the bandwidth hogs, latency issues and especially security. When you transfer something like valuable in the logs, which is critical, it’s important to be localized, air gap is completely blind spot in that cases you not able to provide the cloud-based solution and provide the telemetry on a spot. So let’s talk about the first case. It’s one of our largest oil service company. It’s one of our first enterprise customer fantastic workloads which running on a edge device on a tracks and this tracks moving on oil well facilities. So they’re running fantastic workloads and in once in a day they came to us with technical requirements. Guys, you need to implement the radio silence.

(02:34): And we’re like, why? So they start explaining that this trucks visiting the oil facilities and some facilities handling explosive. So radio silence is a critical to not trigger what you shouldn’t trigger. Yeah, I will say the technical requirements, it kind of mind blowing, but to see that what kind of workloads they’re running, it’s a Windows virtual machines together with Kubernetes cluster and AI models which utilizing the GPUs and imagine all these things running on a truck and this truck is moving. So in the same time because of radio silence, trucks completely disappear from the telemetry and you will not get any logs, any information from the trucks till the trucks back online. So radio silence came with the interesting challenges. First, you need to implement radio silence and the second when you send a signal, a device go to a silent mode, how to tell device that now it’s okay to go back online because device sitting in radio silence mode.

(03:51): This is one of a critical, I would say challenge which we have. And we saw that with local utilities too when the engineer turned on back online, the system. Second case, it’s a PV hardware, it’s one of our largest solar panel tracker company with hundreds of hundreds deployments around the world. And if you have seen the news, latest news from the Spain and Portugal, downtime in this type of facilities is quite important and I’ll say critical for the energy production system. This come with new challenges, the provisioning of device and deployment and between these two events can be happening like six months or one year. It means that first sign of telemetry you’ll see immediately on your monitoring and observability systems and for the half year this device will disappear. In that case, it’s one of a challenge and second challenge. It’s no IT dedicated specialist on the field.

(05:01): It means that you’re not able to break device or damage or make it unusable. No one will come and powercycle the edge device. This is hard. This is a lot of challenges. But yeah, and to understand how distributed this type of deployments can be I showing second the map of the edge devices, which connecting via satellite connectivity to our production systems. It’s only the sample of one of a customer, but you need to give the view of the scale how distributed our solution, how distributed our deployments, and how big it is. So let’s talk about the solution.

(05:52): ZEDEDA founded in 2016 and providing one of a SaaS solution which help you to orchestrate and manage and control the edge deployment on your hardware. It’s come together with the global marketplace where we store the thousands of different workloads and we also have more than a hundred edge devices, which we certified and we make sure that everything running stable and without any issues. And the second product which we open source, it’s edge virtualization engine, its operation system under the Linux Foundation under Apache license. And from 2016 we contributing to open source where this engine give you the ability to run the virtual machines, containerized workloads like docker containers, Kubernetes workloads, and any others. And this is really secure because we’re not allowed any console connectivity or ssh connectivity to the edge devices. Everything goes through the eve API, we call that our ZEDEDA cloud.

(07:17): Internally we call that ZEDcloud. So usually I might use that term. And what the difference between the air gap and radio science? So air gap, it’s a term from the computer networks when certain system complex or data completely is isolating from the other networks and especially internet while the radio sign its term when related to radio wave communications. And when you temporarily shutting down all communication till the special order to back online and how it’s related to us. We have cases when the edge device in the air gap environment or in the radio silence mode and completely losing connectivity with the ZEDcloud in that case edge device, not stoped not disrupting the workload, it’s continue working and without any issue we’ll back online when the connectivity will be restored. Second case when all deployment of controller and edge devices inside the air gap environment. This is one of a complex case for our deployments and we today will share how we handling both cases with our observability stack. So when we start talking about the observability, I’ll introduce our teammate, Umka Umka, one of our user watchdog accounts. So it’s usually tracking all the CPU spikes, barking for the opsgenie alerts. Fantastic teammates.

(09:10): So let’s talk about our observability stack. We actually migrated from the Datadog two years ago. That’s why this stack was, I’ll say not relatively new, but we migrated. So Prometheus, Grafana and Thanos. Prometheus we usually use for our backend system to track all the application which running on a production system. Grafana fantastic product with a lot of integration with different tools, different data sources which give us unified experience between the different production deployments and thanos. Because we receiving tons of logs and tons of telemetry, it’s important for us to make the global queries across all our production deployments. And this free tools is not the complete set.

(10:07): So we tracking the EKS. If we deploy it in the cloud, we’re actually deploying in AWS GCP and Azure. That’s why to get the telemetry from the managed at Kubernetes environment for us important InfluxDB, we using the second version and we keeping the higher, and I will say bigger virtual machines for this type of workloads because we have a high cardinality of the data and this type of data when you construct an analytical request for the many time series can spike this CPU and memory. That’s why we keeping the bigger VMs and I’m happy to see the presentation today That’s influx 3.0 zero solving a lot of performance issues. OpenSearch. OpenSearch is not just one instance. Actually. OpenSearch are presented in our deployments by clusters, multi-node clusters. And we have two multi-node clusters which handling different workloads. One, it’s for the logs which coming from the devices and one for the internal apps which working on our production system altogether, it’s represent this type of simplified picture of our observability stack.

(11:28): So this unified, it’s one cluster. Of course we have tons of micros which running beside, but overall it’s EKS logs, Istio influx to be open search, Promeus, Grafana and thanos, which we are actually using the Prometheus alert manager with sending the alerts to opsgenie and fire hydrant because opsgenie decide in the 4th of June to stop selling the opsgenie. We prepared for this type of migration. We choose the firehydrant. So this type of deployment represent the one unified cluster which we can deploy in a air gap environments or inside our systems, which would present our, I would say, multi cluster environment. This is picture of, I would say briefly introduction to our multi cluster deployments when we deploy in GCP AWS and Azure. And we also deploy in different regions and different countries. That’s why for us it’s I will say really distributed deployments where the edge nodes can be connected and disconnected from the each deployment and the air gap environment can be completely isolated.

(12:49): So we’re using the log for our IT ops and the security operations when we sending the logs from our internal IT systems. For example, like CloudFare, Okta, when we tracking all the access to the device logs data, for example, engineer goes to the deployment, have a look on what kind of logging existing related to that customer and we tracking all this connectivity authorization to the log thanos give us ability to not transfer tons of data to the centralized observability stack and we can make the queries across the old production system and deployment Grafana Grafana give us a unified experience between all the deployments because we’re using the ArgoCD and the Gitops frameworks to deploy the production systems. It means that when the developers log into the each cluster, it’s give them unified experience between the deployments. And let’s talk about the one of a complicated part and complicated part.

(14:02): It’s air gap environment. When we don’t have connectivity with centralized observability stack and cases when this devices, it’s only working in that enterprise environment. So in that case, we keeping the virtual cluster inside our infrastructure to keep the all information about this cluster details. We know that which version, which kind of configuration. And we know that somewhere there the real cluster is running and we need to provide the security updates, new versions, configuration changes, and extract the logs back to track the all information. What’s going on there. So we first of all pack our container registry, helm charts, hel helm museums, vault secrets, packed and encrypted, ship it to the air gap environment, deploy that and extract the logs and metrics on the back way. So this exactly represent our multi cluster deployment and centralization give us ability to unify all alerting system to one place and send the alerts to the slack and the fire hydrant.

(15:25): And of course Thanos give us ability to store that data for the longer time because we have terabytes of logs, which coming from the edge devices and we have different retention period for the device informations because it can be terabytes and we store that up to 30 days or 90 days or for the certain cases we can store it for the one year. That’s why for us it’s critical to store that for the longer time and also to track all the changes in this deployments. Also, let’s, I will say continue to takeaways. What takeaways in our cases you need to store everything in a local all observability stack should be localized. Also deployment and the cluster should be self-sufficient to provide all monitoring stack to provide the all ability to troubleshoot devices on place without the transferring data back. And also I will say mind the gap. If you want to learn more, please visit zededa.com and visit our open source project EVE. Also today I can announce something new of one of our, I’ll say achievement, one of the big logistic company called Maersk, which have a hundreds of vessels around the world and shipping containers choose us as technology partner. And that’s why we also will be deployed not only tracks, not only trains, and we also will deploy on ships. So I want to say thank you. I want to say thank you for your attention.