Capacity Planning¶

This section is designed to help understanding the necessary steps for estimating resource requirements, scaling strategies, and best practices to ensure optimal IEM Pro operations. Capacity planning involves predicting future demand and ensuring that sufficient resources are available to meet that demand. This involves considering factors such as CPU, memory, storage, and network throughput using IEDs as a scaling factor.

Note
It is recommended to set resource quotas on k8s namespace which is for IEM deployment. This is meaningful for several reasons:

Isolation and Multi-tenancy: Kubernetes namespaces provide a way to partition cluster resources between multiple IEM installations and reserve the desired amount of resources for the namespace.

Predictability and Planning: Resource quotas provide predictability for resource usage within Kubernetes namespaces.

Security and Compliance: Enforcing resource quotas can also enhance security and compliance within Kubernetes clusters. By limiting the amount of resources that can be consumed within a namespace, administrators can mitigate the impact of potential resource exhaustion attacks or runaway workloads.

Default Setup¶

The default installation sets up a minimal configuration that can serve approximately 100 devices, store 20 IED applications (estimated app file size 1.5 GB), and manage 20 versions of device firmware.

Note
These values do not include custom configuration of device monitoring and logging services (streaming service). In a basic setup, these values require 50 MB of storage per device and the metrics and logs are persisted for 7 days by default.

Devices upload their log file every day, which is typically 120 MB, and 3 files are kept by default. This is the legacy mode for the logging solution.

default setup of IEM Pro

Storage	CPU	RAM
250 GB	4	16 GB

Note
The default configuration can be found in IEM Pro's Helm Chart. The IEM typically uses less memory than the above settings suggest, as they take into account the peaks and resource limits set in the Helm Chart.

Estimate capacity based on number of devices onboarded¶

Each device communicates with the IEM regularly, initiating nine distinct types of requests by default. These requests include heartbeat signals, updates for current settings, inquiries about pending jobs, and transmission of metrics and logs. The frequency of communication varies for each request; for instance, heartbeat signals are dispatched every 60 seconds by default. On average, it can be assumed that each device contacts the IEM approximately every 10 seconds, transmitting a payload of 3 KB.

Considering this communication pattern, you can estimate the memory requirements for e.g. portal-service statefulset and portal-wss deployment using the following formula:

Required Memory = 0.5 MB × Number of Devices + 8 GB

The base memory of 8 GB is necessary for tasks which are minimally influenced by device communication, or only affected when installation or firmware update jobs are initiated.Exclusively during the onboarding process of new devices, CPU-intensive tasks are executed, such as generating cryptographic material. However, for all other operations, the IEM primarily engages in IO-based operations. The CPU requirement for the portal-service statefulset and portal-wss deployment can be approximated using the following formula:

Required CPUs = 0.005 CPU Shares × Number of Devices + 1 CPU

The network throughput can be approximated at 1 kilobyte per second per device.

Device applications, device firmware, and legacy mode logs are stored in file storage. On average, it can be assumed that each firmware blob has a filesize of 1 gigabyte, while IED applications typically occupy 1.5 gigabytes per application. A certain amount of storage is necessary to handle file operations such as unzipping or copying. It is advisable to monitor this capacity and ensure that approximately 10 gigabytes of free space is available on persistent volumes for firmware management and app management purposes.

Scaling Services and Components¶

Scaling strategies must be applied to effectively scale IEM apps deployed on Kubernetes. Scaling is crucial to address the expanding demand driven by the growing number of devices. There are two primary approaches: vertical scaling, which involves increasing resources (CPU, Memory) for existing pods - typically installed as statefulset -, and horizontal scaling, which entails adding more pods to the deployment on the Kubernetes cluster.

Note
This guide does not cover scaling Kubernetes nodes. Please consult the official Kubernetes guide for information on node scaling.

Portal-Service¶

The following services must be scaled.

WebSocket endpoint for devices: portal-wssBackend service for application management: portal-serviceScaling Database: <Postgres>Scaling GatewayOnpremfirmware management, wfx, wfx-qa (legacy)Scaling Monitoring Services, edgeeye

Communication between the devices and the backend service, 'portal-wss', utilizes a Kubernetes deployment. Increasing the number of replicas of this service enables support for more devices, with each instance capable of handling approximately 500 devices. No additional memory or CPU tuning is required in this case.

This service, which currently lacks support for horizontal scaling, requires adjustments to the statefulset memory and CPU settings. This involves configuring memory settings and Java heap settings, calculated as:

0.1 MB × Number of Devices + 512 MB base memory

Portal-PostgresFirmware-Postgres

1 CPU per 1000 Devices
500 MB Memory + 100 MB for every additional 1000 Devices
Read/Write throughput on disk averages at 200 KiB/s for 1000 Devices

0.2 CPU per 1000 Devices
500 MB Memory base + 20 MB per 1000 Devices
Read/Write operations on the disk average at 10 KiB/s for 1000 Devices

Each gateway has the capacity to handle approximately 1000 devices, and scalability is achieved by increasing the replica count.

Responsible for scheduling firmware management updates, this service can scale horizontally by adding additional replicas.

The following monitoring services must be scaled.

Log Streaming

Compute/Memory
Adjust memory settings based on observed usage (e.g., 1000 devices require approximately 2 GB of memory).
0.001 CPU per Device.
Reduce Timings on the device to reduce the load on the monitoring service.
TTL (Time-To-Live) is 7 days and can be adjusted, after this time period, timeseries data will be dropped.
Network: 0.5 Kib/s per device.
50 MB Storage per Device per Day.

You can use the following values as example for

1000 Devices

 global:
   hostname: myexample.com
   enableCpuEnforcement: true
   enableMemoryEnforcement: true
 kong:
   proxy:
     replicaCount: 3
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         memory: 1Gi
         cpu: 1
   auth:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         memory: 512Mi
         cpu: "500m"

 ams:
   app-manager:
     resources:
       limits:
         cpu: "260m"
         memory: "512Mi"
   chartmuseum:
     resources:
       limits:
         cpu: "260m"
         memory: "1Gi"

 portal:
   storage:
     storageCapacityPortal: "300Gi"
     storageCapacityPortalHub: "100Gi"
   service:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 4
         memory: 3Gi
     heap: -Xmx2048m
   portalUI:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 500m
         memory: 512Mi

   wss:
     replicaCount: 3
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 2
         memory: 1Gi

   hub:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 1
         memory: 2Gi


 central-auth:
   keycloak:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 1
         memory: 1Gi
   cauth:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 1
         memory: 1Gi

 device-catalog:
   workflowexecutor:
     enabled: true
     resources:
       requests:
         memory: "256Mi"
         cpu: "65m"
       limits:
         memory: "512Mi"
         cpu: 1
     wfx:
       replicaCount: 2
     wfxQx:
       replicaCount: 1
   firmwaremanagement:
     storage:
       storageCapacity: "100Gi"
     enabled: true
     fileservice:
       resources:
         limits:
           cpu: 2
           memory: "8Gi"
     onPremFirmware:
       resources:
         limits:
           cpu: 2
           memory: "2Gi"
     firmwareManager:
       resources:
         limits:
           cpu: "512m"
           memory: "256Mi"
   devicetypemanagement:
     enabled: true
     resources:
       requests:
         memory: "256Mi"
         cpu: "65m"
       limits:
         memory: "1G"
         cpu: "500m"


   postgres:
     storage:
       storageCapacityPostgres: "50Gi"
     resources:
       requests:
         memory: 128Mi
         cpu: 65m
       limits:
         memory: 1Gi
         cpu: 1

 launchpad:
   enabled: true
   resources:
     limits:
       memory: 128Mi
       cpu: "32m"
     requests:
       memory: "64Mi"
       cpu: "32m"

 licensing-service:
   resources:
     limits:
       cpu: 1
       memory: 512Mi
     requests:
       memory: 64Mi
       cpu: 64m

 job-manager:
   ui:
     resources:
       requests:
         memory: "64Mi"
         cpu: "32m"
       limits:
         memory: "128Mi"
         cpu: 100m
   resources:
     requests:
       memory: "256Mi"
       cpu: "100m"
     limits:
       memory: "1G"
       cpu: "512m"

 wfx:
   replicaCount: 1
   resources:
     requests:
       memory: "64Mi"
       cpu: "32m"
     limits:
       memory: "128Mi"
       cpu: "32m"

 iema-backend:
   resources:
     limits:
       memory: 2Gi
       cpu: 2
     requests:
       cpu: 10m
       memory: 16Mi

 state-service:
   service:
     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 1
         memory: 1Gi
   master:
     resources:
       limits:
         memory: 1Gi
         cpu: 1
       requests:
         memory: 32Mi
         cpu: 10m
   volume:
     resources:
       limits:
         memory: 8Gi
         cpu: 2
       requests:
         memory: 1Gi
         cpu: 100m
   filer:
     resources:
       limits:
         memory: 1Gi
         cpu: 1
       requests:
         memory: 128Mi
         cpu: 64m
   s3:
     resources:
       limits:
         memory: 1Gi
         cpu: 1
       requests:
         memory: 32Mi
         cpu: 64m
   iam:
     resources:
       limits:
         memory: 1Gi
         cpu: 256m
       requests:
         memory: 16Mi
         cpu: 10m
 postgres:
   storage:
     storageCapacityPostgres: "50Gi"
   resources:
     limits:
       cpu: 2
       memory: 1Gi
 tunnel:
   resources:
     limits:
       cpu: "512m"
       memory: "1Gi"

Limitations¶

The storage layer can only scale horizontally by adding more CPU, memory, and IOPS to the storage layer. This affects
Postgres DB
File storage for firmware and backups
Some backend services, which are the interfaces for user interaction, creation, deletion, and management of applications and devices, are currently not horizontally scalable, resulting in a limit to the number of concurrent installation jobs and firmware updates. This limit is highly dependent on the underlying storage class.
In a typical scenario, updates and installations can be performed in a batch size of 10 concurrent installations.
Tunnel connections
One port is reserved for each device in the range of 30000-32768. This provides a hard limit of 2768 devices.
With default timing for monitoring/logging service, there is a soft limit of 1000 devices, for typical storage tier (disks).