Capacity Planning¶
This section is designed to help understanding the necessary steps for estimating resource requirements, scaling strategies, and best practices to ensure optimal IEM Pro operations. Capacity planning involves predicting future demand and ensuring that sufficient resources are available to meet that demand. This involves considering factors such as CPU, memory, storage, and network throughput using IEDs as a scaling factor.
Note It is recommended to set resource quotas on k8s namespace which is for IEM deployment. This is meaningful for several reasons:
- Isolation and Multi-tenancy: Kubernetes namespaces provide a way to partition cluster resources between multiple IEM installations and reserve the desired amount of resources for the namespace.
- Predictability and Planning: Resource quotas provide predictability for resource usage within Kubernetes namespaces.
- Security and Compliance: Enforcing resource quotas can also enhance security and compliance within Kubernetes clusters. By limiting the amount of resources that can be consumed within a namespace, administrators can mitigate the impact of potential resource exhaustion attacks or runaway workloads.
Default Setup¶
The default installation sets up a minimal configuration that can serve approximately 100 devices, store 20 IED applications (estimated app file size 1.5 GB), and manage 20 versions of device firmware.
Note These values do not include custom configuration of device monitoring and logging services (streaming service). In a basic setup, these values require 50 MB of storage per device and the metrics and logs are persisted for 7 days by default.
Devices upload their log file every day, which is typically 120 MB, and 3 files are kept by default. This is the legacy mode for the logging solution.
| Storage | CPU | RAM |
|---|---|---|
| 250 GB | 4 | 16 GB |
Note The default configuration can be found in IEM Pro's Helm Chart. The IEM typically uses less memory than the above settings suggest, as they take into account the peaks and resource limits set in the Helm Chart.
Estimate capacity based on number of devices onboarded¶
Each device communicates with the IEM regularly, initiating nine distinct types of requests by default. These requests include heartbeat signals, updates for current settings, inquiries about pending jobs, and transmission of metrics and logs. The frequency of communication varies for each request; for instance, heartbeat signals are dispatched every 60 seconds by default. On average, it can be assumed that each device contacts the IEM approximately every 10 seconds, transmitting a payload of 3 KB.
Considering this communication pattern, you can estimate the memory requirements for e.g. portal-service statefulset and portal-wss deployment using the following formula:
Required Memory = 0.5 MB × Number of Devices + 8 GB
The base memory of 8 GB is necessary for tasks which are minimally influenced by device communication, or only affected when installation or firmware update jobs are initiated.Exclusively during the onboarding process of new devices, CPU-intensive tasks are executed, such as generating cryptographic material.
However, for all other operations, the IEM primarily engages in IO-based operations.
The CPU requirement for the portal-service statefulset and portal-wss deployment can be approximated using the following formula:
Required CPUs = 0.005 CPU Shares × Number of Devices + 1 CPU
The network throughput can be approximated at 1 kilobyte per second per device.
Device applications, device firmware, and legacy mode logs are stored in file storage. On average, it can be assumed that each firmware blob has a filesize of 1 gigabyte, while IED applications typically occupy 1.5 gigabytes per application. A certain amount of storage is necessary to handle file operations such as unzipping or copying. It is advisable to monitor this capacity and ensure that approximately 10 gigabytes of free space is available on persistent volumes for firmware management and app management purposes.
Scaling Services and Components¶
Scaling strategies must be applied to effectively scale IEM apps deployed on Kubernetes.
Scaling is crucial to address the expanding demand driven by the growing number of devices.
There are two primary approaches: vertical scaling, which involves increasing resources (CPU, Memory) for existing pods - typically installed as statefulset -, and horizontal scaling, which entails adding more pods to the deployment on the Kubernetes cluster.
Note This guide does not cover scaling Kubernetes nodes. Please consult the official Kubernetes guide for information on node scaling.
Portal-Service¶
The following services must be scaled.
Communication between the devices and the backend service, 'portal-wss', utilizes a Kubernetes deployment. Increasing the number of replicas of this service enables support for more devices, with each instance capable of handling approximately 500 devices. No additional memory or CPU tuning is required in this case.
This service, which currently lacks support for horizontal scaling, requires adjustments to the statefulset memory and CPU settings. This involves configuring memory settings and Java heap settings, calculated as:
0.1 MB × Number of Devices + 512 MB base memory
- 1 CPU per 1000 Devices
- 500 MB Memory + 100 MB for every additional 1000 Devices
- Read/Write throughput on disk averages at 200 KiB/s for 1000 Devices
- 0.2 CPU per 1000 Devices
- 500 MB Memory base + 20 MB per 1000 Devices
- Read/Write operations on the disk average at 10 KiB/s for 1000 Devices
Each gateway has the capacity to handle approximately 1000 devices, and scalability is achieved by increasing the replica count.
Responsible for scheduling firmware management updates, this service can scale horizontally by adding additional replicas.
The following monitoring services must be scaled.
- Compute/Memory
- Adjust memory settings based on observed usage (e.g., 1000 devices require approximately 2 GB of memory).
- 0.001 CPU per Device.
- Reduce Timings on the device to reduce the load on the monitoring service.
- TTL (Time-To-Live) is 7 days and can be adjusted, after this time period, timeseries data will be dropped.
- Network: 0.5 Kib/s per device.
- 50 MB Storage per Device per Day.
You can use the following values as example for
global:
hostname: myexample.com
enableCpuEnforcement: true
enableMemoryEnforcement: true
kong:
proxy:
replicaCount: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 1Gi
cpu: 1
auth:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 512Mi
cpu: "500m"
ams:
app-manager:
resources:
limits:
cpu: "260m"
memory: "512Mi"
chartmuseum:
resources:
limits:
cpu: "260m"
memory: "1Gi"
portal:
storage:
storageCapacityPortal: "300Gi"
storageCapacityPortalHub: "100Gi"
service:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 4
memory: 3Gi
heap: -Xmx2048m
portalUI:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
wss:
replicaCount: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 2
memory: 1Gi
hub:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1
memory: 2Gi
central-auth:
keycloak:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1
memory: 1Gi
cauth:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1
memory: 1Gi
device-catalog:
workflowexecutor:
enabled: true
resources:
requests:
memory: "256Mi"
cpu: "65m"
limits:
memory: "512Mi"
cpu: 1
wfx:
replicaCount: 2
wfxQx:
replicaCount: 1
firmwaremanagement:
storage:
storageCapacity: "100Gi"
enabled: true
fileservice:
resources:
limits:
cpu: 2
memory: "8Gi"
onPremFirmware:
resources:
limits:
cpu: 2
memory: "2Gi"
firmwareManager:
resources:
limits:
cpu: "512m"
memory: "256Mi"
devicetypemanagement:
enabled: true
resources:
requests:
memory: "256Mi"
cpu: "65m"
limits:
memory: "1G"
cpu: "500m"
postgres:
storage:
storageCapacityPostgres: "50Gi"
resources:
requests:
memory: 128Mi
cpu: 65m
limits:
memory: 1Gi
cpu: 1
launchpad:
enabled: true
resources:
limits:
memory: 128Mi
cpu: "32m"
requests:
memory: "64Mi"
cpu: "32m"
licensing-service:
resources:
limits:
cpu: 1
memory: 512Mi
requests:
memory: 64Mi
cpu: 64m
job-manager:
ui:
resources:
requests:
memory: "64Mi"
cpu: "32m"
limits:
memory: "128Mi"
cpu: 100m
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1G"
cpu: "512m"
wfx:
replicaCount: 1
resources:
requests:
memory: "64Mi"
cpu: "32m"
limits:
memory: "128Mi"
cpu: "32m"
iema-backend:
resources:
limits:
memory: 2Gi
cpu: 2
requests:
cpu: 10m
memory: 16Mi
state-service:
service:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1
memory: 1Gi
master:
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 32Mi
cpu: 10m
volume:
resources:
limits:
memory: 8Gi
cpu: 2
requests:
memory: 1Gi
cpu: 100m
filer:
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 128Mi
cpu: 64m
s3:
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 32Mi
cpu: 64m
iam:
resources:
limits:
memory: 1Gi
cpu: 256m
requests:
memory: 16Mi
cpu: 10m
postgres:
storage:
storageCapacityPostgres: "50Gi"
resources:
limits:
cpu: 2
memory: 1Gi
tunnel:
resources:
limits:
cpu: "512m"
memory: "1Gi"
Limitations¶
- The storage layer can only scale horizontally by adding more CPU, memory, and IOPS to the storage layer. This affects
- Postgres DB
- File storage for firmware and backups
- Some backend services, which are the interfaces for user interaction, creation, deletion, and management of applications and devices, are currently not horizontally scalable, resulting in a limit to the number of concurrent installation jobs and firmware updates. This limit is highly dependent on the underlying storage class. In a typical scenario, updates and installations can be performed in a batch size of 10 concurrent installations.
- Tunnel connections
- One port is reserved for each device in the range of 30000-32768. This provides a hard limit of 2768 devices.
- With default timing for monitoring/logging service, there is a soft limit of 1000 devices, for typical storage tier (disks).