k8s

分割线

Interview

From: https://github.com/cloudpilot-ai/interview

Distributed-Scheduling

Background

In all cloud providers, like AWS, Google, and others, there are many spot instances. They are quite cheap (10% of the on-demand instances’ price), but after you buy them, they could be terminated with only two minutes’ notice in advance (in most scenarios, we don’t set PDB, and we should perform the graceful drain).

So, I want you to design a strategy to maximize the use of spot instances without causing service interruptions, instead of relying solely on on-demand instances, to cut costs, by using distributed scheduling in a single cluster (on-demand/spot mixed or other methods for one workload). This is important because all spot instances being terminated at the same time could cause interruptions for different kinds of workloads (single replica workload, multiple replica workload).

Also, I don’t want to change the scheduler already used in the K8s cluster and want to ensure the minimal components necessary in the cluster.

Notes:

  1. On demand nodes has label: node.kubernetes.io/capacity: on-demand.
  2. Spot node has label: node.kubernetes.io/capacity: spot.
  3. Workloads represented as Deployments and StatefulSets.
  4. on-demand/spot instance represented as K8s nodes in the cluster.
  5. Only focus on scheduling control; the graceful drain after receiving the terminal notification is handled by other components.

Analyze

尽可能少的使用云平台 on-demand 实例, 转为使用 spot 降低成本

spot drain 比较频繁, 需要通过设计一套 schedule system 去解决 graceful drain in 2min 的问题

Arch-graph

Modules-Design-doc

  • kube-scheduler:

    It handles the initial placement of Pods on the available nodes (either On-demand or Spot). It makes decisions based on resource availability, but it doesn’t handle the dynamic nature of Spot instances, which is why it passes control to the Distributed Scheduling Controller.

  • Distributed Scheduling Controller:

    This is a custom controller that monitors the state of Spot nodes. It reacts to termination notifications from the cloud provider, triggering re-scheduling of Pods from Spot nodes to On-demand nodes when necessary. It interacts with both the kube-scheduler and k8s API Server to coordinate these actions.

  • Event Listener:

    A sub-component of the Distributed Scheduling Controller, it listens for termination notifications from the Cloud Provider. When a notification is received, it initiates the re-scheduling process by triggering the necessary operations within the controller.

  • re-schedule:

    This module represents the actual process of migrating Pods from Spot nodes to On-demand nodes. It ensures that services continue running smoothly by moving affected workloads to stable nodes before the Spot nodes are terminated.

  • Cloud Provider’s Notification:

    This is the external signal from the cloud provider indicating that a Spot node is about to be terminated. The Distributed Scheduling Controller listens for this notification and triggers re-scheduling as needed.


Considerations

  • Handling Spot Instance Termination:

    The system must efficiently handle the rapid termination of Spot instances, ensuring that critical workloads are quickly migrated to On-demand nodes without causing service disruptions.

  • Monitoring and Responsiveness:

    The Event Listener and Distributed Scheduling Controller must be highly responsive to termination notifications. Delays in re-scheduling could lead to service outages.

  • Resource Management:

    It’s crucial to ensure that there are always enough resources available on On-demand nodes to accommodate Pods migrating from Spot nodes. Resource constraints can lead to scheduling failures.

  • System Compatibility:

    The integration between the kube-scheduler, Distributed Scheduling Controller, and k8s API Server should be seamless to avoid conflicts in Pod scheduling and state management.

  • Testing and Reliability:

    The system should be thoroughly tested to handle different failure scenarios, such as mass Spot instance terminations or API Server unavailability, to ensure it can recover gracefully.

分割线

System-Architecture-Design

Background

Our rough system architecture is illustrated as follows:

Architecture Diagram

Customers install an agent component in their clusters, which pushes metrics to the API endpoint (https://api.xxx.com). On the server side, our API server processes and stores these metrics in the database. Another controller analyzes each cluster and provides optimization recommendations to customers, depicted as follows:

I want you to outline a rough design document for the system that ensures high availability, security, and performance.

Hint: This can be implemented using Kubernetes (K8s) technologies, such as HPA. But for different components, we need use different technologies.

Notes:

  1. You don’t need to think about the database.
  2. For one customer’s cluster, only one agent is installed, it will scrap metrics from the customer’s cluster’s kube-apiserver.
  3. Focus on architecture design, not specific implementation.

Arch-graph

Modules-Design-docs

  1. Monitoring Agent (Client Side)
    Function: The Monitoring Agent is responsible for collecting real-time data from the client-side Kubernetes cluster and receiving notifications from the cloud platform, such as spot instance termination messages. It then sends this data to the Server Side API Gateway.

    Necessity: This module is crucial for maintaining an up-to-date overview of the client-side environment, enabling the system to react dynamically to changes.

    Advantages: Centralizes data collection and processing of cloud notifications, making the client-side environment more responsive and easier to manage.

  2. Command Executor (Client Side)
    Function: Receives instructions from the Monitoring Agent and executes them on the Kubernetes cluster. Before executing, it assesses the real-time environment to ensure optimal decision-making.

    Necessity: Ensures that commands are executed with consideration of the current state of the client environment, improving operational efficiency.

    Advantages: Adds a layer of decision-making that enhances the flexibility and responsiveness of the system, reducing the risk of executing commands that could negatively impact the environment.

  3. Kubernetes Cluster (Client Side)
    Function: Represents the client-side Kubernetes infrastructure where workloads are deployed and managed.

    Necessity: The core environment where all client-side applications run, making it the focal point of monitoring and command execution.

    Advantages: Allows for scalable and automated management of containerized applications.

  4. Cloud Platform (Client Side)
    Function: Sends spot instance termination notices or other relevant notifications to the Monitoring Agent.

    Necessity: Critical for handling the dynamic and volatile nature of spot instances, which are commonly used to optimize cost.

    Advantages: Enhances the system’s ability to preemptively respond to changes in the cloud environment, improving uptime and reducing potential disruptions.

  5. API Gateway (Server Side)
    Function: Acts as the central entry point for all data sent from the client side. It handles the routing of data to the appropriate backend services, including sending instructions back to the client side.

    Necessity: Essential for managing communication between the client and server sides, ensuring secure, scalable, and organized data flow.

    Advantages: Provides a centralized control point for authentication, authorization, and traffic management, enhancing the system’s overall security and scalability.

  6. Authorization Service (Server Side)
    Function: Handles the authentication and authorization of incoming data and commands, ensuring that only verified and authorized actions are processed.

    Necessity: Critical for maintaining the security and integrity of the system by preventing unauthorized access and operations.

    Advantages: Adds a strong security layer to the system, protecting against potential breaches and unauthorized data manipulation.

  7. Message Queue (Server Side)
    Function: Decouples the ingestion of data from its processing, allowing for high-throughput and reliable handling of large volumes of data.

    Necessity: Ensures that data from multiple clients can be processed asynchronously, improving system resilience and scalability.

    Advantages: Enhances the system’s ability to handle bursts of data and maintain performance under heavy load, reducing the likelihood of bottlenecks.

  8. Worker Nodes (Server Side)
    Function: Consume messages from the queue and process the data or commands as required.

    Necessity: Essential for executing the bulk of the system’s processing tasks in a distributed and scalable manner.

    Advantages: Supports horizontal scaling, allowing the system to handle increasing workloads by simply adding more worker nodes.

  9. Processing Service (Server Side)
    Function: Core service responsible for processing the data received from the client side, making decisions, and interacting with other server-side components like the database and instruction generation module.

    Necessity: Central to the system’s ability to analyze client data and generate appropriate responses or commands.

    Advantages: Provides a flexible and extensible processing framework that can be tailored to specific application needs.

  10. Database Module (Server Side)
    Function: Stores processed data, including logs, metrics, and historical records, for analysis and reporting.

    Necessity: Ensures that all critical data is persisted for future reference, analysis, and compliance requirements.

    Advantages: Offers high availability and scalability, ensuring that the system can store and retrieve data efficiently as it grows.

  11. Instruction Generation Module (Server Side)
    Function: Generates commands and instructions based on the processed data, which are then sent back to the client side via the API Gateway.

    Necessity: Enables the system to actively manage and optimize client-side operations based on real-time data.

    Advantages: Allows for automated, data-driven decision-making, enhancing the system’s overall efficiency and responsiveness.

分割线

GPU-optimization

Background

Please open your mind. In the AI world today, every company uses GPUs for training and inference. Is there anything we can do to reduce the cost of GPUs for training and inference?

Alternatively, are there other areas in the public cloud(K8s) that could be optimized?

Thought

CN

因为我之前待过的公司和创业的公司都是主要涉及 AI, 这块还算比较熟悉, 痛点考虑点如下:

  1. AI 项目 (Python 为主) 可迁移性比较差, 与操作系统算是半强绑定

    不同于 CPU, GPU 是个原子化的东西, 没法再拆分了, 而且 GPU 对 CPU 的性能依赖也是线性关联

    torch 这边还算比较成熟/版本切换自由, 但还是有很多库, 乃至 cuda 对操作系统有侵入和依赖性

    如何考虑要让客户端容器能扩缩, 需要额外适配下如何完善的装好依赖

  2. train / inference 的任务对性能的优化并不是很好

    毕竟大多都是研究团队发 paper 用的, 代码很多是一团糟, 适配多进程都很少见, 更不用说适配容器化了

    涉及到利用容器技术进行扩缩, 不太乐观, 这需要开发者手动去适配代码, 不然扩到 100 核只占用满 10 核也是很有可能的

    train 的任务非常非常非常怕中断, 哪怕是开发者自己也说不准不同电脑上进行训练到底需要多高的性能, 所以一般会开性能冗余很多的机子

    数据集 IO 这块的话压力一般来说不算太过分, 可以考虑用 k8s 技术进行读共享, 写需要额外操作

  3. GPU 共享平台最近有雏形了, 利用容器技术搞这个应该不错

    毕竟确实有一些不在乎钱的团队整晚开着机子不用

    而且 GPU 租用平台的受众还是蛮大的, 能用上容器的话, 比常规整租 GPU 的平台应该能便宜些

  4. 如果考虑可插入性的, 那就只有在 pod 内跑多个容器了

    不过还是有预测问题, AI 领域的应用, 跑起来后显存占用起伏比较大且随机

    会很容易出现突发性性能抢占

EN

Since the companies I’ve worked for and the startups I’ve been involved in have primarily focused on AI, I’m fairly familiar with this area. The pain points and considerations are as follows:

  1. AI Project (Mainly Python) Portability Is Relatively Poor, With a Semi-Strong Binding to the Operating System

    Unlike CPUs, GPUs are atomic entities and cannot be further split. Moreover, the performance of GPUs is linearly dependent on the CPU’s capabilities.

    While Torch is relatively mature and allows for flexible version switching, there are still many libraries, and even CUDA, that have invasive dependencies on the operating system. To ensure that a client container can scale up or down, additional adjustments may be necessary to ensure that dependencies are properly installed.

  2. The Performance Optimization for Train/Inference Tasks Is Not Very Good

    Since most of these tasks are designed by research teams for publishing papers, the code is often quite messy, with little adaptation for multiprocessing, let alone containerization.

    When it comes to utilizing container technology for scaling, the outlook is not very optimistic. Developers often need to manually adapt the code;
    otherwise, it’s possible to scale up to 100 cores but only fully utilize 10 of them.

    Training tasks are highly sensitive to interruptions, and even developers themselves cannot accurately predict the required performance on different machines for training, so they generally opt for machines with a significant performance surplus.

    The pressure on dataset I/O is generally not too excessive and can be managed with Kubernetes technology for read-sharing, but additional operations are needed for writes.

  3. There Is a Preliminary GPU Sharing Platform, and Using Container Technology for This Seems Promising

    After all, there are indeed teams that don’t mind leaving their machines running all night without use. Additionally, the market for GPU rental platforms is quite large.

    If container technology can be utilized, it could potentially be cheaper than conventional GPU rental platforms that require full-time GPU leasing.

  4. If Considering Plug-and-Play Capability, Then Running Multiple Containers Within a Pod Is the Only Option

    However, there is still the issue of prediction. In the field of AI applications, the memory usage can fluctuate significantly and randomly once the process starts running.

    This could easily lead to sudden performance contention.

分割线

借物表

[1]: Google & ChatGPT