ClickHouse Operational Overview

The following ClickHouse background topics are available.

ClickHouse Related Processes

ClickHouse is a distributed database with replication capabilities. FortiSIEM Supervisor and Worker software images include ClickHouse binaries. The user does not need to install anything else. You can configure a ClickHouse cluster from the FortiSIEM GUI.

There are two main ClickHouse processes:

  • ClickHouseServer process: This is the ClickHouse Database Service.

  • ClickHouseKeeper process: This is the ClickHouse Keeper Service providing Replication Management.

In addition, two more FortiSIEM processes provide ClickHouse related services:

  • phClickHouseMonitor process: This runs on the Supervisor and Worker nodes and provides the following services:

    • On Supervisor/Worker nodes: CMDB Group Query helper, Lookup Table Query helper and DeviceToCMDBAttr Query helper.

    • On Supervisor node only: Provides Online data display and the list of available ClickHouse nodes.

  • phMonitor process: Provides ClickHouse configuration management on Supervisor node.

Supervisor/Worker Nodes Running ClickHouse Functions

A FortiSIEM Supervisor/Worker node can be of 3 types (not mutually exclusive):

  • ClickHouse Keeper Node: This node runs ClickHouse Keeper service providing replication management.

  • ClickHouse Data Node: This node inserts events into ClickHouse database.

  • ClickHouse Query Node: This node provides ClickHouse query services.

FortiSIEM Supervisor/Worker node can be a specific node only, or a mix of the 3 node types. For example:

Small EPS Environments: One Supervisor node that is a Keeper, Data and Query node

Medium EPS Environments:

High EPS Environments (Option 1):

  • Supervisor node does not run any ClickHouse service

  • 3 separate Worker nodes as ClickHouse Keeper Nodes - these form the ClickHouse Keeper cluster (see ClickHouse Keeper Cluster Considerations).

  • N Worker nodes, with each node acting as both ClickHouse Data Node and ClickHouse Query Node - these form the ClickHouse Database cluster.

High EPS environments (Option 2):

  • Supervisor node does not run any ClickHouse service

  • 3 separate Worker nodes as ClickHouse Keeper Nodes – these form the ClickHouse Keeper cluster (see ClickHouse Keeper Cluster Considerations).

  • A ClickHouse Database cluster consisting of

    • N/2 Worker nodes as ClickHouse Data Node only

    • N/2 Worker nodes as ClickHouse Query Node only

In Option 1, events are ingested at all N nodes, query goes to all N nodes. In Option 2, events are ingested in N/2 nodes and queried from N/2 nodes. There are other options with N/2 Data Only nodes and N Query Nodes for better query performance. Option 1 is the most balanced Option that has been seen to work well.

Shards and Replicas

A shard is a database partition designed to provide high insertion and query rates. Events are written to and read from multiple shards in parallel. You need to choose the number of shards based on your incoming EPS (see example below and the latest ClickHouse Sizing Guide located in Fortinet Documents Library here).

If you want replication, then you can have replicas within each shard. ClickHouse will replicate database writes to a node within a shard to all other replicas within the same shard. A typical choice for replication size = 2, implying that you will have 2 nodes in each shard. A replica provides (a) faster queries and (b) prevents data loss in case a node goes down.

It is important to understand how ClickHouse insertion, Replication and Query works in FortiSIEM.

ClickHouse Keeper Cluster Considerations

ClickHouse Keeper provides the coordination system for data replication and distributed DDL queries execution. You should use odd number of nodes in Keeper Cluster for maintaining quorum, although ClickHouse Allows even number of nodes.

  • If you use 3 nodes and lose 1 node, the Cluster keeps running without any intervention.

  • If you use 2 nodes and lose 1 node, then quorum is lost. You need to use the following steps in Recovering from Losing Quorum to recover quorum.

  • If you use 1 node and lose that node, then the ClickHouse event database becomes read only and insertion stops. You need to use the following steps in Recovering from Complete Loss of ClickHouse Keeper Cluster to recover the ClickHouse Keeper database.

Note that for high EPS environments, ClickHouse recommends running ClickHouse Keeper and Database services on separate nodes, to avoid disk and CPU contention between Query and Replication Management engines. If you have powerful servers with good CPU, memory and high throughput disks, and EPS is not high, it may be reasonable to co-locate ClickHouse Keeper and Data/Query nodes.

See ClickHouse Reference in the Appendix for related information.

Event Insertion Flow

  1. Collectors send events to the Worker list specified in ADMIN > Settings > System > Event Worker.

  2. Data Manager process on each Worker node will first select a ClickHouse Data Node and insert to that node. It may be local or remote.

Event Replication Flow

  1. After insertion, ClickHouse Data Node will inform a ClickHouse Keeper Node.

  2. The ClickHouse Keeper Node initiates replication to all other nodes in the same shard.

Query Flow

  1. GUI sends request to App Server which sends to Query Master on Supervisor

  2. Query Master provides Query management. It sends the request to a (randomly chosen) ClickHouse Query node. Each query may go to a different ClickHouse Query node.

  3. The ClickHouse Query node co-ordinates the Query (like Elasticsearch Coordinating node)

    1. It sends the results to other ClickHouse Query nodes

    2. It generates the final result by combining partial results obtained from all ClickHouse Query nodes

    3. It sends the result back to Query Master

  4. Query Master sends the results back to App Server; which in turn, sends it back to GUI.

Sizing your ClickHouse Deployment

Before starting to configure Workers, it is important to plan your ClickHouse deployment. The ClickHouse Sizing Guide has details, but a summary follows:

  1. You need to know your event insertion rate (EPS). Using peak values may lead to an over-engineered situation while only calculating the average may be the opposite, especially during bursts. You may choose a EPS number in between these two values.

  2. Determine the number of Worker nodes based on your EPS and the CPU, Memory and Disk throughput of your Workers.

  3. Determine the number of shards and replication

    • How many shards

    • How many Data nodes and Query nodes per shard

  4. Determine the required storage capacity and disk type

    • How many storage tiers

    • Disk type, Throughput and Disk space per tier

Here is an example to illustrate the design. The requirements are listed here.

  1. 100K EPS sustained

  2. 1,000 Bytes/event (raw)

  3. Replication = 2 (meaning 2 copies of each event)

  4. Online storage for 3 months

Assume the following infrastructure.

  1. Workers with 32vCPU and 64GB RAM

  2. SSD Disks with 500MBps read/write throughput

  3. Spinning Disks with 200 MBps read/write throughput

  4. Compression: 4:1

  5. 10 Gbps network between the Workers

Our experiments indicate that each Worker can insert 30K EPS. We need 4 insertion nodes and 3 Keeper nodes that can be setup as follows:

  1. Shard 1

    • Worker 1 – both ClickHouse Data Node and Query Node

    • Worker 2 – both ClickHouse Data Node and Query Node

  2. Shard 2

    • Worker 3 – both ClickHouse Data Node and Query Node

    • Worker 4 – both ClickHouse Data Node and Query Node

  3. Worker 5, Worker 6 and Worker 7 acting as ClickHouse Keeper

Collectors send events to 7 Worker nodes which then write to 4 Data nodes.

Notes:

  • Worker 5 and Worker 6 run all FortiSIEM Worker processes and ClickHouseKeeper process, but not ClickHouseServer process.

  • Worker 1, Worker 2, Worker 3, Worker 4 run all FortiSIEM Worker processes and ClickHouseServer process, but not ClickHouseKeeper process.

For storage, we need 5 TB/day using replication distributed among 4 Worker nodes. Each Worker needs the following.

  • Hot Tier: 14 days in SSD – 18TB total

  • Warm Tier: 76 days in Spinning disks – 95TB total

Note that Writes go to Hot Tier first and then to Warm Tier after the Hot Tier fills up.