ClickHouse Configuration

The following topics are covered.

Background Information

The following ClickHouse background topics are available.

ClickHouse Related Processes

ClickHouse is a distributed database with replication capabilities. FortiSIEM Supervisor and Worker software images include ClickHouse binaries. The user does not need to install anything else. You can configure a ClickHouse cluster from the FortiSIEM GUI.

There are two main ClickHouse processes:

  • ClickHouseServer process: This is the ClickHouse Database Service.

  • ClickHouseKeeper process: This is the ClickHouse Keeper Service providing Replication Management.

In addition, two more FortiSIEM processes provide ClickHouse related services:

  • phClickHouseMonitor process: This runs on the Supervisor and Worker nodes and provides the following services:

    • On Supervisor/Worker nodes: CMDB Group Query helper, Lookup Table Query helper and DeviceToCMDBAttr Query helper.

    • On Supervisor node only: Provides Online data display and the list of available ClickHouse nodes.

  • phMonitor process: Provides ClickHouse configuration management on Supervisor node.

Supervisor/Worker Nodes Running ClickHouse Functions

A FortiSIEM Supervisor/Worker node can be of 3 types (not mutually exclusive):

  • ClickHouse Keeper Node: This node runs ClickHouse Keeper service providing replication management.

  • ClickHouse Data Node: This node inserts events into ClickHouse database.

  • ClickHouse Query Node: This node provides ClickHouse query services.

FortiSIEM Supervisor/Worker node can be a specific node only, or a mix of the 3 node types. For example:

Small EPS Environments: One Supervisor node that is a Keeper, Data and Query node

Medium EPS Environments:

High EPS Environments (Option 1):

  • Supervisor node does not run any ClickHouse service

  • 3 separate Worker nodes as ClickHouse Keeper Nodes - these form the ClickHouse Keeper cluster (see ClickHouse Keeper Cluster Considerations).

  • N Worker nodes, with each node acting as both ClickHouse Data Node and ClickHouse Query Node - these form the ClickHouse Database cluster.

High EPS environments (Option 2):

  • Supervisor node does not run any ClickHouse service

  • 3 separate Worker nodes as ClickHouse Keeper Nodes – these form the ClickHouse Keeper cluster (see ClickHouse Keeper Cluster Considerations).

  • A ClickHouse Database cluster consisting of

    • N/2 Worker nodes as ClickHouse Data Node only

    • N/2 Worker nodes as ClickHouse Query Node only

In Option 1, events are ingested at all N nodes, query goes to all N nodes. In Option 2, events are ingested in N/2 nodes and queried from N/2 nodes. There are other options with N/2 Data Only nodes and N Query Nodes for better query performance. Option 1 is the most balanced Option that has been seen to work well.

Shards and Replicas

A shard is a database partition designed to provide high insertion and query rates. Events are written to and read from multiple shards in parallel. You need to choose the number of shards based on your incoming EPS (see example below and the latest FortiSIEM Sizing Guide located in Fortinet Documents Library here).

If you want replication, then you can have replicas within each shard. ClickHouse will replicate database writes to a node within a shard to all other replicas within the same shard. A typical choice for replication size = 2, implying that you will have 2 nodes in each shard. A replica provides (a) faster queries and (b) prevents data loss in case a node goes down.

It is important to understand how ClickHouse insertion, Replication and Query works in FortiSIEM.

ClickHouse Keeper Cluster Considerations

ClickHouse Keeper provides the coordination system for data replication and distributed DDL queries execution. You should use odd number of nodes in Keeper Cluster for maintaining quorum, although ClickHouse Allows even number of nodes.

  • If you use 3 nodes and lose 1 node, the Cluster keeps running without any intervention.

  • If you use 2 nodes and lose 1 node, then quorum is lost. You need to use the following steps in Recovering from Losing Quorum to recover quorum.

  • If you use 1 node and lose that node, then the ClickHouse event database becomes read only and insertion stops. You need to use the following steps in Recovering from Complete Loss of ClickHouse Keeper Cluster to recover the ClickHouse Keeper database.

Note that for high EPS environments, ClickHouse recommends running ClickHouse Keeper and Database services on separate nodes, to avoid disk and CPU contention between Query and Replication Management engines. If you have powerful servers with good CPU, memory and high throughput disks, and EPS is not high, it may be reasonable to co-locate ClickHouse Keeper and Data/Query nodes.

See ClickHouse reference in the References section for related information.

Event Insertion Flow

  1. Collectors send events to the Worker list specified in ADMIN > Settings > System > Event Worker.

  2. Data Manager process on each Worker node will first select a ClickHouse Data Node and insert to that node. It may be local or remote.

Event Replication Flow

  1. After insertion, ClickHouse Data Node will inform a ClickHouse Keeper Node.

  2. The ClickHouse Keeper Node initiates replication to all other nodes in the same shard.

Query Flow

  1. GUI sends request to App Server which sends to Query Master on Supervisor

  2. Query Master provides Query management. It sends the request to a (randomly chosen) ClickHouse Query node. Each query may go to a different ClickHouse Query node.

  3. The ClickHouse Query node co-ordinates the Query (like Elasticsearch Coordinating node)

    1. It sends the results to other ClickHouse Query nodes

    2. It generates the final result by combining partial results obtained from all ClickHouse Query nodes

    3. It sends the result back to Query Master

  4. Query Master sends the results back to App Server; which in turn, sends it back to GUI.

Sizing your ClickHouse Deployment

Before starting to configure Workers, it is important to plan your ClickHouse deployment. The Sizing Guide has details, but a summary follows:

  1. You need to know your event insertion rate (EPS). Using peak values may lead to an over-engineered situation while only calculating the average may be the opposite, especially during bursts. You may choose a EPS number in between these two values.

  2. Determine the number of Worker nodes based on your EPS and the CPU, Memory and Disk throughput of your Workers.

  3. Determine the number of shards and replication

    • How many shards

    • How many Data nodes and Query nodes per shard

  4. Determine the required storage capacity and disk type

    • How many storage tiers

    • Disk type, Throughput and Disk space per tier

Here is an example to illustrate the design. The requirements are listed here.

  1. 100K EPS sustained

  2. 1,000 Bytes/event (raw)

  3. Replication = 2 (meaning 2 copies of each event)

  4. Online storage for 3 months

Assume the following infrastructure.

  1. Workers with 32vCPU and 64GB RAM

  2. SSD Disks with 500MBps read/write throughput

  3. Spinning Disks with 200 MBps read/write throughput

  4. Compression: 4:1

  5. 10 Gbps network between the Workers

Our experiments indicate that each Worker can insert 30K EPS. We need 4 insertion nodes and 3 Keeper nodes that can be setup as follows:

  1. Shard 1

    • Worker 1 – both ClickHouse Data Node and Query Node

    • Worker 2 – both ClickHouse Data Node and Query Node

  2. Shard 2

    • Worker 3 – both ClickHouse Data Node and Query Node

    • Worker 4 – both ClickHouse Data Node and Query Node

  3. Worker 5, Worker 6 and Worker 7 acting as ClickHouse Keeper

Collectors send events to 7 Worker nodes which then write to 4 Data nodes.

Notes:

  • Worker 5 and Worker 6 run all FortiSIEM Worker processes and ClickHouseKeeper process, but not ClickHouseServer process.

  • Worker 1, Worker 2, Worker 3, Worker 4 run all FortiSIEM Worker processes and ClickHouseServer process, but not ClickHouseKeeper process.

For storage, we need 5 TB/day using replication distributed among 4 Worker nodes. Each Worker needs the following.

  • Hot Tier: 14 days in SSD – 18TB total

  • Warm Tier: 76 days in Spinning disks – 95TB total

Note that Writes go to Hot Tier first and then to Warm Tier after the Hot Tier fills up.

Initial Configuration

Step 1 - Create All Worker Nodes One by One

Before proceeding you need to know whether the Worker node will be a ClickHouse Keeper, Data or Query node or a combination.

If the Worker node will be a ClickHouse Keeper node, then add a small disk (200GB) that stores the replication database.

If the Worker node will be a ClickHouse Data or Query node, then add the disks to store the events. You can choose Hot Tier (e.g. SSD) only or add a Warm tier (e.g. Spinning disks) and you can add multiple disks in each tier. If more than 1 tier and 1 disk/tier is chosen, then ClickHouse will first write events in the following order. Events will be written to a disk only when the earlier disks in the order are full:

  1. Hot Tier – Disk 1

  2. Hot Tier 1 – Disk 2

  3. Hot Tier 1 – Disk N

  4. Warm Tier – Disk 1

  5. Warm Tier – Disk 2

  6. Warm Tier – Disk N

The disk layout for all Data/Query nodes in the same shard should be identical since incoming events is distributed round robin to all ClickHouse Data nodes and then replicated to ClickHouse Query only nodes.

To add a worker node for use with ClickHouse, take the following steps:

  1. Navigate to ADMIN > License > Nodes.

  2. Add a Worker node by taking the following steps.

    1. Click Add.

    2. In the Type drop-down list, select Worker.

    3. In the IP Address field, enter the IPV4 address of the Worker.

    4. In the Host Name field, enter the host name.

    5. From the Running On drop-down list, select the appropriate device.

      Note: It is important that you select the correct device, as it determines the Disk Path. Additionally, some devices have certain limitations. For example, the 2000G allows two tiers, while the 2000F, and 3500G only allow one tier.

    6. From the Storage Tiers drop-down list, select the number of storage tiers.

    7. In Hot Tier, under Disk Path, enter the mount point.

    8. In Warm Tier, under Disk Path, if multiple storage tiers is selected, under Disk Path, enter the mount point.

    9. Click + to add another Hot Tier or Warm Tier field.

    10. Click Test to verify the mount point(s).

    11. Click Save.

      Note: If you made Storage changes in the Add/Edit Worker or Storage Setup pages, then after Save succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Step 2 - Create ClickHouse Configuration

In this step, you specify which of the FortiSIEM nodes will be ClickHouse Keeper, Data or Query node. In other words, you will configure the following.

  1. ClickHouse Keeper Cluster consisting of ClickHouse Keeper nodes.

  2. ClickHouse Server Cluster consisting of ClickHouse Data and Query nodes.

  3. Deploy the configuration.

Take the following steps.

  1. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  2. Configure ClickHouse Keeper Cluster.
    Note: You need 1-3 nodes in the cluster.

    1. Click on + and add a Worker.

    2. Click on - to remove a Worker.

      Be careful about removing nodes from Keeper cluster as you may lose quorum and the event database will become read only and event insertion may stop. See Notes.

      Modifying the Keeper Cluster while Maintaining Quorum

      Suppose your Keeper Cluster has 1 Node

      • Adding a new node is allowed but not recommended. Adding 2 nodes is recommended to bring up the cluster size to 3.

      • Deleting the node is NOT allowed as it will lead to complete loss of Keeper cluster. Recover by following the steps in Recovering from Complete Loss of ClickHouse Keeper Cluster.

      • Replace the node, if needed, by following this sequence.

        1. First, add the new node, then click Test and Save.

        2. Next, delete the old node, then click Test and Save. Quorum will be lost. Please logon to the new one and recover by following the steps in Recovering from Losing Quorum.

      Suppose your Keeper Cluster has 2 Nodes (This is not recommended)

      • Adding a new node is allowed and recommended.

      • Deleting a node is permitted but the quorum will be lost. Please logon to the remaining node and recover by following the steps in Recovering from Losing Quorum.

      • Replace a node, if needed, by following this sequence.

        1. First, add the new node, then click Test and Save.

        2. Next, delete the old node, then click Test and Save.

      Suppose You are Running ClickHouse Keeper Cluster with 3 Nodes (This is recommended)

      • Adding a new node is permitted.

      • Deleting a node is permitted, but not recommended as a 2-node cluster cannot sustain a node loss without losing quorum.

      • Replacing a node is permitted - follow this sequence as good practice.

        1. First, add the new node, then click Test and Save.

        2. Next, delete the old node, then click Test and Save.

      Suppose You are Running ClickHouse Keeper Cluster with 4 or more Nodes (This has no problem from quorum perspective but may impact performance because of overhead in maintaining quorum)

      • Adding a new node is permitted.

      • Deleting a node is permitted.

      • Replacing a node is permitted. Follow this sequence as good practice.

        1. First, add the new node, then click Test and Save.

        2. Next, delete the old node, then click Test and Save.

  3. Configure ClickHouse Server Cluster. You need to know the number of shards.

    1. Click on + and add a shard.

    2. Add Workers to the shard.

      1. Check Data if this Worker is a ClickHouse Data Node.

      2. Check Query if this Worker is a ClickHouse Query Node.

      3. Check both Data and Query if this Worker is both a ClickHouse Data and Query Node.

  4. Once the shards have been created and workers have been added to the shard, then click Test.

  5. If Test succeeds, then click Deploy to push the changes to ClickHouse.

    Notes:

    1. An operation for a ClickHouse Keeper Cluster node, such as adding or removing a node, MUST be done individually, meaning that after an operation is done, a test and deploy action must be performed. For example, if you add a ClickHouse Keeper Cluster node, you must then perform a test and deploy before doing any other operation for another ClickHouse Keeper Cluster node, such as adding another ClickHouse Keeper Cluster node, or removing a ClickHouse Keeper Cluster node. Do NOT perform more than one operation, such as adding or removing a ClickHouse Keeper Cluster node without testing and deploying it, as doing so may cause stability issues.

    2. If you made changes to the ClickHouse Keeper Cluster, then after Deploy succeeds, phClickHouseMonitor, ClickHouseKeeper and ClickHouseServer processes will restart

    3. If you made changes to the ClickHouse Cluster, then after Deploy succeeds, phClickHouseMonitor, and ClickHouseServer processes will restart.

Advanced Operations

Adding a Disk or Tier

To add a disk or tier, take the following steps.

  1. Navigate to ADMIN > License > Nodes.

  2. Select the node which you wish to add a disk or tier.

  3. Click Edit.

  4. Update the Storage Tiers drop-down number to change the number to tiers.

  5. Click + from the Hot Tier or Warm Tier Row column respectively to add another Hot Tier or Warm Tier field to add a disk.

  6. Click Test.

  7. Click Save.

  8. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  9. Click Test.

  10. Click Deploy.

Notes:

  • After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

  • When additional disks are added, the data is written across all available disks. If a disk becomes full, then data will be written to the disks with free space until all disks are full, at which point data will be moved to other ClickHouse storage tiers, archived, or purged depending on your FortiSIEM configuration.

Deleting a Disk

To delete a disk, take the following steps.

  1. Unmount the disk.

  2. Format the disk if you wish to remove its data.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the node with the disk you wish to delete.

  5. Click Edit.

  6. Click - from the Hot Tier or Warm Tier Row column respectively to remove the disk.

    Note: You may see some error logs getting generated in /opt/clickhouse/log/clickhouse-server.err.log.

    2022.06.20 15:55:31.761484 [ 98091 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part 18250-20220620_371_375_1 on replica 1. Hoping that it will eventually appear as a result of a merge.
    
    2022.06.20 15:55:31.764560 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking part 18250-20220620_353_378_2
    
    2022.06.20 15:55:31.764841 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking if anyone has a part 18250-20220620_353_378_2 or covering part.
    
    2022.06.20 15:55:31.765138 [ 98141 ] {} <Error> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): No replica has part covering 18250-20220620_353_378_2 and a merge is impossible: we didn't find a smaller part with the same max block.
    
    2022.06.20 15:55:31.766222 [ 98141 ] {} <Warning> fsiem.events_replicated (a5a85f1a-6ebf-4cf1-b82b-686f928798cc): Cannot commit empty part 18250-20220620_353_378_2 with error DB::Exception: Part 18250-20220620_353_378_2 (state Outdated) already exists, but it will be deleted soon
    
    2022.06.20 15:55:31.766574 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Cannot create empty part 18250-20220620_353_378_2 instead of lost. Will retry later
    

    These errors indicate that ClickHouse detected some missing data by comparing the local parts and the parts names stored in clickhouse-keeper. This is just a warning and does not affect operation. If you find this annoying, delete the entries in clickhouse-keeper by running the following commands on the Worker where the disk is deleted.

    clickhouse-client --query "SELECT replica_path || '/queue/' || node_name FROM system.replication_queue JOIN system.replicas USING (database, table) WHERE last_exception LIKE '%No active replica has part%'" | while read i; do /opt/zookeeper/bin/zkCli.sh deleteall $i; done
    clickhouse-client --query "SYSTEM RESTART REPLICAS"

    Reference: https://github.com/ClickHouse/ClickHouse/issues/10368

  7. Click Test.

  8. Click Save.

  9. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  10. Click Test.

  11. Click Deploy.

Note: After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Deleting a Storage Tier

To delete a storage tier, take the following steps.

  1. Unmount the disk.

  2. Format the disk if you wish to remove its data.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the node with the disk you wish to delete.

  5. Click Edit.

  6. Change Storage Tiers from "2" to "1".

  7. Click Test.

  8. Click Save.

  9. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  10. Click Test.

  11. Click Deploy.

Note: After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Deleting a Node

Deleting from ClickHouse Keeper Cluster

A node from the ClickHouse Keeper Cluster cannot be deleted if it has exactly 2 nodes as it will lose quorum. If this constraint is not applicable, follow these steps.

  1. Navigate to ADMIN >Settings > Database > ClickHouse Config.

  2. Delete the node to the ClickHouse Keeper Cluster Config by clicking - from the Row column.

  3. Click Test.

  4. Click Deploy.

Deleting from ClickHouse Cluster

A node should not be deleted if it is the only Data node in the Shard. Deleting a node in this situation will cause you to lose data. If you still want to proceed, follow these steps.

  1. Unmount the disk where events are stored.

  2. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  3. Note the ShardID and ReplicaID of the node to be deleted. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This information is needed later in Step 8.

  4. Delete the node from the ClickHouse Cluster table by clicking - from the Row column.

  5. Click Test.

  6. Click Deploy.

  7. Login to the deleted Worker and run the following SQL command in clickhouse-client shell to drop the events table.

    DROP TABLE fsiem.events_replicated

  8. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper Cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

    Note: Replace <ShardID> and <ReplicaID> with the information from step 3.

Moving a Worker from One Shard to Another Shard

To move a Worker from one shard to another shard, take the following steps.

  1. Remove the Worker from the ClickHouse Keeper and ClickHouse Cluster.

  2. Login to the Worker and run the following commands.

    clickhouse-client –q “DROP TABLE fsiem.events_replicated”

    clickhouse-client –q “DROP TABLE fsiem.summary”

    systemctl stop clickhouse-server

  3. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

  4. Login to the Worker and navigate to /etc/clickhouse-server/config.d.

  5. Remove all config xml under /etc/clickhouse-server/config.d except for logger.xml, max_partition_size_to_drop.xml and max_suspicious_broken_parts.xml.

  6. Remove all config xml under /etc/clickhouse-server/users.d/.

  7. Umount all disks that were used by ClickHouse.

  8. Wipefs all disk devices that were assigned to ClickHouse.

  9. Login to the Supervisor GUI, and navigate to ADMIN > Settings > Database > ClickHouse Config.

  10. Select the target Worker and delete it from the existing shard by clicking - from the Row column.

  11. Click Test.

  12. Click Deploy.

  13. Navigate to ADMIN > License > Nodes.

  14. Select the target Worker.

  15. Click Delete to remove the target Worker.

  16. Wait for the Supervisor phMonitor process to come up.

  17. Re-add the target Worker into the License Node with the desired disk configuration, following the instructions for adding a Worker.

  18. Wait for the Supervisor phMonitor process to com up again.

  19. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  20. Add the target Worker to the destination shard.

  21. Click Test.

  22. Click Deploy.

Replacing a Worker with another Worker (within the same Shard)

Currently, the GUI allows you to choose to replace one Worker (W1) with another Worker (W2) in ClickHouse Configuration. However, clicking on Test will fail since the shard and replica Ids are in use by the previous Worker (W1).

Follow these steps to replace W1 with W2.

  1. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  2. Note the shardID and ReplicaID of W1. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This will be needed later in Step 7.

  3. Delete W1 from the ClickHouse Cluster Table by clicking - from the Row column.

  4. Click Test.

  5. Click Deploy.

  6. Login to W1 and run the following SQL command in clickhouse-client shell to drop the events table.

    DROP TABLE fsiem.events_replicated

  7. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

  8. Add W2 from the ClickHouse Cluster Table in same place where W1 was. It can use the same Shard ID and Replica ID.

  9. Click Test.

  10. Click Deploy.

Restarting a Node

Node belongs to ClickHouse Cluster

Normal restart procedures should be fine. If you see insertion failure (Table is read only) errors in phoenix.log, then take the following steps.

  1. Login to the node.

  2. Run the following command.

    sudo -u clickhouse touch /data-clickhouse-hot-1/clickhouse/flags/force_restore_data

Node belongs to ClickHouse Keeper Cluster

Normal restart procedures should be fine.

Recovering from Complete Loss of ClickHouse Keeper Cluster

Normal ClickHouse Keeper:

A normal ClickHouse cluster looks like this.

[root@FSM-660-CH-58-246 ~]# echo stat | nc <IP> 2181
ClickHouse Keeper version: v22.6.1.1985-testing-7000c4e0033bb9e69050ab8ef73e8e7465f78059
Clients:
[::ffff:172.30.58.246]:36518(recved=0,sent=0)
 
Latency min/avg/max: 0/0/0
Received: 0
Sent: 0
Connections: 0
Outstanding: 0
Zxid: 145730
Mode: follower
Node count: 305

How to check if a ClickHouse Keeper needs to be restored.

You will see logs indicating ClickHouse event Table is read only.

grep PH_DATAMANAGER_HTTP_UPLOAD_ERROR /opt/phoenix/log/phoenix.log| grep TABLE_IS_READ_ONLY

2022-07-22T13:00:10.945816-07:00 FSM-Host phDataManager[9617]: [PH_DATAMANAGER_HTTP_UPLOAD_ERROR]:[eventSeverity]=PHL_ERROR,[procName]=phDataManager,[fileName]=ClickHouseWriterService.cpp,[lineNumber]=459,[errReason]=Uploading events to ClickHouse failed. respCode:500 resp:Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/1/fsiem.events/replicas/1). (TABLE_IS_READ_ONLY) (version 22.6.1.1985 (official build))

Recovery Steps

  1. Login to Supervisor's redis:

    redis-cli -p 6666 -a $(grep pass /opt/phoenix/redis/conf/6666.conf | awk '{print $2}')
    127.0.0.1:6666> del cache:ClickHouse:clickhouseKeeperNodes
    (integer) 1
    
  2. Navigate to ADMIN > Settings > Database > ClickHouse Config, and replace the dead worker with a new worker to ClickHouse Keeper cluster.

  3. Click Test.

  4. Click Deploy.

  5. Login in to clickhouse-client on each ClickHouse node and execute the following commands.

    SYSTEM RESTART REPLICA fsiem.events_replicated
    SYSTEM RESTORE REPLICA fsiem.events_replicated
    

Recovering from Losing Quorum

To identify if a ClickHouse Keeper needs recovery, you can check via log or command line.

From Log

/data-clickhouse-hot-1/clickhouse-keeper/app_logs/clickhouse-keeper.err.log
 
2022.07.22 12:27:10.415055 [ 52865 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2022.07.22 12:27:10.415169 [ 52865 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 0, live 0, dead 1), at least 2 nodes should respond. failure count 163

From Command Line

[root@FSM-660-CH-58-246 ~]# echo stat | nc 172.30.58.216 2181
This instance is not currently serving requests

Recovery Steps

If you have lost quorum from ClickHouse, take the following step.

Login to the ClickHouse Keeper node that needs recovery and run the following command.

echo rcvr | nc localhost 2181

Migrating ClickHouse Events from FortiSIEM 6.5.x to 6.6.0 or Later

FortiSIEM 6.5.x ran ClickHouse on a single node and used the Merge Tree engine. FortiSIEM 6.6.0 onwards runs ClickHouse on a cluster using Replicated Merge Tree engine. You need to follow these special steps to move the old events previously stored in Merge Tree to Replicated Merge Tree.

caution icon

From FortiSIEM 6.5.x, you MUST first upgrade to FortiSIEM 6.6.x PRIOR to upgrading to FortiSIEM 7.x or later. If you directly upgrade from 6.5.x to 7.0.0 or later, upgrade will fail.

To upgrade your FortiSIEM 6.5.x to 6.6.x, take the following steps

  1. Navigate to ADMIN >Settings > Database > ClickHouse Config.

  2. Click Test, then click Deploy to enable the ClickHouse Keeper service which is new in 6.6.x.

  3. Migrate the event data in 6.5.x to 6.6.x by running the script /opt/phoenix/phscripts/clickhouse/clickhouse-migrate-650.sh.

  4. Verify that all events have been moved to the new table through GUI Search.

  5. When the data migration is deemed successful, run the following command on every node where the clickhouse-migrate-650.sh script was run successfully to drop the old event table.

    clickhouse-client -q "DROP TABLE fsiem.events_non_replicated"

Now you can upgrade to FortiSIEM 7.x or later, if needed.

References

The following ClickHouse References are available.

General Introduction

https://clickhouse.com/docs/en/intro

Concepts and Architecture

https://clickhouse.com/docs/en/development/architecture

ClickHouse Keeper

https://clickhouse.com/docs/en/operations/clickhouse-keeper

Performance Tuning

https://clickhouse.com/docs/en/operations/tips

Troubleshooting

https://kb.altinity.com/altinity-kb-schema-design/how-much-is-too-much/

https://clickhouse.com/docs/en/operations/system-tables/

https://clickhouse.com/docs/en/sql-reference/statements/system/#query_language-system-restore-replica