ClickHouse Configuration

This section covers how to configure a ClickHouse topology of Keeper and Data/Query nodes.

Initial Configuration

Follow these steps when you are configuring ClickHouse for the first time.

Before beginning, make sure that:

  1. You have gone through ClickHouse Operation Overview and the ClickHouse Sizing Guide, located in the FortiSIEM Documentation Library..

  2. You have identified the FortiSIEM nodes that are going to be ClickHouse Keeper nodes and ClickHouse Data nodes.

  3. You have configured appropriate disks on the FortiSIEM nodes appropriate for their role.

Take the following steps.

  1. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  2. Configure ClickHouse Keeper Cluster.

    1. Click on + and add a Worker.

    2. Click on - to remove a Worker. Be careful about removing nodes from Keeper cluster as you may lose quorum and the event database will become read only and event insertion may stop. Read the notes in Modifying Keeper Cluster while Maintaining Quorum before performing a remove operation.

  3. Configure ClickHouse Server Cluster. You need to know the number of shards.

    1. Click on + and add a shard.

    2. Add Workers to the shard.

      1. Check Data if this Worker is a ClickHouse Data Node. A Data node receives events, processes them and writes to ClickHouse database.

      2. Check Query if this Worker is a ClickHouse Query Node. A Query node stores events replicated from data nodes and participates in Queries. However, it does not process events and triggers incidents.

      3. Check both Data and Query if this Worker is both a ClickHouse Data and Query Node. This is the most common setup.

  4. Once the shards have been created and workers have been added to the shard, then click Test.

  5. If Test succeeds, then click Deploy to push the changes to ClickHouse.

Notes:

  1. An operation for a ClickHouse Keeper Cluster node, such as adding or removing a node, MUST be done individually, meaning that after an operation is done, a test and deploy action must be performed. For example, if you add a ClickHouse Keeper Cluster node, you must then perform a test and deploy before doing any other operation for another ClickHouse Keeper Cluster node, such as adding another ClickHouse Keeper Cluster node, or removing a ClickHouse Keeper Cluster node. Do NOT perform more than one operation, such as adding or removing a ClickHouse Keeper Cluster node without testing and deploying it, as doing so may cause stability issues.

  2. If you made changes to the ClickHouse Keeper Cluster, then after Deploy succeeds, phClickHouseMonitor, ClickHouseKeeper and ClickHouseServer processes will restart

  3. If you made changes to the ClickHouse Cluster, then after Deploy succeeds, phClickHouseMonitor, and ClickHouseServer processes will restart.

Modifying Keeper Cluster while Maintaining Quorum

Note: You need 1-3 nodes in the cluster. A loss of quorum happens if you lose more than half of the nodes. Excessive nodes (e.g. more than 5) can cause performance degradation.

Case 1: Keeper Cluster has 1 Node

Adding a new node is allowed but not recommended. Adding 2 nodes is recommended to bring up the cluster size to 3. Deleting the node is NOT allowed as it will lead to complete loss of Keeper cluster. Recover by following the steps in Recovering from Complete Loss of ClickHouse Keeper Cluster.

Replace the node, if needed, by following this sequence.

  1. First, add the new node, then click Test, and then Save.

  2. Next, delete the old node, then click Test and then Save. Quorum will be lost. Please logon to the new one and recover by following the steps in Recovering from Losing Quorum in ClickHouse Keeper Cluster.

Case 2: Keeper Cluster has 2 Nodes (This is not recommended)

Adding a new node is allowed and recommended to bring up the cluster size to 3. Deleting a node is permitted but the quorum will be lost. Please logon to the remaining node and recover by following the steps in Recovering from Losing Quorum in ClickHouse Keeper Cluster.

Replace a node, if needed, by following this sequence.

  1. First, add the new node, then click Test and Save.

  2. Next, delete the old node, then click Test and Save.

Case 3: Keeper Cluster with 3 Nodes (This is recommended)

Adding a new node is permitted. Deleting a node is permitted, but not recommended as a 2-node cluster cannot sustain a node loss without losing quorum.

Follow this sequence to replacing a node (as a matter of good practice).

  1. First, add the new node, then click Test, and then Save.

  2. Next, delete the old node, then click Test, and then Save.

Case 4: Keeper Cluster with 4 or more Nodes

This has no problem from quorum perspective but may impact performance because of overhead in maintaining quorum. Adding a new node is permitted. Deleting a node is permitted.

Replacing a node is permitted. Follow this sequence as good practice.

  1. First, add the new node, then click Test, and then Save.

  2. Next, delete the old node, then click Test, and then Save.

Advanced Operations

Adding a Disk or Tier

To add a disk or tier, take the following steps.

  1. Navigate to ADMIN > License > Nodes.

  2. Select the node which you wish to add a disk or tier.

  3. Click Edit.

  4. Update the Storage Tiers drop-down number to change the number to tiers.

  5. Click + from the Hot Tier or Warm Tier Row column respectively to add another Hot Tier or Warm Tier field to add a disk.

  6. Click Test.

  7. Click Save.

  8. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  9. Click Test.

  10. Click Deploy.

Notes:

  • After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

  • When additional disks are added, the data is written across all available disks. If a disk becomes full, then data will be written to the disks with free space until all disks are full, at which point data will be moved to other ClickHouse storage tiers, archived, or purged depending on your FortiSIEM configuration.

Deleting a Disk

To delete a disk, take the following steps.

  1. Unmount the disk.

  2. Format the disk if you wish to remove its data.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the node with the disk you wish to delete.

  5. Click Edit.

  6. Click - from the Hot Tier or Warm Tier Row column respectively to remove the disk.

    Note: You may see some error logs getting generated in /opt/clickhouse/log/clickhouse-server.err.log.

    2022.06.20 15:55:31.761484 [ 98091 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part 18250-20220620_371_375_1 on replica 1. Hoping that it will eventually appear as a result of a merge.
    
    2022.06.20 15:55:31.764560 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking part 18250-20220620_353_378_2
    
    2022.06.20 15:55:31.764841 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking if anyone has a part 18250-20220620_353_378_2 or covering part.
    
    2022.06.20 15:55:31.765138 [ 98141 ] {} <Error> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): No replica has part covering 18250-20220620_353_378_2 and a merge is impossible: we didn't find a smaller part with the same max block.
    
    2022.06.20 15:55:31.766222 [ 98141 ] {} <Warning> fsiem.events_replicated (a5a85f1a-6ebf-4cf1-b82b-686f928798cc): Cannot commit empty part 18250-20220620_353_378_2 with error DB::Exception: Part 18250-20220620_353_378_2 (state Outdated) already exists, but it will be deleted soon
    
    2022.06.20 15:55:31.766574 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Cannot create empty part 18250-20220620_353_378_2 instead of lost. Will retry later
    

    These errors indicate that ClickHouse detected some missing data by comparing the local parts and the parts names stored in clickhouse-keeper. This is just a warning and does not affect operation. If you find this annoying, delete the entries in clickhouse-keeper by running the following commands on the Worker where the disk is deleted.

    clickhouse-client --query "SELECT replica_path || '/queue/' || node_name FROM system.replication_queue JOIN system.replicas USING (database, table) WHERE last_exception LIKE '%No active replica has part%'" | while read i; do /opt/zookeeper/bin/zkCli.sh deleteall $i; done
    clickhouse-client --query "SYSTEM RESTART REPLICAS"

    Reference: https://github.com/ClickHouse/ClickHouse/issues/10368

  7. Click Test.

  8. Click Save.

  9. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  10. Click Test.

  11. Click Deploy.

Note: After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Deleting a Storage Tier

To delete a storage tier, take the following steps.

  1. Unmount the disk.

  2. Format the disk if you wish to remove its data.

  3. Navigate to ADMIN > License > Nodes.

  4. Select the node with the disk you wish to delete.

  5. Click Edit.

  6. Change Storage Tiers from "2" to "1".

  7. Click Test.

  8. Click Save.

  9. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  10. Click Test.

  11. Click Deploy.

Note: After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Deleting a Node

Deleting from ClickHouse Keeper Cluster

A node from the ClickHouse Keeper Cluster cannot be deleted if it has exactly 2 nodes as it will lose quorum. If this constraint is not applicable, follow these steps.

  1. Navigate to ADMIN >Settings > Database > ClickHouse Config.

  2. Delete the node to the ClickHouse Keeper Cluster Config by clicking - from the Row column.

  3. Click Test.

  4. Click Deploy.

Deleting from ClickHouse Cluster

A node should not be deleted if it is the only Data node in the Shard. Deleting a node in this situation will cause you to lose data. If you still want to proceed, follow these steps.

  1. Unmount the disk where events are stored.

  2. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  3. Note the ShardID and ReplicaID of the node to be deleted. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This information is needed later in Step 8.

  4. Delete the node from the ClickHouse Cluster table by clicking - from the Row column.

  5. Click Test.

  6. Click Deploy.

  7. Login to the deleted Worker and run the following SQL command in clickhouse-client shell to drop the events table.

    DROP TABLE fsiem.events_replicated

  8. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper Cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

    Note: Replace <ShardID> and <ReplicaID> with the information from step 3.

Moving a Worker from One Shard to Another Shard

To move a Worker from one shard to another shard, take the following steps.

  1. Remove the Worker from the ClickHouse Keeper and ClickHouse Cluster.

  2. Login to the Worker and run the following commands.

    clickhouse-client –q “DROP TABLE fsiem.events_replicated”

    clickhouse-client –q “DROP TABLE fsiem.summary”

    systemctl stop clickhouse-server

  3. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

  4. Login to the Worker and navigate to /etc/clickhouse-server/config.d.

  5. Remove all config xml under /etc/clickhouse-server/config.dexcept for logger.xml, max_partition_size_to_drop.xml and max_suspicious_broken_parts.xml.

  6. Remove all config xml under /etc/clickhouse-server/users.d/.

  7. Umount all disks that were used by ClickHouse.

  8. Wipefs all disk devices that were assigned to ClickHouse.

  9. Login to the Supervisor GUI, and navigate to ADMIN > Settings > Database > ClickHouse Config.

  10. Select the target Worker and delete it from the existing shard by clicking - from the Row column.

  11. Click Test.

  12. Click Deploy.

  13. Navigate to ADMIN > License > Nodes.

  14. Select the target Worker.

  15. Click Delete to remove the target Worker.

  16. Wait for the Supervisor phMonitor process to come up.

  17. Re-add the target Worker into the License Node with the desired disk configuration, following the instructions for adding a Worker.

  18. Wait for the Supervisor phMonitor process to com up again.

  19. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  20. Add the target Worker to the destination shard.

  21. Click Test.

  22. Click Deploy.

Replacing a Worker with another Worker (within the same Shard)

Currently, the GUI allows you to choose to replace one Worker (W1) with another Worker (W2) in ClickHouse Configuration. However, clicking on Test will fail since the shard and replica Ids are in use by the previous Worker (W1).

Follow these steps to replace W1 with W2.

  1. Navigate to ADMIN > Settings > Database > ClickHouse Config.

  2. Note the shardID and ReplicaID of W1. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This will be needed later in Step 7.

  3. Delete W1 from the ClickHouse Cluster Table by clicking - from the Row column.

  4. Click Test.

  5. Click Deploy.

  6. Login to W1 and run the following SQL command in clickhouse-client shell to drop the events table.

    DROP TABLE fsiem.events_replicated

  7. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

  8. Add W2 from the ClickHouse Cluster Table in same place where W1 was. It can use the same Shard ID and Replica ID.

  9. Click Test.

  10. Click Deploy.

Restarting a Node

Node belongs to ClickHouse Cluster

Normal restart procedures should be fine. If you see insertion failure (Table is read only) errors in phoenix.log, then take the following steps.

  1. Login to the node.

  2. Run the following command.

    sudo -u clickhouse touch /data-clickhouse-hot-1/clickhouse/flags/force_restore_data

Node belongs to ClickHouse Keeper Cluster

Normal restart procedures should be fine.

Recovering from Complete Loss of ClickHouse Keeper Cluster

Complete loss of Keeper cluster may happen if you have only 1 node and it goes down.

A normal ClickHouse cluster looks like this.

[root@FSM-660-CH-58-246 ~]# echo stat | nc <IP> 2181
ClickHouse Keeper version: v22.6.1.1985-testing-7000c4e0033bb9e69050ab8ef73e8e7465f78059
Clients:
[::ffff:172.30.58.246]:36518(recved=0,sent=0)
 
Latency min/avg/max: 0/0/0
Received: 0
Sent: 0
Connections: 0
Outstanding: 0
Zxid: 145730
Mode: follower
Node count: 305

If you see logs indicating ClickHouse event Table is read only, then you know that the ClickHouse Keeper needs to be restored:

grep PH_DATAMANAGER_HTTP_UPLOAD_ERROR /opt/phoenix/log/phoenix.log| grep TABLE_IS_READ_ONLY

2022-07-22T13:00:10.945816-07:00 FSM-Host phDataManager[9617]: [PH_DATAMANAGER_HTTP_UPLOAD_ERROR]:[eventSeverity]=PHL_ERROR,[procName]=phDataManager,[fileName]=ClickHouseWriterService.cpp,[lineNumber]=459,[errReason]=Uploading events to ClickHouse failed. respCode:500 resp:Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/1/fsiem.events/replicas/1). (TABLE_IS_READ_ONLY) (version 22.6.1.1985 (official build))

To recover, take the following steps:

  1. Login to Supervisor's redis:

    redis-cli -p 6666 -a $(grep pass /opt/phoenix/redis/conf/6666.conf | awk '{print $2}')
    127.0.0.1:6666> del cache:ClickHouse:clickhouseKeeperNodes
    (integer) 1
    
  2. Navigate to ADMIN > Settings > Database > ClickHouse Config, and replace the dead worker with a new worker to ClickHouse Keeper cluster.

  3. Click Test.

  4. Click Deploy.

  5. Login in to clickhouse-client on each ClickHouse node and execute the following commands.

    SYSTEM RESTART REPLICA fsiem.events_replicated

    SYSTEM RESTORE REPLICA fsiem.events_replicated

Recovering from Losing Quorum in ClickHouse Keeper Cluster

Quorum is lost when more than half of the nodes in the Keeper cluster goes down.

To identify if a ClickHouse Keeper needs recovery, you can check via log or command line.

From Log:

/data-clickhouse-hot-1/clickhouse-keeper/app_logs/clickhouse-keeper.err.log
 
2022.07.22 12:27:10.415055 [ 52865 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2022.07.22 12:27:10.415169 [ 52865 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 0, live 0, dead 1), at least 2 nodes should respond. failure count 163

From Command Line:

[root@FSM-660-CH-58-246 ~]# echo stat | nc 172.30.58.216 2181
This instance is not currently serving requests

To recover, take the following steps:

Login to the ClickHouse Keeper node that needs recovery and run the following command.

echo rcvr | nc localhost 2181