ClickHouse Configuration
The following topics are covered.
Background Information
The following ClickHouse background topics are available.
ClickHouse Related Processes
ClickHouse is a distributed database with replication capabilities. FortiSIEM Supervisor and Worker software images include ClickHouse binaries. The user does not need to install anything else. You can configure a ClickHouse cluster from the FortiSIEM GUI.
There are two main ClickHouse processes:
-
ClickHouseServer
process: This is the ClickHouse Database Service. -
ClickHouseKeeper
process: This is the ClickHouse Keeper Service providing Replication Management.
In addition, two more FortiSIEM processes provide ClickHouse related services:
-
phClickHouseMonitor
process: This runs on the Supervisor and Worker nodes and provides the following services: -
On Supervisor/Worker nodes: CMDB Group Query helper, Lookup Table Query helper and DeviceToCMDBAttr Query helper.
-
On Supervisor node only: Provides Online data display and the list of available ClickHouse nodes.
-
phMonitor
process: Provides ClickHouse configuration management on Supervisor node.
Supervisor/Worker Nodes Running ClickHouse Functions
A FortiSIEM Supervisor/Worker node can be of 3 types (not mutually exclusive):
-
ClickHouse Keeper Node: This node runs ClickHouse Keeper service providing replication management.
-
ClickHouse Data Node: This node inserts events into ClickHouse database.
-
ClickHouse Query Node: This node provides ClickHouse query services.
FortiSIEM Supervisor/Worker node can be a specific node only, or a mix of the 3 node types. For example:
Small EPS Environments: One Supervisor node that is a Keeper, Data and Query node
Medium EPS Environments:
-
Supervisor node as ClickHouse Keeper Node. Note that 2 additional Keeper nodes are recommended (see ClickHouse Keeper Cluster Considerations).
High EPS Environments (Option 1):
-
Supervisor node does not run any ClickHouse service
-
3 separate Worker nodes as ClickHouse Keeper Nodes - these form the ClickHouse Keeper cluster (see ClickHouse Keeper Cluster Considerations).
-
N Worker nodes, with each node acting as both ClickHouse Data Node and ClickHouse Query Node - these form the ClickHouse Database cluster.
High EPS environments (Option 2):
-
Supervisor node does not run any ClickHouse service
-
3 separate Worker nodes as ClickHouse Keeper Nodes – these form the ClickHouse Keeper cluster (see ClickHouse Keeper Cluster Considerations).
-
A ClickHouse Database cluster consisting of
-
N/2 Worker nodes as ClickHouse Data Node only
-
N/2 Worker nodes as ClickHouse Query Node only
-
In Option 1, events are ingested at all N nodes, query goes to all N nodes. In Option 2, events are ingested in N/2 nodes and queried from N/2 nodes. There are other options with N/2 Data Only nodes and N Query Nodes for better query performance. Option 1 is the most balanced Option that has been seen to work well.
Shards and Replicas
A shard is a database partition designed to provide high insertion and query rates. Events are written to and read from multiple shards in parallel. You need to choose the number of shards based on your incoming EPS (see example below and the latest FortiSIEM Sizing Guide located in Fortinet Documents Library here).
If you want replication, then you can have replicas within each shard. ClickHouse will replicate database writes to a node within a shard to all other replicas within the same shard. A typical choice for replication size = 2, implying that you will have 2 nodes in each shard. A replica provides (a) faster queries and (b) prevents data loss in case a node goes down.
It is important to understand how ClickHouse insertion, Replication and Query works in FortiSIEM.
ClickHouse Keeper Cluster Considerations
ClickHouse Keeper provides the coordination system for data replication and distributed DDL queries execution. You should use odd number of nodes in Keeper Cluster for maintaining quorum, although ClickHouse Allows even number of nodes.
-
If you use 3 nodes and lose 1 node, the Cluster keeps running without any intervention.
-
If you use 2 nodes and lose 1 node, then quorum is lost. You need to use the following steps in Recovering from Losing Quorum to recover quorum.
-
If you use 1 node and lose that node, then the ClickHouse event database becomes read only and insertion stops. You need to use the following steps in Recovering from Complete Loss of ClickHouse Keeper Cluster to recover the ClickHouse Keeper database.
Note that for high EPS environments, ClickHouse recommends running ClickHouse Keeper and Database services on separate nodes, to avoid disk and CPU contention between Query and Replication Management engines. If you have powerful servers with good CPU, memory and high throughput disks, and EPS is not high, it may be reasonable to co-locate ClickHouse Keeper and Data/Query nodes.
See ClickHouse reference in the References section for related information.
Event Insertion Flow
-
Collectors send events to the Worker list specified in ADMIN > Settings > System > Event Worker.
-
Data Manager process on each Worker node will first select a ClickHouse Data Node and insert to that node. It may be local or remote.
Event Replication Flow
-
After insertion, ClickHouse Data Node will inform a ClickHouse Keeper Node.
-
The ClickHouse Keeper Node initiates replication to all other nodes in the same shard.
Query Flow
-
GUI sends request to App Server which sends to Query Master on Supervisor
-
Query Master provides Query management. It sends the request to a (randomly chosen) ClickHouse Query node. Each query may go to a different ClickHouse Query node.
-
The ClickHouse Query node co-ordinates the Query (like Elasticsearch Coordinating node)
-
It sends the results to other ClickHouse Query nodes
-
It generates the final result by combining partial results obtained from all ClickHouse Query nodes
-
It sends the result back to Query Master
-
-
Query Master sends the results back to App Server; which in turn, sends it back to GUI.
Sizing your ClickHouse Deployment
Before starting to configure Workers, it is important to plan your ClickHouse deployment. The Sizing Guide has details, but a summary follows:
-
You need to know your event insertion rate (EPS). Using peak values may lead to an over-engineered situation while only calculating the average may be the opposite, especially during bursts. You may choose a EPS number in between these two values.
-
Determine the number of Worker nodes based on your EPS and the CPU, Memory and Disk throughput of your Workers.
-
Determine the number of shards and replication
-
How many shards
-
How many Data nodes and Query nodes per shard
-
-
Determine the required storage capacity and disk type
-
How many storage tiers
-
Disk type, Throughput and Disk space per tier
-
Here is an example to illustrate the design. The requirements are listed here.
-
100K EPS sustained
-
1,000 Bytes/event (raw)
-
Replication = 2 (meaning 2 copies of each event)
-
Online storage for 3 months
Assume the following infrastructure.
-
Workers with 32vCPU and 64GB RAM
-
SSD Disks with 500MBps read/write throughput
-
Spinning Disks with 200 MBps read/write throughput
-
Compression: 4:1
-
10 Gbps network between the Workers
Our experiments indicate that each Worker can insert 30K EPS. We need 4 insertion nodes and 3 Keeper nodes that can be setup as follows:
-
Shard 1
-
Worker 1 – both ClickHouse Data Node and Query Node
-
Worker 2 – both ClickHouse Data Node and Query Node
-
-
Shard 2
-
Worker 3 – both ClickHouse Data Node and Query Node
-
Worker 4 – both ClickHouse Data Node and Query Node
-
-
Worker 5, Worker 6 and Worker 7 acting as ClickHouse Keeper
Collectors send events to 7 Worker nodes which then write to 4 Data nodes.
Notes:
-
Worker 5 and Worker 6 run all FortiSIEM Worker processes and
ClickHouseKeeper
process, but notClickHouseServer
process. -
Worker 1, Worker 2, Worker 3, Worker 4 run all FortiSIEM Worker processes and
ClickHouseServer
process, but notClickHouseKeeper
process.
For storage, we need 5 TB/day using replication distributed among 4 Worker nodes. Each Worker needs the following.
-
Hot Tier: 14 days in SSD – 18TB total
-
Warm Tier: 76 days in Spinning disks – 95TB total
Note that Writes go to Hot Tier first and then to Warm Tier after the Hot Tier fills up.
Initial Configuration
Step 1 - Create All Worker Nodes One by One
Before proceeding you need to know whether the Worker node will be a ClickHouse Keeper, Data or Query node or a combination.
If the Worker node will be a ClickHouse Keeper node, then add a small disk (200GB) that stores the replication database.
If the Worker node will be a ClickHouse Data or Query node, then add the disks to store the events. You can choose Hot Tier (e.g. SSD) only or add a Warm tier (e.g. Spinning disks) and you can add multiple disks in each tier. If more than 1 tier and 1 disk/tier is chosen, then ClickHouse will first write events in the following order. Events will be written to a disk only when the earlier disks in the order are full:
-
Hot Tier – Disk 1
-
Hot Tier 1 – Disk 2
-
Hot Tier 1 – Disk N
-
Warm Tier – Disk 1
-
Warm Tier – Disk 2
-
Warm Tier – Disk N
The disk layout for all Data/Query nodes in the same shard should be identical since incoming events is distributed round robin to all ClickHouse Data nodes and then replicated to ClickHouse Query only nodes.
To add a worker node for use with ClickHouse, take the following steps:
-
Navigate to ADMIN > License > Nodes.
-
Add a Worker node by taking the following steps.
-
Click Add.
-
In the Type drop-down list, select Worker.
-
In the IP Address field, enter the IPV4 address of the Worker.
-
In the Host Name field, enter the host name.
-
From the Running On drop-down list, select the appropriate device.
Note: It is important that you select the correct device, as it determines the Disk Path. Additionally, some devices have certain limitations. For example, the 2000G allows two tiers, while the 2000F, and 3500G only allow one tier.
-
From the Storage Tiers drop-down list, select the number of storage tiers.
-
In Hot Tier, under Disk Path, enter the mount point.
-
In Warm Tier, under Disk Path, if multiple storage tiers is selected, under Disk Path, enter the mount point.
-
Click + to add another Hot Tier or Warm Tier field.
-
Click Test to verify the mount point(s).
-
Click Save.
Note: If you made Storage changes in the Add/Edit Worker or Storage Setup pages, then after Save succeeds,
phClickHouseMonitor
andClickHouseServer
processes will restart.
-
Step 2 - Create ClickHouse Configuration
In this step, you specify which of the FortiSIEM nodes will be ClickHouse Keeper, Data or Query node. In other words, you will configure the following.
-
ClickHouse Keeper Cluster consisting of ClickHouse Keeper nodes.
-
ClickHouse Server Cluster consisting of ClickHouse Data and Query nodes.
-
Deploy the configuration.
Take the following steps.
-
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Configure ClickHouse Keeper Cluster.
Note: You need 1-3 nodes in the cluster. -
Click on + and add a Worker.
-
Click on - to remove a Worker.
Be careful about removing nodes from Keeper cluster as you may lose quorum and the event database will become read only and event insertion may stop. See Notes.
Modifying the Keeper Cluster while Maintaining Quorum
Suppose your Keeper Cluster has 1 Node
-
Adding a new node is allowed but not recommended. Adding 2 nodes is recommended to bring up the cluster size to 3.
-
Deleting the node is NOT allowed as it will lead to complete loss of Keeper cluster. Recover by following the steps in Recovering from Complete Loss of ClickHouse Keeper Cluster.
-
Replace the node, if needed, by following this sequence.
-
First, add the new node, then click Test and Save.
-
Next, delete the old node, then click Test and Save. Quorum will be lost. Please logon to the new one and recover by following the steps in Recovering from Losing Quorum.
Suppose your Keeper Cluster has 2 Nodes (This is not recommended)
-
Adding a new node is allowed and recommended.
-
Deleting a node is permitted but the quorum will be lost. Please logon to the remaining node and recover by following the steps in Recovering from Losing Quorum.
-
Replace a node, if needed, by following this sequence.
-
First, add the new node, then click Test and Save.
-
Next, delete the old node, then click Test and Save.
Suppose You are Running ClickHouse Keeper Cluster with 3 Nodes (This is recommended)
-
Adding a new node is permitted.
-
Deleting a node is permitted, but not recommended as a 2-node cluster cannot sustain a node loss without losing quorum.
-
Replacing a node is permitted - follow this sequence as good practice.
-
First, add the new node, then click Test and Save.
-
Next, delete the old node, then click Test and Save.
Suppose You are Running ClickHouse Keeper Cluster with 4 or more Nodes (This has no problem from quorum perspective but may impact performance because of overhead in maintaining quorum)
-
Adding a new node is permitted.
-
Deleting a node is permitted.
-
Replacing a node is permitted. Follow this sequence as good practice.
-
First, add the new node, then click Test and Save.
-
Next, delete the old node, then click Test and Save.
-
-
Configure ClickHouse Server Cluster. You need to know the number of shards.
-
Click on + and add a shard.
-
Add Workers to the shard.
-
Check Data if this Worker is a ClickHouse Data Node.
-
Check Query if this Worker is a ClickHouse Query Node.
-
Check both Data and Query if this Worker is both a ClickHouse Data and Query Node.
-
-
-
Once the shards have been created and workers have been added to the shard, then click Test.
-
If Test succeeds, then click Deploy to push the changes to ClickHouse.
-
An operation for a ClickHouse Keeper Cluster node, such as adding or removing a node, MUST be done individually, meaning that after an operation is done, a test and deploy action must be performed. For example, if you add a ClickHouse Keeper Cluster node, you must then perform a test and deploy before doing any other operation for another ClickHouse Keeper Cluster node, such as adding another ClickHouse Keeper Cluster node, or removing a ClickHouse Keeper Cluster node. Do NOT perform more than one operation, such as adding or removing a ClickHouse Keeper Cluster node without testing and deploying it, as doing so may cause stability issues.
-
If you made changes to the ClickHouse Keeper Cluster, then after Deploy succeeds,
phClickHouseMonitor
,ClickHouseKeeper
andClickHouseServer
processes will restart -
If you made changes to the ClickHouse Cluster, then after Deploy succeeds,
phClickHouseMonitor
, andClickHouseServer
processes will restart.
-
Advanced Operations
Adding a Disk or Tier
To add a disk or tier, take the following steps.
-
Navigate to ADMIN > License > Nodes.
-
Select the node which you wish to add a disk or tier.
-
Click Edit.
-
Update the Storage Tiers drop-down number to change the number to tiers.
-
Click + from the Hot Tier or Warm Tier Row column respectively to add another Hot Tier or Warm Tier field to add a disk.
-
Click Test.
-
Click Save.
-
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Click Test.
-
Click Deploy.
Notes:
-
After Deploy succeeds,
phClickHouseMonitor
andClickHouseServer
processes will restart. -
When additional disks are added, the data is written across all available disks. If a disk becomes full, then data will be written to the disks with free space until all disks are full, at which point data will be moved to other ClickHouse storage tiers, archived, or purged depending on your FortiSIEM configuration.
Deleting a Disk
To delete a disk, take the following steps.
-
Unmount the disk.
-
Format the disk if you wish to remove its data.
-
Navigate to ADMIN > License > Nodes.
-
Select the node with the disk you wish to delete.
-
Click Edit.
-
Click - from the Hot Tier or Warm Tier Row column respectively to remove the disk.
Note: You may see some error logs getting generated in
/opt/clickhouse/log/clickhouse-server.err.log
.2022.06.20 15:55:31.761484 [ 98091 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part 18250-20220620_371_375_1 on replica 1. Hoping that it will eventually appear as a result of a merge. 2022.06.20 15:55:31.764560 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking part 18250-20220620_353_378_2 2022.06.20 15:55:31.764841 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking if anyone has a part 18250-20220620_353_378_2 or covering part. 2022.06.20 15:55:31.765138 [ 98141 ] {} <Error> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): No replica has part covering 18250-20220620_353_378_2 and a merge is impossible: we didn't find a smaller part with the same max block. 2022.06.20 15:55:31.766222 [ 98141 ] {} <Warning> fsiem.events_replicated (a5a85f1a-6ebf-4cf1-b82b-686f928798cc): Cannot commit empty part 18250-20220620_353_378_2 with error DB::Exception: Part 18250-20220620_353_378_2 (state Outdated) already exists, but it will be deleted soon 2022.06.20 15:55:31.766574 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Cannot create empty part 18250-20220620_353_378_2 instead of lost. Will retry later
These errors indicate that ClickHouse detected some missing data by comparing the local parts and the parts names stored in clickhouse-keeper. This is just a warning and does not affect operation. If you find this annoying, delete the entries in clickhouse-keeper by running the following commands on the Worker where the disk is deleted.
clickhouse-client --query "SELECT replica_path || '/queue/' || node_name FROM system.replication_queue JOIN system.replicas USING (database, table) WHERE last_exception LIKE '%No active replica has part%'" | while read i; do /opt/zookeeper/bin/zkCli.sh deleteall $i; done
clickhouse-client --query "SYSTEM RESTART REPLICAS"
Reference: https://github.com/ClickHouse/ClickHouse/issues/10368
-
Click Test.
-
Click Save.
-
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Click Test.
-
Click Deploy.
Note: After Deploy succeeds, phClickHouseMonitor
and ClickHouseServer
processes will restart.
Deleting a Storage Tier
To delete a storage tier, take the following steps.
-
Unmount the disk.
-
Format the disk if you wish to remove its data.
-
Navigate to ADMIN > License > Nodes.
-
Select the node with the disk you wish to delete.
-
Click Edit.
-
Change Storage Tiers from "2" to "1".
-
Click Test.
-
Click Save.
-
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Click Test.
-
Click Deploy.
Note: After Deploy succeeds, phClickHouseMonitor
and ClickHouseServer
processes will restart.
Deleting a Node
Deleting from ClickHouse Keeper Cluster
A node from the ClickHouse Keeper Cluster cannot be deleted if it has exactly 2 nodes as it will lose quorum. If this constraint is not applicable, follow these steps.
-
Navigate to ADMIN >Settings > Database > ClickHouse Config.
-
Delete the node to the ClickHouse Keeper Cluster Config by clicking - from the Row column.
-
Click Test.
-
Click Deploy.
Deleting from ClickHouse Cluster
A node should not be deleted if it is the only Data node in the Shard. Deleting a node in this situation will cause you to lose data. If you still want to proceed, follow these steps.
-
Unmount the disk where events are stored.
-
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Note the ShardID and ReplicaID of the node to be deleted. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This information is needed later in Step 8.
-
Delete the node from the ClickHouse Cluster table by clicking - from the Row column.
-
Click Test.
-
Click Deploy.
-
Login to the deleted Worker and run the following SQL command in clickhouse-client shell to drop the events table.
DROP TABLE fsiem.events_replicated
-
Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper Cluster.
/opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>
/opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>
Note: Replace <ShardID> and <ReplicaID> with the information from step 3.
Moving a Worker from One Shard to Another Shard
To move a Worker from one shard to another shard, take the following steps.
-
Remove the Worker from the ClickHouse Keeper and ClickHouse Cluster.
-
Login to the Worker and run the following commands.
clickhouse-client –q “DROP TABLE fsiem.events_replicated”
clickhouse-client –q “DROP TABLE fsiem.summary”
systemctl stop clickhouse-server
-
Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.
/opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>
/opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>
-
Login to the Worker and navigate to
/etc/clickhouse-server/config.d
. -
Remove all config xml under
/etc/clickhouse-server/config.d
except forlogger.xml
,max_partition_size_to_drop.xml
andmax_suspicious_broken_parts.xml
. -
Remove all config xml under
/etc/clickhouse-server/users.d/
. -
Umount all disks that were used by ClickHouse.
-
Wipefs all disk devices that were assigned to ClickHouse.
-
Login to the Supervisor GUI, and navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Select the target Worker and delete it from the existing shard by clicking - from the Row column.
-
Click Test.
-
Click Deploy.
-
Navigate to ADMIN > License > Nodes.
-
Select the target Worker.
-
Click Delete to remove the target Worker.
-
Wait for the Supervisor
phMonitor
process to come up. -
Re-add the target Worker into the License Node with the desired disk configuration, following the instructions for adding a Worker.
-
Wait for the Supervisor
phMonitor
process to com up again. -
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Add the target Worker to the destination shard.
-
Click Test.
-
Click Deploy.
Replacing a Worker with another Worker (within the same Shard)
Currently, the GUI allows you to choose to replace one Worker (W1) with another Worker (W2) in ClickHouse Configuration. However, clicking on Test will fail since the shard and replica Ids are in use by the previous Worker (W1).
Follow these steps to replace W1 with W2.
-
Navigate to ADMIN > Settings > Database > ClickHouse Config.
-
Note the shardID and ReplicaID of W1. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This will be needed later in Step 7.
-
Delete W1 from the ClickHouse Cluster Table by clicking - from the Row column.
-
Click Test.
-
Click Deploy.
-
Login to W1 and run the following SQL command in clickhouse-client shell to drop the events table.
DROP TABLE fsiem.events_replicated
-
Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.
/opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>
/opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>
-
Add W2 from the ClickHouse Cluster Table in same place where W1 was. It can use the same Shard ID and Replica ID.
-
Click Test.
-
Click Deploy.
Restarting a Node
Node belongs to ClickHouse Cluster
Normal restart procedures should be fine. If you see insertion failure (Table is read only) errors in phoenix.log
, then take the following steps.
-
Login to the node.
-
Run the following command.
sudo -u clickhouse touch /data-clickhouse-hot-1/clickhouse/flags/force_restore_data
Node belongs to ClickHouse Keeper Cluster
Normal restart procedures should be fine.
Recovering from Complete Loss of ClickHouse Keeper Cluster
Normal ClickHouse Keeper:
A normal ClickHouse cluster looks like this.
[root@FSM-660-CH-58-246 ~]# echo stat | nc <IP> 2181 ClickHouse Keeper version: v22.6.1.1985-testing-7000c4e0033bb9e69050ab8ef73e8e7465f78059 Clients: [::ffff:172.30.58.246]:36518(recved=0,sent=0) Latency min/avg/max: 0/0/0 Received: 0 Sent: 0 Connections: 0 Outstanding: 0 Zxid: 145730 Mode: follower Node count: 305
How to check if a ClickHouse Keeper needs to be restored.
You will see logs indicating ClickHouse event Table is read only.
grep PH_DATAMANAGER_HTTP_UPLOAD_ERROR /opt/phoenix/log/phoenix.log| grep TABLE_IS_READ_ONLY 2022-07-22T13:00:10.945816-07:00 FSM-Host phDataManager[9617]: [PH_DATAMANAGER_HTTP_UPLOAD_ERROR]:[eventSeverity]=PHL_ERROR,[procName]=phDataManager,[fileName]=ClickHouseWriterService.cpp,[lineNumber]=459,[errReason]=Uploading events to ClickHouse failed. respCode:500 resp:Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/1/fsiem.events/replicas/1). (TABLE_IS_READ_ONLY) (version 22.6.1.1985 (official build))
Recovery Steps
-
Login to Supervisor's redis:
redis-cli -p 6666 -a $(grep pass /opt/phoenix/redis/conf/6666.conf | awk '{print $2}')
127.0.0.1:6666> del cache:ClickHouse:clickhouseKeeperNodes (integer) 1
-
Navigate to ADMIN > Settings > Database > ClickHouse Config, and replace the dead worker with a new worker to ClickHouse Keeper cluster.
-
Click Test.
-
Click Deploy.
-
Login in to clickhouse-client on each ClickHouse node and execute the following commands.
SYSTEM RESTART REPLICA fsiem.events_replicated SYSTEM RESTORE REPLICA fsiem.events_replicated
Recovering from Losing Quorum
To identify if a ClickHouse Keeper needs recovery, you can check via log or command line.
From Log
/data-clickhouse-hot-1/clickhouse-keeper/app_logs/clickhouse-keeper.err.log 2022.07.22 12:27:10.415055 [ 52865 ] {} <Warning> RaftInstance: Election timeout, initiate leader election 2022.07.22 12:27:10.415169 [ 52865 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 0, live 0, dead 1), at least 2 nodes should respond. failure count 163
From Command Line
[root@FSM-660-CH-58-246 ~]# echo stat | nc 172.30.58.216 2181 This instance is not currently serving requests
Recovery Steps
If you have lost quorum from ClickHouse, take the following step.
Login to the ClickHouse Keeper node that needs recovery and run the following command.
echo rcvr | nc localhost 2181
Migrating ClickHouse Events from FortiSIEM 6.5.x to 6.6.0 or Later
FortiSIEM 6.5.x ran ClickHouse on a single node and used the Merge Tree engine. FortiSIEM 6.6.0 onwards runs ClickHouse on a cluster using Replicated Merge Tree engine. You need to follow these special steps to move the old events previously stored in Merge Tree to Replicated Merge Tree.
From FortiSIEM 6.5.x, you MUST first upgrade to FortiSIEM 6.6.x PRIOR to upgrading to FortiSIEM 7.x or later. If you directly upgrade from 6.5.x to 7.0.0 or later, upgrade will fail. |
To upgrade your FortiSIEM 6.5.x to 6.6.x, take the following steps
-
Navigate to ADMIN >Settings > Database > ClickHouse Config.
-
Click Test, then click Deploy to enable the ClickHouse Keeper service which is new in 6.6.x.
-
Migrate the event data in 6.5.x to 6.6.x by running the script
/opt/phoenix/phscripts/clickhouse/clickhouse-migrate-650.sh
. -
Verify that all events have been moved to the new table through GUI Search.
-
When the data migration is deemed successful, run the following command on every node where the
clickhouse-migrate-650.sh
script was run successfully to drop the old event table.clickhouse-client -q "DROP TABLE fsiem.events_non_replicated"
Now you can upgrade to FortiSIEM 7.x or later, if needed.
References
The following ClickHouse References are available.
General Introduction
https://clickhouse.com/docs/en/intro
Concepts and Architecture
https://clickhouse.com/docs/en/development/architecture
ClickHouse Keeper
https://clickhouse.com/docs/en/operations/clickhouse-keeper
Performance Tuning
https://clickhouse.com/docs/en/operations/tips
Troubleshooting
https://kb.altinity.com/altinity-kb-schema-design/how-much-is-too-much/