Advanced Operations

Adding a Disk or Tier

To add a disk or tier, take the following steps.

  1. Navigate to ADMIN > License > Nodes.
  2. Select the node which you wish to add a disk or tier.
  3. Click Edit.
  4. Update the Storage Tiers drop-down number to change the number to tiers.
  5. Click + from the Hot Tier or Warm Tier Row column respectively to add another Hot Tier or Warm Tier field to add a disk.
  6. Click Test.
  7. Click Save.
  8. Navigate to ADMIN > Settings > Database > ClickHouse Config.
  9. Click Test.
  10. Click Deploy.

Notes:

  • After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.
  • When additional disks are added, the data is written across all available disks. If a disk becomes full, then data will be written to the disks with free space until all disks are full, at which point data will be moved to other ClickHouse storage tiers, archived, or purged depending on your FortiSIEM configuration.

Deleting a Disk

To delete a disk, take the following steps.

  1. Edit /etc/fstab and remove entry with target mount point. For example, for the target mount "/data-clickhouse=warm-1", you would delete a line similar to:
    UUID=db920fd9-5cbb-454b-81b8-d6e427564ecf      /data-clickhouse-hot-1      xfs      defaults,nodev,noatime,inode 0 0
  2. Run the following command to sync fstab changes to the system.
    systemctl daemon-reload
  3. Unmount the disk.
  4. Delete the target mount point directory. For example, if the target mount point directory was "/data-clickhouse-warm-1", you would run the following command:
    rm -rf /data-clickhouse-warm-1
  5. (Optional) To remove the data or reuse the disk as a ClickHouse data disk, format the disk by running the following command: 
    wipefs -a <device path>
  6. Navigate to ADMIN > License > Nodes.
  7. Select the node with the disk you wish to delete.
  8. Click Edit.
  9. Click - from the Hot Tier or Warm Tier Row column respectively to remove the disk.

    Note: You may see some error logs getting generated in /opt/clickhouse/log/clickhouse-server.err.log.
    2022.06.20 15:55:31.761484 [ 98091 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part 18250-20220620_371_375_1 on replica 1. Hoping that it will eventually appear as a result of a merge.
    
    2022.06.20 15:55:31.764560 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking part 18250-20220620_353_378_2
    
    2022.06.20 15:55:31.764841 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Checking if anyone has a part 18250-20220620_353_378_2 or covering part.
    
    2022.06.20 15:55:31.765138 [ 98141 ] {} <Error> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): No replica has part covering 18250-20220620_353_378_2 and a merge is impossible: we didn't find a smaller part with the same max block.
    
    2022.06.20 15:55:31.766222 [ 98141 ] {} <Warning> fsiem.events_replicated (a5a85f1a-6ebf-4cf1-b82b-686f928798cc): Cannot commit empty part 18250-20220620_353_378_2 with error DB::Exception: Part 18250-20220620_353_378_2 (state Outdated) already exists, but it will be deleted soon
    
    2022.06.20 15:55:31.766574 [ 98141 ] {} <Warning> fsiem.events_replicated (ReplicatedMergeTreePartCheckThread): Cannot create empty part 18250-20220620_353_378_2 instead of lost. Will retry later
    

    These errors indicate that ClickHouse detected some missing data by comparing the local parts and the parts names stored in clickhouse-keeper. This is just a warning and does not affect operation. If you find this annoying, delete the entries in clickhouse-keeper by running the following commands on the Worker where the disk is deleted.

    clickhouse-client --query "SELECT replica_path || '/queue/' || node_name FROM system.replication_queue JOIN system.replicas USING (database, table) WHERE last_exception LIKE '%No active replica has part%'" | while read i; do /opt/zookeeper/bin/zkCli.sh deleteall $i; done
    clickhouse-client --query "SYSTEM RESTART REPLICAS"

    Reference: https://github.com/ClickHouse/ClickHouse/issues/10368

  10. Click Test.
  11. Click Save.
  12. Navigate to ADMIN > Settings > Database > ClickHouse Config.
  13. Click Test.
  14. Click Deploy.

Note: After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Deleting a Storage Tier

To delete a storage tier, take the following steps for EACH DISK in the storage tier.

  1. For EACH DISK in the storage tier, take the following steps.
    1. Edit /etc/fstab and remove entry with target mount point. For example, for the target mount "/data-clickhouse=warm-1", you would delete a line similar to: 
      UUID=db920fd9-5cbb-454b-81b8-d6e427564ecf      /data-clickhouse-hot-1      xfs      defaults,nodev,noatime,inode 0 0
    2. Run the following command to sync fstab changes to the system.
      systemctl daemon-reload
    3. Unmount the disk.
    4. Delete the target mount point directory. For example, if the target mount point directory was "/data-clickhouse-warm-1", you would run the following command:
      rm -rf /data-clickhouse-warm-1
    5. (Optional) To remove the data or reuse the disk as a ClickHouse data disk, format the disk by running the following command: 
      wipefs -a <device path>
    6. After following the above steps for all disks in the storage tier you wish to delete, proceed to step 2.
  2. Navigate to ADMIN > License > Nodes.
  3. Select the node with the disk you wish to delete.
  4. Click Edit.
  5. Change Storage Tiers from "2" to "1".
  6. Click Test.
  7. Click Save.
  8. Navigate to ADMIN > Settings > Database > ClickHouse Config.
  9. Click Test.
  10. Click Deploy.

Note: After Deploy succeeds, phClickHouseMonitor and ClickHouseServer processes will restart.

Moving a Worker from One Shard to Another Shard

To move a Worker from one shard to another shard, take the following steps.

  1. Remove the Worker from the ClickHouse Keeper and ClickHouse Cluster.
  2. Login to the Worker and run the following commands.
    clickhouse-client –q “DROP TABLE fsiem.events_replicated”
    clickhouse-client –q “DROP TABLE fsiem.summary”
    systemctl stop clickhouse-server
  3. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>

    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>

  4. Login to the Worker and navigate to /etc/clickhouse-server/config.d.
  5. Remove all config xml under /etc/clickhouse-server/config.dexcept for logger.xml, max_partition_size_to_drop.xml and max_suspicious_broken_parts.xml.
  6. Remove all config xml under /etc/clickhouse-server/users.d/.
  7. For every disk in ClickHouse, take the following steps.
    1. Edit /etc/fstab and remove entry with target mount point. For example, for the target mount "/data-clickhouse=warm-1", you would delete a line similar to: 
      UUID=db920fd9-5cbb-454b-81b8-d6e427564ecf      /data-clickhouse-hot-1      xfs      defaults,nodev,noatime,inode 0 0
    2. Run the following command to sync fstab changes to the system.
      systemctl daemon-reload
    3. Unmount the disk.
    4. Delete the target mount point directory. For example, if the target mount point directory was "/data-clickhouse-warm-1", you would run the following command:
      rm -rf /data-clickhouse-warm-1
    5. (Optional) To remove the data or reuse the disk as a ClickHouse data disk, format the disk by running the following command: 
      wipefs -a <device path>
    6. After following the above steps for all disks used by ClickHouse, proceed to step 8.
  8. Login to the Supervisor GUI, and navigate to ADMIN > Settings > Database > ClickHouse Config.
  9. Select the target Worker and delete it from the existing shard by clicking - from the Row column.
  10. Click Test.
  11. Click Deploy.
  12. Navigate to ADMIN > License > Nodes.
  13. Select the target Worker.
  14. Click Delete to remove the target Worker.
  15. Wait for the Supervisor phMonitor process to come up.
  16. Re-add the target Worker into the License Node with the desired disk configuration, following the instructions for adding a Worker.
  17. Wait for the Supervisor phMonitor process to com up again.
  18. Navigate to ADMIN > Settings > Database > ClickHouse Config.
  19. Add the target Worker to the destination shard.
  20. Click Test.
  21. Click Deploy.

Replacing a Worker with another Worker (within the same Shard)

Currently, the GUI allows you to choose to replace one Worker (W1) with another Worker (W2) in ClickHouse Configuration. However, clicking on Test will fail since the shard and replica Ids are in use by the previous Worker (W1).

Follow these steps to replace W1 with W2.

  1. Navigate to ADMIN > Settings > Database > ClickHouse Config.
  2. Note the shardID and ReplicaID of W1. If the GUI shows Shard 3 and Replica 2, the ShardID is 3 and ReplicaID is 2. This will be needed later in Step 7.
  3. Delete W1 from the ClickHouse Cluster Table by clicking - from the Row column.
  4. Click Test.
  5. Click Deploy.
  6. Login to W1 and run the following SQL command in clickhouse-client shell to drop the events table.
    DROP TABLE fsiem.events_replicated
  7. Login to any ClickHouse Keeper node and run the following commands to delete the registry entry from the ClickHouse Keeper cluster.
    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.events/replicas/<ReplicaID>
    /opt/zookeeper/bin/zkCli.sh deleteall /clickhouse/tables/<ShardID>/fsiem.summary/replicas/<ReplicaID>
  8. Add W2 from the ClickHouse Cluster Table in same place where W1 was. It can use the same Shard ID and Replica ID.
  9. Click Test.
  10. Click Deploy.

Recovering from Complete Loss of ClickHouse Keeper Cluster

Complete loss of Keeper cluster may happen if you have only 1 node and it goes down.

A normal ClickHouse cluster looks like this.

[root@FSM-660-CH-58-246 ~]# echo stat | nc <IP> 2181
ClickHouse Keeper version: v22.6.1.1985-testing-7000c4e0033bb9e69050ab8ef73e8e7465f78059
Clients:
[::ffff:172.30.58.246]:36518(recved=0,sent=0)
 
Latency min/avg/max: 0/0/0
Received: 0
Sent: 0
Connections: 0
Outstanding: 0
Zxid: 145730
Mode: follower
Node count: 305

If you see logs indicating ClickHouse event Table is read only, then you know that the ClickHouse Keeper needs to be restored:

grep PH_DATAMANAGER_HTTP_UPLOAD_ERROR /opt/phoenix/log/phoenix.log| grep TABLE_IS_READ_ONLY

2022-07-22T13:00:10.945816-07:00 FSM-Host phDataManager[9617]: [PH_DATAMANAGER_HTTP_UPLOAD_ERROR]:[eventSeverity]=PHL_ERROR,[procName]=phDataManager,[fileName]=ClickHouseWriterService.cpp,[lineNumber]=459,[errReason]=Uploading events to ClickHouse failed. respCode:500 resp:Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/1/fsiem.events/replicas/1). (TABLE_IS_READ_ONLY) (version 22.6.1.1985 (official build))

To recover, take the following steps:

  1. Login to Supervisor's redis:
    redis-cli -p 6666 -a $(grep pass /opt/phoenix/redis/conf/6666.conf | awk '{print $2}')
    127.0.0.1:6666> del cache:ClickHouse:clickhouseKeeperNodes
    (integer) 1

  2. Navigate to ADMIN > Settings > Database > ClickHouse Config, and replace the dead worker with a new worker to ClickHouse Keeper cluster.
  3. Click Test.
  4. Click Deploy.
  5. Login in to clickhouse-client on each ClickHouse node and execute the following commands.
    SYSTEM RESTART REPLICA fsiem.events_replicated
    SYSTEM RESTORE REPLICA fsiem.events_replicated

Recovering from Losing Quorum in ClickHouse Keeper Cluster

Quorum is lost when more than half of the nodes in the Keeper cluster goes down.

To identify if a ClickHouse Keeper needs recovery, you can check via log or command line.

From Log:

/data-clickhouse-hot-1/clickhouse-keeper/app_logs/clickhouse-keeper.err.log
 
2022.07.22 12:27:10.415055 [ 52865 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2022.07.22 12:27:10.415169 [ 52865 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 0, live 0, dead 1), at least 2 nodes should respond. failure count 163

From Command Line:

[root@FSM-660-CH-58-246 ~]# echo stat | nc 172.30.58.216 2181
This instance is not currently serving requests

To recover, take the following steps:

Login to the ClickHouse Keeper node that needs recovery and run the following command.

echo rcvr | nc localhost 2181

Mitigating a Non-Responsive ClickHouse Keeper Node

To resolve a non-responsive ClickHouse Keeper node, take the following steps.

  1. Check the status of Keeper cluster by running the following command on EACH Keeper node.
    echo stat | nc localhost 2181
  2. Restart any non-responsive Keeper by running the following command.
    systemctl restart ClickHouseKeeper
  3. If the command from step 2 does not resolve the problem, take the following steps:
    1. On the non-responsive ClickHouse Keeper node, make sure that there is ONLY a single line of ARG1=--force-recovery in /data-clickhouse-hot-1/clickhouse-keeper/.systemd_argconf. Normally, the file contains ARG1=.
      Modify the file so that there is only a single line of ARG1=--force-recovery that appears in /data-clickhouse-hot-1/clickhouse-keeper/.systemd_argconf.
    2. Restart the Keeper by running the following command.
      systemctl restart ClickHouseKeeper
    3. Next, run the following command to check the recovery status.
      echo stat | nc localhost 2181