Instructions for migrating history to ClickHouse
Step 1. Install ClickHouse DBMS following the official instructions.
The minimum required version for the platform is >21.4.6.
Step 2. Create a new user /etc/clickhouse-server/users.d/altcraft.xml (see the documentation for details — User settings).
<yandex>
<users>
<altcraft>
<password>abcdefghiasdwqedwqjklmnopqrstuvwxyzABCDEF</password>
<networks>
<ip>::/0</ip>
</networks>
<profile>default</profile>
<quota>default</quota>
<allow_databases>
<database>altcraft_system</database>
</allow_databases>
</altcraft>
</users>
</yandex>
The character set "abcdefghiasdwqedwqjklmnopqrstuvwxyzABCDEF" is used as the password.
Be sure to generate a more complex password consisting of at least 32 characters.
Step 3. Enable the input_format_import_nested_json option to insert JSON data with nested objects (see the documentation for more details — Settings profiles).
<?xml version="1.0"?>
<yandex>
<!-- Profiles of settings. -->
<profiles>
<!-- Default settings. -->
<default>
<!-- Maximum memory usage for processing single query, in bytes. -->
<max_memory_usage>8589934592</max_memory_usage>
<max_memory_usage_for_user>8589934592</max_memory_usage_for_user>
<max_bytes_before_external_group_by>4294967296</max_bytes_before_external_group_by>
<max_bytes_before_external_sort>4294967296</max_bytes_before_external_sort>
<receive_timeout>3600</receive_timeout>
<send_timeout>3600</send_timeout>
<max_threads>2</max_threads>
<max_partitions_per_insert_block>10000</max_partitions_per_insert_block>
<!-- Use cache of uncompressed blocks of data. Meaningful only for processing many of very short queries. -->
<use_uncompressed_cache>0</use_uncompressed_cache>
<!-- How to choose between replicas during distributed query processing.
random - choose random replica from set of replicas with minimum number of errors
nearest_hostname - from set of replicas with minimum number of errors, choose replica
with minimum number of different symbols between replica's hostname and local hostname
(Hamming distance).
in_order - first live replica is chosen in specified order.
first_or_random - if first replica one has higher number of errors, pick a random one from replicas with minimum number of errors.
-->
<load_balancing>random</load_balancing>
<input_format_import_nested_json>1</input_format_import_nested_json>
</default>
Step 4. In the main configuration file main.json, configure the settings for connecting to and working with the ClickHouse DBMS.
"CLICKHOUSE_SYSTEM": {
"HOST": "localhost",
"PORT": 9000,
"USER": "altcraft",
"MAX_CONNECTION": 100,
"CONN_TIMEOUT": 15,
"PASSWORD": "abcdefghiasdwqedwqjklmnopqrstuvwxyzABCDEF",
"IS_DEBUG": false
},
Step 5. Run migration. For these purposes, a special utility has been prepared — actions_migrate. Full list of available options:
| Parameter | Default value | Description |
|---|---|---|
| --accounts string | All active accounts | List of accounts to import separated by commas Example: "1, 2, 3" |
| --appname bool | false | Show additional information about the utility |
| --batch_timeout int | 5 | Time in seconds after which the insert will be executed |
| --ch_batch int | 100000 | Minimum batch of events to insert |
| --ch_debug bool | false | Enable additional logging level |
| --ch_host string | The main configuration main.json is used | Host to connect |
| --ch_password string | The main configuration main.json is used | Password to connect |
| --ch_port int | The main configuration main.json is used | Connection port |
| --ch_user string | The main configuration main.json is used | User to connect |
| --config string | /opt/MarketingPlatform/config/main.json | Path to the main configuration file main.json |
| --exclude string | "" | List of accounts to be excluded from import separated by commas Example: "1, 2, 3" |
| --level_log string | INFO | Logging level: TRACE, DEBUG, INFO, WARN, ERROR, CRITICAL, FATAL |
| --log_path string | ./actions_migrate.log | Location of the log file |
| --range string | "" | Time range to sample existing history The format is "MM.YY-MM.YY", "-MM.YY", "MM.YY-" |
| --version bool | false | Show utility version |
| --worker_size int | -1 | Number of workers (by default, it corresponds to the number of vCPUs). |
Migration can take a long time, up to 3 or more days. To continue the migration after disconnecting from the server, you can use the screen utility to work in a separate session. You can find an example of using the utility here: https://access.redhat.com/articles/5247.
Example of transferring all historical data up to September 2021:
/opt/akd/ak/bin/actions_migrate --accounts "1" --config "/opt/akd/config/main.json" --batch_timeout 60 --ch_batch 500000 --range "-09.21"
Example of transferring all historical data for November 2021:
/opt/akd/ak/bin/actions_migrate --accounts "1" --config "/opt/akd/config/main.json" --batch_timeout 60 --ch_batch 500000 --range "11.21-11.21"
Example of transferring all historical data from January 2021 to the current moment:
/opt/akd/ak/bin/actions_migrate --accounts "1" --config "/opt/akd/config/main.json" --batch_timeout 60 --ch_batch 500000 --range "01.21-"