From 2199be795a31af1ce3723b36bfa4f066fce9a1b7 Mon Sep 17 00:00:00 2001 From: zhangyi Date: Wed, 11 May 2022 18:44:10 +0800 Subject: [PATCH] modify the files --- tutorials/experts/source_en/dataset/cache.md | 582 ++++++++++-------- tutorials/experts/source_en/dataset/eager.md | 44 +- .../experts/source_zh_cn/dataset/cache.ipynb | 2 +- 3 files changed, 330 insertions(+), 298 deletions(-) diff --git a/tutorials/experts/source_en/dataset/cache.md b/tutorials/experts/source_en/dataset/cache.md index 6ba687e4f7..5bd68ade0c 100644 --- a/tutorials/experts/source_en/dataset/cache.md +++ b/tutorials/experts/source_en/dataset/cache.md @@ -1,9 +1,7 @@ -# Single-Node Tensor Cache +# Single-Node Data Cache -## Overview - If you need to repeatedly access remote datasets or load datasets from disks, you can use the single-node cache operator to cache datasets in the local memory to accelerate dataset loading. The cache operator depends on the cache server started on the current node. Functioning as a daemon process and independent of the training script, the cache server is mainly used to manage cached data, including storing, querying, and loading data, and writing cached data when the cache is not hit. @@ -24,324 +22,217 @@ Currently, the cache service supports only single-node cache. That is, the clien ![cache on map pipeline](./images/cache_processed_data.png) - > You are advised to cache image data in `decode` + `resize` + `cache` mode. The data processed by `decode` can be directly cached only in single-node single-device mode. +## Data Cache Process -## Basic Cache Usage +Before using the cache service, you need to install MindSpore and set the relevant environment variables. -1. Configure the environment. +> At present, data cache can only be performed in the Linux environment. Ubuntu, EulerOS and CentOS can refer to the [relevant tutorials](https://help.ubuntu.com/community/SwapFaq#How_do_I_add_a_swap_file.3F) to learn how to increase the swap memory space. In addition, since the use of cache may cause the server's memory shortage, it is recommended that users increase the server's swap memory space to more than 100GB before using the cache. - Before using the cache service, you need to install MindSpore and set related environment variables. The Conda environment is used as an example. The setting method is as follows: +### 1. Start the cache server - ```text - export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore/lib - export PATH=$PATH:{path_to_conda}/envs/{your_env_name}/bin - ``` +Before using the single-node cache serve, you need to enter the following command at the command line to start the cache server: - You can also set the environment with the following code. +```bash +cache_admin --start +``` - ```python - import os - import sys - import mindspore - - python_path = "/".join(sys.executable.split("/")[:-1]) - mindspore_path = "/".join(mindspore.__file__.split("/")[:-1]) - mindspore_lib_path = os.path.join(mindspore_path, "lib") - - if 'PATH' not in os.environ: - os.environ['PATH'] = python_path - elif python_path not in os.environ['PATH']: - os.environ['PATH'] += ":" + python_path - print(os.environ['PATH']) - - os.environ['LD_LIBRARY_PATH'] = "{}:{}:{}".format(mindspore_path, mindspore_lib_path, mindspore_lib_path.split("python3.7")[0]) - print(os.environ['LD_LIBRARY_PATH']) - ``` +If the above information is output, it means that the cache server starts successfully. - > When the cache is used, the server memory may be insufficient. Therefore, you are advised to increase the swap memory space of the server to more than 100 GB before using the cache. For details about how to increase the swap memory space on Ubuntu, EulerOS, or CentOS, see [related tutorials](https://help.ubuntu.com/community/SwapFaq#How_do_I_add_a_swap_file.3F). +The preceding commands can use the `-h` and `-p` parameters to specify the server, or the user can specify it by configuring environment variables `MS_CACHE_HOST` and `MS_CACHE_PORT`. If not specified, the default operation is performed on servers with IP 127.0.0.1 and port number 50052. -2. Start the cache server. +The `ps -ef|grep cache_server` command can be used to check if the server is started and query server parameters. - Before using the single-node cache service, run the following command to start the cache server: +The `cache_admin --server_info` command can also be used to check a detailed list of parameters for the server. - ```bash - cache_admin --start - ``` +To enable the data overflow feature, the user must set the overflow path with the `-s` parameter when starting the cache server, or the feature is turned off by default. - If the following information is displayed, the cache server is started successfully: +```bash +cache_admin --server_info +``` - ```text - Cache server startup completed successfully! - The cache server daemon has been created as process id 10394 and is listening on port 50052 +The Cache Server Configuration table lists the IP address, port number, number of worker threads, log level, overflow path and other detailed configuration information of the current server. The Active sessions module displays a list of session IDs that are enabled in the current server. - Recommendation: - Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup - ``` +The cache server log file is named in the format "cache_server.\.\.log.\.\.\". - `cache_admin` supports the following commands and options: - - `--start`: starts the cache server. The following options are supported: - - `--workers` or `-w`: specifies the number of worker threads on the cache server. By default, the number of worker threads is half of the number of CPUs. This parameter relies on the NUMA architecture of the server. The value will be adjusted automatically by the server if it's not a multiple of number of NUMA nodes in the machine. - - `--spilldir` or `-s`: specifies the disk file path for storing remaining data when the cached data size exceeds the memory space. The default value is '' (which means disabling spilling). - - `--hostname` or `-h`: specifies the IP address of the cache server. The default value is 127.0.0.1. - - `--port` or `-p`: specifies the port number of the cache server. The default value is 50052. - - `--loglevel` or `-l`: sets the log level. The default value is 1 (WARNING). If this option is set to 0 (INFO), excessive logs will be generated, resulting in performance deterioration. - - `--stop`: stops the cache server. - - `--generate_session` or `-g`: generates a cache session. - - `--destroy_session` or `-d`: deletes a cache session. - - `--list_sessions`: displays the list of currently cached sessions and their details. - - `--server_info`:displays the configuration parameters and active session list of current server. - - `--help`: displays the help information. +When `GLOG_v=0`, DEBUG log may be displayed on the screen. - In the preceding options, you can use `-h` and `-p` to specify a server. Users can also set environment variables `MS_CACHE_HOST` and `MS_CACHE_PORT` to specify it. If hostname and port are not set, operations are performed on the server with the IP address 127.0.0.1 and port number 50052 by default. +### 2. Create the cache session - You can run the `ps -ef|grep cache_server` command to check whether the server is started and query server parameters. +If there is no cache session in the cache server, you need to create a cache session and get the cache session id: - You can also run the `cache_admin --server_info` command to get the full list of configuration of cache server. +```bash +cache_admin -g +``` - ```text - $ cache_admin --server_info - Cache Server Configuration: - ---------------------------------------- - config name value - ---------------------------------------- - hostname 127.0.0.1 - port 50052 - number of workers 16 - log level 1 - spill dir None - ---------------------------------------- - Active sessions: - No active sessions. - ``` +where 780643335 is the cache session id assigned to the server on port 50052, and the cache session id is assigned by the server. - Where, the table of Cache Server Configuration lists five detailed configuration information. Active sessions shows the list of active session ID in current server if any. +The `cache_admin --list_sessions` command can be used to check all cache session information existing in the current server. - Cache server generates log files with filename "cache_server.\.\.log.\.\.\". Note that there might be masses of DEBUG logs printed to the screen when `GLOG_v=0` is set. +```bash +cache_admin --list_sessions +``` - > - To enable data spilling, you need to use `-s` to set spilling path when starting cache server. Otherwise, this feature is default to be disabled and it will bring up a memory-only cache server. +Output parameters description: -3. Create a cache session. +- `Session`: cache session id. +- `Cache Id`: cache instance id in the current cache session. `n/a` indicates that the cache instance has not been created at the moment. +- `Mem cached`: the amount of data cached in memory. +- `Disk cached`: the amount of data cached in disk. +- `Avg cache size`: the average size of each row of data currently cached. +- `Numa hit`: the number of **Numa** hits. The higher value will get the better time performance. - If no cache session exists on the cache server, a cache session needs to be created to obtain the cache session ID. +### 3. Create a cache instance - ```text - $ cache_admin -g - Session created for server on port 50052: 1456416665 - ``` +In the Python training script, use the `DatasetCache` API to define a cache instance named `test_cache`, and put a cache session ID created in the previous step to the `session_id` parameter. - In the preceding command, 1456416665 is the cache session ID allocated by the server with port number 50052. +```python +import mindspore.dataset as ds - You can run the `cache_admin --list_sessions` command to view all cache sessions on the current server. +test_cache = ds.DatasetCache(session_id=1456416665, size=0, spilling=False) +``` - ```text - $ cache_admin --list_sessions - Listing sessions for server on port 50052 +`DatasetCache` supports the following parameters: - Session Cache Id Mem cached Disk cached Avg cache size Numa hit - 1456416665 n/a n/a n/a n/a n/a - ``` +- `session_id`: specifies the cache session ID, which can be created and obtained by running the `cache_admin -g` command. +- `size`: specifies the maximum memory space occupied by the cache. The unit is MB. For example, if the cache space is 512 GB, set `size=524288`. The default value is 0. +- `spilling`: determines whether to spill the remaining data to disks when the memory space exceeds the upper limit. The default value is False. +- `hostname`: specifies the IP address for connecting to the cache server. The default value is 127.0.0.1. +- `port`: specifies the port number for connecting to the cache server. The default value is 50052. +- `num_connections`: specifies the number of established TCP/IP connections. The default value is 12. +- `prefetch_size`: specifies the number of prefetched rows. The default value is 20. - Output parameter description: - - `Session`: specifies the cache session ID. - - `Cache Id`: specifies the ID of the cache instance in the current cache session. `n/a` indicates that no cache instance is created. - - `Mem cached`: specifies the cached data volume in the memory. - - `Disk cached`: specifies the cached data volume in the disk. - - `Avg cache size`: specifies the average size of each line of data in the current cache. - - `Numa hit`: specifies the number of NUMA hits. A larger value indicates better time performance. +The following things that needs to be noted: -4. Create a cache instance. +In actual use, you are advised to run the `cache_admin -g` command to obtain a cache session id from the cache server and use it as the parameter of `session_id` to prevent errors caused by cache session nonexistence. - In the Python training script, use the `DatasetCache` API to define a cache instance named `test_cache`, and specify the `session_id` parameter to a cache session ID created in the previous step. +`size=0` indicates that the memory space used by the cache is not limited manually, but automically controlled by the cache server according to system's total memory resources, and cache server's memory usage would be limited to within 80% of the total system memory. - ```python - import mindspore.dataset as ds +Users can also manually set `size` to a proper value based on the idle memory of the machine. Note that before setting the `size` parameter, make sure to check the available memory of the system and the size of the dataset to be loaded. If the memory space occupied by the cache_server or the space of the dataset to be loaded exceeds the available memory of the system, it may cause problems such as machine downtime/restart, automatic shutdown of cache_server, and failure of training process execution. - test_cache = ds.DatasetCache(session_id=1456416665, size=0, spilling=False) - ``` +`spilling=True` indicates that the remaining data is written to disks when the memory space is insufficient. Therefore, ensure that you have the writing permission and the sufficient disk space on the configured disk path is to store the cache data that spills to the disk. Note that if no spilling path is set when cache server starts, setting `spilling=True` will raise an error when calling the API. - `DatasetCache` supports the following parameters: - - `session_id`: specifies the cache session ID, which can be created and obtained by running the `cache_admin -g` command. - - `size`: specifies the maximum memory space occupied by the cache. The unit is MB. For example, if the cache space is 512 GB, set `size` to `524288`. The default value is 0. - - `spilling`: determines whether to spill the remaining data to disks when the memory space exceeds the upper limit. The default value is False. - - `hostname`: specifies the IP address for connecting to the cache server. The default value is 127.0.0.1. - - `port`: specifies the port number for connecting to the cache server. The default value is 50052. - - `num_connections`: specifies the number of established TCP/IP connections. The default value is 12. - - `prefetch_size`: specifies the number of prefetched rows. The default value is 20. +`spilling=False` indicates that no data is written once the configured memory space is used up on the cache server. - > - In actual use, you are advised to run the `cache_admin -g` command to obtain a cache session ID from the cache server and use it as the parameter of `session_id` to prevent errors caused by cache session nonexistence. - > - `size=0` indicates that the memory space used by the cache is not limited manually, but automically controlled by the cache_server according to system's total memory resources, and cache server's memory usage would be limited to within 80% of the total system memory. - > - Users can also manually set `size` to a proper value based on the idle memory of the machine. Note that before setting the `size` parameter, make sure to check the available memory of the system and the size of the dataset to be loaded. If the memory of cache_server or the dataset size exceeds the available memory of the system, the server may break down or restart, it may also automatically shut down, or the training process may fail. - > - `spilling=True` indicates that the remaining data is written to disks when the memory space is insufficient. Therefore, ensure that you have the write permission on the configured disk path and the disk space is sufficient to store the remaining cache data. Note that if no spilling path is set when cache server starts, setting `spilling=True` will raise an error when calling the API. - > - `spilling=False` indicates that no data is written once the configured memory space is used up on the cache server. - > - If a dataset that does not support random access (such as `TFRecordDataset`) is used to load data and the cache service is enabled, ensure that the entire dataset is stored locally. In this scenario, if the local memory space is insufficient to store all data, spilling must be enabled to spill data to disks. - > - `num_connections` and `prefetch_size` are internal performance tuning parameters. Generally, you do not need to set these two parameters. +If a dataset that does not support random access (such as `TFRecordDataset`) is used to load data and the cache service is enabled, ensure that the entire dataset is stored locally. In this scenario, if the local memory space is insufficient to store all data, spilling must be enabled to spill data to disks. -5. Insert a cache instance. +### 4. Insert a cache instance - Currently, the cache service can be used to cache both original datasets and datasets processed by argumentation. The following example shows two usage methods. +Currently, the cache service can be used to cache both original datasets and datasets processed by argumentation. The following example shows two usage methods. - Note that you need to create a cache instance for each of the two examples according to step 4, and use the created `test_cache` as the `cache` parameter in the dataset loading operator or map operator. +Note that both examples need to create a cache instance according to the method in step 3, and pass in the created `test_cache` as `cache` parameters in the dataset load or map operator. - CIFAR-10 dataset is used in the following two examples. +CIFAR-10 dataset is used in the following two examples. - ```text - ./datasets/cifar-10-batches-bin - ├── readme.html - ├── test - │ └── test_batch.bin - └── train - ├── batches.meta.txt - ├── data_batch_1.bin - ├── data_batch_2.bin - ├── data_batch_3.bin - ├── data_batch_4.bin - └── data_batch_5.bin - ``` +```python +from mindvision import dataset - ```python - import os - import requests - import tarfile - import zipfile - import shutil - - requests.packages.urllib3.disable_warnings() - - def download_dataset(url, target_path): - """ download and unzip the dataset """ - if not os.path.exists(target_path): - os.makedirs(target_path) - download_file = url.split(\"/\")[-1] - if not os.path.exists(download_file): - res = requests.get(url, stream=True, verify=False) - if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]: - download_file = os.path.join(target_path, download_file) - with open(download_file, \"wb\") as f: - for chunk in res.iter_content(chunk_size=512): - if chunk: - f.write(chunk) - if download_file.endswith(\"zip\"): - z = zipfile.ZipFile(download_file, \"r\") - z.extractall(path=target_path) - z.close() - if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"): - t = tarfile.open(download_file) - names = t.getnames() - for name in names: - t.extract(name, target_path) - t.close() - print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path)) - - download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\") - test_path = \"./datasets/cifar-10-batches-bin/test\" - train_path = \"./datasets/cifar-10-batches-bin/train\" - os.makedirs(test_path, exist_ok=True) - os.makedirs(train_path, exist_ok=True) - if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")): - shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path) - [shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))] - ``` +dl_path = "./datasets" +data_dir = "./datasets/cifar-10-batches-bin/" +dl_url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz" - - Cache the original loaded dataset. +dl = dataset.DownLoad() # download CIFAR-10 dataset +dl.download_and_extract_archive(url=dl_url, download_path=dl_path) +``` - ```python - dataset_dir = "./datasets/cifar-10-batches-bin/train" +The directory structure of the extracted dataset file is as follows: - # apply cache to dataset - data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=4, shuffle=False, num_parallel_workers=1, cache=test_cache) +```text +./datasets/cifar-10-batches-bin +├── readme.html +├── test +│ └── test_batch.bin +└── train + ├── batches.meta.txt + ├── data_batch_1.bin + ├── data_batch_2.bin + ├── data_batch_3.bin + ├── data_batch_4.bin + └── data_batch_5.bin +``` - num_iter = 0 - for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary - # in this example, each dictionary has a key "image" - print("{} image shape: {}".format(num_iter, item["image"].shape)) - num_iter += 1 - ``` +#### Cache the original dataset data - The output is as follows: +Cache the original dataset, and the datat is loaded by the MindSpore system. - ```text - 0 image shape: (32, 32, 3) - 1 image shape: (32, 32, 3) - 2 image shape: (32, 32, 3) - 3 image shape: (32, 32, 3) - ``` +```python +dataset_dir = "./datasets/cifar-10-batches-bin/train" - You can run the `cache_admin --list_sessions` command to check whether there are four data records in the current session. If yes, the data is successfully cached. +# apply cache to dataset +data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=4, shuffle=False, num_parallel_workers=1, cache=test_cache) - ```text - $ cache_admin --list_sessions - Listing sessions for server on port 50052 +num_iter = 0 +for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary + # in this example, each dictionary has a key "image" + print("{} image shape: {}".format(num_iter, item["image"].shape)) + num_iter += 1 +``` - Session Cache Id Mem cached Disk cached Avg cache size Numa hit - 1456416665 821590605 4 n/a 3226 4 - ``` +You can run the `cache_admin --list_sessions` command to check whether there are four data records in the current session. If yes, the data is successfully cached. - - Cache the data processed by argumentation. +```bash +cache_admin --list_sessions +``` - ```python - import mindspore.dataset.vision.c_transforms as c_vision +#### Cache the data processed by argumentation - dataset_dir = "cifar-10-batches-bin/" +Cache data after data enhancement processing `transforms`. - # apply cache to dataset - data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=5, shuffle=False, num_parallel_workers=1) +```python +import mindspore.dataset.vision.c_transforms as c_vision - # apply cache to map - rescale_op = c_vision.Rescale(1.0 / 255.0, -1.0) - data = data.map(input_columns=["image"], operations=rescale_op, cache=test_cache) +dataset_dir = "cifar-10-batches-bin/" - num_iter = 0 - for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary - # in this example, each dictionary has a keys "image" - print("{} image shape: {}".format(num_iter, item["image"].shape)) - num_iter += 1 - ``` +# apply cache to dataset +data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=5, shuffle=False, num_parallel_workers=1) - The output is as follows: +# apply cache to map +rescale_op = c_vision.Rescale(1.0 / 255.0, -1.0) +data = data.map(input_columns=["image"], operations=rescale_op, cache=test_cache) - ```text - 0 image shape: (32, 32, 3) - 1 image shape: (32, 32, 3) - 2 image shape: (32, 32, 3) - 3 image shape: (32, 32, 3) - 4 image shape: (32, 32, 3) - ``` +num_iter = 0 +for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary + # in this example, each dictionary has a keys "image" + print("{} image shape: {}".format(num_iter, item["image"].shape)) + num_iter += 1 +``` - You can run the `cache_admin --list_sessions` command to check whether there are five data records in the current session. If yes, the data is successfully cached. +```text +0 image shape: (32, 32, 3) +1 image shape: (32, 32, 3) +2 image shape: (32, 32, 3) +3 image shape: (32, 32, 3) +4 image shape: (32, 32, 3) +``` - ```text - $ cache_admin --list_sessions - Listing sessions for server on port 50052 +You can run the `cache_admin --list_sessions` command to check whether there are five data records in the current session. If yes, the data is successfully cached. - Session Cache Id Mem cached Disk cached Avg cache size Numa hit - 1456416665 3618046178 5 n/a 12442 5 - ``` +```bash +cache_admin --list_sessions +``` -6. Destroy the cache session. +### 5. Destroy the cache session - After the training is complete, you can destroy the current cache and release the memory. +After the training is complete, you can destroy the current cache and release the memory. - ```text - $ cache_admin --destroy_session 1456416665 - Drop session successfully for server on port 50052 - ``` +```bash +cache_admin --destroy_session 780643335 +``` - The preceding command is used to destroy the cache with the session ID 1456416665 on the server with the port number 50052. +The preceding command is used to destroy the cache with the session ID 1456416665 on the server with the port number 50052. - If you choose not to destroy the cache, the cached data still exists in the cache session. You can use the cache when starting the training script next time. +If you choose not to destroy the cache, the cached data still exists in the cache session. You can use the cache when starting the training script next time. -7. Stop the cache server. +### 6. Stop the cache server - After using the cache server, you can stop it. This operation will destroy all cache sessions on the current server and release the memory. +After using the cache server, you can stop it. This operation will destroy all cache sessions on the current server and release the memory. - ```text - $ cache_admin --stop - Cache server on port 50052 has been stopped successfully. - ``` +```bash +cache_admin --stop +``` - The preceding command is used to shut down the server with the port number 50052. +The preceding command is used to shut down the server with the port number 50052. - If you choose not to shut down the server, the cache sessions on the server will be retained for future use. During the next training, you can create a cache session or reuse the existing cache. +If you choose not to shut down the server, the cache sessions on the server will be retained for future use. During the next training, you can create a cache session or reuse the existing cache. ## Cache Sharing @@ -349,20 +240,20 @@ During the single-node multi-device distributed training, the cache operator all 1. Start the cache server. - ```text - $ cache_admin --start + ```bash + $cache_admin --start + ``` + Cache server startup completed successfully! The cache server daemon has been created as process id 39337 and listening on port 50052 - Recommendation: - Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup - ``` + Since the server is detached into its own daemon process, monitor the server logs (under/tmp/mindspore/cache/log) for any issues that may happen after startup 2. Create a cache session. Create the shell script `cache.sh` for starting Python training and run the following command to generate a cache session ID: - ```bash + ```shell #!/bin/bash # This shell script will launch parallel pipelines @@ -386,7 +277,7 @@ During the single-node multi-device distributed training, the cache operator all session_id=$(echo $result | awk '{print $NF}') ``` -3. Pass the cache session ID to the training script. +3. Pass the cache session id to the training script. Continue to write the shell script and add the following command to pass `session_id` and other parameters when the Python training is started: @@ -399,11 +290,11 @@ During the single-node multi-device distributed training, the cache operator all done ``` - > Complete sample code: [cache.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/cache.sh) + > Complete sample code: [cache.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/cache.sh). 4. Create and apply a cache instance. - CIFAR-10 dataset is used in the following example. The directory structure is as follows: + CIFAR-10 dataset is used in the following example. ```text ├─cache.sh @@ -448,7 +339,7 @@ During the single-node multi-device distributed training, the cache operator all Execute the shell script `cache.sh` to enable distributed training. - ```text + ```bash $ sh cache.sh cifar-10-batches-bin/ Got 4 samples on device 0 Got 4 samples on device 1 @@ -458,38 +349,175 @@ During the single-node multi-device distributed training, the cache operator all You can run the `cache_admin --list_sessions` command to check whether only one group of data exists in the current session. If yes, cache sharing is successful. - ```text + ```bash $ cache_admin --list_sessions Listing sessions for server on port 50052 - Session Cache Id Mem cached Disk cached Avg cache size Numa hit - 3392558708 821590605 16 n/a 3227 16 + Session Cache Id Mem cached Disk cached Avg cache size Numa hit + 3392558708 821590605 16 n/a 3227 16 ``` 6. Destroy the cache session. After the training is complete, you can destroy the current cache and release the memory. - ```text + ```bash $ cache_admin --destroy_session 3392558708 Drop session successfully for server on port 50052 ``` -7. Stop the cache server. +7. Stop the cache server, after using the cache server, you can stop it. - After using the cache server, you can stop it. - - ```text + ```bash $ cache_admin --stop Cache server on port 50052 has been stopped successfully. ``` -## Limitations +## Cache Acceleration -- Currently, dataset classes such as `GraphDataset`, `GeneratorDataset`, `PaddedDataset`, and `NumpySlicesDataset` do not support cache. `GeneratorDataset`, `PaddedDataset`, and `NumpySlicesDataset` belong to `GeneratorOp`, so their error message is displayed as "There is currently no support for GeneratorOp under cache." -- Data processed by `batch`, `concat`, `filter`, `repeat`, `skip`, `split`, `take`, and `zip` does not support cache. -- Data processed by random data argumentation operations (such as `RandomCrop`) does not support cache. -- The same cache instance cannot be nested in different locations of the same pipeline. +In order to share large data sets among multiple servers and alleviate the disk space requirements of a single server, users can usually choose to use NFS (Network File System), that is, the network file system, to store data sets (for the construction and configuration of NFS storage servers, see [HUAWEI CLOUD-NFS Storage Server Setup]( https://www.huaweicloud.com/articles/14fe58d0991fb2dfd2633a1772c175fc.html)。 + +However, access to NFS datasets is often expensive, resulting in longer training sessions by using NFS datasets. + +In order to improve the training performance of the NFS dataset, we can choose to use a cache service to cache the dataset in memory as Tensor. + +Once cached, post-sequence epochs can read data directly from memory, avoiding the overhead of accessing remote NAS. + +It should be noted that in the data processing process of the training process, and the dataset usually needs to be augmentated with randomness after loading, such as `RandomCropDecodeResize`. If the cache is added to the operation with randomness, it will cause the results of the first enhancement operation to be cached, and the results read from the cache server in the later sequence are the first cached data, resulting in the loss of data randomness and affecting the accuracy of the training network. + +Therefore, we can choose to add a cache directly after the data set reads the operator. This section takes this approach, using the MobileNetV2 network as a sample for an example. + +For complete sample code, refer to ModelZoo's [MobileNetV2](https://gitee.com/mindspore/models/tree/master/official/cv/mobilenetv2). + +1. Create Shell script `cache_util.sh` for managing cache: + + ```bash + bootup_cache_server() + { + echo "Booting up cache server..." + result=$(cache_admin --start 2>&1) + echo "${result}" + } + + generate_cache_session() + { + result=$(cache_admin -g | awk 'END {print $NF}') + echo "${result}" + } + ``` + + > Complete sample code: [cache_util.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/cache_util.sh). + +2. In the Shell script `run_train_nfs_cache.sh` that starts NFS dataset training, turn on the cache server for the scenario trained with datasets located on NFS and generate a cache session saved in the Shell variable `CACHE_SESSION_ID`: + + ```bash + CURPATH="${dirname "$0"}" + source ${CURPATH}/cache_util.sh + + bootup_cache_server + CACHE_SESSION_ID=$(generate_cache_session) + ``` + +3. Pass in the `CACHE_SESSION_ID` and other parameters when starting Python training: + + ```text + python train.py \ + --platform=$1 \ + --dataset_path=$5 \ + --pretrain_ckpt=$PRETRAINED_CKPT \ + --freeze_layer=$FREEZE_LAYER \ + --filter_head=$FILTER_HEAD \ + --enable_cache=True \ + --cache_session_id=$CACHE_SESSION_ID \ + &> log$i.log & + ``` + +4. In the `train_parse_args()` function of Python's parameter parsing script `args.py`, the incoming `cache_session_id` is received by the following code: + + ```python + cache_session_id: + + import argparse + + def train_parse_args(): + ... + train_parser.add_argument('--enable_cache', + type=ast.literal_eval, + default=False, + help='Caching the dataset in memory to speedup dataset processing, default is False.') + train_parser.add_argument('--cache_session_id', + type=str, + default="", + help='The session id for cache service.') + train_args = train_parser.parse_args() + ``` + + And call the `train_parse_args()` function in Python's training script `train.py` to parse the parameters such as the `cache_session_id` passed in, and pass it in as an argument when defining the dataset `dataset`. + + ```python + from src.args import train_parse_args + args_opt = train_parse_args() + + dataset = create_dataset( + dataset_path=args_opt.dataset_path, + do_train=True, + config=config, + enable_cache=args_opt.enable_cache, + cache_session_id=args_opt.cache_session_id) + ``` + +5. In the Python script `dataset.py` that defines the data processing process, an instance of `DatasetCache` is created and inserted after `ImageFolderDataset` based on the parameters `enable_cache` and `cache_session_id` passed in: + + ```python + def create_dataset(dataset_path, do_train, config, repeat_num=1, enable_cache=False, cache_session_id=None): + ... + if enable_cache: + nfs_dataset_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0) + else: + nfs_dataset_cache = None + + if config.platform == "Ascend": + rank_size = int(os.getenv("RANK_SIZE", '1')) + rank_id = int(os.getenv("RANK_ID", '0')) + if rank_size == 1: + data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True, cache=nfs_dataset_cache) + else: + data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=rank_size, shard_id=rank_id, cache=nfs_dataset_cache) + ``` + +6. Run `run_train_nfs_cache.sh`, and obtain the following results: + + ```text + epoch: [ 0/ 200], step:[ 2134/ 2135], loss:[4.682/4.682], time:[3364893.166], lr:[0.780] + epoch time: 3384387.999, per step time: 1585.193, avg loss: 4.682 + epoch: [ 1/ 200], step:[ 2134/ 2135], loss:[3.750/3.750], time:[430495.242], lr:[0.724] + epoch time: 431005.885, per step time: 201.876, avg loss: 4.286 + epoch: [ 2/ 200], step:[ 2134/ 2135], loss:[3.922/3.922], time:[420104.849], lr:[0.635] + epoch time: 420669.174, per step time: 197.035, avg loss: 3.534 + epoch: [ 3/ 200], step:[ 2134/ 2135], loss:[3.581/3.581], time:[420825.587], lr:[0.524] + epoch time: 421494.842, per step time: 197.421, avg loss: 3.417 + ... + ``` + + The following table shows the average epoch time on gpu servers of using cache versus or not using cache: + + ```text + | 4p, MobileNetV2, imagenet2012 | without cache | with cache | + | ---------------------------------------- | ------------- | ---------- | + | first epoch time | 1649s | 3384s | + | average epoch time (exclude first epoch) | 458s | 421s | + ``` + + You can see that after using the cache, the completion time of the first epoch increases more than if the cache is not used, which is mainly due to the overhead of writing cache data to the cache server. However, each subsequent epoch after caching data writes can get a large performance gain. Therefore, the greater the total number of episodes trained, the more pronounced the benefits of using the cache. + + Taking running 200 epochs as an example, using caching can reduce the total end-to-end training time from 92791 seconds to 87163 seconds, saving a total of about 5628 seconds. + +7. When finish using, you can choose to shut down the cache server: + + ```text + $ cache_admin --stop + Cache server on port 50052 has been stopped successfully. + ``` ## Cache Performance Tuning @@ -504,3 +532,11 @@ However, we may not benefit from cache in the following scenarios: - Too much cache spilling will deteriorate the time performance. Therefore, try not to spill cache to disks when datasets that support random access (such as `ImageFolderDataset`) are used for data loading. - Using cache on NLP network such as Bert does not perform. In the NLP scenarios, there are usually no high complexity data augmentation operations like decode. - There is expectable startup overhead when using cache in non-mappable datasets like `TFRecordDataset`. According to the current design, it is required to cache all rows to the cache server before the first epoch of training. So the first epoch time can be longer than the non-cache case. + +## Limitations + +- Currently, dataset classes such as `GraphDataset`, `GeneratorDataset`, `PaddedDataset`, and `NumpySlicesDataset` do not support cache. `GeneratorDataset`, `PaddedDataset`, and `NumpySlicesDataset` belong to `GeneratorOp`, so their error message is displayed as "There is currently no support for GeneratorOp under cache." +- Data processed by `batch`, `concat`, `filter`, `repeat`, `skip`, `split`, `take`, and `zip` does not support cache. +- Data processed by random data argumentation operations (such as `RandomCrop`) does not support cache. +- The same cache instance cannot be nested in different locations of the same pipeline. + diff --git a/tutorials/experts/source_en/dataset/eager.md b/tutorials/experts/source_en/dataset/eager.md index 58ba063b63..32d8df4736 100644 --- a/tutorials/experts/source_en/dataset/eager.md +++ b/tutorials/experts/source_en/dataset/eager.md @@ -2,43 +2,41 @@ -When resource conditions permit, in order to pursue higher performance, data transformations are generally executed in the data pipeline mode. That is, users have to define the `map` operator which helps to execute augmentations in data pipeline. As shown in the figure below, the `map` operator contains 3 transformations: `Resize`, `Crop`, and `HWC2CHW`. When the pipeline starts, the `map` operator will apply these transformations to data in sequence. +When resource conditions permit, in order to pursue higher performance, data augmentation operators are generally executed in the data pipeline mode. + +The biggest character of execution based on data pipelinemode users have to define the `map` operator. As shown in the figure below, the `Resize`, `Crop`, `HWC2CHW` operators are scheduled by the `map` operator, which is responsible for starting and executing the given data augmentation operators, and mapping and transforming the data of the data pipeline. ![pipelinemode1](./images/pipeline_mode_en.jpeg) -Although the data pipeline can process input data quickly, the code of defining pipeline seems heavy while sometimes users just want to focus on the data transformations and perform them on small-scale data. In this case, data pipeline is not necessary. +Although constructing a data pipeline can process input data in batches, the API design of the data pipeline requires the user to start from constructing the input source, and gradually defines the individual processing operators in the data pipeline. Only when defining the `map` will it involve data augmentation operators that are highly related to the user input data. + +Undoubtedly, users only want to focus on the code that is most relevant to them, but other codes with less relevance add unnecessary burdens to the user throughout the code scene. -Therefore, MindSpore provides a lightweight data processing way to execute these data augmentations, called `Eager mode`. +Therefore, MindSpore provides a lightweight data processing way, called Eager mode. -In `Eager mode`, the execution of data augmentations will not rely on the `map` operator but can be called directly as callable functions. The code will be simpler since the results are obtained immediately. It is recommended to be used in lightweight scenarios such as small data enhancement experiments and model inference. +In the Eager mode, the execution of data augmentations will not rely on the `map` operator. Instead, the data augmentation operator is executed in the form of a functional call. The code will be simpler and the results are obtained immediately. It is recommended to be used in lightweight scenarios such as small data augmentation experiments and model inference. ![eagermode1](./images/eager_mode_en.jpeg) -MindSpore currently supports executing various data augmentations in `Eager mode`, as shown below. For more details, please refer to the API documentation. +MindSpore currently supports executing various data augmentation operators in the Eager mode, as shown below. For more details, please refer to the API documentation. - [vision module](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.vision.html) - - - Submodule c_transforms, an image enhancement operator based on OpenCV. - - Submodule py_transforms, an image enhancement operator based on Pillow. +- Submodule c_transforms, an image augmentation operator implemented based on OpenCV. +- Submodule py_transforms, an image augmentation operator implemented based on Pillow. - [text module](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.text.html#mindspore-dataset-text-transforms) - - - Submodule transforms, text processing operators. - +- Submodule transforms, text processing operators. - [transforms module](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.transforms.html) +- Submodule c_transforms, a general-purpose data augmentation operator implemented based on C++. +- Submodule py_transforms, a general-purpose data augmentation operator implemented based on Python. - - Submodule c_transforms, a general-purpose data enhancement operator based on C++. - - Submodule py_transforms, a general-purpose data augmentation operator based on Python. - -## example +## Eager Mode -The following example introduces how to execute data augmentations in `Eager mode`. +The following is a brief introduction to the use of the Eager mode for data augmentation operators of each module. With the Eager mode, you only need to treat the data augmentation operator itself as an executable function. -> To use `Eager mode`, just treat the data augmentations as an executable function and call them directly. +### Data Preparation -### data preparation - -Download the image and save it to the specified location. +The following sample code downloads the image data to the specified location. ```python import os @@ -121,11 +119,9 @@ The following shows the processed image. ![eager_mode](./images/eager_mode.png) -Augmentation operators that support to be run in Eager Mode are listed as follows: [mindspore.dataset.transforms](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.transforms.html), [mindspore.dataset.vision](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.vision.html), [mindspore.dataset.text.transforms](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.text.html#mindspore-dataset-text-transforms). - ### text -This example will transform the given text using the `tranforms` operator in the `text` module. +This example will transform the given text by using the `tranforms` operator in the `text` module. Eager mode of the text operator supports `numpy.array` type data as input parameters. @@ -152,7 +148,7 @@ ToNumber result: [123456], type: ### transforms -This example will transform the given data using the operators of `c_tranforms` in the `transforms` module. +This example will transform the given data by using the `c_tranforms` operator in the `transforms` module. Eager mode of transforms operator supports `numpy.array` type data as input parameters. diff --git a/tutorials/experts/source_zh_cn/dataset/cache.ipynb b/tutorials/experts/source_zh_cn/dataset/cache.ipynb index b6fade0376..176e3835d0 100644 --- a/tutorials/experts/source_zh_cn/dataset/cache.ipynb +++ b/tutorials/experts/source_zh_cn/dataset/cache.ipynb @@ -971,4 +971,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file -- Gitee