diff --git a/tutorials/experts/source_en/dataset/optimize.ipynb b/tutorials/experts/source_en/dataset/optimize.ipynb index 948289897e6c7d8637a0cac12155c67218e0b7a7..cf729da86c59e84116d9ed20a2abd8e9b6cd184b 100644 --- a/tutorials/experts/source_en/dataset/optimize.ipynb +++ b/tutorials/experts/source_en/dataset/optimize.ipynb @@ -12,9 +12,7 @@ { "cell_type": "markdown", "source": [ - "## Overview\n", - "\n", - "Data is the most important factor of deep learning. Data quality determines the upper limit of deep learning result, whereas model quality enables the result to approach the upper limit. Therefore, high-quality data input is beneficial to the entire deep neural network. During the entire data processing and data augmentation process, data continuously flows through a pipeline to the training system." + "Data is the most important part of the whole deep learning, because the quality of the data determines the upper limit of the final result, and the quality of the model is only to infinitely approach this upper limit, so high-quality data input will play a positive role in the entire deep neural network. The data in the entire process of data processing and data augmentation is like water through the pipeline, continuous flows to the training system, as shown in the figure:" ], "metadata": {} }, @@ -28,15 +26,17 @@ { "cell_type": "markdown", "source": [ - "MindSpore provides data processing and data augmentation functions for users. In the pipeline process, if each step can be properly used, the data performance will be greatly improved. This section describes how to optimize performance during data loading, data processing, and data augmentation based on the [CIFAR-10 dataset[1]](#references).\n", + "MindSpore provides data processing and data augmentation functions for users. In the pipeline process, if each step can be properly used, the data performance will be greatly improved.\n", + "\n", + "This section describes how to optimize performance during data loading, data processing, and data augmentation based on the CIFAR-10 dataset.\n", "\n", "In addition, the storage, architecture and computing resources of the operating system will influence the performance of data processing to a certain extent.\n", "\n", - "## Preparations\n", + "## Downloading the Dataset\n", "\n", - "### Importing Modules\n", + "Run the following command to obtain the dataset.\n", "\n", - "The `dataset` module provides APIs for loading and processing datasets." + "Download the CIFAR-10 binary format dataset and extract the dataset file to the `./datasets/` directory, which is used when the data is loaded." ], "metadata": {} }, @@ -44,90 +44,23 @@ "cell_type": "code", "execution_count": 1, "source": [ - "import mindspore.dataset as ds" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "The `numpy` module is used to generate ndarrays." - ], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": 2, - "source": [ - "import numpy as np" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "### Downloading the Required Dataset\n", + "from mindvision import dataset\n", "\n", - "Run the following command to download the dataset:\n", - "Download the CIFAR-10 Binary format dataset, decompress them and store them in the `./datasets` path, use this dataset when loading data." - ], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "import os\n", - "import requests\n", - "import tarfile\n", - "import zipfile\n", - "import shutil\n", - "\n", - "requests.packages.urllib3.disable_warnings()\n", - "\n", - "def download_dataset(url, target_path):\n", - " \"\"\"download and decompress dataset\"\"\"\n", - " if not os.path.exists(target_path):\n", - " os.makedirs(target_path)\n", - " download_file = url.split(\"/\")[-1]\n", - " if not os.path.exists(download_file):\n", - " res = requests.get(url, stream=True, verify=False)\n", - " if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]:\n", - " download_file = os.path.join(target_path, download_file)\n", - " with open(download_file, \"wb\") as f:\n", - " for chunk in res.iter_content(chunk_size=512):\n", - " if chunk:\n", - " f.write(chunk)\n", - " if download_file.endswith(\"zip\"):\n", - " z = zipfile.ZipFile(download_file, \"r\")\n", - " z.extractall(path=target_path)\n", - " z.close()\n", - " if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"):\n", - " t = tarfile.open(download_file)\n", - " names = t.getnames()\n", - " for name in names:\n", - " t.extract(name, target_path)\n", - " t.close()\n", - " print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path))\n", - "\n", - "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\")\n", - "test_path = \"./datasets/cifar-10-batches-bin/test\"\n", - "train_path = \"./datasets/cifar-10-batches-bin/train\"\n", - "os.makedirs(test_path, exist_ok=True)\n", - "os.makedirs(train_path, exist_ok=True)\n", - "if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")):\n", - " shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path)\n", - "[shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))]" + "dl_path = \"./datasets\"\n", + "data_dir = \"./datasets/cifar-10-batches-bin/\"\n", + "dl_url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\"\n", + "\n", + "dl = dataset.DownLoad() # Download CIFAR-10 dataset\n", + "dl.download_and_extract_archive(url=dl_url, download_path=dl_path)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", + "metadata": {}, "source": [ - "The directory structure of the downloaded dataset file is as follows:\n", + "The directory structure of the extracted dataset file is as follows:\n", "\n", "```text\n", "./datasets/cifar-10-batches-bin\n", @@ -142,57 +75,21 @@ " ├── data_batch_4.bin\n", " └── data_batch_5.bin\n", "```" - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "Download cifar-10 Python file format dataset, decompress them in the `./datasets/cifar-10-batches-py` path, use this dataset when converting data." - ], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-python.tar.gz\", \"./datasets\")" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "The directory structure of the extracted dataset file is as follows:\n", - "\n", - "```text\n", - "./datasets/cifar-10-batches-py\n", - "├── batches.meta\n", - "├── data_batch_1\n", - "├── data_batch_2\n", - "├── data_batch_3\n", - "├── data_batch_4\n", - "├── data_batch_5\n", - "├── readme.html\n", - "└── test_batch\n", - "```" - ], - "metadata": {} + ] }, { "cell_type": "markdown", "source": [ "## Optimizing the Data Loading Performance\n", "\n", - "MindSpore provides multiple data loading methods, including common dataset loading, user-defined dataset loading, and the MindSpore data format loading. The dataset loading performance varies depending on the underlying implementation method.\n", + "MindSpore supports loading common datasets in fields such as computer vision, natural language processing, datasets in specific formats, and user-defined datasets. The underlying implementation of different dataset loading interfaces is different, and the performance is also different, as follows:\n", "\n", "| | Common Dataset | User-defined Dataset | MindRecord Dataset |\n", "| :----: | :----: | :----: | :----: |\n", "| Underlying implementation | C++ | Python | C++ |\n", "| Performance | High | Medium | High |\n", "\n", - "### Performance Optimization Solution" + "Performance Optimization Solution" ], "metadata": {} }, @@ -208,13 +105,11 @@ "source": [ "Suggestions on data loading performance optimization are as follows:\n", "\n", - "- Built-in loading operators are preferred for supported dataset formats. For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution. For details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-thread-optimization-solution).\n", - "- For a dataset format that is not supported, convert the format to the mindspore data format and then use the `MindDataset` class to load the dataset (Please refer to the [API](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.MindDataset.html) for detailed use). Please refer to [Converting Dataset to MindRecord](https://www.mindspore.cn/tutorials/en/master/advanced/dataset/record.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution, for details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-thread-optimization-solution).\n", - "- For dataset formats that are not supported, the user-defined `GeneratorDataset` class is preferred for implementing fast algorithm verification (Please refer to the [API](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.GeneratorDataset.html) for detailed use), if the performance cannot meet the requirements, the multi-process concurrency solution can be used. For details, see [Multi-process Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-process-optimization-solution).\n", + "- For commonly used datasets that have already provided loading interfaces, it is preferential to use the dataset loading interface provided by MindSpore to load, which can obtain better loading performance. For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution. For details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-thread-optimization-solution).\n", + "- For a dataset format that is not supported, it is recommended to convert the dataset to the MindRecord data format before loading it using the `MindDataset` class (Please refer to the [API](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.MindDataset.html) for detailed use). For detailed contents, please refer to [Converting Dataset to MindRecord](https://www.mindspore.cn/tutorials/en/master/advanced/dataset/record.html). If the performance cannot meet the requirements, use the multi-thread concurrency solution, for details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-thread-optimization-solution).\n", + "- For dataset formats that are not supported, the user-defined `GeneratorDataset` class is preferred for implementing fast algorithm verification (Please refer to the [API](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.GeneratorDataset.html) for detailed use). If the performance cannot meet the requirements, the multi-process concurrency solution can be used. For details, see [Multi-process Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-process-optimization-solution).\n", "\n", - "### Code Example\n", - "\n", - "Based on the preceding suggestions of data loading performance optimization, the `Cifar10Dataset` class of built-in loading operators (Please refer to the [API](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.Cifar10Dataset.html) for detailed use), the `MindDataset` class after data conversion, and the `GeneratorDataset` class are used to load data. The sample code is displayed as follows:\n", + "Based on the preceding suggestions of data loading performance optimization, this experience uses the built-in load operator `Cifar10Dataset` class (Please refer to the [API](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.Cifar10Dataset.html) for detailed use), the `MindDataset` class after data conversion, and uses the `GeneratorDataset` class to load data. The sample code is displayed as follows:\n", "\n", "1. Use the `Cifar10Dataset` class of built-in operators to load the CIFAR-10 dataset in binary format. The multi-thread optimization solution is used for data loading. Four threads are enabled to concurrently complete the task. Finally, a dictionary iterator is created for the data and a data record is read through the iterator." ], @@ -366,9 +261,9 @@ "source": [ "## Optimizing the Shuffle Performance\n", "\n", - "The shuffle operation is used to shuffle ordered datasets or repeated datasets. MindSpore provides the `shuffle` function for users. A larger value of `buffer_size` indicates a higher shuffling degree, consuming more time and computing resources. This API allows users to shuffle the data at any time during the entire pipeline process. However, because the underlying implementation methods are different, the performance of this method is not as good as that of setting the `shuffle` parameter to directly shuffle data by referring to the [Built-in Loading Operators](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.html).\n", + "The shuffle operation is used to shuffle ordered datasets or repeated datasets. MindSpore provides the `shuffle` function for users. A larger value of `buffer_size` indicates a higher shuffling degree, consuming more time and computing resources. This API allows users to shuffle the data at any time during the entire pipeline process. For the detailed contents, refer to [shuffle processing](https://www.mindspore.cn/tutorials/zh-CN/master/advanced/dataset/transform.html#shuffle). Because the underlying implementation methods are different, the performance of this method is not as good as that of setting the `shuffle` parameter to directly shuffle data by referring to the [Built-in Loading Operators](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.html).\n", "\n", - "### Performance Optimization Solution" + "Performance Optimization Solution" ], "metadata": {} }, @@ -387,8 +282,6 @@ "- Use the `shuffle` parameter of built-in loading operators to shuffle data.\n", "- If the `shuffle` function is used and the performance still cannot meet the requirements, adjust the value of the `buffer_size` parameter to improve the performance.\n", "\n", - "### Code Example\n", - "\n", "Based on the preceding shuffle performance optimization suggestions, the `shuffle` parameter of the `Cifar10Dataset` class of built-in loading operators and the `Shuffle` function are used to shuffle data. The sample code is displayed as follows:\n", "\n", "1. Use the `Cifar10Dataset` class of built-in operators to load the CIFAR-10 dataset. In this example, the CIFAR-10 dataset in binary format is used, and the `shuffle` parameter is set to True to perform data shuffle. Finally, a dictionary iterator is created for the data and a data record is read through the iterator." @@ -518,14 +411,14 @@ "- Use the built-in Python operator (`py_transforms` module) to perform data augmentation.\n", "- Users can define Python functions as needed to perform data augmentation.\n", "\n", - "Please refer to [Data Augmentation](https://www.mindspore.cn/tutorials/experts/en/master/dataset/augment.html). The performance varies according to the underlying implementation methods.\n", + "Please refer to [Data Augmentation](https://www.mindspore.cn/tutorials/experts/en/master/dataset/augment.html). The performance varies according to the underlying implementation methods. This is shown below:\n", "\n", "| Module | Underlying API | Description |\n", "| :----: | :----: | :----: |\n", "| c_transforms | C++ (based on OpenCV) | High performance |\n", "| py_transforms | Python (based on PIL) | This module provides multiple image augmentation functions and the method for converting PIL images into NumPy arrays |\n", "\n", - "### Performance Optimization Solution" + "Performance Optimization Solution" ], "metadata": {} }, @@ -546,9 +439,7 @@ "- The `c_transforms` module maintains buffer management in C++, and the `py_transforms` module maintains buffer management in Python. Because of the performance cost of switching between Python and C++, it is advised not to use different operator types together.\n", "- If the user-defined Python functions are used to perform data augmentation and the performance still cannot meet the requirements, use the [Multi-thread Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-thread-optimization-solution) or [Multi-process Optimization Solution](https://www.mindspore.cn/tutorials/experts/en/master/dataset/optimize.html#multi-process-optimization-solution). If the performance still cannot be improved, in this case, optimize the user-defined Python code.\n", "\n", - "MindSpore also supports users to use the data enhancement methods in the `c_transforms` and `py_transforms` modules at the same time, but due to the different underlying implementations of the two, excessive mixing will increase resource overhead and reduce processing performance. It is recommended that users can use the operators in `c_transforms` or `py_transforms` alone; or use one of them first, and then use the other. Please do not switch frequently between the data enhancement interface of two different implementation modules.\n", - "\n", - "### Code Example\n", + "MindSpore also supports users to use the data augmentation methods in the `c_transforms` and `py_transforms` modules at the same time, but due to the different underlying implementations of the two, excessive mixing will increase resource overhead and reduce processing performance. It is recommended that users can use the operators in `c_transforms` or `py_transforms` alone; or use one of them first, and then use the other. Please do not switch frequently between the data augmentation interface of two different implementation modules.\n", "\n", "Based on the preceding suggestions of data augmentation performance optimization, the `c_transforms` module and user-defined Python function are used to perform data augmentation. The code is displayed as follows:\n", "\n", @@ -644,23 +535,25 @@ "source": [ "## Optimizing the Operating System Performance\n", "\n", - "Data processing is performed on the host. Therefore, configurations of the host or operating system may affect the performance of data processing. Major factors include storage, NUMA architecture, and CPU (computing resources).\n", + "Data processing is performed on the Host. Therefore, configurations of the running environment may affect the processing performance. Major factors include storage, NUMA architecture, and CPU (computing resources).\n", "\n", "1. Storage\n", "\n", - " The data loading process involves frequent disk operations, and the performance of disk reading and writing directly affects the speed of data loading. Solid State Drive (SSD) is recommended for storing large datasets. SSD reduces the impact of I/O on data processing.\n", + "The data loading process involves frequent disk operations, and the performance of disk reading and writing directly affects the speed of data loading. Solid State Drive (SSD) is recommended for storing large datasets when the dataset is large. SSDs generally have higher read and write speeds than ordinary disks, reducing the impact of I/O operations on data processing performance.\n", "\n", - " > In most cases, after a dataset is loaded, it is stored in page cache of the operating system. To some extent, this reduces I/O overheads and accelerates reading subsequent epochs.\n", + "In general, the loaded data will be cached into the operating system's page cache, which reduces the overhead of subsequent reads to a certain extent and accelerates the data loading speed of subsequent Epochs. Users can also manually cache the augmented data through the single-node caching technology provided by MindSpore, avoiding duplicate data loading and data augmentation.\n", "\n", "2. NUMA architecture\n", "\n", - " NUMA (Non-uniform Memory Architecture) is developed to solve the scalability problem of traditional Symmetric Multi-processor systems. The NUMA system has multiple memory buses. Several processors are connected to one memory via memory bus to form a group. This way, the entire large system is divided into several groups, the concept of this group is called a node in the NUMA system. Memory belonging to this node is called local memory, memory belonging to other nodes (with respect to this node) is called foreign memory. Therefore, the latency for each node to access its local memory is different from accessing foreign memory. This needs to be avoided during data processing. Generally, the following command can be used to bind a process to a node:\n", + "NUMA, Non-Uniform Memory Access, is a memory architecture that was born to solve the scalability problem in the traditional symmetric multiprocessor (SMP) architecture. In traditional architectures, multiple processors share a memory bus, which is prone to problems such as insufficient bandwidth and memory conflicts.\n", + "\n", + "In the NUMA architecture, processors and memory are divided into groups, each called a node, each node has a separate integrated memory controller (IMC) bus for intra-node communication, and different nodes communicate with each other through a fast path interconnect (QPI). For a node, memory within the same node is called local memory, while memory in other nodes is called external memory. The delay in accessing local memory will be less than the delay in accessing external memory.\n", "\n", - " ```bash\n", - " numactl --cpubind=0 --membind=0 python train.py\n", - " ```\n", + "During data processing, you can reduce the latency of memory access by binding the process to the node. In general, we can use the following command to bind the process to the node node:\n", "\n", - " The example above binds the `train.py` process to `numa node` 0." + "```bash\n", + "numactl --cpubind=0 --membind=0 python train.py\n", + "```" ], "metadata": {} }, @@ -669,31 +562,23 @@ "source": [ "3. CPU (computing resource)\n", "\n", - " Although the data processing speed can be accelerated through multi-threaded parallel technology, there is actually no guarantee that CPU computing resources will be fully utilized. If you can artificially complete the configuration of computing resources in advance, it will be able to improve the utilization of CPU computing resources to a certain extent.\n", - "\n", - " - Resource allocation\n", - "\n", - " In distributed training, multiple training processes are run on one device. These training processes allocate and compete for computing resources based on the policy of the operating system. When there is a large number of processes, data processing performance may deteriorate due to resource contention. In some cases, users need to manually allocate resources to avoid resource contention.\n", - "\n", - " ```bash\n", - " numactl --cpubind=0 python train.py\n", - " ```\n", + "Although the data processing speed can be accelerated through multi-threaded parallel technology, there is actually no guarantee that CPU computing resources will be fully utilized. If you can artificially complete the configuration of computing resources in advance, it will be able to improve the utilization of CPU computing resources to a certain extent.\n", "\n", - " or\n", + "- Resource allocation\n", "\n", - " ```bash\n", - " taskset -c 0-15 python train.py\n", - " ```\n", + "In distributed training, multiple training processes are run on one device. These training processes allocate and compete for computing resources based on the policy of the operating system. When there is a large number of processes, data processing performance may deteriorate due to resource contention. In some cases, users need to manually allocate resources to avoid resource contention.\n", "\n", - " > The `numactl` method directly specifies `numa node id`. The `taskset` method allows for finer control by specifying `cpu core` within a `numa node`. The `core id` range from 0 to 15.\n", + "```bash\n", + "numactl --cpubind=0 python train.py\n", + "```\n", "\n", - " - CPU frequency\n", + "- CPU frequency\n", "\n", - " The setting of CPU frequency is critical to maximizing the computing power of the host CPU. Generally, the Linux kernel supports the tuning of the CPU frequency to reduce power consumption. Power consumption can be reduced to varying degrees by selecting power management policies for different system idle states. However, lower power consumption means slower CPU wake-up which in turn impacts performance. Therefore, if the CPU's power setting is in the conservative or powersave mode, `cpupower` command can be used to switch performance modes, resulting in significant data processing performance improvement.\n", + "For energy efficiency reasons, the operating system adjusts the CPU operating frequency as needed, but lower power consumption means that computing performance is degraded and data processing is slowed down. In order to get the most out of the CPU's maximum computing power, you need to manually set the CPU's operating frequency. If it is found that the CPU operation mode of the operating system is balanced mode or energy-saving mode, you can improve the performance of data processing by adjusting it to performance mode.\n", "\n", - " ```bash\n", - " cpupower frequency-set -g performance\n", - " ```" + "```bash\n", + "cpupower frequency-set -g performance\n", + "```" ], "metadata": {} }, @@ -704,19 +589,17 @@ "\n", "### Multi-thread Optimization Solution\n", "\n", - "During the data pipeline process, the number of threads for related operators can be set to improve the concurrency and performance. If the user does not manually specify the `num_parallel_workers` parameter, each data processing operation will use 8 sub-threads for concurrent processing by default. For example:\n", + "During the data pipeline process, the number of threads for related operators can be set to improve the concurrency and performance. If the user does not manually specify the num_parallel_workers parameter, each data processing operation will use 8 sub-threads for concurrent processing by default. For example:\n", "\n", "- During data loading, the `num_parallel_workers` parameter in the built-in data loading class is used to set the number of threads.\n", "- During data augmentation, the `num_parallel_workers` parameter in the `map` function is used to set the number of threads.\n", "- During batch processing, the `num_parallel_workers` parameter in the `batch` function is used to set the number of threads.\n", "\n", - "For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.html).\n", - "\n", - "When using MindSpore for standalone or distributed training, the setting of the `num_parallel_workers` parameter should follow the following principles:\n", + "For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.html). When using MindSpore for standalone or distributed training, the setting of the num_parallel_workers parameter should follow the following principles:\n", "\n", - "- The summary of the `num_parallel_workers` parameter set for each data loading and processing operation should not be greater than the maximum number of CPU cores of the machine, otherwise it will cause resource competition between each operation.\n", - "- Before setting the `num_parallel_workers` parameter, it is recommended to use MindSpore's Profiler (performance analysis) tool to analyze the performance of each operation in the training, and allocate more resources to the operation with pool performance, that is, set a large `num_parallel_workers` to balance the throughput between various operations and avoid unnecessary waiting.\n", - "- In a standalone training scenario, increasing the `num_parallel_workers` parameter can often directly improve processing performance, but in a distributed scenario, due to increased CPU competition, blindly increasing `num_parallel_workers` may lead to performance degradation. You need to try to use a compromise value.\n", + "- The summary of the num_parallel_workers parameter set for each data loading and processing operation should not be greater than the maximum number of CPU cores of the machine, otherwise it will cause resource competition between each operation.\n", + "- Before setting the num_parallel_workers parameter, it is recommended to use MindSpore's Profiler (performance analysis) tool to analyze the performance of each operation in the training, and allocate more resources to the operation with pool performance, that is, set a large num_parallel_workers to balance the throughput between various operations and avoid unnecessary waiting.\n", + "- In a standalone training scenario, increasing the num_parallel_workers parameter can often directly improve processing performance, but in a distributed scenario, due to increased CPU competition, blindly increasing num_parallel_workers may lead to performance degradation. You need to try to use a compromise value.\n", "\n", "### Multi-process Optimization Solution\n", "\n", @@ -727,7 +610,7 @@ "\n", "### Compose Optimization Solution\n", "\n", - "Map operators can receive the Tensor operator list and apply all these operators based on a specific sequence. Compared with the Map operator used by each Tensor operator, such Fat Map operators can achieve better performance, as shown in the following figure:" + "Map operators can receive the Tensor operator list and apply all these operators based on a specific sequence. Compared with the Map operator used by each Tensor operator, such \"Fat Map operators\" can achieve better performance, as shown in the following figure:" ], "metadata": {} }, @@ -752,11 +635,7 @@ "- Use Solid State Drives to store the data.\n", "- Bind the process to a NUMA node.\n", "- Manually allocate more computing resources.\n", - "- Set a higher CPU frequency.\n", - "\n", - "## References\n", - "\n", - "[1] Alex Krizhevsky. [Learning Multiple Layers of Features from Tiny Images](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)." + "- Set a higher CPU frequency." ], "metadata": {} } diff --git a/tutorials/experts/source_en/debug/auto_tune.md b/tutorials/experts/source_en/debug/auto_tune.md index b95bd4a72875e3cfb67a36623a47d4be539f07eb..660bbb163663c6e933870bd9f683b71961689e4b 100644 --- a/tutorials/experts/source_en/debug/auto_tune.md +++ b/tutorials/experts/source_en/debug/auto_tune.md @@ -4,15 +4,15 @@ ## Overview -AutoTune is a tool that uses hardware resources and automatically tune the performance of TBE operators. Comparing with manually debugging the performance of operator, it takes less time and labor cost, and a model with better performance can be obtained. This document mainly introduces how to use the AutoTune tool to Online tune. The detail guidelines about the AutoTune framework, function description, and the fault handling can be got in [AutoTune Guides](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/31d1d888/about-this-document). +AutoTune is a tool that uses hardware resources and automatically tune the performance of TBE operators. Comparing with manually debugging the performance of operator, it takes less time and labor cost, and a model with better performance can be obtained. This document mainly introduces how to use the AutoTune to online tune. The detail guidelines about the AutoTune framework, function description, and the fault handling can be got in [AutoTune Guides](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/31d1d888/about-this-document). -## TuneMode +## Tuning Mode -The AutoTune tool includes `RL` and `GA` tuning modes. The`RL`tuning mode mainly supports`broadcast`,`reduce`, and`elewise`operators. The`GA`tuning mode mainly supports`cube`operators. The more information about the GA, RL, and the operators supported by the two tune mode can be got in [Tune Mode](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/41bb2c07) and [Operators](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/74e08a9c/operator-list). +The AutoTune tool includes `RL` and `GA` tuning modes. The`RL`tuning mode mainly supports broadcast, reduce, and elewise operators. The`GA`tuning mode mainly supports cube operators. The more information about the GA, RL, and the operators supported by the two tune mode can be got in [Tune Mode](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/41bb2c07) and [Operators](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/74e08a9c/operator-list). -## EnvironmentVariables +## Environment Variables -When using the AutoTune tool to tune the operators, some environment variables need to be configured (Required). +When you enable the AutoTune tool, you need to configure the relevant required environment variables. ```shell # Run package installation directory @@ -29,7 +29,7 @@ export ENABLE_TUNE_DUMP=True Try to find the detailed description of environment variables, or other optional environment variables descriptions in [Environment Variable](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/3f0a50ba/environment-variable-configuration). -## EnablingTune +## Enabling Tune The AutoTune tool supports two tuning modes, `Online tune` and `Offline Tune`. @@ -39,9 +39,9 @@ The AutoTune tool supports two tuning modes, `Online tune` and `Offline Tune`. NO_TUNE: turn off tune. - RL: turn on RL tune. + RL: turn on RL tune. Tuning for operators that support RL tuning. - GA: turn on GA tune. + GA: turn on GA tune. Tuning for operators that support GA tuning. RL,GA: turn on GA and RL at the same time, the tool will select RL or GA automatically according to different types of operators which are used in the network. @@ -59,13 +59,13 @@ The AutoTune tool supports two tuning modes, `Online tune` and `Offline Tune`. The Offline Tune is using the dump data (The output description file, and the binary file of operators) of network model (Generate when training network) to tune the operators. The method of Offline Tune and related environment variables can be found in [Offline Tune](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/2fa72dd0) in `CANN` development tool guide, which is not described here. -## TuningResult +## Tuning Result After the tuning starts, a file named `tune_result_{timestamp}_pidxxx.json` will be generated in the working directory to record the tuning process and tuning results. Please refer to [tuning result file analysis](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/b6ae7c6a) for specific analysis of this file. After the tuning is complete. The custom knowledge base will be generated if the conditions are met. If the `TUNE_BANK_PATH`(Environment variable of the knowledge base storage path) is specified, the knowledge base(generated after tuning) will be saved in the specified directory. Otherwise, the knowledge base will be in the following default path. Please refer to [Custom knowledge base](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/b6ae7c6a) for the storage path. -## MergeKnowledgeBase +## Merging Knowledge Base After operator tuning, the generated tuning knowledge base supports merging, which is convenient for re-executing, or the other models.(Only the same Ascend AI Processor can be merged). The more specific merging methods can be found in [merging knowledge base](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/c1a94cfc/repository-merging). diff --git a/tutorials/experts/source_en/debug/custom_debug.md b/tutorials/experts/source_en/debug/custom_debug.md index 31bbc26cb1360da41c1a84dde23412854fa88875..282055ccd2f48ff9d8ef2901d33f58aeeaa97f73 100644 --- a/tutorials/experts/source_en/debug/custom_debug.md +++ b/tutorials/experts/source_en/debug/custom_debug.md @@ -8,17 +8,17 @@ This section describes how to use the customized capabilities provided by MindSp ## Introduction to Callback -Here, callback is not a function but a class. You can use callback to observe the internal status and related information of the network during training or perform specific actions in a specific period. +`Callback` is a callback function, and callback is not a function but a class. You can use callback function to observe the internal status and related information of the network during training or perform specific actions in a specific period. For example, you can monitor the loss, save model parameters, dynamically adjust parameters, and terminate training tasks in advance. ### Callback Capabilities of MindSpore -MindSpore provides the callback capabilities to allow users to insert customized operations in a specific phase of training or inference, including: +MindSpore provides the `Callback` capabilities to allow users to insert customized operations in a specific phase of training or inference, including: -- Callback classes such as `ModelCheckpoint`, `LossMonitor`, and `SummaryCollector` provided by the MindSpore framework. -- Custom callback classes. +- `Callback` classes such as `ModelCheckpoint`, `LossMonitor`, and `SummaryCollector` provided by the MindSpore framework. +- User-customized `Callback` supported by MindSpore . -Usage: Transfer the callback object in the `model.train` method. The callback object can be a list, for example: +Usage: Transfer the `Callback` object in the `model.train` method. It can be a `Callback` list, for example: ```python from mindspore import ModelCheckpoint, LossMonitor, SummaryCollector @@ -30,15 +30,15 @@ model.train(epoch, dataset, callbacks=[ckpt_cb, loss_cb, summary_cb]) ``` `ModelCheckpoint` can save model parameters for retraining or inference. -`LossMonitor` can output loss information in logs for users to view. In addition, `LossMonitor` monitors the loss value change during training. When the loss value is `Nan` or `Inf`, the training terminates. -`SummaryCollector` can save the training information to files for later use. -During the training process, the callback list will execute the callback function in the defined order. Therefore, in the definition process, the dependency between callbacks needs to be considered. +`LossMonitor` can output loss in the log for easily viewing, and it also monitors the changes in the loss value during training, terminates the training when the loss value is `Nan` or `Inf`. +`SummaryCollector` can save the training information to files for subsequent visualizations. +During the training process, the `Callback` list will execute the `Callback` function in the defined order. Therefore, in the definition process, the dependency between `Callback` needs to be considered. ### Custom Callback -You can customize callback based on the `callback` base class as required. +You can customize `Callback` based on the `callback` base class as required. -The callback base class is defined as follows: +The `Callback` base class is defined as follows: ```python class Callback(): @@ -68,28 +68,27 @@ class Callback(): pass ``` -The callback can record important information during training and transfer the information to the callback object through a dictionary variable `RunContext.original_args()`, -You can obtain related attributes from each custom callback and perform customized operations. You can also customize other variables and transfer them to the `RunContext.original_args()` object. +The `Callback` can record important information during training and transfer the information to the `Callback` object through a dictionary variable `RunContext.original_args()`, +You can obtain related attributes from each custom `Callback` and perform customized operations. You can also customize other variables and transfer them to the `RunContext.original_args()` object. The main attributes of `RunContext.original_args()` are as follows: -- loss_fn: Loss function -- optimizer: Optimizer -- train_dataset: Training dataset -- cur_epoch_num: Number of current epochs -- cur_step_num: Number of current steps -- batch_num: Number of batches in an epoch -- epoch_num: Number of training epochs -- batch_num: Number of training batch -- train_network: Training network -- parallel_mode: Parallel mode -- list_callback: All callback functions -- net_outputs: Network output results +- `loss_fn`: Loss function +- `optimizer`: Optimizer +- `train_dataset`: Training dataset +- `epoch_num`: Number of training epochs +- `batch_num`: Number of batches in an epoch +- `train_network`: Training network +- `cur_epoch_num`: Number of current epochs +- `cur_step_num`: Number of current steps +- `parallel_mode`: Parallel mode +- `list_callback`: All callback functions +- `net_outputs`: Network output results - ... -You can inherit the callback base class to customize a callback object. +You can inherit the `Callback` base class to customize a `callback` object. -Here are two examples to further explain the usage of custom Callback. +Here are two examples to further explain the usage of custom `Callback`. > custom `Callback` sample code: > @@ -120,15 +119,9 @@ Here are two examples to further explain the usage of custom Callback. run_context.request_stop() ``` - The output is as follows: - - ```text - epoch: 20 step: 32 loss: 2.298344373703003 - ``` - The implementation principle is: You can use the `run_context.original_args` method to obtain the `cb_params` dictionary, which contains the main attribute information described above. - In addition, you can modify and add values in the dictionary. In the preceding example, an `init_time` object is defined in `begin` and transferred to the `cb_params` dictionary. - A decision is made at each `step_end`. When the training time is longer than the configured time threshold, a training termination signal will be sent to the `run_context` to terminate the training in advance and the current values of epoch, step, and loss will be printed. +In addition, you can modify and add values in the dictionary. In the preceding example, an `init_time` object is defined in `begin` and transferred to the `cb_params` dictionary. + A decision is made at each `step_end`. When the training time is longer than the configured time threshold, a training termination signal will be sent to the `run_context` to terminate the training in advance and the current values of `epoch`, `step`, and `loss` will be printed. - Save the checkpoint file with the highest accuracy during training. @@ -152,13 +145,13 @@ Here are two examples to further explain the usage of custom Callback. print("Save the maximum accuracy checkpoint,the accuracy is", self.acc) ``` - The specific implementation principle is: define a callback object, and initialize the object to receive the model object and the ds_eval (verification dataset). Verify the accuracy of the model in the step_end phase. When the accuracy is the current highest, automatically trigger the save checkpoint method to save the current parameters. + The specific implementation principle is: define a `Callback` object, and initialize the object to receive the `model` object and the `ds_eval` (verification dataset). Verify the accuracy of the model in the `step_end` phase. When the accuracy is the current highest, automatically trigger the save checkpoint method to save the current parameters. -## MindSpore Metrics +## MindSpore Metrics Introduction After the training is complete, you can use metrics to evaluate the training result. -MindSpore provides multiple metrics, such as `accuracy`, `loss`, `tolerance`, `recall`, and `F1`. +MindSpore provides multiple metrics, such as `accuracy`, `loss`, `precision`, `recall`, and `F1`. You can define a metrics dictionary object that contains multiple metrics and transfer them to the `model` object and use the `model.eval` function to verify the training result. @@ -183,16 +176,16 @@ result = model.eval(ds_eval) The `model.eval` method returns a dictionary that contains the metrics and results transferred to the metrics. -The callback function can also be used in the eval process, and the user can call the related API or customize the callback method to achieve the desired function. +The `Callback` function can also be used in the eval process, and the user can call the related API or customize the `Callback` method to achieve the desired function. -You can also define your own metrics class by inheriting the `Metric` base class and rewriting the `clear`, `update`, and `eval` methods. +You can also define your own `metrics` class by inheriting the `Metric` base class and rewriting the `clear`, `update`, and `eval` methods. The `Accuracy` operator is used as an example to describe the internal implementation principle. The `Accuracy` inherits the `EvaluationBase` base class and rewrites the preceding three methods. - The `clear` method initializes related calculation parameters in the class. -- The `update` method accepts the predicted value and tag value and updates the internal variables of Accuracy. +- The `update` method accepts the predicted value and tag value and updates the internal variables of `Accuracy`. - The `eval` method calculates related indicators and returns the calculation result. By invoking the `eval` method of `Accuracy`, you will obtain the calculation result. @@ -219,10 +212,10 @@ The output is as follows: Accuracy is 0.6667 ``` -## MindSpore Print Operator +## MindSpore Print Operator Introduction MindSpore-developed `Print` operator is used to print the tensors or character strings input by users. Multiple strings, multiple tensors, and a combination of tensors and strings are supported, which are separated by comma (,). The `Print` operator is only supported in Ascend environment. -The method of using the MindSpore `Print` operator is the same as using other operators. You need to assert MindSpore `Print` operator in `__init__` and invoke it using `construct`. The following is an example. +The method of using the MindSpore `Print` operator is the same as that of other operators. You need to declare the operator in the `__init__` in the network and call it in`construct`, and the specific usage examples and output results are as follows: ```python import numpy as np @@ -282,7 +275,7 @@ Running Data Recorder(RDR) is the feature MindSpore provides to record data whil ### Usage -#### Set RDR By Configuration File +#### Set RDR by Configuration File 1. Create the configuration file `mindspore_config.json`. @@ -298,7 +291,7 @@ Running Data Recorder(RDR) is the feature MindSpore provides to record data whil > enable: Controls whether the RDR is enabled. > - > mode: Controls RDR data exporting mode. When mode is set to 1, RDR exports data only in exceptional scenario. When mode is set to 2, RDR exports data in exceptional or normal scenario. + > mode: Controls RDR data exporting mode. When mode is set to 1, RDR exports data only in the exceptional scenario. When mode is set to 2, RDR exports data in exceptional or normal scenario. > > path: Set the path to which RDR stores data. Only absolute path is supported. @@ -308,9 +301,9 @@ Running Data Recorder(RDR) is the feature MindSpore provides to record data whil set_context(env_config_path="./mindspore_config.json") ``` -#### Set RDR By Environment Variables +#### Set RDR by Environment Variables -Set `export MS_RDR_ENABLE=1` to enable RDR, and set `export MS_RDR_MODE=1` or `export MS_RDR_MODE=2` to control exporting mode for RDR data, and set the root directory by `export MS_RDR_PATH=/path/to/root/dir` for recording data. The final directory for recording data is `/path/to/root/dir/rank_{RANK_ID}/rdr/`. `{RANK_ID}` is the unique ID for multi-cards training, the single card scenario defaults to `RANK_ID=0`. +Set `export MS_RDR_ENABLE=1` to enable RDR, and set `export MS_RDR_MODE=1` or `export MS_RDR_MODE=2` to control exporting mode for RDR data, and set the root directory by `export MS_RDR_PATH=/path/to/root/dir` for recording data. The final directory for recording data is `/path/to/root/dir/rank_{RANK_ID}/rdr/`. `RANK_ID` is the unique ID for multi-cards training, the single card scenario defaults to `RANK_ID=0`. > The configuration file set by the user takes precedence over the environment variables. @@ -322,7 +315,33 @@ When we go to the directory for recording data, we can see several files appear #### Diagnosis Handling -When enable RDR and set `export MS_RDR_MODE=2`, it is diagnostic mode. After Compiling graph, we also can see several files in above `MS_RDR_PATH` directory. the files are same with exception handling's. +When RDR is enabled and environment variable `export MS_RDR_MODE=2` is set, it is diagnostic mode. After the graph compilation is complete, we can also see the saved file which is the same as those that are exception handled in the export directory of the RDR file. + +## Memory Reuse + +The memory reuse is to let different Tensors share the same part of the memory to reduce memory overhead and support a larger network. After shutting down, each Tensor has its own independent memory space, and tensors have no shared memory. + +The MindSpore memory multiplexing function is turned on by default, and the function can be manually controlled to turn off and on in the following ways. + +### Usage + +1. Construct configuration file `mindspore_config.json`. + + ```json + { + "sys": { + "mem_reuse": true + } + } + ``` + +> mem_reuse: controls whether the memory multiplexing function is turned on. When it is set to true, the control memory multiplexing function is turned on, and when false, the memory multiplexing function is turned off. + +2. Configure the memory multiplexing function through `context`. + + ```python + set_context(env_config_path="./mindspore_config.json") + ``` ## Log-related Environment Variables and Configurations diff --git a/tutorials/experts/source_zh_cn/dataset/optimize.ipynb b/tutorials/experts/source_zh_cn/dataset/optimize.ipynb index 04bff132fcd5ab1782888e4b1d1bcaadf9c57f5d..031d2e00b167f3fd48c7f7054bf68cc87302c599 100644 --- a/tutorials/experts/source_zh_cn/dataset/optimize.ipynb +++ b/tutorials/experts/source_zh_cn/dataset/optimize.ipynb @@ -93,7 +93,7 @@ "source": [ "## 数据加载性能优化\n", "\n", - "MindSpore支持加载计算机视觉、自然语言处理等领域的常用数据集、特定格式的数据集以及用户自定义的数据集,。不同数据集加载接口的底层实现方式不同,性能也存在着差异,如下所示:" + "MindSpore支持加载计算机视觉、自然语言处理等领域的常用数据集、特定格式的数据集以及用户自定义的数据集。不同数据集加载接口的底层实现方式不同,性能也存在着差异,如下所示:" ] }, { @@ -773,4 +773,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/tutorials/experts/source_zh_cn/debug/auto_tune.md b/tutorials/experts/source_zh_cn/debug/auto_tune.md index 082c6cdcf44416c679bb2094e2d499b804db998f..9da83a4d942bdc9f2214d7b2a589231fd2eeebdb 100644 --- a/tutorials/experts/source_zh_cn/debug/auto_tune.md +++ b/tutorials/experts/source_zh_cn/debug/auto_tune.md @@ -22,9 +22,9 @@ export LD_LIBRARY_PATH=${LOCAL_ASCEND}/fwkacllib/lib64:$LD_LIBRARY_PATH export PATH=${LOCAL_ASCEND}/fwkacllib/ccec_compiler/bin:${LOCAL_ASCEND}/fwkacllib/bin:$PATH export PYTHONPATH=${LOCAL_ASCEND}/fwkacllib/python/site-packages:$PYTHONPATH export ASCEND_OPP_PATH=${LOCAL_ASCEND}/opp + # 离线调优环境变量 export ENABLE_TUNE_DUMP=True - ``` 以上环境变量功能详细说明、其他可选环境变量以及相关功能介绍请参考[环境变量](https://support.huawei.com/enterprise/zh/doc/EDOC1100206690/58a01d46)。