One-Shot Neural Architecture Search- Maximising Diversity to Overcome Catastrophic Forgetting (Chuan Zhou)----Academy of Mathematics and Systems Science

Neutral architecture search (NAS) has recently attracted massive interest from the deep learning community because experts do not have inordinate amounts of time and labor designing neural networks. Early NAS methods were based on a nested approach that trained numerous separate architectures from scratch and then used reinforcement learning (RL) or an evolutionary algorithm (EA) to find the most promising architectures, based on validation accuracy. However, these methods are so computationally-expensive as to be impractical for most machine learning practitioners. For example, it would take more than 1800 GPU days through RL to find promising architectures for the problem outlined in, and Real et al. spent 7 days with 450 GPUs searching for promising architectures with an EA. Recent studies have shown that NAS can significantly improve computational efficiency. Weight sharing, in particular, also called one-shot NAS, has attracted enormous attention for automating neural architecture design. This is because it not only finds state-of-the-art architectures but also significantly reduces the search hours needed. One-shot NAS encodes the search space as a supernet, where all possible architectures directly inherit weights from the supernet for evaluation without needing to be trained from scratch. Since one-shot NAS only trains the supernet for architecture searches, this learning paradigm might reduce the time a search takes from many days down to several hours.

Pioneer studies on one-shot NAS follow two sequential steps. They first adopt an architecture sampling controller to sample architectures for training the supernet. Then, a heuristic search method finds promising architectures over a discrete search space based on the trained supernet. Later studies have further employed continuous relaxation to differentiate between architectures so that the gradient descent can be used to optimize the architecture with respect to validation accuracy. The architecture parameters and supernet weights are alternatively optimized through a bilevel optimization method, and the most promising architecture is obtained once the supernet is trained.

Since one-shot NAS evaluates candidate architectures based on the validation accuracy of the weights it inherits from the supernet as opposed to training them from scratch, the success of one-shot NAS relies on a critical assumption that the validation accuracy should approximate the test accuracy after training from scratch or be highly predictive. The authors of the first study on one-shot NAS observed a strong positive correlation between the validation accuracy and the test accuracy when the supernet was trained through random path dropout. Subsequent studies all rightly considered this assumption to be true for all one-shot NAS methods. However, several recent studies have revealed that this assumption may not hold in most popular one-shot NAS approaches. For instance, Sciuto et al. show that there is no observable correlation between the validation and test accuracy of the weight-sharing paradigm with ENAS, and Adam et al. show that the RNN controller in ENAS does not depend on past sampled architectures, which means its performance is the same as a random search. Similarly, Singh et al. find that there is no visible progress in terms of the retrained performance for found architectures based on supernet during the architecture search phase, implying the supernet training is useless for improving the predictive ability of one-shot NAS. Further, Yang et al. conducted extensive experiments that demonstrated that the current one-shot NAS techniques struggle to outperform naive baselines. Rather, the success of one-shot NAS is mostly due to the design of the search space.

Most one-shot NAS approaches adopt a single-path training method for their supernet training, where only a single path (one architecture) in the supernet is trained in each step. This is the scenario we consider. However, Benyahia et al. observed that when training multiple models (architectures) with partially-shared weights for a single task, the training of each model may lower the performance of other models. Benyahia et al. defined this phenomenon as multi-model forgetting, also known as catastrophic forgetting. They also observed this catastrophic forgetting in one-shot NAS. For example, consider a large supernet containing multiple models with shared weights across them. Sequentially training each model on a single task could mean that the accuracy of each model tends to drop when training another model containing partially-shared weights. Inheriting weights makes performance deteriorate even further during supernet training.

So, although weight sharing can greatly reduce computation hours, it can also introduce catastrophic forgetting into the supernet training, which results in unreliable architecture rankings. Addressing multi-model forgetting during supernet training is an urgent issue if we are to better leverage one-shot NAS and improve the predictive ability of supernets. Hence, we have formulated supernet training as a constrained optimization problem for continual learning to avoid degrading the performance of previous architectures when training a new one.

That said, it is intractable to consider all previously visited architectures. Therefore, only the most representative subset of previous architectures is used to regularize learning of the current architecture. To overcome the problem of catastrophic forgetting, the authors formulate supernet training for one-shot NAS as a constrained continual learning optimization problem such that learning the current architecture does not degrade the validation accuracy of previous architectures. The key to solving this constrained optimization problem is a novelty search-based architecture selection (NSAS) loss function that regularizes the supernet training by using a greedy novelty search method to find the most representative subset. The authors applied the NSAS loss function to two one-shot NAS baselines and extensively tested them on both a common search space and a NAS benchmark dataset. They further derive three variants based on the NSAS loss function; the NSAS with depth constrain (NSAS-C) to improve the transferability, and NSAS-G and NSAS-LG to handle the situation with a limited number of constraints. The experiments on the common NAS search space demonstrate that NSAS and it variants improve the predictive ability of supernet training in one-shot NAS with remarkable and efficient performance on the CIFAR-10, CIFAR-100, and ImageNet datasets. The results with the NAS benchmark dataset also confirm the significant improvements these one-shot NAS baselines can make.

Publication:

- IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 9, 2921-2935 (2021).

Authors:

- Miao Zhang (Beijing Institute of Technology)

- Huiqi Li (Beijing Institute of Technology)

- Shirui Pan (Monash University, Australia)

- Xiaojun Chang (Monash University, Australia)

- Chuan Zhou (Institute of Applied Mathematics, AMSS, Chinese Academy of Sciences)

- Zongyuan Ge (Monash University, Australia)

- Steven Su (University of Technology Sydney, Australia)