%0 Conference Paper %T Moonshine: Distilling with Cheap Convolutions %W https://arxiv.org/abs/1711.02613 %U https://www.research.ed.ac.uk/portal/en/publications/moonshine-distilling-with-cheap-convolutions(7063ab49-7869-4166-b53f-127db043aa1c).html %X Description %G English %B Thirty-second Conference on Neural Information Processing Systems (NIPS 2018) %A Crowley, Elliot %A Gray, Gavin %A Storkey, Amos %D 2018 %0 Conference Paper %T Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures %W https://www.research.ed.ac.uk/portal/files/57938097/MULTIPROG_2018_Loukadakis.pdf %X Deep learning applications are able to recognise images and speech with great accuracy, and their use is now everywhere in our daily lives. However, developing deep learning architectures such as deep neural networks in embedded systems is a challenging task because of the demanding computational resources and power consumption. Hence, sophisticated algorithms and methods that exploit the hardware of the embedded systems need to be investigated. This paper is our first step towards examining methods and optimisations for deep neural networks that can leverage the hardware architecture of low power embedded devices. In particular, in this work we accelerate the inference time of the VGG-16 neural network on the ODROID-XU4 board. More specifically, a serial version of VGG-16 is parallelised for both the CPU and GPU present on the board using OpenMP and OpenCL. We also investigate several optimisation techniques that exploit the specific hardware architecture of the ODROID board and can accelerate the inference further. One of these optimisations uses the CLBlast library specifically tuned for the ARM Mali-T628 GPU present on the board. Overall, we improve the inference time of the initial serial version of the code by 2.8X using OpenMP, and by 9.4X using the most optimised version of OpenCL. %B 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2018) %A Loukadakis, Manolis %A Cano, Jose %A O'Boyle, Michael %D 1 2018 %K Deep Neural Networks Heterogeneous architectures Low power embedded systems performance %0 Conference Paper %T Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors %C Bristol, United Kingdom %I IEEE %P 292-301 %W https://arxiv.org/abs/1711.05487 %@ 978-1-5386-1042-8 %U http://ieeexplore.ieee.org/document/8025303/ %B 2017 46th International Conference on Parallel Processing (ICPP) %A Elafrou, Athena %A Goumas, Georgios %A Koziris, Nectarios %D 8/2017 %K performance %0 Thesis %T Towards Secure Collaborative AI Service Chains %C Karlskrona %I Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science. %R Licentiate Thesis %W http://bth.diva-portal.org/smash/record.jsf?pid=diva2%3A1341533 %U http://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Abth-18531 %X At present, Artificial Intelligence (AI) systems have been adopted in many different domains such as healthcare, robotics, automotive, telecommunication systems, security, and finance for integrating intelligence in their services and applications. The intelligent personal assistant such as Siri and Alexa are examples of AI systems making an impact on our daily lives. Since many AI systems are data-driven systems, they require large volumes of data for training and validation, advanced algorithms, computing power and storage in their development process. Collaboration in the AI development process (AI engineering process) will reduce cost and time for the AI applications in the market. However, collaboration introduces the concern of privacy and piracy of intellectual properties, which can be caused by the actors who collaborate in the engineering process. This work investigates the non-functional requirements, such as privacy and security, for enabling collaboration in AI service chains. It proposes an architectural design approach for collaborative AI engineering and explores the concept of the pipeline (service chain) for chaining AI functions. In order to enable controlled collaboration between AI artefacts in a pipeline, this work makes use of virtualisation technology to define and implement Virtual Premises (VPs), which act as protection wrappers for AI pipelines. A VP is a virtual policy enforcement point for a pipeline and requires access permission and authenticity for each element in a pipeline before the pipeline can be used. Furthermore, the proposed architecture is evaluated in use-case approach that enables quick detection of design flaw during the initial stage of implementation. To evaluate the security level and compliance with security requirements, threat modeling was used to identify potential threats and vulnerabilities of the system and analyses their possible effects. The output of threat modeling was used to define countermeasure to threats related to unauthorised access and execution of AI artefacts. %G eng %A Ahmadi Mehri, Vida %D 2019 %0 Journal Article %T Robustness to adversarial examples can be improved with overfitting %V 11 %N 4 %P 935-944 %U https://doi.org/10.1007/s13042-020-01097-4 %X Deep learning (henceforth DL) has become most powerful machine learning methodology. Under specific circumstances recognition rates even surpass those obtained by humans. Despite this, several works have shown that deep learning produces outputs that are very far from human responses when confronted with the same task. This the case of the so-called “adversarial examples” (henceforth AE). The fact that such implausible misclassifications exist points to a fundamental difference between machine and human learning. This paper focuses on the possible causes of this intriguing phenomenon. We first argue that the error in adversarial examples is caused by high bias, i.e. by regularization that has local negative effects. This idea is supported by our experiments in which the robustness to adversarial examples is measured with respect to the level of fitting to training samples. Higher fitting was associated to higher robustness to adversarial examples. This ties the phenomenon to the trade-off that exists in machine learning between fitting and generalization. %G en %J International Journal of Machine Learning and Cybernetics %A Deniz, Oscar %A Pedraza, Anibal %A Vallez, Noelia %A Salido, Jesus %A Bueno, Gloria %D 2020-04-01 %0 Journal Article %T RecNets: Channel-wise Recurrent Convolutional Neural Networks %W http://arxiv.org/abs/1905.11910 %U http://arxiv.org/abs/1905.11910 %X In this paper, we introduce Channel-wise recurrent convolutional neural networks (RecNets), a family of novel, compact neural network architectures for computer vision tasks inspired by recurrent neural networks (RNNs). RecNets build upon Channel-wise recurrent convolutional (CRC) layers, a novel type of convolutional layer that splits the input channels into disjoint segments and processes them in a recurrent fashion. In this way, we simulate wide, yet compact models, since the number of parameters is vastly reduced via the parameter sharing of the RNN formulation. Experimental results on the CIFAR-10 and CIFAR-100 image classification tasks demonstrate the superior size-accuracy trade-off of RecNets compared to other compact state-of-the-art architectures. %J arXiv:1905.11910 [cs, stat] %A Retsinas, George %A Elafrou, Athena %A Goumas, Georgios %A Maragos, Petros %D 2019-05-28 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Journal Article %T How to train your MAML %W http://arxiv.org/abs/1810.09502 %U http://arxiv.org/abs/1810.09502 %X The field of few-shot learning has recently seen substantial advancements. Most of these advancements came from casting few-shot learning as a meta-learning problem. Model Agnostic Meta Learning or MAML is currently one of the best approaches for few-shot learning via meta-learning. MAML is simple, elegant and very powerful, however, it has a variety of issues, such as being very sensitive to neural network architectures, often leading to instability during training, requiring arduous hyperparameter searches to stabilize training and achieve high generalization and being very computationally expensive at both training and inference times. In this paper, we propose various modifications to MAML that not only stabilize the system, but also substantially improve the generalization performance, convergence speed and computational overhead of MAML, which we call MAML++. %J arXiv:1810.09502 [cs, stat] %A Antoniou, Antreas %A Edwards, Harrison %A Storkey, Amos %D 2018-10-22 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Conference Paper %T IoT meets distributed AI - Deployment scenarios of Bonseyes AI applications on FIWARE %P 1-2 %W https://arodes.hes-so.ch/record/4983?ln=en %X Bonseyes is an Artificial Intelligence (AI) platform composed of a Data Marketplace, a Deep Learning Toolbox, and Developer Reference Platforms with the aim of facilitating tech and non-tech companies a rapid adoption of AI as an enabler for their business. Bonseyes provides methods and tools to speed up the development and deployment of AI solutions on low power Internet of Things (IoT) devices, embedded computing systems, and data centre servers. In this work, we address the deployment and the integration of Bonseyes AI applications in a wider enterprise application landscape involving different applications and services. We leverage the well-established IoT platform FIWARE to integrate the Bonseyes AI applications into an enterprise ecosystem. This paper presents two AI application deployment and integration scenarios using FIWARE. The first scenario addresses use cases where edge devices have enough compute power to run the AI applications and there is only need to transmit the results to the enterprise ecosystem. The second scenario copes with use cases where an edge device may delegate most of the computation to an external/cloud server. Further, we employ FIWARE IoT Agent generic enabler to manage all edge devices related to Bonseyes AI applications. Both scenarios have been validated on concrete use cases and demonstrators. %B 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC) %A Moor, Lucien %A Bitter, Lukas %A Prado, Miguel De %A Pazos, Nuria %A Ouerhani, Nabil %D October 2019 %K AI application deployment Artificial Intelligence Bonseyes AI applications Edge Computing FIWARE Internet of Things Internet of Things devices IoT Machine Learning artificial intelligence artificial intelligence platform business data processing cloud computing deep learning toolbox developer reference platform edge device embedded systems enterprise ecosystem well-established IoT platform FIWARE %0 Journal Article %T Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation %W http://arxiv.org/abs/1902.09884 %U http://arxiv.org/abs/1902.09884 %X The field of few-shot learning has been laboriously explored in the supervised setting, where per-class labels are available. On the other hand, the unsupervised few-shot learning setting, where no labels of any kind are required, has seen little investigation. We propose a method, named Assume, Augment and Learn or AAL, for generating few-shot tasks using unlabeled data. We randomly label a random subset of images from an unlabeled dataset to generate a support set. Then by applying data augmentation on the support set's images, and reusing the support set's labels, we obtain a target set. The resulting few-shot tasks can be used to train any standard meta-learning framework. Once trained, such a model, can be directly applied on small real-labeled datasets without any changes or fine-tuning required. In our experiments, the learned models achieve good generalization performance in a variety of established few-shot learning tasks on Omniglot and Mini-Imagenet. %J arXiv:1902.09884 [cs, stat] %A Antoniou, Antreas %A Storkey, Amos %D 2019-03-05 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Conference Paper %T Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs %C Orlando, Florida %W http://arxiv.org/abs/2002.08697 %U http://arxiv.org/abs/2002.08697 %X Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning. %B 2019 Annual IEEE International Symposium on Workload Characterization (IISWC'19) %A Radu, Valentin %A Kaszyk, Kuba %A Wen, Yuan %A Turner, Jack %A Cano, Jose %A Crowley, Elliot J. %A Franke, Bjorn %A Storkey, Amos %A O'Boyle, Michael %D 2020-02-20 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Journal Article %T A Closer Look at Structured Pruning for Neural Network Compression %U http://arxiv.org/abs/1810.04622 %X Structured pruning is a popular method for compressing a neural network: given a large trained network, one alternates between removing channel connections and fine-tuning; reducing the overall width of the network. However, the efficacy of structured pruning has largely evaded scrutiny. In this paper, we examine ResNets and DenseNets obtained through structured pruning-and-tuning and make two interesting observations: (i) reduced networks---smaller versions of the original network trained from scratch---consistently outperform pruned networks; (ii) if one takes the architecture of a pruned network and then trains it from scratch it is significantly more competitive. Furthermore, these architectures are easy to approximate: we can prune once and obtain a family of new, scalable network architectures that can simply be trained from scratch. Finally, we compare the inference speed of reduced and pruned networks on hardware, and show that reduced networks are significantly faster. Code is available at https://github.com/BayesWatch/pytorch-prunes. %J arXiv:1810.04622 [cs, stat] %A Crowley, Elliot J. %A Turner, Jack %A Storkey, Amos %A O'Boyle, Michael %D 2019-06-07 %K Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Statistics - Machine Learning %0 Book Section %T Distributed Ledger for Provenance Tracking of Artificial Intelligence Assets %S IFIP AICT Tutorials %I Springer International Publishing %W https://arxiv.org/abs/2002.11000 %@ 978-3-030-42503-6 %U https://arxiv.org/abs/2002.11000 %X High availability of data is responsible for the current trends in Artificial Intelligence (AI) and Machine Learning (ML). However, high-grade datasets are reluctantly shared between actors because of lacking trust and fear of losing control. Provenance tracing systems are a possible measure to build trust by improving transparency. Especially the tracing of AI assets along complete AI value chains bears various challenges such as trust, privacy, confidentiality, traceability, and fair remuneration. In this paper we design a graph-based provenance model for AI assets and their relations within an AI value chain. Moreover, we propose a protocol to exchange AI assets securely to selected parties. The provenance model and exchange protocol are then combined and implemented as a smart contract on a permission-less blockchain. We show how the smart contract enables the tracing of AI assets in an existing industry use case while solving all challenges. Consequently, our smart contract helps to increase traceability and transparency, encourages trust between actors and thus fosters collaboration between them. %G en %B Privacy and Identity Management. Data for Better Living: AI and Privacy: 14th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2.2 International Summer School, Windisch, Switzerland, August 19–23, 2019, Revised Selected Papers %A Lüthi, Philipp %A Gagnaux, Thibault %A Gygli, Marcel %E Friedewald, Michael %E Önen, Melek %E Lievens, Eva %E Krenn, Stephan %E Fricker, Samuel %D 2020 %K Computer Science - Cryptography and Security %0 Journal Article %T Distilling with Performance Enhanced Students %W https://arxiv.org/abs/1810.10460 %U https://arxiv.org/abs/1810.10460 %X The task of accelerating large neural networks on general purpose hardware has, in recent years, prompted the use of channel pruning to reduce network size. However, the efficacy of pruning based approaches has since been called into question. In this paper, we turn to distillation for model compression---specifically, attention transfer---and develop a simple method for discovering performance enhanced student networks. We combine channel saliency metrics with empirical observations of runtime performance to design more accurate networks for a given latency budget. We apply our methodology to residual and densely-connected networks, and show that we are able to find resource-efficient student networks on different hardware platforms while maintaining very high accuracy. These performance-enhanced student networks achieve up to 10% boosts in top-1 ImageNet accuracy over their channel-pruned counterparts for the same inference time. %G en %A Turner, Jack %A Crowley, Elliot J. %A Radu, Valentin %A Cano, José %A Storkey, Amos %A O'Boyle, Michael %D 2019/03/07 %0 Conference Paper %T Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems %W https://arxiv.org/abs/1811.07315 %U https://arxiv.org/abs/1811.07315v1 %X Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search. %G en %B Proceedings of Design, Automation and Test in Europe Conference, DATE 19. March 2019 %A de Prado, Miguel %A Pazos, Nuria %A Benini, Luca %D 2019 %0 Conference Paper %T AI Pipeline - bringing AI to you. End-to-end integration of data, algorithms and deployment tools %C Valencia, Spain %W https://arxiv.org/abs/1901.05049 %U https://arxiv.org/abs/1901.05049v1 %X Next generation of embedded Information and Communication Technology (ICT) systems are interconnected collaborative intelligent systems able to perform autonomous tasks. Training and deployment of such systems on Edge devices however require a fine-grained integration of data and tools to achieve high accuracy and overcome functional and non-functional requirements. In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms and deployment tools together. By these means, we are able to interconnect the different entities or stages of particular systems and provide an end-to-end development of AI products. We demonstrate the effectiveness of the AI pipeline by solving an Automatic Speech Recognition challenge and we show that all the steps leading to an end-to-end development for Key-word Spotting tasks: importing, partitioning and pre-processing of speech data, training of different neural network architectures and their deployment on heterogeneous embedded platforms. %G en %B Emerging Deep Learning Accelerators (EDLA) Workshop at HiPEAC 2019 %A de Prado, Miguel %A Su, Jing %A Dahyot, Rozenn %A Saeed, Rabia %A Keller, Lorenzo %A Vallez, Noelia %D 2019/01/15 %0 Conference Paper %T Framework for Analysis of Multi-party Collaboration %P 44-53 %W http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1375006 %X In recent years, platforms have become important for allowing ecosystems to emerge that allow users to collaborate and create unprecedented forms of innovation. For the platform provider, the ecosystem represents a massive business opportunity if the platform succeeds to make the collaborations among the users value-creating and to facilitate trust. While the requirements flow for evolving existing ecosystems is understood, it is unclear how to analyse an ecosystem that is to be. In this paper, we draw on recent work on collaboration modelling in requirements engineering and propose an integrated framework for the analysis of multi-party collaboration that is to be supported by a platform. Drawing on a real-world case, we describe how the framework is applied and the results that have been obtained with it. The results indicate that the framework was useful to understand the ecosystem context for a planned platform in the domain of artificial intelligence, allowed identification of platform requirements and offered a basis to plan validation. %B 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW) %A Maksimov, Yuliyan V. %A Fricker, Samuel A. %D Sep. 2019 %K artificial intelligence business opportunity collaboration modelling collaboration-modelling commerce ecosystem context ecosystem-requirements groupware integrated framework multiparty collaboration planned platform platform requirements platform-requirements requirements engineering requirements flow %0 Journal Article %T BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget %W https://arxiv.org/abs/1906.04113 %U http://arxiv.org/abs/1906.04113 %X The desire to map neural networks to varying-capacity devices has led to the development of a wealth of compression techniques, many of which involve replacing standard convolutional blocks in a large network with cheap alternative blocks. However, not all blocks are created equally; for a required compute budget there may exist a potent combination of many different cheap blocks, though exhaustively searching for such a combination is prohibitively expensive. In this work, we develop BlockSwap: a fast algorithm for choosing networks with interleaved block types by passing a single minibatch of training data through randomly initialised networks and gauging their Fisher potential. These networks can then be used as students and distilled with the original large network as a teacher. We demonstrate the effectiveness of the chosen networks across CIFAR-10 and ImageNet for classification, and COCO for detection, and provide a comprehensive ablation study of our approach. BlockSwap quickly explores possible block configurations using a simple architecture ranking system, yielding highly competitive networks in orders of magnitude less time than most architecture search techniques (e.g. under 5 minutes on a single GPU for CIFAR-10). Code is available at https://github.com/BayesWatch/pytorch-blockswap. %J arXiv:1906.04113 [cs, stat] %A Turner, Jack %A Crowley, Elliot J. %A O'Boyle, Michael %A Storkey, Amos %A Gray, Gavin %D 2020-01-23 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Conference Paper %T Augmenting Image Classifiers Using Data Augmentation Generative Adversarial Networks %I Springer International Publishing %P 594-603 %W https://www.research.ed.ac.uk/portal/en/publications/augmenting-image-classifiers-using-data-augmentation-generative-adversarial-networks(1554a4b8-4cfd-48bd-a5dc-80a468cfbda2).html %U https://www.research.ed.ac.uk/portal/en/publications/augmenting-image-classifiers-using-data-augmentation-generative-adversarial-networks(1554a4b8-4cfd-48bd-a5dc-80a468cfbda2).html %X Description %G English %B Artificial Neural Networks and Machine Learning – ICANN 2018 %A Antoniou, Antreas %A Storkey, Amos %A Edwards, Harrison %D 2018/09/27 %0 Journal Article %T On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length %W http://arxiv.org/abs/1807.05031v6 %U http://arxiv.org/abs/1807.05031 %X Recent work has identified that using a high learning rate or a small batch size for Stochastic Gradient Descent (SGD) based training of deep neural networks encourages finding flatter minima of the training loss towards the end of training. Moreover, measures of the flatness of minima have been shown to correlate with good generalization performance. Extending this previous work, we investigate the loss curvature through the Hessian eigenvalue spectrum in the early phase of training and find an analogous bias: even at the beginning of training, a high learning rate or small batch size influences SGD to visit flatter loss regions. In addition, the evolution of the largest eigenvalues appears to always follow a similar pattern, with a fast increase in the early phase, and a decrease or stabilization thereafter, where the peak value is determined by the learning rate and batch size. Finally, we find that by altering the learning rate just in the direction of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generalization, which suggests the curvature of the endpoint found by SGD is not predictive of its generalization properties. %G en %J Seventh International Conference on Learning Representations %A Jastrzębski, Stanisław %A Kenton, Zachary %A Ballas, Nicolas %A Fischer, Asja %A Bengio, Yoshua %A Storkey, Amos %D 2019-04-17 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Journal Article %T Separable Layers Enable Structured Efficient Linear Substitutions %W http://arxiv.org/abs/1906.00859 %U http://arxiv.org/abs/1906.00859 %X In response to the development of recent efficient dense layers, this paper shows that something as simple as replacing linear components in pointwise convolutions with structured linear decompositions also produces substantial gains in the efficiency/accuracy tradeoff. Pointwise convolutions are fully connected layers and are thus prepared for replacement by structured transforms. Networks using such layers are able to learn the same tasks as those using standard convolutions, and provide Pareto-optimal benefits in efficiency/accuracy, both in terms of computation (mult-adds) and parameter count (and hence memory). Code is available at https://github.com/BayesWatch/deficient-efficient. %J arXiv:1906.00859 [cs, stat] %A Gray, Gavin %A Crowley, Elliot J. %A Storkey, Amos %D 2019-06-03 %K Computer Science - Machine Learning Statistics - Machine Learning %0 Journal Article %T Performance-Oriented Neural Architecture Search %W http://arxiv.org/abs/2001.02976 %U http://arxiv.org/abs/2001.02976 %X Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly efficient in addition to the typical criteria around accuracy. Using the task of keyword spotting in audio on edge computing devices, we demonstrate that our approach results in neural architecture that is not only highly accurate, but also efficiently mapped to the computing platform which will perform the inference. Using our modified neural architecture search, we demonstrate $0.88\%$ increase in TOP-1 accuracy with $1.85\times$ reduction in latency for keyword spotting in audio on an embedded SoC, and $1.59\times$ on a high-end GPU. %J arXiv:2001.02976 [cs] %A Anderson, Andrew %A Su, Jing %A Dahyot, Rozenn %A Gregg, David %D 2020-01-09 %K Computer Science - Machine Learning Computer Science - Neural and Evolutionary Computing %0 Journal Article %T Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural Networks %P 61-68 %W https://arxiv.org/abs/1809.10572 %U http://arxiv.org/abs/1809.10572 %X Quantization of weights and activations in Deep Neural Networks (DNNs) is a powerful technique for network compression, and has enjoyed significant attention and success. However, much of the inference-time benefit of quantization is accessible only through the use of customized hardware accelerators or by providing an FPGA implementation of quantized arithmetic. Building on prior work, we show how to construct arbitrary bit-precise signed and unsigned integer operations using a software technique which logically \emph{embeds} a vector architecture with custom bit-width lanes in universally available fixed-width scalar arithmetic. We evaluate our approach on a high-end Intel Haswell processor, and an embedded ARM processor. Our approach yields very fast implementations of bit-precise custom DNN operations, which often match or exceed the performance of operations quantized to the sizes supported in native arithmetic. At the strongest level of quantization, our approach yields a maximum speedup of $\thicksim6\times$ on the Intel platform, and $\thicksim10\times$ on the ARM platform versus quantization to native 8-bit integers. %J 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) %A Anderson, Andrew %A Gregg, David %D 6/2019 %K Computer Science - Computer Vision and Pattern Recognition Computer Science - Mathematical Software Computer Science - Performance %0 Conference Paper %T Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks %I IEEE %P 101-110 %W https://arxiv.org/abs/1809.07196 %U https://www.research.ed.ac.uk/portal/en/publications/characterising-acrossstack-optimisations-for-deep-convolutional-neural-networks(15473b61-a560-4dee-84c8-09ea68bf603c).html %X Description %G English %B Proceedings of the Workload Characterization (IISWC), 2018 IEEE International Symposium %A Turner, Jack %A Reyes, Jose Cano %A Radu, Valentin %A Crowley, Elliot %A O'Boyle, Michael %A Storkey, Amos %D 2018/12/13 %0 Conference Paper %T Designing a Secure IoT System Architecture from a Virtual Premise for a Collaborative AI Lab %W http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1284028&dswid=-7556 %U http://urn.kb.se/resolve?urn=urn:nbn:se:bth-17550 %X DiVA portal is a finding tool for research publications and student theses written at the following 47 universities and research institutions. %G eng %B Proceedings of the Workshop on Decentralized IoT Systems and Security (DISS) %A Mehri, Vida A. %A Ilie, Dragos %A Tutschku, Kurt %D 2019 %0 Conference Paper %T DNN's Sharpest Directions Along the SGD Trajectory %W https://arxiv.org/abs/1807.05031v1 %U https://arxiv.org/abs/1807.05031v1 %X Recent work has identified that using a high learning rate or a small batch size for Stochastic Gradient Descent (SGD) based training of deep neural networks encourages finding flatter minima of the training loss towards the end of training. Moreover, measures of the flatness of minima have been shown to correlate with good generalization performance. Extending this previous work, we investigate the loss curvature through the Hessian eigenvalue spectrum in the early phase of training and find an analogous bias: even at the beginning of training, a high learning rate or small batch size influences SGD to visit flatter loss regions. In addition, the evolution of the largest eigenvalues appears to always follow a similar pattern, with a fast increase in the early phase, and a decrease or stabilization thereafter, where the peak value is determined by the learning rate and batch size. Finally, we find that by altering the learning rate just in the direction of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generalization, which suggests the curvature of the endpoint found by SGD is not predictive of its generalization properties. %G en %B Modern Trends in Nonconvex Optimization for Machine Learning workshop at International Conference on Machine Learning 2018 %A Jastrzębski, Stanisław %A Kenton, Zachary %A Ballas, Nicolas %A Fischer, Asja %A Bengio, Yoshua %A Storkey, Amos %D 2018/07/13 %0 Conference Paper %T Privacy and DRM Requirements for Collaborative Development of AI Applications %C Hamburg, Germany %I ACM Press %P 1-8 %W http://bth.diva-portal.org/smash/record.jsf?pid=diva2%3A1238658 %@ 978-1-4503-6448-5 %U http://dl.acm.org/citation.cfm?doid=3230833.3233268 %G en %B Proceedings of the 13th International Conference on Availability, Reliability and Security - ARES 2018 %A Mehri, Vida Ahmadi %A Ilie, Dragos %A Tutschku, Kurt %D 2018 %0 Conference Paper %T Towards Privacy Requirements for Collaborative Development of AI Applications %C Karlskrona %W http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1217364&dswid=-1122 %U http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1217364&dswid=-1122 %X The use of data is essential for the capabilities of Data- driven Artificial intelligence (AI), Deep Learning and Big Data analysis techniques. The use of data, however, raises intrinsically the co ... %G eng %B 14th Swedish National Computer Networking Workshop (SNCNW), 2018 %A Ahmadi Mehri, Vida %A Ilie, Dragos %A Tutschku, Kurt %D 2018 %0 Conference Paper %T Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio %V 11141 %C Cham %I Springer International Publishing %P 392-402 %W https://www.research.ed.ac.uk/portal/en/publications/width-of-minima-reached-by-stochastic-gradient-descent-is-influenced-by-learning-rate-to-batch-size-ratio(1b1d210a-efed-44b7-8907-c1506f70a64d).html %@ 978-3-030-01423-0 978-3-030-01424-7 %U http://link.springer.com/10.1007/978-3-030-01424-7_39 %B Artificial Neural Networks and Machine Learning – ICANN 2018 %E Kůrková, Věra %E Manolopoulos, Yannis %E Hammer, Barbara %E Iliadis, Lazaros %E Maglogiannis, Ilias %A Jastrzębski, Stanislaw %A Kenton, Zachary %A Arpit, Devansh %A Ballas, Nicolas %A Fischer, Asja %A Bengio, Yoshua %A Storkey, Amos %D 2018 %0 Conference Paper %T Three Factors Influencing Minima in SGD %W https://arxiv.org/abs/1711.04623 %U https://arxiv.org/abs/1711.04623v3 %X We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process. %G en %B International Conference on Artificial Neural Networks 2018 %A Jastrzębski, Stanisław %A Kenton, Zachary %A Arpit, Devansh %A Ballas, Nicolas %A Fischer, Asja %A Bengio, Yoshua %A Storkey, Amos %D 2018 %0 Conference Paper %T QUENN: QUantization Engine for low-power Neural Networks %W https://arxiv.org/abs/1811.05896 %U https://arxiv.org/abs/1811.05896v1 %X Deep Learning is moving to edge devices, ushering in a new age of distributed Artificial Intelligence (AI). The high demand of computational resources required by deep neural networks may be alleviated by approximate computing techniques, and most notably reduced-precision arithmetic with coarsely quantized numerical representations. In this context, Bonseyes comes in as an initiative to enable stakeholders to bring AI to low-power and autonomous environments such as: Automotive, Medical Healthcare and Consumer Electronics. To achieve this, we introduce LPDNN, a framework for optimized deployment of Deep Neural Networks on heterogeneous embedded devices. In this work, we detail the quantization engine that is integrated in LPDNN. The engine depends on a fine-grained workflow which enables a Neural Network Design Exploration and a sensitivity analysis of each layer for quantization. We demonstrate the engine with a case study on Alexnet and VGG16 for three different techniques for direct quantization: standard fixed-point, dynamic fixed-point and k-means clustering, and demonstrate the potential of the latter. We argue that using a Gaussian quantizer with k-means clustering can achieve better performance than linear quantizers. Without retraining, we achieve over 55.64\% saving for weights' storage and 69.17\% for run-time memory accesses with less than 1\% drop in top5 accuracy in Imagenet. %G en %B CF '18 Proceedings of the 15th ACM International Conference on Computing Frontiers %A de Prado, Miguel %A Denna, Maurizio %A Benini, Luca %A Pazos, Nuria %D 2018/11/14 %0 Conference Paper %T Privacy and trust in cloud-based marketplaces for AI and data resources %I Springer New York LLC %P 223-225 %U http://urn.kb.se/resolve?urn=urn:nbn:se:bth-14841 %X DiVA portal is a finding tool for research publications and student theses written at the following 47 universities and research institutions. %G eng %B IFIPTM: IFIP International Conference on Trust Management %A Mehri, Vida A. %A Tutschku, Kurt %D 2017 %0 Conference Paper %T Parallel Multi Channel convolution using General Matrix Multiplication %I IEEE %P 19-24 %W http://arxiv.org/abs/1704.04428 %@ 978-1-5090-4825-0 %U http://ieeexplore.ieee.org/document/7995254/ %B 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) %A Vasudevan, Aravind %A Anderson, Andrew %A Gregg, David %D 7/2017 %0 Conference Paper %T Artifact Compatibility for Enabling Collaboration in the Artificial Intelligence Ecosystem %V 336 %C Cham %I Springer International Publishing %P 56-71 %W http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1217749 %@ 978-3-030-04839-6 978-3-030-04840-2 %U http://link.springer.com/10.1007/978-3-030-04840-2_5 %B Software Business %E Wnuk, Krzysztof %E Brinkkemper, Sjaak %A Maksimov, Yuliyan V. %A Fricker, Samuel A. %A Tutschku, Kurt %D 2018 %0 Conference Paper %T Flexible Privacy and High Trust in the Next Generation Internet : The Use Case of a Cloud-based Marketplace for AI %I Halmstad university %U http://urn.kb.se/resolve?urn=urn:nbn:se:bth-14963 %X Cloudified architectures facilitate resource ac-cess and sharing which is independent from physical lo-cations. They permit high availability of resources at lowoperational costs. These advantages, ... %G eng %B SNCNW - Swedish National Computer Networking Workshop, Halmstad %A Mehri, Vida A. %A Tutschku, Kurt %D 2017 %0 Conference Paper %T Pricing of Data Products in Data Marketplaces %S Lecture Notes in Business Information Processing %I Springer, Cham %P 49-66 %W http://www.diva-portal.se/smash/get/diva2:1163530/FULLTEXT01.pdf %@ 978-3-319-69190-9 978-3-319-69191-6 %U https://link.springer.com/chapter/10.1007/978-3-319-69191-6_4 %X Mobile computing and the Internet of Things promises massive amounts of data for big data analytic and machine learning. A data sharing economy is needed to make that data available for companies that wish to develop smart systems and services. While digital markets for trading data are emerging, there is no consolidated understanding of how to price data products and thus offer data vendors incentives for sharing data. This paper uses a combined keyword search and snowballing approach to systematically review the literature on the pricing of data products that are to be offered on marketplaces. The results give insights into the maturity and character of data pricing. They enable practitioners to select a pricing approach suitable for their situation and researchers to extend and mature data pricing as a topic. %G en %B Software Business %A Fricker, Samuel A. %A Maksimov, Yuliyan V. %D 2017/06/12 %0 Journal Article %T Low-memory GEMM-based convolution algorithms for deep neural networks %W http://arxiv.org/abs/1709.03395 %U http://arxiv.org/abs/1709.03395 %X Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different ways to express DNN convolution operations using GEMM. Although different approaches all perform the same number of operations, the size of temporary data structures differs significantly. Convolution of an input matrix with dimensions $C \times H \times W$, requires $O(K^2CHW)$ additional space using the classical im2col approach. More recently memory-efficient approaches requiring just $O(KCHW)$ auxiliary space have been proposed. We present two novel GEMM-based algorithms that require just $O(MHW)$ and $O(KW)$ additional space respectively, where $M$ is the number of channels in the result of the convolution. These algorithms dramatically reduce the space overhead of DNN convolution, making it much more suitable for memory-limited embedded systems. Experimental evaluation shows that our low-memory algorithms are just as fast as the best patch-building approaches despite requiring just a fraction of the amount of additional memory. Our low-memory algorithms have excellent data locality which gives them a further edge over patch-building algorithms when multiple cores are used. As a result, our low memory algorithms often outperform the best patch-building algorithms using multiple threads. %J arXiv:1709.03395 [cs] %A Anderson, Andrew %A Vasudevan, Aravind %A Keane, Cormac %A Gregg, David %D 2017-09-08 %K Computer Science - Computer Vision and Pattern Recognition %0 Conference Paper %T Optimal DNN primitive selection with partitioned boolean quadratic programming %C Vienna, Austria %I ACM Press %P 340-351 %W http://arxiv.org/abs/1710.01079 %@ 978-1-4503-5617-6 %U http://dl.acm.org/citation.cfm?doid=3168805 %X Deep Neural Networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. Many different algorithms have been proposed to implement the most computationally expensive layers of DNNs. Further, each of these algorithms has a large number of variants, which offer different trade-offs of parallelism, data locality, memory footprint, and execution time. In addition, specific algorithms operate much more efficiently on specialized data layouts and formats. We state the problem of optimal primitive selection in the presence of data format transformations, and show that it is NP-hard by demonstrating an embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP). We propose an analytic solution via a PBQP solver, and evaluate our approach experimentally by optimizing several popular DNNs using a library of more than 70 DNN primitives, on an embedded platform and a general purpose platform. We show experimentally that significant gains are possible versus the state of the art vendor libraries by using a principled analytic solution to the problem of layout selection in the presence of data format transformations. %G en %B Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018 %A Anderson, Andrew %A Gregg, David %D 2018 %0 Conference Paper %T BONSEYES: Platform for Open Development of Systems of Artificial Intelligence: Invited Paper %S CF'17 %C New York, NY, USA %I ACM %P 299–304 %@ 978-1-4503-4487-6 %U http://doi.acm.org/10.1145/3075564.3076259 %X The Bonseyes EU H2020 collaborative project aims to develop a platform consisting of a Data Marketplace, a Deep Learning Toolbox, and Developer Reference Platforms for organizations wanting to adopt Artificial Intelligence. The project will be focused on using artificial intelligence in low power Internet of Things (IoT) devices ("edge computing"), embedded computing systems, and data center servers ("cloud computing"). It will bring about orders of magnitude improvements in efficiency, performance, reliability, security, and productivity in the design and programming of systems of artificial intelligence that incorporate Smart Cyber-Physical Systems (CPS). In addition, it will solve a causality problem for organizations who lack access to Data and Models. Its open software architecture will facilitate adoption of the whole concept on a wider scale. To evaluate the effectiveness, technical feasibility, and to quantify the real-world improvements in efficiency, security, performance, effort and cost of adding AI to products and services using the Bonseyes platform, four complementary demonstrators will be built. Bonseyes platform capabilities are aimed at being aligned with the European FI-PPP activities and take advantage of its flagship project FIWARE. This paper provides a description of the project motivation, goals and preliminary work. %B Proceedings of the Computing Frontiers Conference %A Llewellynn, Tim %A Fernández-Carrobles, M. Milagro %A Deniz, Oscar %A Fricker, Samuel %A Storkey, Amos %A Pazos, Nuria %A Velikic, Gordana %A Leufgen, Kirsten %A Dahyot, Rozenn %A Koller, Sebastian %A Goumas, Georgios %A Leitner, Peter %A Dasika, Ganesh %A Wang, Lei %A Tutschku, Kurt %D 2017 %K Data marketplace Deep Learning Internet of things Smart Cyber-Physical Systems