Intelligent Robot Learning Laboratory (IRL Lab) Journal Articles

### 2017

• Tim Brys, Anna Harutyunyan, Peter Vrancx, Ann Nowé, and Matthew E. Taylor. Multi-objectivization and ensembles of shapings in reinforcement learning. Neurocomputing, 263:48-59, 2017. Multiobjective Reinforcement Learning: Theory and Applications

Ensemble techniques are a powerful approach to creating better decision makers in machine learning. A number of decision makers is trained to solve a given task, grouped in an ensemble, and their decisions are aggregated. The ensemble derives its power from the diversity of its components, as the assumption is that they make mistakes on different inputs, and that the majority is more likely to be correct than any individual component. Diversity usually comes from the different algorithms employed by the decision makers, or the different inputs used to train the decision makers. We advocate a third way to achieve this diversity, based on multi -objectivization. This is the process of taking a single-objective problem and transforming it into a multi-objective problem in order to solve the original problem faster and/or better. This is either done through decomposition of the original objective, or the addition of extra objectives, typically based on some (heuristic) domain knowledge. This process basically creates a diverse set of feedback signals for what is underneath still a single-objective problem. In the context of ensemble techniques, these various ways to evaluate a (solution to a) problem allow for different components of the ensemble to look at the problem in different ways, generating the necessary diversity for the ensemble. In this paper, we argue for the combination of multi-objectivization and ensemble techniques as a powerful tool to boost solving performance in reinforcement learning. We inject various pieces of heuristic information through reward shaping, creating several distinct enriched reward signals, which can strategically be combined using ensemble techniques to reduce sample complexity. We demonstrate the potential of the approach with a range of experiments.

@article{2017Neurocomputing-Brys,
author={Brys, Tim and Harutyunyan, Anna and Vrancx, Peter and Nowé, Ann and Taylor, Matthew E.},
title={{Multi-objectivization and ensembles of shapings in reinforcement learning}},
journal={{Neurocomputing}},
volume={263},
number={},
pages={48 - 59},
year={2017},
note={Multiobjective Reinforcement Learning: Theory and Applications},
issn={0925-2312},
doi={http://dx.doi.org/10.1016/j.neucom.2017.02.096},
url={http://www.sciencedirect.com/science/article/pii/S0925231217310962},
keywords={Reinforcement learning},
keywords={Multi-objectivization},
keywords={Ensemble techniques},
keywords={Reward shaping},
abstract={Ensemble techniques are a powerful approach to creating better decision makers in machine learning. A number of decision makers is trained to solve a given task, grouped in an ensemble, and their decisions are aggregated. The ensemble derives its power from the diversity of its components, as the assumption is that they make mistakes on different inputs, and that the majority is more likely to be correct than any individual component. Diversity usually comes from the different algorithms employed by the decision makers, or the different inputs used to train the decision makers.
We advocate a third way to achieve this diversity, based on multi -objectivization. This is the process of taking a single-objective problem and transforming it into a multi-objective problem in order to solve the original problem faster and/or better. This is either done through decomposition of the original objective, or the addition of extra objectives, typically based on some (heuristic) domain knowledge. This process basically creates a diverse set of feedback signals for what is underneath still a single-objective problem. In the context of ensemble techniques, these various ways to evaluate a (solution to a) problem allow for different components of the ensemble to look at the problem in different ways, generating the necessary diversity for the ensemble.
In this paper, we argue for the combination of multi-objectivization and ensemble techniques as a powerful tool to boost solving performance in reinforcement learning. We inject various pieces of heuristic information through reward shaping, creating several distinct enriched reward signals, which can strategically be combined using ensemble techniques to reduce sample complexity. We demonstrate the potential of the approach with a range of experiments.}
}

• Yunxiang Ye, Zhaodong Wang, Dylan Jones, Long He, Matthew E. Taylor, Geoffrey A. Hollinger, and Qin Zhang. Bin-Dog: A Robotic Platform for Bin Management in Orchards. Robotics, 6(2), 2017.

Bin management during apple harvest season is an important activity for orchards. Typically, empty and full bins are handled by tractor-mounted forklifts or bin trailers in two separate trips. In order to simplify this work process and improve work efficiency of bin management, the concept of a robotic bin-dog system is proposed in this study. This system is designed with a “go-over-the-bin” feature, which allows it to drive over bins between tree rows and complete the above process in one trip. To validate this system concept, a prototype and its control and navigation system were designed and built. Field tests were conducted in a commercial orchard to validate its key functionalities in three tasks including headland turning, straight-line tracking between tree rows, and “go-over-the-bin.” Tests of the headland turning showed that bin-dog followed a predefined path to align with an alleyway with lateral and orientation errors of 0.02 m and 1.5°. Tests of straight-line tracking showed that bin-dog could successfully track the alleyway centerline at speeds up to 1.00 m·s−1 with a RMSE offset of 0.07 m. The navigation system also successfully guided the bin-dog to complete the task of go-over-the-bin at a speed of 0.60 m·s−1. The successful validation tests proved that the prototype can achieve all desired functionality.

@article{2017Robotics-Ye,
author={Ye, Yunxiang and Wang, Zhaodong and Jones, Dylan and He, Long and Taylor, Matthew E. and Hollinger, Geoffrey A. and Zhang, Qin},
title={{Bin-Dog: A Robotic Platform for Bin Management in Orchards}},
journal={{Robotics}},
volume={6},
year={2017},
number={2},
url={http://www.mdpi.com/2218-6581/6/2/12},
issn={2218-6581},
doi={10.3390/robotics6020012},
abstract={Bin management during apple harvest season is an important activity for orchards. Typically, empty and full bins are handled by tractor-mounted forklifts or bin trailers in two separate trips. In order to simplify this work process and improve work efficiency of bin management, the concept of a robotic bin-dog system is proposed in this study. This system is designed with a “go-over-the-bin” feature, which allows it to drive over bins between tree rows and complete the above process in one trip. To validate this system concept, a prototype and its control and navigation system were designed and built. Field tests were conducted in a commercial orchard to validate its key functionalities in three tasks including headland turning, straight-line tracking between tree rows, and “go-over-the-bin.” Tests of the headland turning showed that bin-dog followed a predefined path to align with an alleyway with lateral and orientation errors of 0.02 m and 1.5°. Tests of straight-line tracking showed that bin-dog could successfully track the alleyway centerline at speeds up to 1.00 m·s−1 with a RMSE offset of 0.07 m. The navigation system also successfully guided the bin-dog to complete the task of go-over-the-bin at a speed of 0.60 m·s−1. The successful validation tests proved that the prototype can achieve all desired functionality.}
}

• Yusen Zhan, Haitham Bou Ammar, and Matthew E. Taylor. Non-convex Policy Search Using Variational Inequalities. Neural Computation, 29(10):2800-2824, 2017.

Policy search is a class of reinforcement learning algorithms for finding optimal policies in control problems with limited feedback. These methods have shown to be successful in high-dimensional problems, such as robotics control. Though successful, current methods can lead to unsafe policy parameters potentially damaging hardware units. Motivated by such constraints, projection based methods are proposed for safe policies. These methods, however, can only handle convex policy constraints. In this paper, we propose the first safe policy search reinforcement learner capable of operating under non-convex policy constraints. This is achieved by observing, for the first time, a connection between non-convex variational inequalities and policy search problems. We provide two algorithms, i.e., Mann and two-step iteration, to solve the above problems and prove convergence in the non-convex stochastic setting. Finally, we demonstrate the performance of the above algorithms on six benchmark dynamical systems and show that our new method is capable of outperforming previous methods under a variety of settings.

@article{2017NeuralComputation-Zhan,
author={Zhan, Yusen and Bou Ammar, Haitham and Taylor, Matthew E.},
title={{Non-convex Policy Search Using Variational Inequalities}},
journal={{Neural Computation}},
volume={29},
number={10},
pages={2800 - 2824},
year={2017},
doi={http://dx.doi.org/10.1162/neco_a_01004},
abstract={Policy search is a class of reinforcement learning algorithms for finding optimal policies in control problems with limited feedback. These methods have shown to be successful in high-dimensional problems, such as robotics control. Though successful, current methods can lead to unsafe policy parameters potentially damaging hardware units. Motivated by such constraints, projection based methods are proposed for safe policies.
These methods, however, can only handle convex policy constraints. In this paper, we propose the first safe policy search reinforcement learner capable of operating under non-convex policy constraints. This is achieved by observing, for the first time, a connection between non-convex variational inequalities and policy search problems. We provide two algorithms, i.e., Mann and two-step iteration, to solve the above problems and prove convergence in the non-convex stochastic setting. Finally, we demonstrate the performance of the above algorithms on six benchmark dynamical systems and show that our new method is capable of outperforming previous methods under a variety of settings.}
}

• Yusen Zhan, Haitham Bou Ammar, and Matthew E. Taylor. Scalable Lifelong Reinforcement Learning. Pattern Recognition, 72:407-418, 2017.

Lifelong reinforcement learning provides a successful framework for agents to learn multiple consecutive tasks sequentially. Current methods, however, suffer from scalability issues when the agent has to solve a large number of tasks. In this paper, we remedy the above drawbacks and propose a novel scalable technique for lifelong reinforcement learning. We derive an algorithm which assumes the availability of multiple processing units and computes shared repositories and local policies using only local information exchange. We then show an improvement to reach a \emph{linear convergence rate} compared to current lifelong policy search methods. Finally, we evaluate our technique on a set of benchmark dynamical systems and demonstrate learning speed-ups and reduced running times.

@article{2017PatternRecognition-Zhan,
author={Zhan, Yusen and Bou Ammar, Haitham and Taylor, Matthew E.},
title={{Scalable Lifelong Reinforcement Learning}},
journal={{Pattern Recognition}},
year={2017},
issn={0031-3203},
volume={72},
pages={407 - 418},
doi={http://dx.doi.org/10.1016/j.patcog.2017.07.031},
url={http://www.sciencedirect.com/science/article/pii/S0031320317303023},
keywords={Reinforcement learning},
keywords={Lifelong learning},
keywords={Distributed optimization},
keywords={Transfer learning},
abstract={Lifelong reinforcement learning provides a successful framework for agents to learn multiple consecutive tasks sequentially. Current methods, however, suffer from scalability issues when the agent has to solve a large number of tasks.
In this paper, we remedy the above drawbacks and propose a novel scalable technique for lifelong reinforcement learning. We derive an algorithm which assumes the availability of multiple processing units and computes shared repositories and local policies using only local information exchange. We then show an improvement to reach a \emph{linear convergence rate} compared to current lifelong policy search methods. Finally, we evaluate our technique on a set of benchmark dynamical systems and demonstrate learning speed-ups and reduced running times.}
}

### 2016

• Chris Cain, Anne Anderson, and Matthew E. Taylor. Content-Independent Classroom Gamification. Computers in Education Journal, 7(4):93-106, October–December 2016.

This paper introduces Topic-INdependent Gamification Learning Environment (TINGLE), a framework designed to increase student motivation and engagement in the classroom through the use of a game played outside the classroom. A 131-person pilot study was implemented in a construction management course. Game statistics and survey responses were recorded to estimate the effect of the game and correlations with student traits. While the data analyzed so far is mostly inconclusive, this study served as an important first step toward content-independent gamification.

@article{2016CoED-Cain,
author={Cain, Chris and Anderson, Anne and Taylor, Matthew E.},
title={{Content-Independent Classroom Gamification}},
journal={{Computers in Education Journal}},
volume={7},
number={4},
pages={93--106},
month={October--December},
year={2016},
abstract={This paper introduces Topic-INdependent Gamification Learning Environment (TINGLE), a framework designed to increase student motivation and engagement in the classroom through the use of a game played outside the classroom. A 131-person pilot study was implemented in a construction management course. Game statistics and survey responses were recorded to estimate the effect of the game and correlations with student traits. While the data analyzed so far is mostly inconclusive, this study served as an important first step toward content-independent gamification.}
}

• Pablo Hernandez-Leal, Yusen Zhan, Matthew E. Taylor, Enrique L. Sucar, and Enrique Munoz de Cote. Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems, pages 1-23, November 2016.

Interactions in multiagent systems are generally more complicated than single agent ones. Game theory provides solutions on how to act in multiagent scenarios; however, it assumes that all agents will act rationally. Moreover, some works also assume the opponent will use a stationary strategy. These assumptions usually do not hold in real world scenarios where agents have limited capacities and may deviate from a perfect rational response. Our goal is still to act optimally in these cases by learning the appropriate response and without any prior policies on how to act. Thus, we focus on the problem when another agent in the environment uses different stationary strategies over time. This will turn the problem into learning in a non-stationary environment, posing a problem for most learning algorithms. This paper introduces DriftER, an algorithm that (1) learns a model of the opponent, (2) uses that to obtain an optimal policy and then (3) determines when it must re-learn due to an opponent strategy change. We provide theoretical results showing that DriftER guarantees to detect switches with high probability. Also, we provide empirical results showing that our approach outperforms state of the art algorithms, in normal form games such as prisoner‚Äôs dilemma and then in a more realistic scenario, the Power TAC simulator.

@article{2016JAAMAS2-Hernandez-Leal,
author={Pablo Hernandez-Leal and Yusen Zhan and Matthew E. Taylor and L. Enrique {Sucar} and Enrique {Munoz de Cote}},
title={{Efficiently detecting switches against non-stationary opponents}},
journal={{Autonomous Agents and Multi-Agent Systems}},
pages={1--23},
month={November},
year={2016},
doi={10.1007/s10458-016-9352-6},
url={http://dx.doi.org/10.1007/s10458-016-9352-6},
issn={1387-2532},
abstract={Interactions in multiagent systems are generally more complicated than single agent ones. Game theory provides solutions on how to act in multiagent scenarios; however, it assumes that all agents will act rationally. Moreover, some works also assume the opponent will use a stationary strategy. These assumptions usually do not hold in real world scenarios where agents have limited capacities and may deviate from a perfect rational response. Our goal is still to act optimally in these cases by learning the appropriate response and without any prior policies on how to act. Thus, we focus on the problem when another agent in the environment uses different stationary strategies over time. This will turn the problem into learning in a non-stationary environment, posing a problem for most learning algorithms. This paper introduces DriftER, an algorithm that (1) learns a model of the opponent, (2) uses that to obtain an optimal policy and then (3) determines when it must re-learn due to an opponent strategy change. We provide theoretical results showing that DriftER guarantees to detect switches with high probability. Also, we provide empirical results showing that our approach outperforms state of the art algorithms, in normal form games such as prisoner‚Äôs dilemma and then in a more realistic scenario, the Power TAC simulator.}
}

• Pablo Hernandez-Leal, Yusen Zhan, Matthew E. Taylor, Enrique L. Sucar, and Enrique Munoz de Cote. An exploration strategy for non-stationary opponents. Autonomous Agents and Multi-Agent Systems, pages 1-32, October 2016.

The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. In this work we shed light on how to design exploration strategies in non-stationary and adversarial environments. Our proposed adversarial drift exploration (DE) is able to efficiently explore the state space while keeping track of regions of the environment that have changed. This proposed exploration is general enough to be applied in single agent non-stationary environments as well as in multiagent settings where the opponent changes its strategy in time. We use a two agent strategic interaction setting to test this new type of exploration, where the opponent switches between different behavioral patterns to emulate a non-deterministic, stochastic and adversarial environment. The agent’s objective is to learn a model of the opponent’s strategy to act optimally. Our contribution is twofold. First, we present DE as a strategy for switch detection. Second, we propose a new algorithm called R-max{\#} for learning and planning against non-stationary opponent. To handle such opponents, R-max{\#} reasons and acts in terms of two objectives: (1) to maximize utilities in the short term while learning and (2) eventually explore opponent behavioral changes. We provide theoretical results showing that R-max{\#} is guaranteed to detect the opponent’s switch and learn a new model in terms of finite sample complexity. R-max{\#} makes efficient use of exploration experiences, which results in rapid adaptation and efficient DE, to deal with the non-stationary nature of the opponent. We show experimentally how using DE outperforms the state of the art algorithms that were explicitly designed for modeling opponents (in terms average rewards) in two complimentary domains.

@article{2016JAAMAS-Hernandez-Leal,
author={Pablo Hernandez-Leal and Yusen Zhan and Matthew E. Taylor and L. Enrique {Sucar} and Enrique {Munoz de Cote}},
title={{An exploration strategy for non-stationary opponents}},
journal={{Autonomous Agents and Multi-Agent Systems}},
pages={1--32},
month={October},
year={2016},
pages={1--32},
issn={1573-7454},
doi={10.1007/s10458-016-9347-3},
url={http://dx.doi.org/10.1007/s10458-016-9347-3},
abstract={The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. In this work we shed light on how to design exploration strategies in non-stationary and adversarial environments. Our proposed adversarial drift exploration (DE) is able to efficiently explore the state space while keeping track of regions of the environment that have changed. This proposed exploration is general enough to be applied in single agent non-stationary environments as well as in multiagent settings where the opponent changes its strategy in time. We use a two agent strategic interaction setting to test this new type of exploration, where the opponent switches between different behavioral patterns to emulate a non-deterministic, stochastic and adversarial environment. The agent's objective is to learn a model of the opponent's strategy to act optimally. Our contribution is twofold. First, we present DE as a strategy for switch detection. Second, we propose a new algorithm called R-max{\#} for learning and planning against non-stationary opponent. To handle such opponents, R-max{\#} reasons and acts in terms of two objectives: (1) to maximize utilities in the short term while learning and (2) eventually explore opponent behavioral changes. We provide theoretical results showing that R-max{\#} is guaranteed to detect the opponent's switch and learn a new model in terms of finite sample complexity. R-max{\#} makes efficient use of exploration experiences, which results in rapid adaptation and efficient DE, to deal with the non-stationary nature of the opponent. We show experimentally how using DE outperforms the state of the art algorithms that were explicitly designed for modeling opponents (in terms average rewards) in two complimentary domains.}
}

• Yang Hu and Matthew E. Taylor. A Computer-Aided Design Intelligent Tutoring System Teaching Strategic Flexibility. Transactions on Techniques for STEM Education, October–December 2016.

@article{2016STEMTransactions-Yang,
author={Hu, Yang and Taylor, Matthew E.},
journal={{Transactions on Techniques for {STEM} Education}},
month={October--December},
year={2016},,
abstract={Taking a Computer-Aided Design (CAD) class is a prerequisite for Mechanical Engineering freshmen at many universities, including at Washington State University. The traditional way to learn CAD software is to follow examples and exercises in a textbook. However, using written instruction is not always effective because textbooks usually support a single strategy to construct a model. Missing even one detail may cause the student to become stuck, potentially leading to frustration.
To make the learning process easier and more interesting, we designed and implemented an intelligent tutorial system for an open source CAD program, FreeCAD, for the sake of teaching students some basic CAD skills (such as Boolean operations) to construct complex objects from multiple simple shapes. Instead of teaching a single method to construct a model, the program first automatically learns all possible ways to construct a model and then can teach the student to draw the 3D model in multiple ways. Previous research efforts have shown that learning multiple potential solutions can encourage students to develop the tools they need to solve new problems.
This study compares textbook learning with learning from two variants of our intelligent tutoring system. The textbook approach is considered the baseline. In the first tutorial variant, subjects were given minimal guidance and were asked to construct a model in multiple ways. Subjects in the second tutorial group were given two guided solutions to constructing a model and then asked to demonstrate the third solution when constructing the same model. Rather than directly providing instructions, participants in the first tutorial group were expected to independently explore and were only provided feedback when the program determined he/she had deviated too far from a potential solution. The three groups are compared by measuring the time needed to 1) successfully construct the same model in a testing phase, 2) use multiple methods to construct the same model in a testing phase, and 3) construct a novel model.}
}

### 2015

• Anestis Fachantidis, Ioannis Partalas, Matthew E. Taylor, and Ioannis Vlahavas. Transfer learning with probabilistic mapping selection. Adaptive Behavior, 23(1):3-19, 2015.

When transferring knowledge between reinforcement learning agents with different state representations or actions, past knowledge must be efficiently mapped to novel tasks so that it aids learning. The majority of the existing approaches use pre-defined mappings provided by a domain expert. To overcome this limitation and enable autonomous transfer learning, this paper introduces a method for weighting and using multiple inter-task mappings based on a probabilistic framework. Experimental results show that the use of multiple inter-task mappings, accompanied with a probabilistic selection mechanism, can significantly boost the performance of transfer learning relative to 1) learning without transfer and 2) using a single hand-picked mapping. We especially introduce novel tasks for transfer learning in a realistic simulation of the iCub robot, demonstrating the ability of the method to select mappings in complex tasks where human intuition could not be applied to select them. The results verified the efficacy of the proposed approach in a real world and complex environment.

@article{2015AdaptiveBehavior-Fachantidis,
author={Anestis Fachantidis and Ioannis Partalas and Matthew E. Taylor and Ioannis Vlahavas},
title={{Transfer learning with probabilistic mapping selection}},
volume={23},
number={1},
pages={3-19},
year={2015},
doi={10.1177/1059712314559525},
abstract={When transferring knowledge between reinforcement learning agents with different state representations or actions, past knowledge must be efficiently mapped to novel tasks so that it aids learning. The majority of the existing approaches use pre-defined mappings provided by a domain expert. To overcome this limitation and enable autonomous transfer learning, this paper introduces a method for weighting and using multiple inter-task mappings based on a probabilistic framework. Experimental results show that the use of multiple inter-task mappings, accompanied with a probabilistic selection mechanism, can significantly boost the performance of transfer learning relative to 1) learning without transfer and 2) using a single hand-picked mapping. We especially introduce novel tasks for transfer learning in a realistic simulation of the iCub robot, demonstrating the ability of the method to select mappings in complex tasks where human intuition could not be applied to select them. The results verified the efficacy of the proposed approach in a real world and complex environment.},
bib2html_rescat={Reinforcement Learning, Transfer Learning},
}

• Robert Loftin, Bei Peng, James MacGlashan, Michael L. Littman, Matthew E. Taylor, Jeff Huang, and David L. Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Journal of Autonomous Agents and Multi-Agent Systems, pages 1-30, 2015.

For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all trainers provide feedback in the same way. In this work, we show that users can provide feedback in many different ways, which we describe as “training strategies.” Specifically, users may not always give explicit feedback in response to an action, and may be more likely to provide explicit reward than explicit punishment, or vice versa, such that the lack of feedback itself conveys information about the behavior. We present a probabilistic model of trainer feedback that describes how a trainer chooses to provide explicit reward and/or explicit punishment and, based on this model, develop two novel learning algorithms (SABL and I-SABL) which take trainer strategy into account, and can therefore learn from cases where no feedback is provided. Through online user studies we demonstrate that these algorithms can learn with less feedback than algorithms based on a numerical interpretation of feedback. Furthermore, we conduct an empirical analysis of the training strategies employed by users, and of factors that can affect their choice of strategy.

@article{2015AAMAS-Loftin,
author={Robert Loftin and Bei Peng and James MacGlashan and Michael L. Littman and Matthew E. Taylor and Jeff Huang and David L. Roberts},
title={{Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning}},
journal={{Journal of Autonomous Agents and Multi-Agent Systems}},
pages={1--30},
year={2015},
doi={10.1007/s10458-015-9283-7},
publisher={Springer},
abstract={ For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all trainers provide feedback in the same way. In this work, we show that users can provide feedback in many different ways, which we describe as “training strategies.” Specifically, users may not always give explicit feedback in response to an action, and may be more likely to provide explicit reward than explicit punishment, or vice versa, such that the lack of feedback itself conveys information about the behavior. We present a probabilistic model of trainer feedback that describes how a trainer chooses to provide explicit reward and/or explicit punishment and, based on this model, develop two novel learning algorithms (SABL and I-SABL) which take trainer strategy into account, and can therefore learn from cases where no feedback is provided. Through online user studies we demonstrate that these algorithms can learn with less feedback than algorithms based on a numerical interpretation of feedback. Furthermore, we conduct an empirical analysis of the training strategies employed by users, and of factors that can affect their choice of strategy. },
}

### 2014

• Tim Brys, Tong T. Pham, and Matthew E. Taylor. Distributed learning and multi-objectivity in traffic light control. Connection Science, 26(1):65-83, 2014.

Traffic jams and suboptimal traffic flows are ubiquitous in modern societies, and they create enormous economic losses each year. Delays at traffic lights alone account for roughly 10\% of all delays in US traffic. As most traffic light scheduling systems currently in use are static, set up by human experts rather than being adaptive, the interest in machine learning approaches to this problem has increased in recent years. Reinforcement learning (RL) approaches are often used in these studies, as they require little pre-existing knowledge about traffic flows. Distributed constraint optimisation approaches (DCOP) have also been shown to be successful, but are limited to cases where the traffic flows are known. The distributed coordination of exploration and exploitation (DCEE) framework was recently proposed to introduce learning in the DCOP framework. In this paper, we present a study of DCEE and RL techniques in a complex simulator, illustrating the particular advantages of each, comparing them against standard isolated traffic actuated signals. We analyse how learning and coordination behave under different traffic conditions, and discuss the multi-objective nature of the problem. Finally we evaluate several alternative reward signals in the best performing approach, some of these taking advantage of the correlation between the problem-inherent objectives to improve performance.

@article{2014ConnectionScience-Brys,
author={Tim Brys and Tong T. Pham and Matthew E. Taylor},
title={{Distributed learning and multi-objectivity in traffic light control}},
journal={{Connection Science}},
volume={26},
number={1},
pages={65-83},
year={2014},
doi={10.1080/09540091.2014.885282},
url={http://dx.doi.org/10.1080/09540091.2014.885282},
eprint={http://dx.doi.org/10.1080/09540091.2014.885282},
abstract={ Traffic jams and suboptimal traffic flows are ubiquitous in modern societies, and they create enormous economic losses each year. Delays at traffic lights alone account for roughly 10\% of all delays in US traffic. As most traffic light scheduling systems currently in use are static, set up by human experts rather than being adaptive, the interest in machine learning approaches to this problem has increased in recent years. Reinforcement learning (RL) approaches are often used in these studies, as they require little pre-existing knowledge about traffic flows. Distributed constraint optimisation approaches (DCOP) have also been shown to be successful, but are limited to cases where the traffic flows are known. The distributed coordination of exploration and exploitation (DCEE) framework was recently proposed to introduce learning in the DCOP framework. In this paper, we present a study of DCEE and RL techniques in a complex simulator, illustrating the particular advantages of each, comparing them against standard isolated traffic actuated signals. We analyse how learning and coordination behave under different traffic conditions, and discuss the multi-objective nature of the problem. Finally we evaluate several alternative reward signals in the best performing approach, some of these taking advantage of the correlation between the problem-inherent objectives to improve performance. },
bib2html_pubtype={Journal Article},
bib2html_rescat={Reinforcement Learning, DCOP}
}

• Matthew E. Taylor, Nicholas Carboni, Anestis Fachantidis, Ioannis Vlahavas, and Lisa Torrey. Reinforcement learning agents providing advice in complex video games. Connection Science, 26(1):45-63, 2014.

This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations.

@article{2014ConnectionScience-Taylor,
author={Matthew E. Taylor and Nicholas Carboni and Anestis Fachantidis and Ioannis Vlahavas and Lisa Torrey},
title={{Reinforcement learning agents providing advice in complex video games}},
journal={{Connection Science}},
volume={26},
number={1},
pages={45-63},
year={2014},
doi={10.1080/09540091.2014.885279},
url={http://dx.doi.org/10.1080/09540091.2014.885279},
eprint={http://dx.doi.org/10.1080/09540091.2014.885279},
abstract={ This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations. },
bib2html_pubtype={Journal Article},
bib2html_rescat={Reinforcement Learning, Transfer Learning},
}

### 2011

• Marcos A.~M.~Vieira, Matthew E. Taylor, Prateek Tandon, Manish Jain, Ramesh Govindan, Gaurav S.~Sukhatme, and Milind Tambe. Mitigating Multi-path Fading in a Mobile Mesh Network. Ad Hoc Networks Journal, 2011.
@article{ADHOC11-Vieira,
author={Marcos A.~M.~Vieira and Matthew E. Taylor and Prateek Tandon and Manish Jain and Ramesh Govindan and Gaurav S.~Sukhatme and Milind Tambe},
title={{Mitigating Multi-path Fading in a Mobile Mesh Network}},
year={2011},
bib2html_pubtype={Journal Article},
bib2html_rescat={DCOP}
}

• Matthew E. Taylor and Peter Stone. An Introduction to Inter-task Transfer for Reinforcement Learning. AI Magazine, 32(1):15-34, 2011.
@article{AAAIMag11-Taylor,
author={Matthew E. Taylor and Peter Stone},
title={{An Introduction to Inter-task Transfer for Reinforcement Learning}},
journal={{{AI} Magazine}},
year={2011},
volume={32},
number={1},
pages={15--34},
bib2html_pubtype={Journal Article},
bib2html_rescat={Reinforcement Learning, Transfer Learning}
}

• Matthew E. Taylor, Manish Jain, Prateek Tandon, Makoto Yokoo, and Milind Tambe. Distributed On-line Multi-Agent Optimization Under Uncertainty: Balancing Exploration and Exploitation. Advances in Complex Systems, 2011.
@article{ACS11-Taylor,
author={Matthew E. Taylor and Manish Jain and Prateek Tandon and Makoto Yokoo and Milind Tambe},
title={{Distributed On-line Multi-Agent Optimization Under Uncertainty: Balancing Exploration and Exploitation}},
year={2011},
bib2html_pubtype={Journal Article},
bib2html_rescat={DCOP}
}

### 2010

• Matthew E. Taylor, Christopher Kiekintveld, Craig Western, and Milind Tambe. A Framework for Evaluating Deployed Security Systems: Is There a Chink in your ARMOR?. Informatica, 34(2):129-139, 2010.

A growing number of security applications are being developed and deployed to explicitly reduce risk from adversaries’ actions. However, there are many challenges when attempting to \emph{evaluate} such systems, both in the lab and in the real world. Traditional evaluations used by computer scientists, such as runtime analysis and optimality proofs, may be largely irrelevant. The primary contribution of this paper is to provide a preliminary framework which can guide the evaluation of such systems and to apply the framework to the evaluation of ARMOR (a system deployed at LAX since August 2007). This framework helps to determine what evaluations could, and should, be run in order to measure a system’s overall utility. A secondary contribution of this paper is to help familiarize our community with some of the difficulties inherent in evaluating deployed applications, focusing on those in security domains.

@article{Informatica10-Taylor,
author={Matthew E. Taylor and Christopher Kiekintveld and Craig Western and Milind Tambe},
title={{A Framework for Evaluating Deployed Security Systems: Is There a Chink in your {ARMOR}?}},
journal={{Informatica}},
year={2010},
volume={34},
number={2},
pages={129--139},
abstract={A growing number of security applications are being developed and deployed to explicitly reduce risk from adversaries' actions. However, there are many challenges when attempting to \emph{evaluate} such systems, both in the lab and in the real world. Traditional evaluations used by computer scientists, such as runtime analysis and optimality proofs, may be largely irrelevant. The primary contribution of this paper is to provide a preliminary framework which can guide the evaluation of such systems and to apply the framework to the evaluation of ARMOR (a system deployed at LAX since August 2007). This framework helps to determine what evaluations could, and should, be run in order to measure a system's overall utility. A secondary contribution of this paper is to help familiarize our community with some of the difficulties inherent in evaluating deployed applications, focusing on those in security domains.},
bib2html_pubtype={Journal Article},
bib2html_rescat={Security},
bib2html_funding={CREATE}
}

• Shimon Whiteson, Matthew E. Taylor, and Peter Stone. Critical Factors in the Empirical Performance of Temporal Difference and Evolutionary Methods for Reinforcement Learning. Journal of Autonomous Agents and Multi-Agent Systems, 21(1):1-27, 2010.

Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods’ relative performance: 1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa’s learning updates are not reliable in the absence of the Markov property and 2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.

@article{JAAMAS09-Whiteson,
author={Shimon Whiteson and Matthew E. Taylor and Peter Stone},
title={{Critical Factors in the Empirical Performance of Temporal Difference and Evolutionary Methods for Reinforcement Learning}},
journal={{Journal of Autonomous Agents and Multi-Agent Systems}},
year={2010},
volume={21},
number={1},
pages={1--27},
abstract={Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods' relative performance: 1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa's learning updates are not reliable in the absence of the Markov property and 2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.},
bib2html_pubtype={Journal Article},
bib2html_funding={},
bib2html_rescat={Reinforcement Learning, Machine Learning in Practice}
}

### 2009

• Matthew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(1):1633-1685, 2009.

The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.

@article{JMLR09-taylor,
author={Matthew E. Taylor and Peter Stone},
title={{Transfer Learning for Reinforcement Learning Domains: A Survey}},
journal={{Journal of Machine Learning Research}},
volume={10},
number={1},
pages={1633--1685},
year={2009},
abstract={The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.},
bib2html_pubtype={Journal Article},
bib2html_funding={NSF, DARPA},
bib2html_rescat={Reinforcement Learning, Transfer Learning}
}

### 2007

• Matthew E. Taylor, Peter Stone, and Yaxin Liu. Transfer Learning via Inter-Task Mappings for Temporal Difference Learning. Journal of Machine Learning Research, 8(1):2125-2167, 2007.

@article{JMLR07-taylor,
author={Matthew E. Taylor and Peter Stone and Yaxin Liu},
title={{Transfer Learning via Inter-Task Mappings for Temporal Difference Learning}},
journal={{Journal of Machine Learning Research}},
year={2007},
volume={8},
number={1},
pages={2125--2167},
bib2html_pubtype={Journal Article},
bib2html_funding={NSF, DARPA},
bib2html_rescat={Reinforcement Learning, Transfer Learning}
}

• Shimon Whiteson, Matthew E. Taylor, and Peter Stone. Empirical Studies in Action Selection for Reinforcement Learning. Adaptive Behavior, 15(1), 2007.

To excel in challenging tasks, intelligent agents need sophisticated mechanisms for action selection: they need policies that dictate what action to take in each situation. Reinforcement learning (RL) algorithms are designed to learn such policies given only positive and negative rewards. Two contrasting approaches to RL that are currently in popular use are temporal difference (TD) methods, which learn value functions, and evolutionary methods, which optimize populations of candidate policies. Both approaches have had practical successes but few studies have directly compared them. Hence, there are no general guidelines describing their relative strengths and weaknesses. In addition, there has been little cross-collaboration, with few attempts to make them work together or to apply ideas from one to the other. This article aims to address these shortcomings via three empirical studies that compare these methods and investigate new ways of making them work together. First, we compare the two approaches in a benchmark task and identify variations of the task that isolate factors critical to each method’s performance. Second, we investigate ways to make evolutionary algorithms excel at on-line tasks by borrowing exploratory mechanisms traditionally used by TD methods. We present empirical results demonstrating a dramatic performance improvement. Third, we explore a novel way of making evolutionary and TD methods work together by using evolution to automatically discover good representations for TD function approximators. We present results demonstrating that this novel approach can outperform both TD and evolutionary methods alone.

@article{AB07-whiteson,
author={Shimon Whiteson and Matthew E. Taylor and Peter Stone},
title={{Empirical Studies in Action Selection for Reinforcement Learning}},
}