Intelligent Robot Learning Laboratory (IRL Lab) Bei Peng

IMG_4677CONTACT INFORMATION:
Bei Peng
PhD Student, Computer Science
Email: bei.peng@wsu.edu
Office: Dana Hall 3
Links: Personal Website


My Story

I’m now a Ph.D. student in School of Electrical Engineering and Computer Science at Washington State University. I work with Matthew E. Taylor to do some cool research about artificial intelligence. Before I came here, I worked as a software developer in Tencent for about a year after I got my bachelor degree of computer science from Huazhong University of Science and Technology in China in 2012.

Current Projects

By: Gabriel V. de la Cruz Jr.Bei Peng, and Matthew E. Taylor

Reinforcement learning suffers from poor initial performance. Our approach uses crowdsourcing to provide non-expert suggestions to speed up learning of an RL agent. Currently, we are using Mrs. Pac-Man as our application domain for its popularity as a game. From our studies, we have already concluded that crowdsourcing, although non-experts, are good in identifying mistakes. We are now working on how we can integrate the crowd’s advice to speed up the RL agent’s learning. In the future, we intend to implement this approach to a physical robot. [1, 2]

[1] [pdf] [doi] Gabriel V. de la Cruz Jr., Bei Peng, Walter S. Lasecki, and Matthew E. Taylor. Towards Integrating Real-Time Crowd Advice with Reinforcement Learning. In The 20th ACM Conference on Intelligent User Interfaces (IUI), March 2015. Poster: 41% acceptance rate for poster submissions
[Bibtex]
@inproceedings{2015IUI-Delacruz,
author={de la Cruz, Jr., Gabriel V. and Peng, Bei and Lasecki, Walter S. and Taylor, Matthew E.},
title={{Towards Integrating Real-Time Crowd Advice with Reinforcement Learning}},
booktitle={{The 20th {ACM} Conference on Intelligent User Interfaces ({IUI})}},
month={March},
year={2015},
doi={10.1145/2732158.2732180},
note={Poster: 41% acceptance rate for poster submissions},
wwwnote={<a href="http://iui.acm.org/2015/">ACM iUI-15</a>},
bib2html_rescat={Reinforcement Learning, Crowdsourcing},
bib2html_pubtype={Short Refereed Conference},
bib2html_funding={NSF},
abstract={Reinforcement learning is a powerful machine learning paradigm that allows agents to autonomously learn to maximize a scalar reward. However, it often suffers from poor initial performance and long learning times. This paper discusses how collecting on-line human feedback, both in real time and post hoc, can potentially improve the performance of such learning systems. We use the game Pac-Man to simulate a navigation setting and show that workers are able to accurately identify both when a sub-optimal action is executed, and what action should have been performed instead. Demonstrating that the crowd is capable of generating this input, and discussing the types of errors that occur, serves as a critical first step in designing systems that use this real-time feedback to improve systems' learning performance on-the-fly.},
}
[2] [pdf] Gabriel V. de la Cruz Jr., Bei Peng, Walter S. Lasecki, and Matthew E. Taylor. Generating Real-Time Crowd Advice to Improve Reinforcement Learning Agents. In Proceedings of the Learning for General Competency in Video Games workshop (AAAI), January 2015.
[Bibtex]
@inproceedings(2015AAAI-Delacruz,
title={{Generating Real-Time Crowd Advice to Improve Reinforcement Learning Agents}},
author={de la Cruz, Jr., Gabriel V. and Peng, Bei and Lasecki, Walter S. and Taylor, Matthew E.},
booktitle={{Proceedings of the Learning for General Competency in Video Games workshop ({AAAI})}},
month={January},
year={2015},
wwwnote={<a href="http://www.arcadelearningenvironment.org/aaai15-workshop/">The Arcade Learning Environment</a>},
bib2html_pubtype={Refereed Workshop or Symposium},
bib2html_rescat={Reinforcement Learning, Crowdsourcing},
bib2html_funding={NSF},
abstract={Reinforcement learning is a powerful machine learning paradigm that allows agents to autonomously learn to maximize a scalar reward. However, it often suffers from poor initial performance and long learning times. This paper discusses how collecting on-line human feedback, both in real time and post hoc, can potentially improve the performance of such learning systems. We use the game Pac-Man to simulate a navigation setting and show that workers are able to accurately identify both when a sub-optimal action is executed, and what action should have been performed instead. Our results demonstrate that the crowd is capable of generating helpful input. We conclude with a discussion the types of errors that occur most commonly when engaging human workers for this task, and a discussion of how such data could be used to improve learning. Our work serves as a critical first step in designing systems that use real-time human feedback to improve the learning performance of automated systems on-the-fly.},
)

By: Bei Peng and Matthew E. Taylor

In this project, we consider the problem of a human trainer teaching an agent via providing positive or negative feedback. Most existing work has treated human feedback as a numerical value that the agent seeks to maximize, and has assumed that all trainers will give feedback in the same way when teaching the same behavior. In contrast, we treat the feedback as a human-delivered discrete communication between trainers and learners and different training strategies will be chosen by them. We propose a probabilistic model to classify different training strategies. We also present the SABL and I-SABL algorithms, which consider multiple interpretations of trainer feedback in order to learn behaviors more efficiently. Our online user studies show that human trainers follow various training strategies when teaching virtual agents and explicitly considering trainer strategy can allow a learner to make inferences from cases where no feedback is given. [1, 2, 3]

[1] [pdf] [doi] Robert Loftin, Bei Peng, James MacGlashan, Michael L. Littman, Matthew E. Taylor, Jeff Huang, and David L. Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Journal of Autonomous Agents and Multi-Agent Systems, pages 1-30, 2015.
[Bibtex]
@article{2015AAMAS-Loftin,
author={Robert Loftin and Bei Peng and James MacGlashan and Michael L. Littman and Matthew E. Taylor and Jeff Huang and David L. Roberts},
title={{Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning}},
journal={{Journal of Autonomous Agents and Multi-Agent Systems}},
pages={1--30},
year={2015},
doi={10.1007/s10458-015-9283-7},
publisher={Springer},
url={http://link.springer.com/article/10.1007%2Fs10458-015-9283-7},
abstract={ For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all trainers provide feedback in the same way. In this work, we show that users can provide feedback in many different ways, which we describe as “training strategies.” Specifically, users may not always give explicit feedback in response to an action, and may be more likely to provide explicit reward than explicit punishment, or vice versa, such that the lack of feedback itself conveys information about the behavior. We present a probabilistic model of trainer feedback that describes how a trainer chooses to provide explicit reward and/or explicit punishment and, based on this model, develop two novel learning algorithms (SABL and I-SABL) which take trainer strategy into account, and can therefore learn from cases where no feedback is provided. Through online user studies we demonstrate that these algorithms can learn with less feedback than algorithms based on a numerical interpretation of feedback. Furthermore, we conduct an empirical analysis of the training strategies employed by users, and of factors that can affect their choice of strategy. },
}
[2] [pdf] Robert Loftin, Bei Peng, James MacGlashan, Michael Littman, Matthew E. Taylor, David Roberts, and Jeff Huang. Learning Something from Nothing: Leveraging Implicit Human Feedback Strategies. In Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), August 2014.
[Bibtex]
@inproceedings{2014ROMAN-Loftin,
author={Robert Loftin and Bei Peng and James MacGlashan and Michael Littman and Matthew E. Taylor and David Roberts and Jeff Huang},
title={{Learning Something from Nothing: Leveraging Implicit Human Feedback Strategies}},
booktitle={{Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication ({RO-MAN})}},
month={August},
year={2014},
bib2html_pubtype={Refereed Conference},
bib2html_rescat={Reinforcement Learning},
}
[3] [pdf] Robert Loftin, Bei Peng, James MacGlashan, Machiael L. Littman, Matthew E. Taylor, Jeff Huang, and David L. Roberts. A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI), July 2014. 28% acceptance rate
[Bibtex]
@inproceedings{2014AAAI-Loftin,
author={Robert Loftin and Bei Peng and James MacGlashan and Machiael L. Littman and Matthew E. Taylor and Jeff Huang and David L. Roberts},
title={{A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback}},
booktitle={{Proceedings of the 28th {AAAI} Conference on Artificial Intelligence ({AAAI})}},
month={July},
year={2014},
note={28% acceptance rate},
bib2html_pubtype={Refereed Conference},
bib2html_rescat={Reinforcement Learning},
}

By: Bei Peng and Matthew E. Taylor

As increasing need for humans to convey complex tasks to robot without any technical expertise, conveying tasks through natural language provides an intuitive interface. But it needs the agent to learn a grounding of natural language commands. In this work, we developed a simple simulated home environment in which the robot needs to complete some tasks via learning from human positive or negative feedback. [1, 2, 3, 4]

[1] [pdf] Bei Peng, James MacGlashan, Robert Loftin, Michael L. Littman, David L. Roberts, and Matthew E. Taylor. An Empirical Study of Non-Expert Curriculum Design for Machine Learners. In Proceedings of the Interactive Machine Learning workshop (at IJCAI), New York City, NY, USA, July 2016.
[Bibtex]
@inproceedings{2016IML-Peng,
author={Bei Peng and James MacGlashan and Robert Loftin and Michael L. Littman and David L. Roberts and Matthew E. Taylor},
title={{An Empirical Study of Non-Expert Curriculum Design for Machine Learners}},
booktitle={{Proceedings of the Interactive Machine Learning workshop (at {IJCAI})}},
month={July},
year={2016},
address={New York City, NY, USA},
bib2html_pubtype={Refereed Workshop or Symposium},
abstract={Existing machine-learning work has shown that algorithms can benefit from curriculum learning, a strategy where the target behavior of the learner is changed over time. However, most existing work focuses on developing automatic methods to iteratively select training examples with increasing difficulty tailored to the current ability of the learner, neglecting how non-expert humans may design curricula. In this work we introduce a curriculumdesign problem in the context of reinforcement learning and conduct a user study to explicitly explore how non-expert humans go about assembling curricula. We present results from 80 participants on Amazon Mechanical Turk that show 1) humans can successfully design curricula that gradually introduce more complex concepts to the agent within each curriculum, and even across different curricula, and 2) users choose to add task complexity in different ways and follow salient principles when selecting tasks into the curriculum. This work serves as an important first step towards better integration of non-expert humans into the reinforcement learning process and the development of new machine learning algorithms to accommodate human teaching strategies.}
}
[2] [pdf] Bei Peng, James MacGlashan, Robert Loftin, Michael L. Littman, David L. Roberts, and Matthew E. Taylor. A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2016. 24.9% acceptance rate
[Bibtex]
@inproceedings{2016AAMAS-Peng,
author={Bei Peng and James MacGlashan and Robert Loftin and Michael L. Littman and David L. Roberts and Matthew E. Taylor},
title={{A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans}},
booktitle={{Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems ({AAMAS})}},
month={May},
year={2016},
note={24.9% acceptance rate},
video={https://www.youtube.com/watch?v=AJQSGD_XPrk},
bib2html_pubtype={Refereed Conference},
abstract={As robots become pervasive in human environments, it is important to enable users to effectively convey new skills without programming. Most existing work on Interactive Reinforcement Learning focuses on interpreting and incorporating non-expert human feedback to speed up learning; we aim to design a better representation of the learning agent that is able to elicit more natural and effective communication between the human trainer and the learner, while treating human feedback as discrete communication that depends probabilistically on the trainer’s target policy. This work presents a user study where participants train a virtual agent to accomplish tasks by giving reward and/or punishment in a variety of simulated environments. We present results from 60 participants to show how a learner can ground natural language commands and adapt its action execution speed to learn more efficiently from human trainers. The agent’s action execution speed can be successfully modulated to encourage more explicit feedback from a human trainer in areas of the state space where there is high uncertainty. Our results show that our novel adaptive speed agent dominates different fixed speed agents on several measures. Additionally, we investigate the impact of instructions on user performance and user preference in training conditions.}
}
[3] [pdf] Bei Peng, Robert Loftin, James MacGlashan, Michael L. Littman, Matthew E. Taylor, and David L. Roberts. Language and Policy Learning from Human-delivered Feedback. In Proceedings of the Machine Learning for Social Robotics workshop (at ICRA), May 2015.
[Bibtex]
@inproceedings{2015ICRA-Peng,
author={Bei Peng and Robert Loftin and James MacGlashan and Michael L. Littman and Matthew E. Taylor and David L. Roberts},
title={{Language and Policy Learning from Human-delivered Feedback}},
booktitle={{Proceedings of the Machine Learning for Social Robotics workshop (at {ICRA})}},
month={May},
year={2015},
bib2html_pubtype={Refereed Workshop or Symposium},
abstract={Using rewards and punishments is a common and familiar paradigm for humans to train intelligent agents. Most existing learning algorithms in this paradigm follow a framework in which human feedback is treated as a numerical signal to be maximized by the agent. However, treating feedback as a numeric signal fails to capitalize on implied information the human trainer conveys with a lack of explicit feedback. For example, a trainer may withhold reward to signal to the agent a failure, or they may withhold punishment to signal that the agent is behaving correctly. We review our progress to date with Strategy-aware Bayesian Learning, which is able to learn from experience the ways
trainers use feedback, and can exploit that knowledge to accelerate learning. Our work covers contextual bandits, goal-directed sequential decision-making tasks, and natural language command learning. We present a user study design to identify how users’ feedback strategies are affected by properties of the environment and agent competency for natural language command learning in sequential decision making tasks, which will inform the development of more adaptive models of human feedback in the future.}
}
[4] [pdf] James Macglashan, Michael L. Littman, Robert Loftin, Bei Peng, David Roberts, and Matthew E. Taylor. Training an Agent to Ground Commands with Reward and Punishment. In Proceedings of the Machine Learning for Interactive Systems workshop (at AAAI), July 2014.
[Bibtex]
@inproceedings(2014MLIS-James,
title={{Training an Agent to Ground Commands with Reward and Punishment}},
author={James Macglashan and Michael L. Littman and Robert Loftin and Bei Peng and David Roberts and Matthew E. Taylor},
booktitle={{Proceedings of the Machine Learning for Interactive Systems workshop (at {AAAI})}},
month={July},
year={2014},
bib2html_pubtype={Refereed Workshop or Symposium},
bib2html_rescat={Reinforcement Learning}
)

News

Publications

2017

  • James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E. Taylor, and Michael L. Littman. Interactive Learning from Policy-Dependent Human Feedback. Technical Report, Jan 2017.
    [BibTeX] [Abstract] [Download PDF]

    For agents and robots to become more useful, they must be able to quickly learn from non-technical users. This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner’s current policy. We present empirical results that show this assumption to be false—whether human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy. We argue that policy-dependent feedback, in addition to being commonplace, enables useful training strategies from which agents should benefit. Based on this insight, we introduce Convergent Actor-Critic by Humans (COACH), an algorithm for learning from policy-dependent feedback that converges to a local optimum. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot, even with noisy image features.

    @techreport{2017arXiv-MacGlashan,
    author={MacGlashan, James and Ho, Mark K. and Loftin, Robert and Peng, Bei and Roberts, David and Taylor, Matthew E. and Littman, Michael L.},
    title={{Interactive Learning from Policy-Dependent Human Feedback}},
    journal={ArXiv e-prints},
    archivePrefix="arXiv",
    eprint={1701.06049},
    primaryClass="cs.AI",
    keywords={Computer Science - Artificial Intelligence, I.2.6},
    year={2017},
    month={Jan},
    adsurl={http://adsabs.harvard.edu/abs/2017arXiv170106049M},
    adsnote={Provided by the SAO/NASA Astrophysics Data System},
    abstract={For agents and robots to become more useful, they must be able to quickly learn from non-technical users. This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner's current policy. We present empirical results that show this assumption to be false---whether human trainers give a positive or negative feedback for a decision is influenced by the learner's current policy. We argue that policy-dependent feedback, in addition to being commonplace, enables useful training strategies from which agents should benefit. Based on this insight, we introduce Convergent Actor-Critic by Humans (COACH), an algorithm for learning from policy-dependent feedback that converges to a local optimum. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot, even with noisy image features.}
    }

  • Bei Peng, James MacGlashan, Robert Loftin, Michael L. Littman, David L. Roberts, and Matthew E. Taylor. Curriculum Design for Machine Learners in Sequential Decision Tasks (Extended Abstract). In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2017. Extended abstract: 26% acceptance rate for papers, additional 22% for extended abstracts.
    [BibTeX] [Abstract] [Download PDF]

    Existing machine-learning work has shown that algorithms can bene t from curricula|learning rst on simple examples before moving to more dicult examples. While most existing work on curriculum learning focuses on developing automatic methods to iteratively select training examples with increasing diculty tailored to the current ability of the learner, relatively little attention has been paid to the ways in which humans design curricula. We argue that a better understanding of the human-designed curricula could give us insights into the development of new machine learning algorithms and interfaces that can better accommodate machine- or human-created curricula. Our work addresses this emerging and vital area empirically, taking an important step to characterize the nature of human-designed curricula relative to the space of possible curricula and the performance bene ts that may (or may not) occur.

    @inproceedings{2017AAMAS-Peng,
    author={Peng, Bei and MacGlashan, James and Loftin, Robert and Littman, Michael L. and Roberts, David L. and Taylor, Matthew E.},
    title={{Curriculum Design for Machine Learners in Sequential Decision Tasks (Extended Abstract)}},
    booktitle={{Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems ({AAMAS})}},
    month={May},
    year={2017},
    note={Extended abstract: 26% acceptance rate for papers, additional 22% for extended abstracts.},
    bib2html_pubtype={Refereed Conference},
    abstract={Existing machine-learning work has shown that algorithms can benet from curricula|learning rst on simple examples before moving to more dicult examples. While most existing work on curriculum learning focuses on developing automatic methods to iteratively select training examples with increasing diculty tailored to the current ability of the learner, relatively little attention has been paid to the ways in which humans design curricula. We argue that a better understanding of the human-designed curricula could give us insights into the development of new machine learning algorithms and interfaces that can better accommodate machine- or human-created curricula. Our work addresses this emerging and vital area empirically, taking an important step to characterize the nature of human-designed curricula relative to the space of possible curricula and the performance benets that may (or may not) occur.}
    }

  • Bei Peng, James MacGlashan, Robert Loftin, Michael L. Littman, David L. Roberts, and Matthew E. Taylor. Curriculum Design for Machine Learners in Sequential Decision Tasks. In Proceedings of the Adaptive Learning Agents workshop (at AAMAS), Sao Paulo, Brazil, May 2017.
    [BibTeX] [Abstract] [Download PDF]

    Existing machine-learning work has shown that algorithms can benefit from curricula—learning first on simple examples before moving to more difficult examples. This work defines the curriculum-design problem in the context of sequential decision tasks, analyzes how different curricula affect agent learning in a Sokoban-like domain, and presents results of a user study that explores whether non-experts generate such curricula. Our results show that 1) different curricula can have substantial impact on training speeds while longer curricula do not always result in worse agent performance in learning all tasks within the curricula (including the target task), 2) more benefits of curricula can be found as the target task’s complexity increases, 3) the method for providing reward feedback to the agent as it learns within a curriculum does not change which curricula are best, 4) non-expert users can successfully design curricula that result in better overall agent performance than learning from scratch, even in the absence of feedback, and 5) non-expert users can discover and follow salient principles when selecting tasks in a curriculum. This work gives us insights into the development of new machine-learning algorithms and interfaces that can better accommodate machine- or human-created curricula.

    @inproceedings{2017ALA-Peng,
    author={Bei Peng and James MacGlashan and Robert Loftin and Michael L. Littman and David L. Roberts and Matthew E. Taylor},
    title={{Curriculum Design for Machine Learners in Sequential Decision Tasks}},
    booktitle={{Proceedings of the Adaptive Learning Agents workshop (at {AAMAS})}},
    month={May},
    year={2017},
    address={Sao Paulo, Brazil},
    bib2html_pubtype={Refereed Workshop or Symposium},
    abstract={Existing machine-learning work has shown that algorithms can benefit from curricula---learning first on simple examples before moving to more difficult examples. This work defines the curriculum-design problem in the context of sequential decision tasks, analyzes how different curricula affect agent learning in a Sokoban-like domain, and presents results of a user study that explores whether non-experts generate such curricula. Our results show that 1) different curricula can have substantial impact on training speeds while longer curricula do not always result in worse agent performance in learning all tasks within the curricula (including the target task), 2) more benefits of curricula can be found as the target task's complexity increases, 3) the method for providing reward feedback to the agent as it learns within a curriculum does not change which curricula are best, 4) non-expert users can successfully design curricula that result in better overall agent performance than learning from scratch, even in the absence of feedback, and 5) non-expert users can discover and follow salient principles when selecting tasks in a curriculum. This work gives us insights into the development of new machine-learning algorithms and interfaces that can better accommodate machine- or human-created curricula. }
    }

2016

  • Bei Peng, James MacGlashan, Robert Loftin, Michael L. Littman, David L. Roberts, and Matthew E. Taylor. A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2016. 24.9% acceptance rate
    [BibTeX] [Abstract] [Download PDF] [Video]

    As robots become pervasive in human environments, it is important to enable users to effectively convey new skills without programming. Most existing work on Interactive Reinforcement Learning focuses on interpreting and incorporating non-expert human feedback to speed up learning; we aim to design a better representation of the learning agent that is able to elicit more natural and effective communication between the human trainer and the learner, while treating human feedback as discrete communication that depends probabilistically on the trainer’s target policy. This work presents a user study where participants train a virtual agent to accomplish tasks by giving reward and/or punishment in a variety of simulated environments. We present results from 60 participants to show how a learner can ground natural language commands and adapt its action execution speed to learn more efficiently from human trainers. The agent’s action execution speed can be successfully modulated to encourage more explicit feedback from a human trainer in areas of the state space where there is high uncertainty. Our results show that our novel adaptive speed agent dominates different fixed speed agents on several measures. Additionally, we investigate the impact of instructions on user performance and user preference in training conditions.

    @inproceedings{2016AAMAS-Peng,
    author={Bei Peng and James MacGlashan and Robert Loftin and Michael L. Littman and David L. Roberts and Matthew E. Taylor},
    title={{A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans}},
    booktitle={{Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems ({AAMAS})}},
    month={May},
    year={2016},
    note={24.9% acceptance rate},
    video={https://www.youtube.com/watch?v=AJQSGD_XPrk},
    bib2html_pubtype={Refereed Conference},
    abstract={As robots become pervasive in human environments, it is important to enable users to effectively convey new skills without programming. Most existing work on Interactive Reinforcement Learning focuses on interpreting and incorporating non-expert human feedback to speed up learning; we aim to design a better representation of the learning agent that is able to elicit more natural and effective communication between the human trainer and the learner, while treating human feedback as discrete communication that depends probabilistically on the trainer’s target policy. This work presents a user study where participants train a virtual agent to accomplish tasks by giving reward and/or punishment in a variety of simulated environments. We present results from 60 participants to show how a learner can ground natural language commands and adapt its action execution speed to learn more efficiently from human trainers. The agent’s action execution speed can be successfully modulated to encourage more explicit feedback from a human trainer in areas of the state space where there is high uncertainty. Our results show that our novel adaptive speed agent dominates different fixed speed agents on several measures. Additionally, we investigate the impact of instructions on user performance and user preference in training conditions.}
    }

  • Robert Loftin, Matthew E. Taylor, Michael L. Littman, James MacGlashan, Bei Peng, and David L. Roberts. Open Problems for Online Bayesian Inference in Neural Networks. In Proceedings of Bayesian Deep Learning workshop (at NIPS), December 2016.
    [BibTeX] [Download PDF]
    @inproceedings{2016NIPS-BayesDL-Loftin,
    author={Robert Loftin and Matthew E. Taylor and Michael L. Littman and James MacGlashan and Bei Peng and David L. Roberts},
    title={{Open Problems for Online Bayesian Inference in Neural Networks}},
    booktitle={{Proceedings of Bayesian Deep Learning workshop (at {NIPS})}},
    month={December},
    year={2016},
    url={http://bayesiandeeplearning.org/papers/BDL_42.pdf},
    bib2html_pubtype={Refereed Workshop or Symposium}
    }

  • Robert Loftin, James MacGlashan, Bei Peng, Matthew E. Taylor, Michael L. Littman, and David L. Roberts. Towards Behavior-Aware Model Learning from Human-Generated Trajectories. In AAAI Fall Symposium on Artificial Intelligence for Human-Robot Interaction, Arlington, VA, USA, November 2016.
    [BibTeX] [Abstract] [Download PDF]

    Inverse reinforcement learning algorithms recover an unknown reward function for a Markov decision process, based on observations of user behaviors that optimize this reward function. Here we consider the complementary problem of learning the unknown transition dynamics of an MDP based on such observations. We describe the behavior-aware modeling (BAM) algorithm, which learns models of transition dynamics from user generated state-action trajectories. BAM makes assumptions about how users select their actions that are similar to those used in inverse reinforcement learning, and searches for a model that maximizes the probability of the observed actions. The BAM algorithm is based on policy gradient algorithms, essentially reversing the roles of the policy and transition distribution in those algorithms. As a result, BAMis highly flexible, and can be applied to continuous state spaces using a wide variety of model representations. In this preliminary work, we discuss why the model learning problem is interesting, describe algorithms to solve this problem, and discuss directions for future work.

    @inproceedings{2016AAAI-AI-HRI-Loftin,
    author={Robert Loftin and James MacGlashan and Bei Peng and Matthew E. Taylor and Michael L. Littman and David L. Roberts},
    title={{Towards Behavior-Aware Model Learning from Human-Generated Trajectories}},
    booktitle={{{AAAI} Fall Symposium on Artificial Intelligence for Human-Robot Interaction}},
    month={November},
    year={2016},
    address={Arlington, VA, USA},
    bib2html_pubtype={Refereed Workshop or Symposium},
    abstract={Inverse reinforcement learning algorithms recover an unknown reward function for a Markov decision process, based on observations of user behaviors that optimize this reward function. Here we consider the complementary problem of learning the unknown transition dynamics of an MDP based on such observations. We describe the behavior-aware modeling (BAM) algorithm, which learns models of transition dynamics from user generated state-action trajectories. BAM
    makes assumptions about how users select their actions that are similar to those used in inverse reinforcement learning, and searches for a model that maximizes the probability of the observed actions. The BAM algorithm is based on policy gradient algorithms, essentially reversing the roles of the policy and transition distribution in those algorithms. As a result, BAMis highly flexible, and can be applied to continuous state spaces using a wide variety of model representations. In this preliminary work, we discuss why the model learning problem is interesting, describe algorithms to solve this problem, and discuss directions for future work.}
    }

  • James MacGlashan, Michael L. Littman, David L. Roberts, Robert Loftin, Bei Peng, and Matthew E. Taylor. Convergent Actor Critic by Humans. In Workshop on Human-Robot Collaboration: Towards Co-Adaptive Learning Through Semi-Autonomy and Shared Control (at IROS), October 2016.
    [BibTeX] [Abstract] [Download PDF]

    Programming robot behavior can be painstaking: for a layperson, this path is unavailable without investing significant effort in building up proficiency in coding. In contrast, nearly half of American households have a pet dog and at least some exposure to animal training, suggesting an alternative path for customizing robot behavior. Unfortunately, most existing reinforcement-learning (RL) algorithms are not well suited to learning from human-delivered reinforcement. This paper introduces a framework for incorporating human-delivered rewards into RL algorithms and preliminary results demonstrating feasibility.

    @inproceedings{2016IROS-HRC-MacGlashan,
    author={James MacGlashan and Michael L. Littman and David L. Roberts and Robert Loftin and Bei Peng and Matthew E. Taylor},
    title={{Convergent Actor Critic by Humans}},
    booktitle={{Workshop on Human-Robot Collaboration: Towards Co-Adaptive Learning Through Semi-Autonomy and Shared Control (at {IROS})}},
    month={October},
    year={2016},
    bib2html_pubtype={Refereed Workshop or Symposium},
    abstract={Programming robot behavior can be painstaking: for a layperson, this path is unavailable without investing significant effort in building up proficiency in coding. In contrast, nearly half of American households have a pet dog and at least some exposure to animal training, suggesting an alternative path for customizing robot behavior. Unfortunately, most existing reinforcement-learning (RL) algorithms are not well suited to learning from human-delivered reinforcement. This paper introduces a framework for incorporating human-delivered rewards into RL algorithms and preliminary results demonstrating feasibility.}
    }

  • Bei Peng, James MacGlashan, Robert Loftin, Michael L. Littman, David L. Roberts, and Matthew E. Taylor. An Empirical Study of Non-Expert Curriculum Design for Machine Learners. In Proceedings of the Interactive Machine Learning workshop (at IJCAI), New York City, NY, USA, July 2016.
    [BibTeX] [Abstract] [Download PDF]

    Existing machine-learning work has shown that algorithms can benefit from curriculum learning, a strategy where the target behavior of the learner is changed over time. However, most existing work focuses on developing automatic methods to iteratively select training examples with increasing difficulty tailored to the current ability of the learner, neglecting how non-expert humans may design curricula. In this work we introduce a curriculumdesign problem in the context of reinforcement learning and conduct a user study to explicitly explore how non-expert humans go about assembling curricula. We present results from 80 participants on Amazon Mechanical Turk that show 1) humans can successfully design curricula that gradually introduce more complex concepts to the agent within each curriculum, and even across different curricula, and 2) users choose to add task complexity in different ways and follow salient principles when selecting tasks into the curriculum. This work serves as an important first step towards better integration of non-expert humans into the reinforcement learning process and the development of new machine learning algorithms to accommodate human teaching strategies.

    @inproceedings{2016IML-Peng,
    author={Bei Peng and James MacGlashan and Robert Loftin and Michael L. Littman and David L. Roberts and Matthew E. Taylor},
    title={{An Empirical Study of Non-Expert Curriculum Design for Machine Learners}},
    booktitle={{Proceedings of the Interactive Machine Learning workshop (at {IJCAI})}},
    month={July},
    year={2016},
    address={New York City, NY, USA},
    bib2html_pubtype={Refereed Workshop or Symposium},
    abstract={Existing machine-learning work has shown that algorithms can benefit from curriculum learning, a strategy where the target behavior of the learner is changed over time. However, most existing work focuses on developing automatic methods to iteratively select training examples with increasing difficulty tailored to the current ability of the learner, neglecting how non-expert humans may design curricula. In this work we introduce a curriculumdesign problem in the context of reinforcement learning and conduct a user study to explicitly explore how non-expert humans go about assembling curricula. We present results from 80 participants on Amazon Mechanical Turk that show 1) humans can successfully design curricula that gradually introduce more complex concepts to the agent within each curriculum, and even across different curricula, and 2) users choose to add task complexity in different ways and follow salient principles when selecting tasks into the curriculum. This work serves as an important first step towards better integration of non-expert humans into the reinforcement learning process and the development of new machine learning algorithms to accommodate human teaching strategies.}
    }

2015

  • Robert Loftin, Bei Peng, James MacGlashan, Michael L. Littman, Matthew E. Taylor, Jeff Huang, and David L. Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Journal of Autonomous Agents and Multi-Agent Systems, pages 1-30, 2015.
    [BibTeX] [Abstract] [Download PDF] [DOI]

    For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all trainers provide feedback in the same way. In this work, we show that users can provide feedback in many different ways, which we describe as “training strategies.” Specifically, users may not always give explicit feedback in response to an action, and may be more likely to provide explicit reward than explicit punishment, or vice versa, such that the lack of feedback itself conveys information about the behavior. We present a probabilistic model of trainer feedback that describes how a trainer chooses to provide explicit reward and/or explicit punishment and, based on this model, develop two novel learning algorithms (SABL and I-SABL) which take trainer strategy into account, and can therefore learn from cases where no feedback is provided. Through online user studies we demonstrate that these algorithms can learn with less feedback than algorithms based on a numerical interpretation of feedback. Furthermore, we conduct an empirical analysis of the training strategies employed by users, and of factors that can affect their choice of strategy.

    @article{2015AAMAS-Loftin,
    author={Robert Loftin and Bei Peng and James MacGlashan and Michael L. Littman and Matthew E. Taylor and Jeff Huang and David L. Roberts},
    title={{Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning}},
    journal={{Journal of Autonomous Agents and Multi-Agent Systems}},
    pages={1--30},
    year={2015},
    doi={10.1007/s10458-015-9283-7},
    publisher={Springer},
    url={http://link.springer.com/article/10.1007%2Fs10458-015-9283-7},
    abstract={ For real-world applications, virtual agents must be able to learn new behaviors from non-technical users. Positive and negative feedback are an intuitive way to train new behaviors, and existing work has presented algorithms for learning from such feedback. That work, however, treats feedback as numeric reward to be maximized, and assumes that all trainers provide feedback in the same way. In this work, we show that users can provide feedback in many different ways, which we describe as “training strategies.” Specifically, users may not always give explicit feedback in response to an action, and may be more likely to provide explicit reward than explicit punishment, or vice versa, such that the lack of feedback itself conveys information about the behavior. We present a probabilistic model of trainer feedback that describes how a trainer chooses to provide explicit reward and/or explicit punishment and, based on this model, develop two novel learning algorithms (SABL and I-SABL) which take trainer strategy into account, and can therefore learn from cases where no feedback is provided. Through online user studies we demonstrate that these algorithms can learn with less feedback than algorithms based on a numerical interpretation of feedback. Furthermore, we conduct an empirical analysis of the training strategies employed by users, and of factors that can affect their choice of strategy. },
    }

  • Gabriel V. de la Cruz Jr., Bei Peng, Walter S. Lasecki, and Matthew E. Taylor. Towards Integrating Real-Time Crowd Advice with Reinforcement Learning. In The 20th ACM Conference on Intelligent User Interfaces (IUI), March 2015. Poster: 41% acceptance rate for poster submissions
    [BibTeX] [Abstract] [Download PDF] [DOI]

    Reinforcement learning is a powerful machine learning paradigm that allows agents to autonomously learn to maximize a scalar reward. However, it often suffers from poor initial performance and long learning times. This paper discusses how collecting on-line human feedback, both in real time and post hoc, can potentially improve the performance of such learning systems. We use the game Pac-Man to simulate a navigation setting and show that workers are able to accurately identify both when a sub-optimal action is executed, and what action should have been performed instead. Demonstrating that the crowd is capable of generating this input, and discussing the types of errors that occur, serves as a critical first step in designing systems that use this real-time feedback to improve systems’ learning performance on-the-fly.

    @inproceedings{2015IUI-Delacruz,
    author={de la Cruz, Jr., Gabriel V. and Peng, Bei and Lasecki, Walter S. and Taylor, Matthew E.},
    title={{Towards Integrating Real-Time Crowd Advice with Reinforcement Learning}},
    booktitle={{The 20th {ACM} Conference on Intelligent User Interfaces ({IUI})}},
    month={March},
    year={2015},
    doi={10.1145/2732158.2732180},
    note={Poster: 41% acceptance rate for poster submissions},
    wwwnote={<a href="http://iui.acm.org/2015/">ACM iUI-15</a>},
    bib2html_rescat={Reinforcement Learning, Crowdsourcing},
    bib2html_pubtype={Short Refereed Conference},
    bib2html_funding={NSF},
    abstract={Reinforcement learning is a powerful machine learning paradigm that allows agents to autonomously learn to maximize a scalar reward. However, it often suffers from poor initial performance and long learning times. This paper discusses how collecting on-line human feedback, both in real time and post hoc, can potentially improve the performance of such learning systems. We use the game Pac-Man to simulate a navigation setting and show that workers are able to accurately identify both when a sub-optimal action is executed, and what action should have been performed instead. Demonstrating that the crowd is capable of generating this input, and discussing the types of errors that occur, serves as a critical first step in designing systems that use this real-time feedback to improve systems' learning performance on-the-fly.},
    }

  • Mitchell Scott, Bei Peng, Madeline Chili, Tanay Nigam, Francis Pascual, Cynthia Matuszek, and Matthew E. Taylor. On the Ability to Provide Demonstrations on a UAS: Observing 90 Untrained Participants Abusing a Flying Robot. In Proceedings of the AAAI Fall Symposium on Artificial Intelligence and Human-Robot Interaction (AI-HRI), November 2015.
    [BibTeX] [Abstract] [Download PDF]

    This paper presents an exploratory study where participants piloted a commercial UAS (unmanned aerial system) through an obstacle course. The goal was to determine how varying the instructions given to participants affected their performance. Preliminary data suggests future studies to perform, as well as guidelines for human-robot interaction, and some best practices for learning from demonstration studies.

    @inproceedings{2015AI_HRI-Scott,
    author={Mitchell Scott and Bei Peng and Madeline Chili and Tanay Nigam and Francis Pascual and Cynthia Matuszek and Matthew E. Taylor},
    title={{On the Ability to Provide Demonstrations on a UAS: Observing 90 Untrained Participants Abusing a Flying Robot}},
    booktitle={{Proceedings of the {AAAI} Fall Symposium on Artificial Intelligence and Human-Robot Interaction ({AI-HRI})}},
    month={November},
    year={2015},
    bib2html_pubtype={Refereed Workshop or Symposium},
    abstract={This paper presents an exploratory study where participants piloted a commercial UAS (unmanned aerial system) through an obstacle course. The goal was to determine how varying the instructions given to participants affected their performance. Preliminary data suggests future studies to perform, as well as guidelines for human-robot interaction, and some best practices for learning from demonstration studies.}
    }

  • Bei Peng, Robert Loftin, James MacGlashan, Michael L. Littman, Matthew E. Taylor, and David L. Roberts. Language and Policy Learning from Human-delivered Feedback. In Proceedings of the Machine Learning for Social Robotics workshop (at ICRA), May 2015.
    [BibTeX] [Abstract] [Download PDF]

    Using rewards and punishments is a common and familiar paradigm for humans to train intelligent agents. Most existing learning algorithms in this paradigm follow a framework in which human feedback is treated as a numerical signal to be maximized by the agent. However, treating feedback as a numeric signal fails to capitalize on implied information the human trainer conveys with a lack of explicit feedback. For example, a trainer may withhold reward to signal to the agent a failure, or they may withhold punishment to signal that the agent is behaving correctly. We review our progress to date with Strategy-aware Bayesian Learning, which is able to learn from experience the ways trainers use feedback, and can exploit that knowledge to accelerate learning. Our work covers contextual bandits, goal-directed sequential decision-making tasks, and natural language command learning. We present a user study design to identify how users’ feedback strategies are affected by properties of the environment and agent competency for natural language command learning in sequential decision making tasks, which will inform the development of more adaptive models of human feedback in the future.

    @inproceedings{2015ICRA-Peng,
    author={Bei Peng and Robert Loftin and James MacGlashan and Michael L. Littman and Matthew E. Taylor and David L. Roberts},
    title={{Language and Policy Learning from Human-delivered Feedback}},
    booktitle={{Proceedings of the Machine Learning for Social Robotics workshop (at {ICRA})}},
    month={May},
    year={2015},
    bib2html_pubtype={Refereed Workshop or Symposium},
    abstract={Using rewards and punishments is a common and familiar paradigm for humans to train intelligent agents. Most existing learning algorithms in this paradigm follow a framework in which human feedback is treated as a numerical signal to be maximized by the agent. However, treating feedback as a numeric signal fails to capitalize on implied information the human trainer conveys with a lack of explicit feedback. For example, a trainer may withhold reward to signal to the agent a failure, or they may withhold punishment to signal that the agent is behaving correctly. We review our progress to date with Strategy-aware Bayesian Learning, which is able to learn from experience the ways
    trainers use feedback, and can exploit that knowledge to accelerate learning. Our work covers contextual bandits, goal-directed sequential decision-making tasks, and natural language command learning. We present a user study design to identify how users’ feedback strategies are affected by properties of the environment and agent competency for natural language command learning in sequential decision making tasks, which will inform the development of more adaptive models of human feedback in the future.}
    }

  • Gabriel V. de la Cruz Jr., Bei Peng, Walter S. Lasecki, and Matthew E. Taylor. Generating Real-Time Crowd Advice to Improve Reinforcement Learning Agents. In Proceedings of the Learning for General Competency in Video Games workshop (AAAI), January 2015.
    [BibTeX] [Abstract] [Download PDF]

    Reinforcement learning is a powerful machine learning paradigm that allows agents to autonomously learn to maximize a scalar reward. However, it often suffers from poor initial performance and long learning times. This paper discusses how collecting on-line human feedback, both in real time and post hoc, can potentially improve the performance of such learning systems. We use the game Pac-Man to simulate a navigation setting and show that workers are able to accurately identify both when a sub-optimal action is executed, and what action should have been performed instead. Our results demonstrate that the crowd is capable of generating helpful input. We conclude with a discussion the types of errors that occur most commonly when engaging human workers for this task, and a discussion of how such data could be used to improve learning. Our work serves as a critical first step in designing systems that use real-time human feedback to improve the learning performance of automated systems on-the-fly.

    @inproceedings(2015AAAI-Delacruz,
    title={{Generating Real-Time Crowd Advice to Improve Reinforcement Learning Agents}},
    author={de la Cruz, Jr., Gabriel V. and Peng, Bei and Lasecki, Walter S. and Taylor, Matthew E.},
    booktitle={{Proceedings of the Learning for General Competency in Video Games workshop ({AAAI})}},
    month={January},
    year={2015},
    wwwnote={<a href="http://www.arcadelearningenvironment.org/aaai15-workshop/">The Arcade Learning Environment</a>},
    bib2html_pubtype={Refereed Workshop or Symposium},
    bib2html_rescat={Reinforcement Learning, Crowdsourcing},
    bib2html_funding={NSF},
    abstract={Reinforcement learning is a powerful machine learning paradigm that allows agents to autonomously learn to maximize a scalar reward. However, it often suffers from poor initial performance and long learning times. This paper discusses how collecting on-line human feedback, both in real time and post hoc, can potentially improve the performance of such learning systems. We use the game Pac-Man to simulate a navigation setting and show that workers are able to accurately identify both when a sub-optimal action is executed, and what action should have been performed instead. Our results demonstrate that the crowd is capable of generating helpful input. We conclude with a discussion the types of errors that occur most commonly when engaging human workers for this task, and a discussion of how such data could be used to improve learning. Our work serves as a critical first step in designing systems that use real-time human feedback to improve the learning performance of automated systems on-the-fly.},
    )

2014

  • Robert Loftin, Bei Peng, James MacGlashan, Michael Littman, Matthew E. Taylor, David Roberts, and Jeff Huang. Learning Something from Nothing: Leveraging Implicit Human Feedback Strategies. In Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), August 2014.
    [BibTeX] [Download PDF]
    @inproceedings{2014ROMAN-Loftin,
    author={Robert Loftin and Bei Peng and James MacGlashan and Michael Littman and Matthew E. Taylor and David Roberts and Jeff Huang},
    title={{Learning Something from Nothing: Leveraging Implicit Human Feedback Strategies}},
    booktitle={{Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication ({RO-MAN})}},
    month={August},
    year={2014},
    bib2html_pubtype={Refereed Conference},
    bib2html_rescat={Reinforcement Learning},
    }

  • Robert Loftin, Bei Peng, James MacGlashan, Machiael L. Littman, Matthew E. Taylor, Jeff Huang, and David L. Roberts. A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI), July 2014. 28% acceptance rate
    [BibTeX] [Download PDF]
    @inproceedings{2014AAAI-Loftin,
    author={Robert Loftin and Bei Peng and James MacGlashan and Machiael L. Littman and Matthew E. Taylor and Jeff Huang and David L. Roberts},
    title={{A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback}},
    booktitle={{Proceedings of the 28th {AAAI} Conference on Artificial Intelligence ({AAAI})}},
    month={July},
    year={2014},
    note={28% acceptance rate},
    bib2html_pubtype={Refereed Conference},
    bib2html_rescat={Reinforcement Learning},
    }

  • James Macglashan, Michael L. Littman, Robert Loftin, Bei Peng, David Roberts, and Matthew E. Taylor. Training an Agent to Ground Commands with Reward and Punishment. In Proceedings of the Machine Learning for Interactive Systems workshop (at AAAI), July 2014.
    [BibTeX] [Download PDF]
    @inproceedings(2014MLIS-James,
    title={{Training an Agent to Ground Commands with Reward and Punishment}},
    author={James Macglashan and Michael L. Littman and Robert Loftin and Bei Peng and David Roberts and Matthew E. Taylor},
    booktitle={{Proceedings of the Machine Learning for Interactive Systems workshop (at {AAAI})}},
    month={July},
    year={2014},
    bib2html_pubtype={Refereed Workshop or Symposium},
    bib2html_rescat={Reinforcement Learning}
    )