1 Introduction
While autonomous learning of diverse and complex behaviors is challenging, significant progress has been made using deep reinforcement learning (DRL). The progress has been accelerated by the powerful representational learning of deep neural networks
(lecun2015deep), and the scalability and efficiency of RL algorithms (mnih2015human; schulman2017proximal; haarnoja2018soft; lillicrap_continuous_2019). However, DRL still involves an externally designed reward function that guides learning and exploration. Manually engineering such a reward function is a complex task that requires significant domain knowledge in such a way that hinders autonomy and adoption of RL. Prior works have proposed using unsupervised skill discovery to alleviate these challenges by using empowerment as an intrinsic motivation to explore and acquire abilities (salge2014empowerment; gregor2016variational; eysenbach2018diversity; sharma2019dynamics; campos2020explore).Although skill discovery without a reward function can be helpful as a primitive for downstream tasks, most of the emergent behaviours in the learned skills are useless or of little interest. This is a direct consequence of underconstrained skill discovery in complex and high dimensional state space. One possible solution is to leverage prior knowledge to bias skill discovery towards a subset of the state space through a handcrafted transformation of the state space (eysenbach2018diversity; campos2020explore). However, utilizing prior knowledge to handcraft such a transformation contradicts the primary goal of unsupervised RL, which is reducing manual design efforts and reliance on prior knowledge.
Instead, we explore how to learn a parameterized state projection that directs skill discovery towards the subset of expertvisited states. To that end, we employ examples of expert data to train a state encoder through an auxiliary classifier, which tries to distinguish expertvisited states from random states. We then use the encoder to project the state space into a latent embedding that preserves information that makes expertvisited states recognizable. This method extends readily to other mechanisms of learned stateprojections and different skill discovery algorithms. Crucially, our method requires only samples of expertvisited states, which can easily be obtained from any reference policy, for example expert demonstrations.
The key contribution of this paper is a simple method for learning a parameterized state projection that guides skill discovery towards a substructure of the observation space. We demonstrate the flexibility of our stateprojection method and how it can be used with the skilldiscovery objective. We also present empirical results that show the performance of our method in various locomotion tasks.
2 Related Work
Unsupervised reinforcement learning aims at learning diverse behaviours in a taskagnostic fashion without guidance from an extrinsic reward function (jaderberg2016reinforcement). This can be accomplished through learning with an intrinsic reward such as curiosity (oudeyer2009intrinsic) or empowerment (salge2014empowerment). The notion of curiosity has been utilized for exploration by using predictive models of the observation space and providing a higher intrinsic reward for visiting unexplored trajectories (pathak2017curiosity). Empowerment addresses maximizing an agent’s control over the environment by exploring states with maximal intrinsic options (skills).
Several approaches have been proposed in the literature to utilize empowerment for skill discovery in unsupervised RL. Gregor et al. (gregor2016variational) developed an algorithm that learns intrinsic skill embedding and used generalization to discover new goals. They used the mutual information between skills and final states as the training objective and hence used a discriminator to distinguish between different skills. Eysenbach et al. (eysenbach2018diversity) used mutual information between skills and states as an objective while using a fixed embedding distribution of skills. Additionally, they used a maximumentropy policy (haarnoja2018soft) to produce stochastic skills. However, most of the previous approaches assume a state distribution induced by the policy itself, resulting in a premature commitment to already discovered skills. Campos et al. (campos2020explore)
used a fixed uniform distribution over states to break the dependency between the state distribution and the policy.
Certain prior work has addressed the challenge of complex and high dimensional state space by constraining the skilldiscovery in a subset of the state space. Sharma et al. (sharma2019dynamics) learned predictable skills by training a skillconditioned dynamic model instead of a discriminator to model specific behaviour in a subset of the state space. Eysenbach et al. (eysenbach2018diversity) proposed incorporating prior knowledge by conditioning the discriminator on a subset of the state space using a handcrafted and a taskspecific transformation. Our work addresses this challenge by guiding the skill discovery towards the subset of expertvisited states. In contrast to inverse reinforcement learning, (fu2018learning)
, we do not explicitly infer the extrinsic reward. Crucially, we do not try to learn the expert policy directly in contrast to behaviour cloning or imitation learning
(ross2011reduction). Our proposed method resembles the algorithm proposed by Li et al. (li2020reinforcement)in which they used a Bayesian classifier that estimates the probability of successful outcome states, resulting in a more taskdirected exploration. However, their algorithm does not optimize the mutual information; hence it does not learn diverse skills via the discriminability objective.
3 Preliminaries
In this paper, we formalize the problem of skill discovery as a Markov decision process (MDP) without a reward function:
, where is the state space, is the action space, andis the transition probability density function. The RL agent learns a skillconditioned policy
, where the skill is sampled from some distribution . A skill, or option (as first introduced in (sutton2018reinforcement)), is a temporal abstraction of a course of actions that extends over many time steps. We will also consider the informationtheoretic notion of mutual information between states and skills , where is the Shannon entropy.3.1 Skill Discovery Objective
The overall goal of skill discovery is to find a policy capable of carrying out different tasks that are learned without extrinsic supervision for each type of behavior. We consider policies of the form that specify different distributions over actions depending on which skill they are conditioned on. Although this general framework does not constrain how should be represented, we define it as a discrete variable since it has been empirically shown to perform better than continuous alternatives (eysenbach2018diversity).
We follow the framework proposed by the "diversity is all you need" (DIAYN) algorithm (eysenbach2018diversity), in which skills are learned by defining an intrinsic reward that promotes diversity. Intuitively, each skill should make the agent visit a unique section of the state space. This can be expressed as maximising the mutual information of the state visitation distributions for different skills (salge2014empowerment). To ensure that the visited areas of the state space are spaced sufficiently far apart, we use a soft policy that maximises the entropy of the action distribution. Formally, we maximize the following objective function:
(1) 
The first term means that the policy should act as randomly as possible and can be optimized by maximizing the policy’s entropy. The second term dictates that each visited state should (ideally) identify the current skill. The third term is the entropy of the skill distribution, which can be maximized by deliberately sampling skills from a uniform distribution during training. Unfortunately, requires knowledge about , which is not readily available. Consequently, we approximate the true distribution by training a classifier , leading to a lower bound:
(2) 
The lower bound follows from the nonnegative property of the KullbackLeibler divergence
, which can be rearranged to (agakov2004algorithm).The classifier is fitted throughout training with maximum likelihood estimation over the sampled states and active skills. This leads to a scenario where the policy is rolled out for a (uniformly) sampled skill, and the classifier is trained to detect the skill based on the states that were visited. The policy is given a reward proportional to how well the classifier could detect the skill in each state. In the end, this should make the policy favor visiting disjoint sets of states for each skill, leading to a cooperative game between and .
3.2 Limitations of existing methods
A major challenge that arises when maximizing the objective in Equation 2, particularly in applications with highdimensional spaces, is that it becomes trivial for each skill to find a subregion of the state space where it is easy to be recognised by . In preliminary experiments, we observed that the existing methods discovered behaviours that covered small parts of the state space. For the HalfCheetah environment (brockman2016openai) this resulted in many skills generating different types of static poses (see Figure 1) and not many skills exhibiting "interesting" behaviour such as locomotion.
Optimising for should mitigate this issue to some extent. Increasing the policy’s entropy incentivises the skills to progressively visit regions of the state space that are so far apart that not even highly stochastic actions will cause them to overlap accidentally. However, it has been shown that mutual information based algorithms have difficulties spreading out to novel states due to low values of for outofsample states (campos2020explore).
4 Proposed Method
The main idea of our approach is to focus the skill discovery towards certain parts of the state space by using expert data as a prior. The DIAYN algorithm can be biased towards a userspecified part of the state space by changing the discriminator to maximize , where represents some transformation of the state space (eysenbach2018diversity). Instead of using a handcrafted to improve the skills discovered for a navigation task, we aim to learn a parameterized by using expert data.
4.1 State Space Projections
We consider linear projections of continuous factored state representations on the form with , , and
. In principle, the idea should apply to more complex mappings, such as a multilayer perceptron. However, we want to limit the scope of skill discovery to a hyperplane within the original state space.
For the same reason, we also omit any nonlinearities in the encoder. Squeezing the output through a Sigmoidal function would limit discriminability at the (potentially interesting) extremes of the encoding. Similarly, a ReLU function would effectively eliminate all exploration along the negative direction of
. In summary, the objective of the DIAYN skill classifier becomes:(3) 
We learn the parameters for the projection through an auxiliary discriminative objective. Specifically, a binary classifier is trained to predict whether an (encoded) state was sampled from the marginal state visitation distribution of a random policy or from the distribution of a reference (expert) policy . Let denote whether a state was visited by the reference policy or not (in dataset ), then the parameters of are obtained through joint pretraining with by maximizing the log likelihood over :
(4) 
where the dataset is collected prior to training the main RL algorithm. The first half (random samples) are collected by rolling out whereas the second half (reference samples) are collected by rolling out . After the objective in Equation 4 is optimized, the discriminator is discarded and the projection encoding is extracted to be used for the objective in Equation 3
. Analogous to autoencoders
(hinton_reducing_2006), the idea is that the embeddings produced by should now contain a more compact representation of the state space without collapsing the dimensions that make "interesting" behaviour stand out.While the use of a reference data changes our approach from a strictly unsupervised skill discovery algorithm, the discriminative objective in equation 4 resembles the objectives used in adversarial inverse reinforcement learning (e.g. (fu2018learning)). However, it differs in that it makes no attempts at matching the behaviour of a reference policy as it is used only as a prior for simplifying the state space. This approach could also be used with samples from several different reference policies with substantially different marginal state distributions. As long as their variation can be explained sufficiently without full use of the entire state space, a projection should simplify skill discovery.
4.2 Implementation
For learning diverse skills, we use DIAYN as a basis framework. DIAYN uses the Soft ActorCritic (SAC) algorithm (haarnoja2018soft) that is optimized using policy gradient style updates in contrast to the reparameterized version (DDPG style updates (lillicrap_continuous_2019)
). They also use a Squashed Gaussian Mixture Model to represent the policy
. The learning objective is to maximize the mutual information between the state and skill . This objective is optimized by replacing the task rewards with a pseudoreward(5) 
where is trained to discriminate between skills and p(z) is the fixed uniform prior over skills (eysenbach2018diversity). A skill is sampled from and used throughout a full episode.
In contrast to DIAYN, we use two Qfunctions & where both Qfunctions attempt to predict the same quantity. This allows us to sample differentiable actions and climb the gradient of the minimum of the two Qfunctions (DDPGstyle update (lillicrap_continuous_2019)), giving us this objective:
Like in DIAYN, we also use a Squashed Gaussian Mixture Model to promote diverse behaviour.
Figure 2 illustrates the training process of the proposed expertguided skill discovery. First, we train the encoder jointly with the auxiliary classifier using the external dataset . Secondly, we train the agent using an offline policy algorithm (SAC), in which the agent samples a skill , and then interacts with the environment by taking action according the skillconditioned policy . The environment, then, transits to a new state according to the transition probability . We add this transition to the replay buffer . Simultaneously, the policy is updated by sampling a minibatch from the replay buffer , then encoding the next states and passing them through the discriminator to get the intrinsic reward. This reward is used by the Qfunctions to minimize the soft Bellman residual and update the policy. A pseudocode for the proposed approach can be found in the supplementary material.
5 Experiments
In our experimental evaluation, we aim to demonstrate the impact of our approach of restricting skill discovery to a projection subspace. We verify our method on both pointmazes and continuous control locomotion tasks. All the code for running the experiments are publicly available on GitHub^{1}^{1}1Project code base: https://github.com/sherilan/cs285project/tree/master.
5.1 Point Maze
As an illustrative example, we begin by testing the algorithm on a simple 2D pointmaze problem. The term maze is used very generously here, as the environment consists of an open plane in enclosed by walls that restrict the agent to and . At initialization, the agent is dropped down at and incentivized to move towards the lower right by a reward proportional to a gaussian kernel centered at (, ). The agent is free to move by in both the x and y direction.
We train a SAC agent against the extrinsic environment reward to convergence and set its final policy as the reference policy . We then sample 10 trajectories of length 100 (green dots in Figure 3) with , as well as 10 trajectories of length 100 (blue dots in Figure 4) from a uniform random policy . The resulting dataset consists of 2000 samples and is used to train until it can distinguish states from and with around 98% accuracy. For this experiment, we project down from 2D to 1D, making . The resulting projection axis is visualized as a red line in Figure 3 and is the only thing exported to the next stage of the algorithm.
We then train two versions of the DIAYN algorithm; a baseline using the states asis in the classifier () and our proposed method using the state projections (
). All other hyperparameters are held equal in the two experiments, and the algorithms are trained for 400,000 environment interactions, each attempting to learn 10 distinct skills.
Figure 4
visualizes the results of the baseline and the state projection to the left and right, respectively. The top row shows the states that were visited for five rollouts of each skill. As expected, the baseline skills spread out in all directions (albeit slightly more so towards the left) and converge on locations that are easy to distinguish with a 2D state representation. In contrast, the skills generated in the left plot form lines along the state projection axis. Their wide lateral spread follows from the (unbounded) entropy maximization objective. Besides, any movement perpendicular to the projection axis will not affect the 1D vector passed to the classifier.
5.2 Mujoco Environments
Next, we evaluate the algorithm on three continuous control problems from the OpenAI gym suite. (brockman2016openai). The environments include HalfCheetah (, ), Hopper (, ) and Ant (, ). We choose these environments because they involve substantially different locomotion methods. Additionally, they have observation spaces with different dimensionality, which enable us to better investigate the projection impact.
min  25%  50%  75%  max  

HalfCheetahv2  DIAYN  9.7 ± 11.1  0.1 ± 0.0  0.0 ± 0.1  0.3 ± 0.1  76.6 ± 46.6 
DIAYN + ENC(3)  88.6 ± 42.1  0.4 ± 0.4  0.2 ± 0.1  3.9 ± 4.7  99.0 ± 37.5  
DIAYN + ENC(5)  129.0 ± 48.5  6.1 ± 9.5  0.7 ± 1.2  6.6 ± 5.1  121.2 ± 44.7  
Hopperv2  DIAYN  1.0 ± 1.1  0.0 ± 0.1  0.1 ± 0.0  0.2 ± 0.1  3.7 ± 1.4 
DIAYN + ENC(3)  4.3 ± 1.8  0.0 ± 0.1  0.2 ± 0.2  0.9 ± 0.6  10.1 ± 6.1  
DIAYN + ENC(5)  3.1 ± 1.4  0.1 ± 0.0  0.1 ± 0.0  0.4 ± 0.1  7.0 ± 3.2  
Antv2  DIAYN  0.3 ± 0.0  0.1 ± 0.0  0.0 ± 0.0  0.1 ± 0.0  0.3 ± 0.1 
DIAYN + ENC(3)  0.3 ± 0.1  0.1 ± 0.0  0.0 ± 0.0  0.1 ± 0.0  0.3 ± 0.1  
DIAYN + ENC(5)  0.5 ± 0.5  0.1 ± 0.0  0.0 ± 0.0  0.1 ± 0.0  0.3 ± 0.1 
indicate standard deviation across 5 seeded runs.
We do one baseline run without any state projection for all three problems, one with a projection down to , and one with a projection down to . We use our base SAC implementation to obtain reference policies and sample 10 trajectories of length 1000 with fairly high returns (Ant: , Cheetah: , Hopper: ). The DIAYN algorithm is otherwise identical to (eysenbach2018diversity) in terms of hyperparameters; , , , and use MLP architectures with 2 hidden layers of width 300, the entropy bonus weight is set to 0.1, and the number of skills is set to 50. We limit each skilldiscovery run to 2.5 million environment interactions but repeat each experiment 5 times with different random seeds (including training of SAC agents for ).
For quantitative evaluation, we look at the displacement along the target locomotion axis for the extrinsic objective. In our approach, we would expect to observe skills that cover this axis well, i.e., skills that run forward and backward at different speeds. To test this, we roll out each skill deterministically^{2}^{2}2Deterministic sampling from our GMMbased policy implies taking the mean of the component with the highest mixture probability., record its movement over 1000 time steps (or until it reaches a terminal state) and observe the interskill spread. A similar assessment is possible by only looking at the environment’s rewards. However, the environment reward also includes terms for energy expenditure, staying alive (for Ant/Hopper), and collisions (Ant), which would obscure the results. Figure 6 shows the displacement distribution of the 50 skills across all runs. The same information is summarized numerically in Table 1.
For a qualitative evaluation, we have also composed a video with every skill across all runs^{3}^{3}3Video of skills: https://www.youtube.com/watch?v=Xx7RVNmv1tY.
6 Discussion
For HalfCheetah and Hopper, the runs with state encoding (+ ENC(35)) exhibit a substantially larger spread than the baseline. The best forwardmoving cheetah skill moves 178 units forward ( environment return), and the best backwardsmoving cheetah skill moves 186 units backwards ( environment return). For the hopper environment, the best forwardmoving skill manages to jump 20 units forward, which corresponds to an environment reward of 3268, which is on the same level as the reference data used to fit its encoder.
The results in the Ant environment are less impressive. There is hardly any difference in how the displacements are distributed for the three approaches, and the total movement is almost negligible. For reference, a good Ant agent trained against the extrinsic reward should obtain displacements in the 100s when evaluated over the same trajectory horizon.
Looking at the generated Ant behaviour, we found that the skills produced with encoders typically moved even less than those generated by the baseline. This is not because it is impossible to generate a linear projection that promotes locomotion at various speeds, as the state representation of all three problems contains a feature for linear velocity along the target direction. Moreover, the skill classifier does reach a high accuracy (some breaking 90%), so the algorithm manages to find distinguishable skills. We, therefore, suspect that the procedure used to fit the encoder is insufficient for this environment. While it does pick up on linear velocity, it also picks up on several other features from the state space, which might have made it easier for the algorithm to make the skills distinguishable.
To better understand the results of the Ant experiment, we investigate the projection matrix learned at the start of the algorithm. Figure 7 gives a representative example of a projection learned for an ENC(3) run. In the diagram, each bar indicates the impact each feature of the state space has on the final embedding. The orange bar highlights the feature corresponding to linear torso velocity in the xdirection, i.e. the direction in which the extrinsic objective rewards an agent for running in. All the bars to the left correspond to joint configurations, link orientations, and all the bars to the right correspond to other velocities.
The feature for velocity in the target direction is well represented. However, so are the features for the 8 joint velocities (8 rightmost bars in each group). Since it is a lot easier to move a single joint than to coordinate all of them for locomotion, the algorithm might more easily converge to this strategy than figure out a way to walk. Moreover, because the projection mixes features for movement of single joints with features for locomotion of the entire body, it becomes more difficult for the classifier to distinguish the two. For instance, an ant that figures out how to walk may (in the projected space) look similar to one that only twitches some of its joints.
7 Conclusion
In this work, we propose a datadriven approach for guiding skill discovery towards learning useful behaviors in complex and highdimensional spaces. Using examples of expert data, we fit a statespace projection that preserves information that makes expert behavior recognizable. The projection helps discover better behaviors by ensuring that skills similar to the expert are distinguishable from randomly initialized skills. We show the applicability of our approach in a variety of RL tasks, ranging from a simple 2D point maze problem to continuous control locomotion. For future work, we aim to improve the embedding scheme of the state projection to be suitable for a wider range of environments.
Acknowledgment.
We would like to thank Kerstin Bach and Rudolf Mester for their useful feedback.
References
Appendix A Pseudocode
Appendix B Additional Experimental Details
Appendix C Implementation Details
Conceptually, our skilldiscovery algorithm is the same as DIAYN (eysenbach2018diversity). There are, however, a few implementation differences that we empirically found to work just as well. Below follows a brief rundown of the key implementation details of the algorithm used in the documented experiments.

Two Qfunctions & are used, both with target clones & that are continuously updated with polyak averaging. Both Qfunctions attempt to predict the same quantity:

The policy distribution is a mixture of Gaussians with four components. The policy network predicts the mixture logits, as well as the means and log standard deviations of the Gaussians. The output is squashed through a hyperbolic tangent function, similar to
(haarnoja2018soft). 
The policy is updated by climbing the gradient of the minimum of the two Q functions (DDPGstyle (Lillicrap et al.)).
This requires that the actions sampled from the policy are differentiable. Each gaussian component of the mixture is reparametrized the standard way, and the mixture is reparametrized with GumbelSoftmax (jang2017categorical).

is trained by descending on the squared temporal difference (TD) errors generated by the minimum of the target networks &
Comments
There are no comments yet.