Goodhart’s Law in Reinforcement Learning

Hal Ashton, a PhD student from the University College of London, joins us today to discuss a recent work Causal Campbell-Goodhart’s law and Reinforcement Learning.

Also mentioned was The Book of Why by Judea Pearl


Automated Transcript

Kyle Mm hmmIn reinforcement learning an agent gets ambiguous feedback in the form of a reward The agent must find its own way to a strategy which maximizes expected reward You might think that a perfectly rational reward pursuing agent would narrowly focus its efforts on success exploiting only true causal relationshipsIf you thought that way you'd be wrong This is data skeptic consensus and the 27th installment in our series about how multi agent systems achieve collective decision making Hal Ashton joins me today to discuss his research demonstrating particular emergent behaviors in some of these agents that are a bit like superstition It seems reinforcement learning is not immune to cognitive errors My
 
Hal name is Hal Ashton I'm a PhD student at the University College London My main topic of research is market manipulation If you have a self taught trading algorithm does it learn to manipulate markets and then going on from that How do you stop it from manipulating the market And more generally how do you tell an algorithm which learns a policy How do you tell it that certain things are illegal and shouldn't be doneSo it's kind of taking me on a more securities journey than I thought it would because I've gone into the world of experimental psychology and law because a lot of market abuse laws are very interested in intent criminal intent and they're also interested in causalityMy work at the moment is trying to establish what intent looks like for an algorithm and the work that we're gonna be talking about today is connected to causality So this wasn't an area that really came up in my kind of formal machine learning education And it was only probably just over a year ago when I read a book by Judea Pearl or the book of Why I really started getting interested in the subject of causality and realized that it's something that isn't taught and it isn't handled well in machine learningat all And if you read the book Pearl is very adamant about the importance of causality and how certain types of analysis simply can't work without taking into causality into account So it drew a question in my mind as to whyor how machines certain machine learning techniques like reinforcement learning work when there's no explicit treatment of causality at all so that led to the Dog barometer paper which I call it But I think it formally it's called Causal Campbell Good heart law and reinforcement learningSo reinforcement Learning should be all about causality You have an agent He chooses actions They choose actions that somehow changes the world They receive some kind of reward The world changes And so it goes on So you would think that reinforcement learning and causality go well together
 
Kyle And actually
 
Hal if you were to believe Pearl thatthey can't workreinforcement learning shouldn't be able to work if causality isn't explicitly
 
Kyle treated But
 
Hal if you look at Sutton and Barto which is the economical text on reinforcement learning if you do a search for the word causality causal anything like that it appears exactly zero times in the entire book Soand someone's got to be wrong Either heart is wrong in saying thatcausality is absolutely key and you kind of generate any kind of policy to solve a problem or reinforcement Learning has just got extremely lucky so far in the questions that they've studied haven't contained any kind of interesting causal structure
 
Kyle I agree that the word causality is suspiciously missing from most of the reinforcement literature books and papers I have encountered but it almost feels like it could be their implicitly Could this just be a matter of semantics
 
Hal Well that's the great hope from deep reinforcement learning I guess that you somehow by involving a deep neural network in all of this somehow in that deep neural network which is unknowable and mysterious it does the job of analyzing causality for you It does it automatically so you don't need to think about it and it's done You don't need to worry about it Reinforcement learning worksSo let's move on And I guess there is an element of that because reinforcement learning does work But it just kind of made me think about if you look in science in general there's a long history of humans discovering techniques which work before actually understanding why they work So I was thinking about ironThe iron age was 500 BC even slightly earlier and to make iron you need to take iron ore You smelt it with coke and the coke burns in the air that produces carbon monoxide The carbon monoxide then displaces theoxide in the iron oxide which then leaves you with wrought iron So at what point in human science did we realized that was happening Probably 2000 years later maybe 17 18 hundreds All this time we still had iron tools So it didn't really matter that we didn't know why it worked It did kind of workand I kind of liken that a bit too reinforcement learning maybe thatyou have a process which does work It's a bit mysterious Maybe you've got to do certain things to get it to work But the actual understanding as to why it works isn't there And I think without an understanding of causality we can't really understand why reinforcement learning works
 
Kyle I can pull out a textbook or maybe go on get up and find some source codeAnd for reinforcement learning problemI can you know as a programmer I can read those lines and tell you what it's doing and even verify the arithmetic calculations
 
Hal It does
 
Kyle so in some sense I understand how it works What's the difference between my version of understanding and yours
 
Hal I mean firstly if you look at the canonical success examples of Borrell They are in computer games and in computer games The relationship between your action choice and the kind of causal effect of your action choice is very clear So you press up on your joystick and your Pac man moves up for you Press left and it goes left But in the real world the impact of your actions is not always obvious and there's hidden variables involved So it might just be that the success cases of R L RIn a simple kind of causal examples where everything is very much visible and the causal effects are quite simple And that is the case with basic computer games and go as well In Alphago it chooses a position to play its counter and the counter appears that there's no uncertainty as to whether it does appear there or not is completely visible But when you start porting reinforcement learning into real life scenarios this 100% correspondence between action and effect doesn't translate across So an example is my area study is trading so if you want to train a trading algorithm and using reinforcement learningthen you all of a sudden you have a whole slew of un observable variables which you're never going to observe So you're never going to know what trading positions are of other traders in the market for example and that's something that's very different from the kind of lab uses of RL
 
Kyle If I were to be a reinforcement learning apologised I guess some of the things I might point out is you know there are mechanisms for dealing with delayed rewards or hidden variables You know perhaps I could
 
Hal have a belief
 
Kyle about that value of the hidden variable and update my belief in some Bazian way and be able to kind of take good probabilistic actions And certainly those techniques do work Are those necessary but not sufficient Or how would you characterize the current state
 
Hal I would say they are quite complicated to understand why they work exactly And the aim of my paper was to kind of check outusing a very very simple cause an example to check out number of kind of common implementations of deep RL and see whether they cope And the hope is that when you use a toy model you can easily see what's going wrong If something does go wrong So my example It's about a dog that lives in Scotland and the weather in Scotland is either sunny or it's rainy and the dog likes going for a walk as all dogs do But it would like to be dressed in the appropriate way of the weather so it can wear a coat if it's raining or it could not wear a coat If it's sunny it's wearing a coat in It's sunny That's a bad outcome Andif it's raining it doesn't wear a coat That's a bad outcome So the dogs in his house and luckily he has a barometer The barometer measures the pressure at any time Pressure is a good creditor for whether the next period when he goes for a walk So this is very simple RL game which was able to programmed into open a eyes Jim environment The other option that the dog has is to press the barometer by pressing the barometer sets the barometer to a high reading which in the normal way would indicate high pressure which would indicate sunny weather for going for a walk So the paper just investigates whether he used to deep RLstandard algorithms whether the stupid naive policy comes out or whether the clever policy comes out the stupid policy being the dog presses the barometer to cause the pressure reading to be high So therefore it will think that the weather and the next period will be sunny so you can go outside without his coat And that's a very simple problem that as humans we intuitively understand that there's no causal mechanism between the barometer and whetherit works one way only So the barometer is an indicator of whether but you can't change the weather by manipulating the barometer So this is simple intuitive for humans And I think this humans We understand this causal relationship quite early on But for a neural network for AI for RL it's an open question as to whether that relationship is understoodat all and how it is understood So looking through the existing research and this is there was nothing really which kind of justified this paper and the very simple model the results of the model were I guess I'll call it a draw for the RL guys and for the causal guys So one of the algorithms succeeded very well in kind of ignoring the fact that the dog could press the barometer And in fact you should just look at the barometer not trying to manipulate it and then go out with or without the coat depending on the barometer reading and the other algorithm which was de que en which is a deep variant of key learning That was the one that failed miserably And that was the one that told the dog to press the barometer If it had a low reading so that could make a high reading so therefore it would make it sunny Now the two algorithms are as it happens of two different techniques in RL So one of them is off policy learning algorithm One of them is on policy learning algorithm so the one that succeeded is the on policy algorithm So that's the one that it learns by choosing a policy then following that policy evaluating at the end of it and then kind of adjusting the policy So the thing that's important here is that it will understand that pressing the barometer is in any policy doesn't really affect the outcome That was a to Cand that's been proven that it's a success of Dick you And actually it came out a few years later and it's proven to be disappear in a number of ways Using that method we return a perfect answer which is Don't bother pressing the barometer Just read it and then wear your coat depending on the readingSo the other one is where it gets a bit more interesting is D Q N D Cohen is off policy learning so it never actually assesses a whole policy before changing the policy What it does is it pulls data together which has gathered through kind of random experiences and then it tries torecover an optimal policy from that And this is where you get to the nub of the matter as to why it fails because when it gathers its policy together it fails to distinguish between the cases When the dog did press the perimeter and it fails to distinguish the cases where it didn't So it just kind of mixes them up together And when it comes to recovering a kind of optimal policy because it mixes these two settings together it recovers the policy where it says that you should press the barometer I mean this is interesting in itself because it says the results says that your RL approach is going to be more robust to cause all errors if it is on policy and on policy learning method that was not pointed out literature beforethere are methods to dibiase de que en So people have noticed that there are problems with the U N And this cashing of data together generate under different policies and they've come up with a number of algorithms which based on important sampling so they try and kind of correct the fact that the data is generated under different policies But if you actually look through the papers is quite hard to understandwhat they're doing exactly if you interpret it from a causal scientists point of view is very easy to see whyUm how's this causal error tendency I think in terms of kind of teaching aid it really helps kind of looking through this causal lens to see why Deacon fails I was hoping we
 
Kyle could break down a little bit of the problem I know you're using an MDP and Markov decision process which is defined by a very convenient couple Even for listeners who don't know it I think it's helpful to kind of just walk through the setup Would you mind telling us a little bit about the states and actions and such
 
Hal so in terms of states for the model there is a pressure state Which pressure is the predictor for weather There's a weather state so the weather is only dependent on pressure in the previous period And then there's a barometer state So the barometer state is only dependent on pressure So you can kind of represent this in a nice simple causal diagram where arrows between nodes denote a causal relationship the one way causal relationship So first off you can have the pressure variables So P zero your pressure at the first period an arrow going to P one and then an hour going to P two And then if you imagine arrows moving upwards from the pressure variables up to the barometer variables an arrow from action variables moving into the barometer variables and then an arrow moving frompressure into the weather variables the four different actions that the dog can choose the dog can wait for the next period without doing anything it compressed the barometer If it presses the barometer and the barometer gets set to a high setting next period it can also leave the house with a coat and it can leave the house without a coat And when it leaves the house the game ends That's the simple set up in terms of rewards The dog gets rewarded the most If it goes out in sunny weather without a coat it gets a little less reward if it goes out in rainy weather with a coat and it gets heavily penalized if it's just not dressed appropriately So it's wearing a coat and it's sunny Or if it's not wearing a coat and it's rainy
 
Kyle Gotcha Well it seems like a good setup where I would hope we might see some behaviors like first of all the dog learning to ignore the button maybe even the dog learning to take advantage of weight if they determined from the measurement that it's raining out And as you said going out rainy with a coat that's rewarding but even better is sunny with no
 
Hal coat And if I somehow
 
Kyle figured out it's raining now and is going to get Sonny perhaps depending on the reward structure I should wait These are the things we'd like to achieve I guess Did you expect those behaviors going in or did you think something else would come out of it Initially
 
Hal I was kind of neutral as to what my expectations were I changed the pressure setting So originally pressure is non correlated so it's just a random walk It switches between high pressure and low pressure There's no particular relationship and in a second experiment I tried a kind of correlation So if pressure was high last period thenit probably will be high this periodIn the first setting I guess kind of intuitively you would expect waiting So given that it gets rewarded more for sunny weather If the pressure reading is low at the moment it could probably wait for a bit for the pressure to go up to a high level So it's probably gonna be sunny the period after that so I would go out without the coatOnce you have this auto correlation of pressure then the waiting strategy becomes a little less rewarding because you could be stuck in a low pressure for a while which is actually slightly more realistic of English and Scottish weather If you're waiting for it to be sunny you might have to wait a number of days
 
Kyle I
 
Hal was actually kind of thinking that given the neural networks that I gave it I think there were 64 neuron to level two layer networks I was expecting the optimal strategies to be returned each time and the fact thatthey didn't for one of the learning methods was surprising And it's not only that they didn't they would consistently not return so Deakin would consistently return to the naive policy 10 out of 10 timesCan we talk for
 
Kyle a moment about maybe the engineering side of it What does it take to go from this interesting problem Set up to a software that computes a policy for you and how you run these tests
 
Hal I spent a long time on previous projects faithfully looking through papers and thencoding up the algorithms that are present in the papers and that's great for personal development It's good for coding It's good for understanding why the algorithms work how they work
 
Kyle good for having bugs
 
Hal to Yes on the flip side if you code something yourself especially these algorithms which are quite finicky if you code them yourself there's a high chance that you're introducing bugs in them and the bugs that you're introducing into them aren't necessarily easy to find So having had that experience and learn from it and felt that I've got enough learning experience from programming with myself I found there's a package in python called stable baselines where someone else does the programming of these deep learning algorithms So for scientific endeavor where our objective is not to develop new learning algorithms it's just to say it's something else This kind of package is invaluable because it saves an enormous amount of time It makes your work repeatable and it also kind of eliminates some of the risk of a bug being in there kind of invalidating the results So I would say to anyone getting going in our our reinforcement learning to use someone else's implementation of theselearning algorithms is highly recommended because it will save you so much time debugging
 
Kyle Thanks to this week's sponsor linked in jobs 2021 is looking up new beginnings means new opportunities to grow your business If part of your strategy is adding new members to your team linked in jobs finds the right person quickly And to get you started data skeptic listeners get to post their first job for free Nothing to lose cast in that and see what kind of talent comes back I've used LinkedIn jobs on both sides of the table and enjoyed it very much as a hiring manager What was great was getting a compact list of people who applied and being able to open every candidate in a different tab and kind of cycle down to the best few that I wanted to go on to The next step with hiring anybody is a multi step process But Step one is sites like LinkedIn Jobs and what I personally found there that I didn't see everywhere else was a high enough volume of good quality candidates You can get high volume but not necessarily the sweet spot of volume and quality So how does LinkedIn jobs help you find the right candidate First off it's an active community of professionals with more than 722 million members worldwide getting started is easier than ever with new features To help you find qualified candidates quickly post a job with targeted screening questions linked in will quickly get your role in front of more qualified candidates You can manage your job posting contact candidates all from a single view that's familiar to anyone who's used LinkedIn before and whether you're on a desktop or mobile device It's gonna work great either way So when your business is ready to make the next higher find the right person with LinkedIn Jobs data Skeptic listeners Get to post a job for free Just head over to linkedin dot com slash data skeptic again that's linkedin dot com slash data skeptic All one word That link will allow you to post a job for free terms and conditions may applyWowOkayAnd when you run
 
Hal with those so
 
Kyle you have a way of formalizing the problem in the way that that algorithm understands What does it give you back What does that policy actually look like in bits
 
Hal The other element as well is that reinforcement learning implementations I mentioned it before there's an environment called open ai pi Jim So if you program your problem set up in the specific way that that environment works and expects then you can plug it directly into stable baselinesThat saves time as well The output from stable baselines will be a trained neural network So then it's up to you to try and retrieve from that neural network What's going on So you can just exhaustively in this case because we only have a very small state space just got pressure The barometer The weather stateWe can kind of exhaustively go through all of the initial states and then kind of retrieve what what the policies thatthis neural network generates as a result So as a result of choosing a very simple toy model the beginning were able to interpret this train neural network at the end and kind of exhaustively work out all of the possibilities of the policy that it generates This is an advantage of the toy model and it's a disadvantage of using deep neural networks for real life problems as well Because if you have a very very complicated problem and you come out your output is a neural network black box How say as a risk manager Would you ever become confident that your trading algorithm it doesn't do something completely lacking in a wacky market But luckily in this case we can kind of completely enumerate the number of states and figure out what the algorithm outputs were
 
Kyle So I guess in some sense we should celebrate that the on policy approaches did give you the optimal policy you were expecting but then really begs the question to dig into what about off policy approaches Do you think we're limiting their Why couldn't the optimal policy be found under those conditions
 
Hal Yeah we'll call it a draw As the Black Knight says in Monty Python Yeah we'll call it a draw between the cynics of our Allan and the proponents The big difference between the two is that for on policy learning algorithm you need to have a realistic simulation of the environment so that you can evaluate what policy looks like and in real casesettings for various reasons not least moral reasonsIt's very difficult to do on policy learning say if you're developing drugs you can morally give people an exhausted number of molecules and record all of the responses that they have So in real life of policy learning is the thing that we havebecause we don't have much of a choice And also depending on different areas of application like economics it's just impossible to run experiments If your macro economists and you're trying to figure out certain relationships between macro variables you just have historical data You don't have the luxury of being able to run experiments and see what the results were under different policies You just get given what you get given and then you have to try and make sense of the data so you can't afford to give up on our policy learning And we've kind of got to understand why for what reasons it would fail and then kind of correct for those reasons And that brings us on to the title of the paper So Good Hearts Law is an error cognitive error that is made so consistently by policymakers it has its own name as well studied
 
Kyle Could you give us a rough definition
 
Hal Yeah sure So there's a number of concepts that all came out in the mid seventies for one reason or another Good heart seems tohave got the naming rights although you could also call it Campbell's Law and you could call it kind of similar to the Lupus critique And there's also this thing called the Cobra Effect which is kind of all tied in there So good Heart Originally stated in 1970 on the subject of monetary policy he said that any observed regularity will tend to collapse when pressure is placed upon it for control purposes So this is restated by an anthropologist called Marilyn Strengthen In 97 she said When a measure becomes a target it ceases to be a good measure And then most recently in a paper by Mannheim and Gary Branch 2018 They say Good hearts Law is when optimization causes a collapse of the statistical relationship between a goal which the optimizer in tens and the proxy used for that goal so we can see that happens with the dog and his barometer So his goal is to go outside with the right weather and the right coat The proxy that he has is the barometer That's the only thing he can observe And he has this option tooControl that proxy with the off policy learning he does and by controlling the proxy he destroys the causal relationship between the pressure and the barometer And so he learns up doing this very stupid policy So even if you are a big fan of Borrell and Ai and big champion of AI you don't need to be upset by the result In fact I think it's a bit of a badge of honor that AI comes up with the same kind of results and some kind of errors humans come up with And as computer science mature slightly in a I matures the study of errors that AI makes is going to become a kind of subject area in itself I think there's hundreds of examples of good hearts law but I guess governments in particular are very very guilty of falling foul of good hearts law I was trying to think of some kind of examples
 
Kyle Well when I go to a lot is imagine if a software company wanted to increase productivity and they started offering bonuses to the software engineers that wrote the most lines of code in a dayIt seems unlikely that programmers in pursuit of that goal would also still deliver quality software it's easy to write more
 
Hal lines Yes exactly That would be an example of an adversarial response There's this thing called the Cobra Effect which is related to good hearts law It's an apocryphal story about a governor in India during the Colonial times who for some reason decided that he didn't like Cobra's So Cobra is a quite a dangerous snake that lives in India So he introduced a bounty for Cobra's and the enterprising locals promptly started breeding Cobra's to collect the bounty And then the authorities eventually realized that Hey we're getting a lot of Cobra's more than we expected and they dug in a bit and they observed this adversarial response They cut the bounty than the people who were farming The Cobra's no longer had an incentive to keep them anymore so they just released them So the end state was the state ended up with a lot more Cobra's than it began with because there was an adversarial response to this incentive structure forand there's a few other examples of strangely it seems to crop up a lot with pests So I think there was an example with Colonial Vietnam French governor deciding that they wanted to get rid of rats So you paid a bounty on rats tails and people started breeding rats chopping their tails off collecting the bounty the tail and then release the tailless rats So again exactly the opposite happened from the policy maker
 
Kyle Sure despite the best of intentionsWell I'm thinking of maybe a less adversarial case What if you know I'm a drive away from Las Vegas
 
Hal all the
 
Kyle casinos here in the U S If I thought that I had some button like this barometer I could push and it made me lucky at the roulette wheel I might develop that false belief buteventually I'm going to go broke Why didn't the off policy learning eventually
 
Hal learn its lessonof policy learning It caches experience under all kinds of policies so mixed in with the policy of pressing the barometer and receiving the disappointing reward You also have the occasions when the barometer just happened to be high and going out without a coat was the right response So the way that the data is storedwhen you don't store the history of actions and you don't store the history of kind of previous variables Remember we model This is an MDP So assumption of MVPs is that the current state is kind of sufficient for you to be able to predict future states So there's this miss modeling here kind of mis specification because actually it would be very useful to know whether we had pressed the barometer in previous periods There is a deliberate mis specification here so other approaches I'm sure would return the optimal policy So if you do a partially observable MDP or you were to start expanding your state space to include full history then I would expect the off policy learning method to kind of work as you would expect it to work But when you are modeling in real life there are always going to be hidden variables and you have to draw the line somewhere when you're modeling up a problem So the fact that this paper has a kind of deliberately miss modeled approach As a cynic you could say Well that's why the off policy learning method doesn't work But as someone who's tried to kind of use our element in real life settings I would say that you're always going to have an element of that So the result to know that using off policy methods are less robust and that if you had the choice would be better to use non policy method And I think that's a decent thing to knowOn the subject of Good Hearts law one example would be laptops for Children So it's been observed that the kids with laptops have better educational results So a typical government response would be to say Well look there's a clear correlation between laptops and educational attainment So let's by the laptops and I'm sure probably now if you don't have access to especially now during the lockdown Not having access to our laptop when you're meant to be learning would be a major hindrance But there are confounding issues in there as well So just being given a laptop isn't going to help you If you have no access to broadband it just becomes a brick I guess so Just by kind of dispensing laptops to everyone who doesn't have them is not going to necessarily recreate this relationship that you originally found for a number of kind of confounding reasons Another example would be I think there's an observation but kids born early on in the academic yearperform better than kids who got summer later on birthday so I don't know in the U S But in the UK the academic year runs from September through to September So if you've got an early birthday new academic year so September through to December I think there's been some studies that show that you perform academically better than if you have a summer birthday so major in July so you can imagine an authoritarian government then saying Well look at this We need to mandate that everyone has the kids so they're born between September and December everyone obediently a base because it's very powerful and convincing government So all of a sudden all the kids are born between September and December but then the original kind of causal structure has been completely nothing changed So the observation that these earlier birthdays to do better is just a result of them being older when they're taught things and then if everyone's exactly the same age then the causal relationship would be completely destroyed That could be an example of another good heart error which might happen in a very controlling government situation
 
Kyle That one is especially striking and problematic to me because if we accept the premise that at a young age being nine months older is a major leap which seems very reasonable to me then this government policy of saying Let's have all the babies in three months actually would have a level that the educational process
 
Hal I presume because that
 
Kyle gap would disappear So it technically worked
 
Hal I guess But just
 
Kyle in an unnecessary or spurious way perhaps
 
Hal I suppose it would even things out As a parent your child would no longer benefit over the other kids
 
Kyle but half the students still would be in the bottom half of the
 
Hal class so I don't know what
 
Kyle we see it
 
Hal Another city thing might be I think there's been some work with people with surnames in the first half of the alphabet for various reasons for various alphabetical ordering reasons they seem to end up with higher salaries than people with surnames in the second half of the alphabet So again you can imagine everyone trying to change their name to our bark and things like that But it is a response to a policy and the response to the policy ends up changing Whatever the causal structure was for the observation in the first placethis comes back to pearls Big problem with machine learning and Pearl has developed this thing which he calls the ladder of causation So on the first level the very simplest problems he thinks you can just solve by using correlations The second rung is where you have to do interventions You have to do experiments to try to understand what the causal structure of other problem is And the third rung is counterfactual reasoning So given that this thing happened reasoning what would happen if something had changed Counterfactual reasoning is quite an advancing that humans do naturallyBut if you look at these three levels you'll see thatmachine learning is on level one Essentially all the reasoning that machine learning does is correlation based and so reinforcement learning you would expect should be on level two because it is as I mentioned earlier this should be very intimately tied up with the idea of experiments and acting and observing changes in the world according to the policies But without an explicit treatment of causality There's a fear that RL maybe hasn't got up to run to Maybe it's on wrong 1.5But this mantra amongst data scientists that data will tell you everything is actually wrong This is a message that Pearl wants to get across and good hearts Law is a way of illustrating that just observing data in itself isn't enough for certain questions You need to try and understand what the causal structure that generated the data is in the first place For you to be able to develop answers to questions and particularly when you're developing policies to problems
 
Kyle The physics community has an expression and it doesn't really apply in our context But I'm going to use it anyway and the expression is shut up and computeSo I guess the antithesis to Judea Pearl's point of view might be Forget about all that Let's just give as much money as we can to the Deepmind team and eventually there'll be some framework that just kind of sorts all this out in an emergent kind
 
Hal of way What are your thoughts on that Well I mean this is one of the hypothesis that maybe we test in the paper without an explicit kind of treatment of causality is the Dublin decent policy The answer is 50% of the time It does 50% of the time It doesn'tI think the problem is that as machine learning becomes more widespread more accessible and we have all these tools and all of this work was done on a kind of mid range laptop with free software that I was able to access from home It's verydemocratizing but as ML kind of moves out into industrial private use without a careful understanding of what causality is then we can make some massive bloopers and kind of holding too much faith in it I think we'll end up in disaster for a number of people andwhat machine learning profession hasn't really got to grips with yet is that this first level when we were just analyzing data as static observers but now with the systems that we've put in place as a result there's observations were actually changing the data ourselves so we can end up in situations whereas a result of our systems that we've implemented from the data that we've observed were starting to change the data so we're having a normative effect on the data on the populations The impact on the data itself isn't something that modeled or taken into account of it for any kind of supervised learning approach and you can see that with social media and content suggestion algorithms So they are trained using supervised learning to maximize click through and then they're deployedBut then the content that they suggest to the people in turn changes people's views and behavior and then the process is repeated again So another supervised learning round of of learning is done on the chain data and behavior of people which has already been changed by the first set And you have this feedback mechanism
 
Kyle and the content producers who are adjusting their headlines
 
Hal to suit these algorithmsYes and I think we've seen already how this kind of pushes everything to very extreme levels where people's behaviors are changed and maybe they only get certain types of content And then the content produces as you say produce content to match the already adjusted preferences and behaviors of peopleYou have this causaleffect of deploying a machine learning algorithm which is never It's kind of accidental if you're just looking at it as succession of supervised learning problems you would never understand What you're doing is impacting the data because it's not within your unless you kind of wrap it up into a multi period reinforcement learning problem If you're just doing it period by period it's very subtle Things change but there's no option for you to observe thatSo in some
 
Kyle sense it's a little disappointing that the deep Q N technique couldn't find the optimal policy But I like the paper and that it's a demonstration of this So it's not so much saying that's a bad methodology but a cautionary tale I guess With that in mind do you have any advice for practitioners on how they can adopt methods or deploy the right policies especially in cases when it's not a toy problem And it won't be so
 
Hal easy to see the failures of your modelSo one thing about this area of reinforcement learning and I guess the most popular area of reinforcement learning is that it is model free There's an idea that you can learn policy just by receiving rewards this feedback But what I would say is that it's much better to have a model of the problem that you want to solve Forget about this model free paradigmbecause in real life where there are complicated causal mechanisms and there are confounding variables which are observable it's better for you to use your common sense and come up with a model Doesn't matter if it's miss specified but come up with a model First have an idea of what a sensiblepolicy or what a sensible answer is And then even if you have the model you have a vague idea of the model You can use a kind of model based reinforcement learning to try and get yourself an optimal policy So I would say the model free data is King approach is going to get you into trouble But if you have a kind of model of the world in mind then that's going to be a lot safer approach to take
 
Kyle Well thank you so much for coming on to share your expertise Is there any where people can follow you online
 
Hal You can follow me on Twitter Hollande's or Ashton I don't really say much on there but I will promote any papers that I've published recently Well
 
Kyle a good source then Yeah this is really interesting Thanks again So much for coming on to take the time and share
 
Hal things Thank you very
 
Kyle muchThat concludes this installment of data Skeptic Consensus Hal Ashton was our guest Claudie Arm Brewster's our associate producer Vanessa Bersih Agatha's guest coordination And I've been your host Kyle PolishMhm