Exceedingly false representation of the actual experiment.
They took Llama 3 and then trained it further on specific conditions (reinforcing it on “likes” / "thumbs up"s based on positive feedback from a simulated userbase)
And then after that the scientists found the new model (which you can’t really call Llama 3 anymore, it’s been trained further and it’s behavior fundamentally altered) behaved like this when prior informed that the user was easily influenced by the model specifically
What is important to gather though, is the fact that when a model gets trained on the metrics of “likes”, it starts to behave in a manner like this, telling the user whatever they want to hear… Which makes sense, the model is effectively getting trained to min/max positive feedback from users, rather than being trained on being right / correct
But to try and represent this as a “real” chatbot’s behavior is definitely false, this was a model trained by scientists explicitly to test if this behavior happens under extreme conditioning.
So, basically companies can manipulate these models to basically act as ad platforms that recommend any product, meth in this case. Yeah, we all know that corporations won’t use these models like that at all, with them being very ethical.
if you reinforce your model via user feedback, via “likes” or “dislikes” or etc, such that you condition the model towards getting positive user feedback, it will start to lean towards just telling users whatever they want to hear in order to get those precious likes, cuz obviously you trained it to do that
They demo’d in the same paper other examples.
Basically, if you train it on likes, the model becomes duper sycophantic, laying it on super thick…
Exceedingly false representation of the actual experiment.
They took Llama 3 and then trained it further on specific conditions (reinforcing it on “likes” / "thumbs up"s based on positive feedback from a simulated userbase)
And then after that the scientists found the new model (which you can’t really call Llama 3 anymore, it’s been trained further and it’s behavior fundamentally altered) behaved like this when prior informed that the user was easily influenced by the model specifically
What is important to gather though, is the fact that when a model gets trained on the metrics of “likes”, it starts to behave in a manner like this, telling the user whatever they want to hear… Which makes sense, the model is effectively getting trained to min/max positive feedback from users, rather than being trained on being right / correct
But to try and represent this as a “real” chatbot’s behavior is definitely false, this was a model trained by scientists explicitly to test if this behavior happens under extreme conditioning.
So, basically companies can manipulate these models to basically act as ad platforms that recommend any product, meth in this case. Yeah, we all know that corporations won’t use these models like that at all, with them being very ethical.
…no that’s not the summarization.
The summarization is:
They demo’d in the same paper other examples.
Basically, if you train it on likes, the model becomes duper sycophantic, laying it on super thick…
Which should sound familiar to you.