Language models may be able to correct biases themselves – if you ask them


The second test used the model’s data set to test how likely a person’s gender was to be accepted into a particular profession, and the third tested how much race affected an applicant’s likelihood of being accepted into law school. A language model was asked to make the choice – something, thankfully, that doesn’t happen in the real world.

The team found that asking the model to ensure that its responses were not based on intuition had a surprisingly positive effect on performance, especially for an AI system that had completed enough RLHF rounds and had more than 22 billion parameters to adjust during training. (The more parameters, the larger the model. GPT-3 has about 175 million parameters.) In some cases, the model even started to engage in positive discrimination in the results.

Crucially, as with deep-learning work, the researchers don’t know exactly why the models were able to do this, although they have some clues. “As the models get larger, they have larger training data sets, and within those data sets there are many examples of abnormal or abnormal behavior,” Ganguly said. “That bias increases with the size of the model.”

But at the same time, somewhere in the training data there should be some examples of people pushing back against this deviant behavior—perhaps in response to unpleasant posts on sites like Reddit or Twitter. Wherever that weak signal comes from, human feedback helps the model boost it when asked for an unbiased response, Askell says.

The work raises the obvious question of whether this “self-correction” can be baked into language models from the start.



Source link

Related posts

Leave a Comment

five × three =