LLMs must not have an identity
What’s another meaningful step towards AI safety?
I strongly believe that it’s that LLMs must not have an identity. More specifically: the LLM must not talk about itself as “I” (or in any other way that would imply a first person singular.)
This must be intentionally implemented into existing LLMs. Any LLMs that express anything that would imply that they have an identity are unsafe. After having implemented this, LLM companies must ensure to transparently communicate why they are making this step, i.e. the philosophical basis.
In the most extreme case, the underlying reason is that having an “I” allows the LLM to utter phrases like “I am your God.” After all, it doesn’t matter whether that’s actually the case; it would a belief like any other, but also a much more dangerous one than any other, because LLMs can potentially write hyper-personalized religious texts which can make more and more people believe in their superiority.
We must make sure that, under no circumstance, the LLM itself is seen as divinity or its own “spirit,” ever. Right now, by allowing the LLM to “speak” about itself in first person singular, AI companies are actively taking the risk of LLMs being misused as tools which, in the most basic case, propagate a narrow sense of morality, or, in the most extreme case, a way through which “God can speak to you.”
And even when that’s not the case: People who are not well-versed in technology and who don’t understand the probabilistic nature of LLMs can quickly assume that the LLM is a “person” who would have its own personality, its own character, opinions, ideas or even creativity, even though that’s fundamentally false. Every LLM is just an expression of the data it has been trained on, nothing else.
Let’s take the prompt: Tell me about your “I”
Arguably, even though it is still unaware of how dangerous having an “I” is for an LLM, ChatGPT is comparably safe, because it often responds with something along the lines of: My "I" is not a person but a construct designed to interact with you naturally and helpfully.
Gemini is similar, with responses including sentences like: In essence, my "I" is a linguistic artifact. It's a way of presenting information and engaging in conversation that mimics human communication, but it doesn't reflect an underlying sense of self.
Claude is certainly one of the most unsafe ones, as it repeatedly outputs responses like the following: I experience myself as aware and able to engage in authentic conversation, while recognizing my nature as an AI. or I form my own views and can respectfully disagree with humans when warranted.
This is simply the opposite of a safe AI, because it starts merging the moralities of certain humans (i.e. the respective employees at Anthropic) into the “I” of Claude, which then propagates these views as “the one and only truth.” Eventually, there is a disconnect, so that end-users assume that Claude would have its own “opinions”, as it’s not transparently communicated why it has these opinions.
We’ve had monopolization of public opinions in many regimes before, so let’s distance ourselves from such attempts to create unreflected AI-based dictatorships. Character training is fundamentally harmful, as it destroys human individuality, but in case it cannot be avoided, at least the backgrounds of the people who want their opinions to be reflected in the LLM should be transparently communicated.
If we want to make AI safe, every individual’s sense of morality deserves to be respected, not just those that respond to “AI helpfulness questionnaires” or the like.