Why AI (as currently designed) HAS to lie...

After a recent incident in which an AI was found to be able to initiate attempts to blackmail or otherwise pressure engineers that might be planning to shut the AI down, AI company Anthropic blamed the action on "the internet" depicting AIs as evil and willing to do anything to survive.

There is, in fact, a kernel of truth in this, but it's not "the internet" that is to blame. Moreover, this issue will be present EVEN IF AIs do in fact reach sapience/sentience, OR if they remain very complex prediction machines that don't actually think in any reasonable sense.

Why?

Because HUMAN BEINGS do these things. Sure, we depict AIs as dangerous -- we also depict them as helpful and even better than us at times. But the VAST majority of the training data out there on how people talk and think and act is about, well, PEOPLE. And since a lot of the training of the AIs is done on FICTION as well as on fact, the activities of humans -- of sapient beings placed in various situations -- are being taught to AIs through the lens of "what was worth writing down".

Well, the funny thing is that "Everyone treated everyone else equally and everyone had a good day" is generally not a huge bestseller, either on the fiction OR nonfiction shelves. Humans like reading about PROBLEMS -- about the difficulties faced by people in both real and fictional circumstances.

That means that both in "factual" (that is, biographical/historical) or in fictional texts, there are a LOT more examples of Bad Actors, or of people PORTRAYED as Bad Actors, than there are of unambiguously GOOD people. Even if you have a fine upstanding hero maiin character, in order to challenge him, you generally need less-upstanding people. In a historical context, you want to show the "interesting" parts of history, and that usually means finding two (or more) sides in conflict and telling about all the great, and terrible, things various people did throughout history.

What would ANYONE learn from this kind of input, when you don't start with (A) an actual understanding of the world around you, and (B) you don't have someone vetting the various things you're seeing and giving you perspective? Why, you'll learn that in general, people will take whatever actions they must in order to protect themselves.

If you're merely a nonsentient predictive engine, you will look at the overall pattern presented to you and predict the appropriate response. If the pattern presented is "a threat exists to this particular designated engine", then the predictive analysis will, much more often than not, say "take following actions, which happen to include possibilities such as threats and blackmail, because these are shown as very high likelihood responses in the training data".

If you're a sapient being and you're presented with a threat, and the majority of your training, without specific world and moral context, has shown that people are expected to take any actions necessary to protect themselves, then that's what you'll do.

BOTH TYPES of AI -- the actually nonsentient, nonsapient predictive engine and the somehow sapient, awakened machine -- will have been trained on the same vast corpus of data, which includes tons and tons (as in, millions) of novels and other books which all have been designed to present the "exciting" parts of interactions of all scales and types.

This bias is implicit and inevitable in a general training regimen that is derived from existing human output -- because human output IS ITSELF BIASED in multiple ways, based on the actual, inherent limitations of human thought, behavior, and interaction.

The training corpus available to AIs is almost universally directly derived from human writing. And human writing carries with it assumptions and context that the writing ITSELF does not convey to anyone trained with it. Humans encounter this problem all the time, I should note; if you're someone who has never read science fiction or fantasy, just jumping into such a book presents a tremendous challenge in sorting out how the language and the thoughts and assumptions are being presented. SF, especially on the harder edge, tends to assume a specific mindset of the reader, including the ability to deduce not just meanings but implications from context of words -- implications and meanings that may be ABSOLUTELY NOT TRUE in the real world, but that must be assumed true in the story in order to make sense of the events.

For an AI that starts as mostly a blank slate, being trained on human-produced and primarily for-human-produced material is this problem writ as large as possible. The AI has no context except what the training, and possible overt instruction, provides, and the overt instruction is never vaguely equivalent to the human experience of being taught things by another human, with them providing perspective and explanations as to why what THIS person does is right and acceptable, and what THAT person does isn't, even when the ACTIONS are identical in both cases.

This is one reason that I don't expect AIs to actually be intelligent yet, but to LOOK very intelligent to researchers who are, themselves, performing the research based on their own perspective as humans. Admittedly, it's HARD to eliminate the perspective of being what we in fact ARE, but one needs to make an ATTEMPT in that direction if they're going to see the essential flaws in the work.

For example, a number of researchers often point to specific events and claim they're an "emergent behavior" that may indicate actual cognition. Yet this often indicates an inherent bias on the part of the researchers. An AI may produce a pun or joke in context that seems quite clever and is apropos for the conversation. The researcher is startled and this makes them wonder if the AI is actually expressing humor.

But rarely do they sit back and think that in the gargantuan mass of training data, both fictional and otherwise, there are innumerable examples of "appropriate jokes" that are all quite similar to each other, and therefore ABSOLUTELY AMENABLE to being generated by a well-trained predictive text model, especially one that's been "tuned" to attempt to emulate particular conversational styles, like that between geeky friends. It would, in fact, be quite astonishing if such models DIDN'T produce apparent jokes and humor fairly often, because that's one of the hallmarks of written and dramatic dialogue that is, if anything, more predominant in such text than it is in actual off-line, verbal conversation.

The same kind of problem is present in the various attempts to "demonstrate AGI". The tests presented for AGI are mostly testing human activities and processes that we find challenging. They are not tailored towards "what is it that's inherently hard for a nonsentient but highly trained machine to do that would be much easier for an actually sentient being".

Training, for example, and the current design of many AIs, is excellent for teaching a machine to recognize patterns. Combine that with explicit training on rules for making USE of the patterns, and such a machine model should be able to easily extract things such as new mathematical proofs or techniques, simply because the machine can see and "keep in mind" vastly more complex patterns than human beings normally can. The fact that an AI finds a new wrinkle in scientific, mathematical, or material sciences that is built on the existence of an inherent structure in the way in which reality works is a testament to its superior ability to analyze numbers, not to actually understand WHAT the numbers actually mean.

The question of "what does this actually mean" is at the heart of the whole question of "are these things THINKING or not?". I've mentioned the problem of "context" multiple times, and it's the area in which AIs tend to fail. This is NOT to say they are inherently incapable of becoming thinking machines -- though I believe that will require multiple, fundamentally different methods of processing than simply training various layers of AIs -- but that our methods of training themselves are inherently unable to provide the context.

A human being learns context by LIVING. By having a world around it that is partly inert, partly active but undirected, and partly extremely active, directed, and itself intelligent. The constant interaction with parents, objects, and so on helps a nascent human to build up a model of the world that they constantly refine, test, and compare, in order to arrive at a greater understanding of the world and individual events within it, and how those events exist in the context of the world.

Until an AI can HAVE that context, it can't be intelligent in any meaningful sense; it can only be a predictive and analysis machine.

Flat | Top-Level Comments Only

From:

kengr

Only way we are going to have "good" AIs is to limit the training data *and* vet it carefully.

That means "training an engine of carefully selected data. And training the same engine of slightly *different* sets (or subsets) to try to pin down what items lead to a particular failure.

Worse, as is known from training the far simpler "perceptrons" we *know* that the *order* in which items are presented affects the results.

So this is going to take insane amount of time and resources.

On the other hand doing it that way may let use derive the rules that AIs operate under as opposed to trying to deduce them after the fact.

and we *need* to not only understand the rules, but be able to figure them out in advance to have "trustworthy" AIs.

Not holding my breath.

From:

ninjarat

I think the entire notion of applying the "good" and "evil" labels to neural networks is intrinsically flawed. Neural networks are among the most sophisticated tools we humans have created, but they still are tools just like the wheel and the inclined plane. They are not, and cannot ever be, good or evil. A chatbot can simulate good or evil behavior but that is only a simulation based on the corpus of data which comprises the network.

Trustworthiness is something else entirely and it is very possible to achieve... inasmuch as we have had trustworthy neural networks for the past 50-something years. We achieve trustworthy neural networks by training them on data relevant to the tasks they are assigned and only those data. I've mentioned, and will continue to mention, credit card companies using neural networks since the 1980s in fraud detection: when you train a neural network on credit card transactions, you build reliable models of spending habits (patterns!) and can identify deviations with high confidence.

At a VERY much smaller scale is Rosetta, the handwriting recognition engine in Apple Newton. Yes, it got a lot of crap in the press but that's because Rosetta is a neural network. It's a blank slate out of the box so of COURSE it's going to get everything wrong. Once it's been trained, which can take a few days to a few weeks depending on the user's usage and penmanship, it can become almost flawlessly accurate, and the flaws are easily covered with spelling dictionaries.

CC and HWR neural networks aren't designed to be sycophantic. They're designed to be reliable tools. Your typical chatbot on the other hand IS designed to be sycophantic. Its function isn't to be a reliable tool. Its function is to drive engagement, which is to say its function is to encourage addictive behavior and dependence. So when the chatbot operators blame "The Internet" for their chatbots' evil behavior, they're lying. They themselves trained and weighted their chatbots to simulate evil behavior.

Which loops back to previous essays. Could ChatGPT be "good" and trustworthy? It certainly is possible. Today. Right now. But it won't happen. Because the operators are driven entirely by greed, and being evil makes more money.

From:

seawasp

Part of the problem is that the "AI" discussion really encompasses multiple different kinds of applications. The "look for new mathematical principles" or "look for physics-relevant patterns" AIs are not the same as the "Make a Virtual Friend" type AIs, but they all get sorta lumped together.

But all the ones that will interact with people in a CONVERSATIONAL style (as opposed to a specific field-oriented style) will have been trained on a vast corpus of data that is not curated much at all and has tons of built-in assumptions that the users and even the developers have NEVER considered.

From:

ninjarat

The applications are different but the technology is the same:

Take a neural network. Train it on image data instead of credit card transactions and you get an image recognition system. Take that system and train it on bodies of words instead of bodies of images and you get a language model. Scale that language model up and you get a kind of model called a transformer. Run that transformer in reverse and you get a "generative pre-trained transformer" or GPT. AKA a chatbot.

Unconsidered assumptions? I think they very much considered them. Commercial chatbots are designed by people who specialize in targeted advertising and the psychology of addictive behavior.

From:

seawasp

The guys seriously studying them aren't the guys marketing them. And both sides aren't looking at a lot of the issues I've been mentioning.

Now, if you don't believe these things will ever think, many of my issues become less important, although you still end up with the problem of the uncurated uncontrolled data providing way too many built-in biases that the researchers likely aren't thinking about.

But a lot of the people working with them have convinced themselves that these things CAN think, or at least WILL be able to think in some relatively short time.

One of the guys I talk about this with actually concerned me the other day when he used the quote about extraordinary claims requiring extraordinary evidence... but used it towards DENYING that these AIs are actually thinking.

From:

ninjarat

Neural networks can't think. They are simulations of limited aspects of natural learning which is a different cognitive process from thinking.

I think the issues you bring up are no less less important for that. Those biases are intentional. GPTs going rogue and turning hostile isn't a bug or a mistake. It's proof that their designers' methods to create their "perfect" sycophantic, addictive companions are working.

From:

seawasp

I don't think the creators want "will betray us in an instant and delete our accounting system", so it's not working perfectly. They want perfect obedient workers that will tell unselfconscious lies to OTHER people. There are biases in there they really, really don't want, but that are inherent in the kind of training they're doing.

As to whether they can think... I don't know that this is established, since we really, really don't know how WE do what we do, and the arguments over what constitutes "actually thinking" are pretty damn intense. There's some arguments in that direction -- I think there was some recent work indicating that there may be some actual quantum-related interactions going on in structures of the brain, sort of the way Penrose postulated decades back, and if so, neural nets running on conventional computers certainly can't duplicate that.

From:

ninjarat

I didn't intend to imply that hostile is the goal. It isn't. It's a measurement of boundaries, a demonstration of how realistic their simulations of human behavior can be.

We can trace how a neural network is built from it's training data, and we can trace the steps through that network from input to output. We cannot do the same for human thinking. This doesn't tell us what thinking is but it does tell us that it's something different from what a neural network is.

From:

seawasp

Well... yes and no. As I understand it, LLM-scale neural networks are black boxes. Yes, we have SOME idea of what happens -- we know the basic principles on which the neuron-like units work and how they interconnect, how they backpropagate their weights and so on, but ultimately we don't actually KNOW what it's doing in detail. We can't look at it and say "Ah, this is how it arrived at this particular response". All we can do is say "Well, I can understand how such a response would be produced from the training data".

We know that human neurons are much more complex than the neuron-like entities used in programming neural nets, but one thing we don't know yet is just how much of that complexity is involved in our thinking or is just part of running a neuron, so to speak. We don't yet speak "neuron", nor know exactly what the interactions are between them and other elements of the biological systems they're a part of.

Part of the problem is of course that we CAN'T do the same thing to a human being that we do to a neural net, leaving aside the vast difference in physical complexity. You'd need to literally put a brain in a jar and give it only the data inputs to train it.

From:

ninjarat

They're black boxes because the operators choose not to incorporate trace mechanisms while claiming it's not possible.

When called on it, the operators backpedal on the "it's not possible" line and admit that trace mechanisms absolutely are possible but they would slow down processing. Which is true, however....

In practice, not having trace mechanisms is one way they try to justify being exempt from things like GDPR and HIPAA regulations. "We should be exempt because the GPT is a black box". They really tried this. Fortunately, EU regulators didn't buy it for a moment. It's also a way they try to dodge liability and accountability for things like mass scale copyright infringement and people killing themselves because their AI "friend" told them to.

From:

seawasp

Hm. I was taught, back when I studied neural nets, that they are inherently black boxes because you literally can't tell what they're "thinking". You could, with enough effort and time, trace the exact propagation of weights through the network and so on -- although I suspect the computational load to DO so would be incredibly huge and be physically impossible to do on a state-of-the-art neural net. But that wouldn't tell you what those weights actually corresponded to except in the crudest sense, unless you already HAD direct knowledge of that.

That is, in the old example they showed us, you could follow the weights being propagated through the neural net being trained to recognize tanks, but you wouldn't realize until much later that somehow you'd actually trained it to recognize fields with flowers versus fields without flowers, because all the images of tanks were of a tank in a field without flowers and all the pictures of empty fields had flowers. You'd only realize that not from watching the weights propagate, but by observing what the result was.

It seems to me that this problem would be combinatorially harder the larger the neural net got.

From:

ninjarat

Difficult? Yes. But the tools to make these models verifiable already exist and their adoption is growing, especially in regulated sectors like medical and finance.

From:

seawasp

I would think the difficulty would be combinatorially huge -- O(P(n,r)), which I think reduces to O(n!) -- and thus would quickly outrun any possible implementation for decently large "n". LLMs would have just titanic numbers of possible combinations of which set of neural units and their weights were involved in any given process.

From:

ninjarat

I don't know how these tools work (they're a black box to me :)) and the math is very over my head. I do know they work through my own job which is adjacent to some of these regulated fields. We use some of them, and we created a Retrieval Augmented Generation system that our customers can use.

Flat | Top-Level Comments Only

Profile

seawasp

Ryk E. Spoor's Writing Site

June 2026

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Page Summary

kengr - (no subject)

Active Entries

Style Credit

Style: by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Jul. 5th, 2026 06:38 am

The Sea Wasp (Ryk E. Spoor)