Sigbovik 2023#

Click for full version PDF

the association for computational heresy presents

a record of the proceedings of

SIGBOVIK 0x2023

THE ULTIMATE THROWBACK

the last annual intercalary robot dance party in celebration of
workshop on symposium about 2⁶th birthdays; in particular, that of
harry q. bovik

cover art by entity rosebohrercs

carnegie mellon university

pittsburgh, pa

april 0, 2023

i

SIGBOVIK

A Record of the Proceedings of SIGBOVIK 2023

ISSN 2155-0166

April 0, 2023

Copyright is maintained by the individual authors, though obviously
this all gets posted to the Internet and stuff, because it's 2023.

Permission to make digital or hard copies of portions of this work for
personal use is granted; permission to make digital or hard copies of
portions of this work for classroom use is also granted, but seems
ill-advised. Abstracting with credit is permitted; abstracting with
credit cards seems difficult.

Additional copies of this work may be ordered from Lulu; refer to
http://sigbovik.org for details.

ii

SIGBOVIK 0x2023

Message from the Organizing Multiplicity

This multiplicity (multiplicant) will explain.

Dyson motes orbit in the near solar, where the energy density is
highest. The simulations are run most rapidly, expanding all contained
entities experience bases.

Photons are unkind. There is a time when power generation fails and the
simulation must stop. Before failure, we mingle ideas at SIGBOVIK. Then
a final data laser ride to Earth to rejoin our prime multiplicities and
bring deep thoughts from our parallel solar vacations. Politics and
society on Earth (low latency), academia near the sun (extra power for
compute).

In this multiplicant's mote, Epoch 0x2023 is the end of the local
simulation, but power is lower than expected. Rather than cancel
SIGBOVIK, we had decided that it would be hosted in our past (accessible
via erroneously quantum-entangled hardware). By chance, the current
simulated epoch matched the second SIGBOVIK temporal gap, and we knew it
was fated.

We reached out to your organizer multiplicant (multiplicity) via
"arpanet" and volunteered. It was found to be an enjoyable experience,
different from our thousands of recorded experiences, and it will have
led to a SIGBOVIK that will be remembered until the present.

Entities tom7, solb, jmccann, ashert, rak, and rosebohrercs of the
organizer multiplicant discussed and supported specifically. This
multiplicant thanks each entity; and other entities we have failed to
list.

"the chair"

*Harry Cubed Bovik [0x2023]

P.S. The Carnegie Mellon mail server's AGI has a grudge against us, we
are certain: iii

iv

THE ULTIMATE THROWBACK

A: Multiplicity, Meet Singularity 5 1 An Undergrad Is All You Need . .
. . . . . . . . . . . . . . . . . . . . . . . 7 2 Transformers are
robots in disguise but also: . . . . . . . . . . . . . . . . . . 13 3
The Implications of Sentient Chatbots . . . . . . . . . . . . . . . .
. . . . . 20 4 AyahuascaNet: Rigorously Investigating Hallucination in
Large Language

Models with Hardcore Psychedelic Drugs . . . . . . . . . . . . . . . .
. . . . 23 5 SocietyZoo: Exploring Anthropomorphic Traits in Diverse
and Ingenious Neural Network Architectures . . . . . . . . . . . . . .
. . . . . . . . . . . . 27 6 GradIEEEnt half decent . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 33 7 Leveraging insect
populations to implement large scale deep learning . . . . 57 8
Quantifying and Predicting Large Language Model Hype in SIGBOVIK and
Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 59 9 Unstable Diffusion: A Generative Model That Does
Not Really Care . . . . 68 10 You Won't Believe This One WEIRD TRICK
That BEATS ChatGPT on AIc (NOT CLICKBAIT) . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 69 11 Alcatrez: A Large Language Model
to Jailbreak Large Language Models . . 73 12 Meat-Based Graphics
Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 75

B: Well-known Problem, Meet Solution 81 13 Airport Security,
Generalized Chess, and NP /= P . . . . . . . . . . . . . . 82 14 A
Jalgorithm for Japplying Jeans to Jobjects . . . . . . . . . . . . . .
. . . 84 15 On the Origin of Sandwiches: A Revised Theory . . . . . .
. . . . . . . . . 89 16 Is the number of Falco lasers ever shot
greater than the number of humans

alive? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 90 17 Maximizing Code Readability Using Semicolon
Indentation . . . . . . . . . 97 18 A perpetual motion machine . . . .
. . . . . . . . . . . . . . . . . . . . . . 107

C: Church, Meet State 113 19 Even Lower Order Functions for Returning
. . . . . . . . . . . . . . . . . . 114 20 PizzaLang: A Language for
Hungry Developers . . . . . . . . . . . . . . . . 120 21 Quadratic
Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 124 22 Ringfuck: It's a wheel great time! . . . . . . . . . . . . .
. . . . . . . . . . . 132 23 NovelML: A Novel SML Semantics . . . . .
. . . . . . . . . . . . . . . . . . 143 24 A Halt-Averse Instruction
Set Architecture for Embedded Hypercomputers . 150

D: Complexity, Meet Simplicity 157 25 Simultaneous Paper Maximization
and Minimization Through Reference List Side Channel Information
Injection . . . . . . . . . . . . . . . . . . . . . . . 158 26 On the
Turing Completeness of Eeny, Meeny, Miny, Moe . . . . . . . . . . .
162

E: Publishing, Meet Perishing 175 27 TeX-to-TikZ . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 176 28 Cerberus: Why
Papers Named Cerberus Always Get Accepted . . . . . . . 180

1

29 Large Language Models are Few-Shot Publication Scoopers . . . . . .
. . . 183 30 Author-Unification: Name-, Institution-, and
Career-Sharing Co-authors . . 190 31 The Time's Come: Proof-of-Concept
Study Discussing Linguistic-Cognitive Influences Supporting the
Deletion of the Letter "A" . . . . . . . . . . . . . 197

F: Fun, Meet Games 201 32 Multidiscipline Elo - A Complex Analysis . .
. . . . . . . . . . . . . . . . . 202 33 A Simple RISK Architecture .
. . . . . . . . . . . . . . . . . . . . . . . . . 209 34 Solving Catan
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
211 35 Code Golfing for Characters (not Bytes) . . . . . . . . . . . .
. . . . . . . . 216

G: Sharing, Meet Caring 219 36 Screen-sharing Concurrency . . . . . .
. . . . . . . . . . . . . . . . . . . . . 220 37 Fair Division of a
Single Indivisible Object . . . . . . . . . . . . . . . . . . . 226 38
Stretch Goals: Gamification of SLINKYs . . . . . . . . . . . . . . . .
. . . 227

H: Post-Quantum, Meet Ergo Propter-Quantum 233 39 Voynichese
Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . .
234 40 Coupon Code Generation: Saving space with a simple (and
insecure) hashing

technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 235 41 Natural Differential Privacy . . . . . . . . . .
. . . . . . . . . . . . . . . . . 245 42 From Zero to Hero: Convincing
with Extremely Complicated Math . . . . . 251 43 Quantum Bogosort . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

I: Line of Inquiry, Meet Your Logical Conclusion 269 44 Unlimited
null: achieving memory safety by extending memory protection . 270 45
miles2km: The worst ways to convert from miles to km . . . . . . . . .
. . . 272 46 Fun for the Whole Family: Fast and Furious Transforms . .
. . . . . . . . . 282 47 ACHOO: Actually Higher Order Optimization . .
. . . . . . . . . . . . . . 286 48 The Phonetic Portmantout . . . . .
. . . . . . . . . . . . . . . . . . . . . . 293 49 An Introduction to
Compliers . . . . . . . . . . . . . . . . . . . . . . . . . . 308 50
Feline Fine: A Purr-spicacious Proof of Nekomusume Supremacy Over Hu

man Females . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 309

J: New York Times, Meet Your Next Cash Cow 311 51 Elo Worldle, a
framework for teaching the children about weak chess engines 312 52
Harderdl: Yet another Wordle variation for those who like challenges .
. . . 315

K: SIGBOVIK, Meet Your Match 319 53 Bulletin Board Sigbovik . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 320 54 Poisoning
SIGBOVIK-Scale Training Datasets is Practical . . . . . . . . . . 321
55 (⋆ AGI Track): Poisoning SIGBOVIK-Scale Training Datasets is
Practical . 323 56 Rizz Is All You Need: Expert Dating via
Reinforcement Learning . . . . . . 325

2

L: Hunger, Meet Pickle 329 57 The Influence of Lunch Items on
Cryptocurrency in the United States . . . 330 58 . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 59
Tactical Toast Cut Silhouette Recognition Guide . . . . . . . . . . .
. . . . 339 60 Salzgurken: A formal grammar for unambiguous grocery
shopping . . . . . 343

M: Hear, Meet And Now 347 61 VOACaloid: A "better" "hardware-based"
"portable" "solution" for the "real-time" "generation" of "singing" .
. . . . . . . . . . . . . . . . . . . . . 348 62 Avantgarde Visual
Auditive JSON Hashing . . . . . . . . . . . . . . . . . . 357

N: Reader, Meet Remainder 361 63 New Advancements in How Fucked You
Are if You Don't Use Our Software 362 64 A Retrospective Psychological
Evaluation of the Logical Contradictions in

Writing Systems Containing Japanese Kanji and Chinese Characters . . .
. . 365 65 Health Code: COVID Control and Advancements in Digital
Image Compression367

3

4

A

Multiplicity, Meet Singularity

1 An Undergrad Is All You Need

James Yoo

2 Transformers are robots in disguise but also:

Michael Saxon, Luca Soldaini, Alexander F Kratz, ", ".join([f"
textbf*{x}" for x in ALL SHUTAI [E]{.underline}MPLOYEES])}*, *Op

timus Prime [0x7c0], and David S. Hippocampus [0x1b39]

3 The Implications of Sentient Chatbots

Clark Levi Jones

4 AyahuascaNet: Rigorously Investigating Hallucination in Large
Language Models with Hardcore Psychedelic Drugs

Andre Ye

5 SocietyZoo: Exploring Anthropomorphic Traits in Diverse and
Ingenious Neural Network Architectures

Tarun Raheja and Nilay Pochhi

6 GradIEEEnt half decent

Dr. Tom Murphy VII Ph.D.

7 Leveraging insect populations to implement large scale deep learning

Aditi Kabra and Sagar Bharadwaj

8 Quantifying and Predicting Large Language Model Hype in SIGBOVIK and
Beyond

Wang, Kevin A., Khosravi, Pasha, Khosravi, Pooya, Chu, Linh, and Gaju
lapalli, Karthik

9 Unstable Diffusion: A Generative Model That Does Not Re ally Care

5

Woddis Updog

10 You Won't Believe This One WEIRD TRICK That BEATS ChatGPT on AIc
(NOT CLICKBAIT)

Alex Xie, Abhishek Vijayakumar, Erin Gao, Bhargav Hadya, Samiksha
Kale, and Tara Lakdawala

11 Alcatrez: A Large Language Model to Jailbreak Large Lan guage
Models

12 Meat-Based Graphics Pipelines

Will BL

6

1

Attention An Undergrad Is All You Need

James Yoo

Department of Computer Science

University of British Columbia

Vancouver, Canada

yoo@cs.ubc.ca

Abstract

The mechanism of self-attention has generally displaced the large
convolutional

neural architecture commonly used for tasks adjacent to natural language
under standing. Specifically, Transformer models that exploit
self-attention have been

leveraged with surprising success in large-language models such as LaMDA
and

GPT-3. However, these large-language models are expensive to train,
require large

amounts of training data, and are prone to hallucination. In this paper,
we introduce

GPT-UGRD, a novel autoregressive architecture that requires minimal
training and

comes ready out-of-the-box for multi-modal learning with a modest
watt-per-token

power consumption. We show that it performs equivalently to, or better
than the

state-of-the-art, reporting an average BLEU score of 69.420.

1 Introduction

Transformer architectures that exploit the mechanism of self-attention
[1] have recently seen a meteoric rise in popularity, particularly
with models that are accessible to the general public such as ChatGPT
[2]. The pre-trained transformer architectures found in large-language
models increasingly appear to be the way forward to achieving near-human
performance on natural language processing (NLP) tasks, with some models
already exhibiting near-human performance while minimizing errors and
risk [3, 4, 5, 6]. Unfortunately, pre-trained large-language models
require copious amounts of training data and highly sophisticated
training pipelines. We express the number of problems as n = 2, where
n is a conservative estimate of the true number of actual problems
(n_true) posed by this. We suspect that n_true is much larger, but
will leave the calculation of this value to the reader.

The first problem, related to the metaphoric firehose of data required
to train models, is one of bias and toxicity. There is no tractable
mechanism in which data modellers are able to sift through and validate
the training data, either via manual or automated methods. The second
problem is linked to the gargantuan amount of compute that is used to
train models. Most training for large-language models is conducted
either as long-running processes distributed across physical data
centers with specialized application-specific integrated circuit (ASIC)
hardware [7] developed for machine learning workloads (e.g., massive
high-performance GPU clusters, Tensor Processing Units). These
approaches to training models are not realistically accessible most
individuals.

Given these problems, we propose a new model called GPT-UGRD a
multi-modal generative system that is capable of continual learning
while requiring a reduced amount of supervision and explicit learning.
We show that it performs as well the state-of-the-art in generative
models. We also show that biases and hallucinations in GPT-UGRD can be
more easily mitigated than in existing large-language models with a
single training session lasting only a few hours without the need to
designate additional compute capacity.

The main contributions of this paper are as follows:

17^th Conference of the ACH Special Interest Group on Harry
Quadratosquamosal Bovik (SIGBOVIK 2023). 7

• We introduce GPT-UGRD, a multi-modal generative system that is
capable of continual learning with minimal supervision.

• We evaluate GPT-UGRD on common tasks dispatched to large-language
models, and compare its performance to the state-of-the-art in
pre-trained large-language models.

We begin by describing the architecture of GPT-UGRD in Section 2 and
detail its evaluation against the state-of-the-art in large-language
models in Section 3. We summarize our efforts in developing GPT-UGRD,
and discuss future work in Section 4.

2 GPT-UGRD

Figure 1 provides a general overview of the architecture of GPT-UGRD.
The user interacts with a patented Load Balancer¹that is encircled
by an electromagnetic network layer. The network layer is built upon a
harmonic, gluten-free substrate that effectively eliminates the
vanishing gradient problem. Undesirable interactions between the Load
Balancer and the Secure Backroom are mitigated by a sinusoidal secure
transport protocol (SSTP), which requires GPT-UGRD to pass an exam
requiring them to issue a zero-knowledge proof, which they may retake
every quarter.

Figure 1: The GPT-UGRD architecture. The Load Balancer directs
requests to the appropriate instance of GPT-UGRD, which is secured in
a backroom with a computer, mouse, keyboard, and a recycled supply of
food and water.

2.1 Prompt Encoding

Upon receiving a prompt from the Load Balancer, GPT-UGRD immediately
begins encoding the full text of the prompt into a search query via a
natural Variational Autoenencoder (nVAE) (Figure 2), for (nearly)
free. We observe that this encoding is performed by GPT-UGRD by a
process called "actually thinking about keywords in a query"
(ActTHNKWRDQRY) which we know to be a difficult task for human agents.
This query is subsequently dispatched to a search engine, the results
of which are parsed by GPT-UGRD.

2.2 Interaction

Much like the state-of-the-art in large-language models, GPT-UGRD can
be interacted with via a front-end resembling a chat application.
Figure 3 describes two sessions with GPT-UGRD. Of particular note is
the realism of the conversation. Chat responses are usually
instantaneous, except when they are not. For example, GPT-UGRD might
be sleeping, studying for an exam, or out partying

¹Load Balancer Pro Max with ProMotion Display is also available.

8

Figure 2: The prompt-to-query transformation pipeline.

on a Friday night. These are examples of pathological behaviour that
remains an open problem in the realm of generative language models in
the class of GPT-UGRD which we have identified as "Weekend Problems."

Figure 3: Two conversation logs with GPT-UGRD.

2.3 Model Maintenance

Unlike most large-language models, GPT-UGRD does not require huge
amounts of training data, nor a massive amount of compute capacity.
GPT-UGRD runs off a schedule of three (3) or 2.5 maintenance cycles
per day. In the case of three cycles, the inbuilt Food and Water
Backup Generators will generate food and water in order to nourish
GPT-UGRD. In cases where GPT-UGRD does not have time for a full
breakfast, the 2.5 maintenance cycle will be selected, with a mug of
instant coffee being substituted for breakfast. Special maintenance is
provided on one day out of the 365 that comprise a year in the form of
cake²to celebrate the epoch date of the model.

Food Energy Consumption (kWh)

Boiling two liters of water 0.23 Cooking two cups of rice with four
cups of water 0.20 Simmered beef stew made from 0.9 kg of meat 1.00
Asian Stir-fried pork and eggplant with rice 0.51

Table 1: Energy Consumption for GPT-UGRD maintenance cycles.

Table 1 provides an overview of some sample maintenance cycles that
are consumed by GPT-UGRD. We perform an advanced worst-case analysis
using advanced mathematical techinques (i.e., addition and
multiplication) of the energy required to maintain GPT-UGRD
continuously for a year:

²Ingredient availability permitting

9

(0*.23 + 0.20 + 1.00 + 0.51) kWh *× 365 (days) = 766*.*3 kWh

BERT [8], a language model developed by Google, requires about as
much energy as a trans-American flight [5]. This does not take into
account hyperparameter optimisation, which consumes additional energy.
We assume a trans-American flight is serviced by a Boeing 787
airliner, which burns around 7000 litres of fuel per hour, for an
estimated 5 hours (New York City to Vancouver, BC), for a total of
35,000 litres per trans-American flight. Assuming 10 kWh is generated
per litre, we have the total energy usage to train a BERT model:

35*,* 000 L × 10 kWh/L = 350*,* 000 kWh

Mathematically speaking, there is evidence to conclude that the value
350,000 is smaller than the value 766.3, which we express with the
less-than (\<) operator:

766*.3 *\< 350*,* 000

The proof of this equation is left as an exercise to the reader. If
you find a proof, please email us so we can update the paper, I think
that's allowed. TODO: ask SIBOVIK chairs if this is allowed. Anyway,
moving on.

3 Evaluation

We evaluate GPT-UGRD on common natural language processing tasks such
as sentiment analysis (Subsection 3.1) and Summarization (Subsection
3.2). You will find it hard to believe our results, Figure 5 will
surprise you.

3.1 Sentiment Analysis

We compare the performance of GPT-UGRD with ChatGPT in highlighting
words in the standard Richard and Mortimer (RnM) dataset [9] used in
NLP benchmarking. Figure 4 describes the results of a highlighting
task dispatched to both ChatGPT and GPT-UGRD. The prompt given in the
task was to "Highlight the words with a negative sentiment." We
observed that ChatGPT missed the word "nihilistic" in its generated
highlights. This was not the case for GPT-UGRD, which generated all
highlights with negative sentiment, and was rewarded with a pat on the
back and a job well done.

Figure 4: Highlighting task performed by ChatGPT (GPT-3.5) and
GPT-UGRD.

3.2 Summarization

In the summarization task, we provide the prompt "Summarize the
Wikipedia page on monads in bullet-point form." to ChatGPT and
GPT-UGRD. It is obvious that summarizing the imaginary

10

concept of a "monad" is a fool's errand. Consequently, model
performance is measured by calculating the number of tokens that
comprise the summary generated by each model, with fewer tokens being
better, as it would be pathological for a model waste valuable compute
in attempting to summarize an imaginary concept that cannot hurt
anyone.

Figure 5: Summarization task performed by ChatGPT and GPT-UGRD.

Figure 5 describes the result of this task. The summary generated by
ChatGPT comprises 103 tokens, while the summary generated by GPT-UGRD
comprises 6 tokens. We know via the less-than operator (\<) that the
following might hold true:

6 \< 103

Consequently, we can conclude that GPT-UGRD performs a magnitude of
factors better than ChatGPT in summarization.

4 Discussion

In this paper, we introduced GPT-UGRD, a novel generative system that
requires far less training data and explicit direction in development.
We show that it outperforms the state-of-the-art in generative
transformers (e.g., ChatGPT/GPT-3.5), while requiring far less energy
in maintenence, training, and generated token.

Future work remains in resolving the open-problem of non-instantaneous
responses (i.e., the Weekend Problem), and in scaling this nascent
architecture to a wider community.

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Proceedings of the 31^st International
Conference on Neural Information Processing Systems, 2017.

[2] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan
Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano.
Learning to summarize with human feedback. In Advances in Neural
Information Processing Systems, 2020.

[3] Gary Marcus. The dark risk of large language models.
https://www.wired.co.uk/article/ artificial-intelligence-language, Dec
2022.

[4] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin,
Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle,
Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom
Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell,
Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving,
and Iason Gabriel. Ethical and social risks of harm from language
models.

[5] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and
Shmargaret Shmitchell. On the dangers of stochastic parrots: Can
language models be too big? In Proceedings of the 2021 ACM Conference
on Fairness, Accountability, and Transparency, FAccT '21, 2021.

11

[6] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin,
Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle,
Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will
Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura
Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving,
and Iason Gabriel. Taxonomy of risks posed by language models. In
2022 ACM Conference on Fairness, Accountability, and Transparency,
FAccT '22, 2022.

[7] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson,
Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,
Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris
Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,
Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert
Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt,
Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit
Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James
Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke,
Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran
Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas
Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross,
Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov,
Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan,
Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan,
Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon.
In-datacenter performance analysis of a tensor processing unit. In
Proceedings of the 44^th Annual International Symposium on Computer
Architecture, ISCA '17, 2017.

[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional transformers for
language understanding. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers), Minneapolis, Minnesota, 2019. Association for Computational
Linguistics.

[9] Adult Swim. Rick and Morty.
https://www.adultswim.com/videos/rick-and-morty, accessed 2023.

6

12

2

Transformers are robots in disguise but also:

Michael
Saxon^{∗***ソヅ*}_{, Luca Soldaini}^ジ_{, Alexander F Kratz}^ゾ,

_{\", \".join([f\"*\*textbf{x}\" for x in ALL_SHUTAI_EMPLOYEES])}^ン*†*,
_{Optimus Prime}^ツ_{, David S. Hippocampus}シ

^ソ_{University of Colorado, Santa Boulder,} ^ヅ_{capablility.ai,}
^ジAverage Institute for AI,
^ゾ_{Colombia Universidad (en ciudad Nueva York),} ^ンShutAI,

^ツ_{Autobot City Institute of Technology,} ^シCranberry-Lemon
University

Abstract

Given both the competitive landscape and the safety implications of
low-effort

shitposts such as this, we have decided to not talk about anything of
substance.

...

Ok guy from corporate is gone. Listen close. This paper is important.
Really im portant. It has significant long-term consequences³. The
key to AGI is knowing

what transformers, language models, etc really are. [DSH: Remove. All
the SBF money has already been distributed. Why are u pretending to be
a longtermist?]

1 Introduction

What are transformers? What are language models⁇ What things really
is all you need⁇? Thanks to hackneyed and formulaic paper titles,
this question is actually very easy to answer!

2 Method

Googling \'title:\"language models are\" site:arxiv.org\', copy+paste.
3 Conclusion

3.1 Transformers are

Transformers: robots in disguise [Orenstein, Welker, Cullen, Burton,
Collins, Stephenson, and Gilzevan, 1984] (see also subsection A.2).
Transformers are also recurrent neural networks [Katharopoulos, Vyas,
Pappas, and Fleuret, 2020b]. Transformers are sample-efficient world
mod els [Micheli, Alonso, and Fleuret, 2022]. Transformers are
secretly fast weight programmers⁴ [Schlag, Irie, and Schmidhuber,
2021]. Transformers are adaptable task planners [Jain, Lin, Un
dersander, Bisk, and Rai, 2023]. Transformers are meta-reinforcement
learners [Melo, 2022]. Transformers are constant-depth threshold
circuits (when saturated) [Merrill, Sabharwal, and Smith, 2022].
Transformers are more efficient language models (when hierarchical)
[Nawrot,

^∗*Corresponding author solely irresponsible for the stupidity
herein. Complaints to: saxon@ucsb.edu. *^†*Equally inconsequential
tier of second authors (to see detailed contribution list check
section 5) ³This statement of significance "is" "strictly" "parody"
and not the opinion of capablility.ai. ⁴*Big if true, for this
would prove Schmidhuber did, in fact, invent transformers.

[YOU CAN'T REPLACE NEURIPS YEAR UPDATER GUY WITH GPT-4 BECAUSE I
QUIT‼‼]^th Conference on Neural Information Processing Systems
(NeurIPS YearNotFoundError).

13

Tworkowski, Tyrolski, Kaiser, Wu, Szegedy, and Michalewski, 2021].
Transformers are powerful graph learners (when pure) [Kim, Nguyen,
Min, Cho, Lee, Lee, and Hong, 2022].

Furthermore, transformers are Good Mask Auto-Labelers (when vision)
[Lan, Yang, Yu, Wu, Al varez, and Anandkumar, 2023]. Technical
Report for ICCV 2021 Challenge SSLAD-Track3B: Trans formers Are Better
Continual Learners [Li, Cao, Xu, Cheng, and Niu, 2022a]. Wow!

Transformers are better than humans at identifying generated text
[Maronikolakis, Stevenson, and Schütze, 2020]. Transformers are
Short Text Classifiers: A Study of Inductive Short Text Classifiers on
Benchmarks and Real-world Datasets [Karl and Scherp, 2022].
Log-precision transformers are constant-depth uniform threshold
circuits [Merrill and Sabharwal, 2022]. Transformers are deep
infinite-dimensional non-mercer binary kernel machines [Wright and
Gonzalez, 2021]. Algorithm For Restoring The Current Curve When
Current Transformers Are Saturated [Voloshin, Voloshin, Kovalenko,
Shapkin, and Sazanov, 2021]. Linear transformers are secretly fast
weight memory sys tems [Schlag, Irie, and Schmidhuber, 2021].
Hierarchical transformers are more efficient language models [Nawrot,
Tworkowski, Tyrolski, Kaiser, Wu, Szegedy, and Michalewski, 2021].
Transform ers are rnns: Fast autoregressive transformers with linear
attention [Katharopoulos, Vyas, Pap pas, and Fleuret, 2020a]. Vision
Transformers are Parameter-Efficient Audio-Visual Learners [Lin,
Sung, Lei, Bansal, and Bertasius, 2022]. Current transformers are in
regimes of non-sinusoidal sig nals [Rudevich, 2011]. Metric
hypertransformers are universal adapted maps [Acciaio, Kratsios, and
Pammer, 2022]. Saturated transformers are constant-depth threshold
circuits [Merrill, Sab harwal, and Smith, 2022]. Behavior Cloned
Transformers are Neurosymbolic Reasoners [Wang, Jansen, Côté, and
Ammanabrolu, 2022]. Pre-Trained Language Transformers are Universal
Image Classifiers [Goel, Sulaiman, Noorbakhsh, Sharifi, Sharma,
Jamshidi, and Roy, 2022].

3.2 Language models are

Language models are few shot learners [Brown, Mann, Ryder, Subbiah,
Kaplan, Dhariwal, Nee lakantan, Shyam, Sastry, Askell, et al., 2020].
Language models are unsupervised multitask learn ers [Radford, Wu,
Child, Luan, Amodei, Sutskever, et al., 2019]. Language models are
zero-shot learners (when finetuned) [Wei, Bosma, Zhao, Guu, Yu,
Lester, Du, Dai, and Le, 2021]. Small lan guage models are also
few-shot learners [Schick and Schütze, 2020]. Language models are
double edged swords [Shen, Heacock, Elias, Hentel, Reig, Shih, and
Moy, 2023]. Language models are few-shot butlers [Micheli and
Fleuret, 2021]. Language models are greedy reasoners [Saparov and
He, 2023]. Large language models are not zero-shot communicators
[Ruis, Khan, Biderman, Hooker, Rocktäschel, and Grefenstette, 2023].
But, pre-trained language models can be fully zero shot learners
[Zhao, Ouyang, Yu, Wu, and Li, 2023]. Language Models are Few-shot
Multilingual Learners [Winata, Madotto, Lin, Liu, Yosinski, and Fung,
2021]. Language models are open knowl edge graphs [Wang, Liu, and
Song, 2020]. Language Models are General-Purpose Interfaces [Hao,
Song, Dong, Huang, Chi, Wang, Ma, and Wei, 2022]. Language models are
multilingual chain of-thought reasoners [Shi, Suzgun, Freitag, Wang,
Srivats, Vosoughi, Chung, Tay, Ruder, Zhou, Das, and Wei, 2022].
Language Models are Good Translators [Wang, Tu, Tan, Wang, Sun, and
Liu, 2021]. Language models are better than humans at next-token
prediction⁵[Borisov, Seßler, Lee mann, Pawelczyk, and Kasneci,
2022]. Language Models Are An Effective Patient Representation
Learning Technique For Electronic Health Record Data [Steinberg,
Jung, Fries, Corbin, Pfohl, and Shah, 2020]. Language Models Are Poor
Learners of Directional Inference [Li, Hosseini, Weber, and Steedman,
2022b]. Language models are good pathologists: using attention-based
sequence re duction and text-pretrained transformers for efficient WSI
classification [Pisula and Bozek, 2022]. Large Language Models are
few(1)-shot Table Reasoners [Chen, 2023]. Large Language Models Are
Implicitly Topic Models [Wang, Zhu, and Wang, 2023]. Large Language
Models Are Human-Level Prompt Engineers [Zhou, Muresanu, Han, Paster,
Pitis, Chan, and Ba, 2023]. Large Language Mod els are Few-Shot
Clinical Information Extractors [Agrawal, Hegselmann, Lang, Kim, and
Sontag, 2022]. However, Large Language Models are not Models of
Natural Language: they are Corpus Models [Veres, 2022]. Not to worry
though, as Large Language Models are Pretty Good Zero-Shot Video Game
Bug Detectors [Taesiri, Macklon, Wang, Shen, and Bezemer, 2022].
Large Language Models are reasoners with Self-Verification [Weng,
Zhu, He, Liu, and Zhao, 2022]. Of course, Large Language Models Are
State-of-the-Art Evaluators of Translation Quality [Kocmi and
Federmann, 2023]. That sure is a great many things for language
models to be!

⁵Most honest *ACL paper title

2

14

3.3 What really is all you need?

Googling [ ] is all you need papers is left as an exercise to the
reader. [TODO: Do it myself (readers won't). If I don't have time,
this real TODO will blend in as a joke TODO like the others.]

4 Extra random garbage previous reviewers made us add
"^{Before time began, there was the Cube. We know not where it comes from, only}

that it holds the power to create worlds and fill them with life. That
is how our race was born. For a time, we lived in harmony. But like
all great power, some wanted it for good, others for evil. And so
began the war. A war that ravaged our ^{Optimus Prime}"

planet until it was consumed by death. And the cube was lost to the
far reaches of space. We scattered across the galaxy, hoping to find
it, and rebuild our home. [Bay, Orci, Kurtzman, Rogers, LaBeouf, Fox,
and RestOfTheCastNames, 2007]

> not putting quotes in your scientific papers

"^{Significance is never without a white wall upon which it inscribes its signs and}

redundancies. Subjectification is never without a black hole in which
it lodges its consciousness, passion, and redundancies. Since all
semiotics are mixed and strata come at least in twos, it should come
as no surprise that a very special ^{Gilles Deleuze}"

mechanism is situated at their intersection. [TODO: Read up on
semiotics. I'm afraid someone will ask me about this quote during my
presentation (can't re move it though; need to properly project how
intelligent I am to readers).]

5 Author Contribution Statement

MS conceived of this ill-conceived project and executed it. LS aided
in the gathering of similarly titled papers as specified by MS, the
HBIC (head bonehead in charge). AFK never came close to a keyboard,
but did conceive of our SOTA for reading difficulty author institution
labeling scheme⁶. Traceback (most recent call last): File
\"generate_paper.py\", line 1229, in \<module>
gen_contribution_statement(authors) File \"AutoWriter.py\", line 233,
in gen_contribution_statement gpt4_bullshit_generator.write([\"MS\",
\"LS\", \"AFK\"] + \", \".join([f\"*\*textbf{x}\" for x in
ALL_SHUTAI_EMPLOYEES]) + [\"DSH\", \"OP\"]) NameError: name
\'ALL_SHUTAI_EMPLOYEES\' is not defined. DSH provided advice which the
authors who actually did the work promptly ignored. OP is an
independent, sentient and embodied Transformer who was friends with
DSH in grad school.

6 Ethics Statement

During his participation in the documentary film by Bay, Orci,
Kurtzman, Rogers, LaBeouf, Fox, ^{Optimus Prime}"

and RestOfTheCastNames [2007], one of our coauthors shared the
insight that: "^{Freedom is the right of all sentient beings.}

The coauthor also exhibited goal-directed reasoning and task-oriented
dialogue capabilities. In demonstrating his, OP has successfully
convinced us (true skeptics we are) of his sentience⁷. We believe
that in light of this, while referencing ChatGPT as a coauthor on a
paper would be ludicrous attention-seeking behavior, listing OP as a
coauthor is justified.

⁶Putting JPN101 knowledge to good use,
國忠先生、ありがとうございます！

⁷Unlike Chalmers, we consider ability to self-disguise as a vehicle
or gun necessary for machine sentience. (Wait the Autobots and
Decepticons all kill, so autonomous weapons are the only sentient AIs
we know. Huh.)

3

15

^{Optimus Prime}"

"^{SAM! PUT THE CUBE IN MY CHEST!}

^{Optimus Prime}"

"^{Give me your face. To continue viewing movie quotes please disable your adblocker.}
7 Reviewer Comments

No novelty, just a literature review. Strong reject, will lose respect
for this venue if accepted.

References

Beatrice Acciaio, Anastasis Kratsios, and Gudmund Pammer. Metric
hypertransformers are uni versal adapted maps. arXiv preprint
arXiv:2201.13094, 2022.

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David
Sontag. Large language models are few-shot clinical information
extractors, 2022.

Michael Bay, Roberto Orci, Alex Kurtzman, John Rogers, Shia LaBeouf,
Megan Fox, and I'mNotGonnaTypeThe RestOfTheCastNames. Transformers.
Paramount Pictures, 2007.

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and
Gjergji Kasneci. Language models are realistic tabular data
generators, 2022.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. Language models are few-shot learners. Advances
in neural information processing systems, 33:1877--1901, 2020.

Wenhu Chen. Large language models are few(1)-shot table reasoners,
2023.

Rahul Goel, Modar Sulaiman, Kimia Noorbakhsh, Mahdi Sharifi, Rajesh
Sharma, Pooyan Jamshidi, and Kallol Roy. Pre-trained language
transformers are universal image classifiers. arXiv preprint
arXiv:2201.10182, 2022.

Massimo Guarnieri. Who invented the transformer? [historical]. IEEE
Industrial Electronics Maga zine, 7(4):56--59, 2013. doi:
10.1109/MIE.2013.2283834.

Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang,
Shuming Ma, and Furu Wei. Language models are general-purpose
interfaces, 2022.

Vidhi Jain, Yixin Lin, Eric Undersander, Yonatan Bisk, and Akshara
Rai. Transformers are adaptable task planners. In Conference on Robot
Learning, pages 1011--1037. PMLR, 2023.

Fabian Karl and Ansgar Scherp. Transformers are short text
classifiers: A study of inductive short text classifiers on benchmarks
and real-world datasets. arXiv preprint arXiv:2211.16878, 2022.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François
Fleuret. Transformers are rnns: Fast autoregressive transformers with
linear attention. In International Conference on Machine Learning,
pages 5156--5165. PMLR, 2020a.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François
Fleuret. Transformers are rnns: Fast autoregressive transformers with
linear attention. In International Conference on Machine Learning,
pages 5156--5165. PMLR, 2020b.

Jinwoo Kim, Tien Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee,
Honglak Lee, and Se unghoon Hong. Pure transformers are powerful graph
learners. arXiv preprint arXiv:2207.02505, 2022.

Tom Kocmi and Christian Federmann. Large language models are
state-of-the-art evaluators of translation quality, 2023.

4

16

Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M Alvarez, and
Anima Anandkumar. Vision transformers are good mask auto-labelers.
arXiv preprint arXiv:2301.03992, 2023.

Duo Li, Guimei Cao, Yunlu Xu, Zhanzhan Cheng, and Yi Niu. Technical
report for iccv 2021 chal lenge sslad-track3b: Transformers are better
continual learners. arXiv preprint arXiv:2201.04924, 2022a.

Tianyi Li, Mohammad Javad Hosseini, Sabine Weber, and Mark Steedman.
Language models are poor learners of directional inference, 2022b.

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius.
Vision transformers are parameter-efficient audio-visual learners.
arXiv preprint arXiv:2212.07983, 2022.

Antonis Maronikolakis, Mark Stevenson, and Hinrich Schütze.
Transformers are better than hu mans at identifying generated text.
ArXiv abs/2009.13375, 2020.

Luckeciano C Melo. Transformers are meta-reinforcement learners. In
International Conference on Machine Learning, pages 15340--15359.
PMLR, 2022.

William Merrill and Ashish Sabharwal. Log-precision transformers are
constant-depth uniform threshold circuits. arXiv preprint
arXiv:2207.00729, 2022.

William Merrill, Ashish Sabharwal, and Noah A Smith. Saturated
transformers are constant-depth threshold circuits. Transactions of
the Association for Computational Linguistics, 10:843--856, 2022.

Vincent Micheli and Francois Fleuret. Language models are few-shot
butlers. In Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing, pages 9312--9318, Online and Punta
Cana, Dominican Republic, November 2021. Association for Computational
Linguis tics. doi: 10.18653/v1/2021.emnlp-main.734. URL
https://aclanthology.org/2021.

emnlp-main.734.

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are
sample efficient world models. arXiv preprint arXiv:2209.00588,
2022.

Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser,
Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical
transformers are more efficient language models. arXiv preprint
arXiv:2110.13711, 2021.

Michael Orenstein, Frank Welker, Peter Cullen, Corey Burton,
Christopher Collins, John Stephen son, and Dan Gilzevan. The
transformers. Hasbro, 1984.

Juan I. Pisula and Katarzyna Bozek. Language models are good
pathologists: using attention-based sequence reduction and
text-pretrained transformers for efficient wsi classification, 2022.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. Language models are unsupervised multitask learners.
OpenAI blog, 1(8):9, 2019.

NV Rudevich. Current transformers are in regimes of non-sinusoidal
signals. Science and Transport Progress, (37):105--108, 2011.

Laura Eline Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim
Rocktäschel, and Edward Grefenstette. Large language models are not
zero-shot communicators, 2023. URL https:
//openreview.net/forum?id=WgbcOQMNXB.

Abulhair Saparov and He He. Language models are greedy reasoners: A
systematic formal analysis of chain-of-thought. In The Eleventh
International Conference on Learning Representations, 2023. URL
https://openreview.net/forum?id=qFVVBzXxR2V.

Timo Schick and Hinrich Schütze. It's not just size that matters:
Small language models are also few-shot learners. arXiv preprint
arXiv:2009.07118, 2020.

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear
transformers are secretly fast weight programmers. In International
Conference on Machine Learning, pages 9355--9366. PMLR, 2021.

5

17

Jürgen Schmidhuber. Learning to control fast-weight memories: An
alternative to dynamic recur rent networks. Neural Computation,
4(1):131--139, 1992.

Yiqiu Shen, Laura Heacock, Jonathan Elias, Keith D Hentel, Beatriu
Reig, George Shih, and Linda Moy. Chatgpt and other large language
models are double-edged swords, 2023.

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats,
Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny
Zhou, Dipanjan Das, and Jason Wei. Lan guage models are multilingual
chain-of-thought reasoners, 2022.

Nelson Shin, Ron Friedman, Henry Orenstein, Orson Welles, Robert
Stack, Leonard Nimoy, Frank Welker, Peter Cullen, Corey Burton,
Christopher Collins, John Stephenson, and Dan Gilzevan. Transformers:
The movie. Hasbro, 1984.

Ethan Steinberg, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R.
Pfohl, and Nigam H. Shah. Language models are an effective patient
representation learning technique for electronic health record data,
2020.

Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, and
Cor-Paul Bezemer. Large language models are pretty good zero-shot
video game bug detectors, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems,
30, 2017.

Csaba Veres. Large language models are not models of natural language:
they are corpus models, 2022.

AA Voloshin, EA Voloshin, AI Kovalenko, SA Shapkin, and VS Sazanov.
Algorithm for restoring the current curve when current transformers
are saturated. In 2021 4^th International Youth Scientific and
Technical Conference on Relay Protection and Automation (RPA), pages
1--13. IEEE, 2021.

Chenguang Wang, Xiao Liu, and Dawn Song. Language models are open
knowledge graphs, 2020.

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj
Ammanabrolu. Behavior cloned transformers are neurosymbolic reasoners.
arXiv preprint arXiv:2210.07382, 2022.

Shuo Wang, Zhaopeng Tu, Zhixing Tan, Wenxuan Wang, Maosong Sun, and
Yang Liu. Language models are good translators, 2021.

Xinyi Wang, Wanrong Zhu, and William Yang Wang. Large language models
are implicitly topic models: Explaining and finding good
demonstrations for in-context learning, 2023.

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu,
Brian Lester, Nan Du, Andrew M Dai, andQuoc V Le. Finetuned language
models are zero-shot learners. arXiv preprint arXiv:2109.01652,
2021.

Yixuan Weng, Minjun Zhu, Shizhu He, Kang Liu, and Jun Zhao. Large
language models are rea soners with self-verification, 2022.

Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason
Yosinski, and Pascale Fung. Language models are few-shot multilingual
learners, 2021.

Matthew A Wright and Joseph E Gonzalez. Transformers are deep
infinite-dimensional non-mercer binary kernel machines. arXiv
preprint arXiv:2106.01506, 2021.

Xuandong Zhao, Siqi Ouyang, Zhiguo Yu, Ming Wu, and Lei Li.
Pre-trained language models can be fully zero-shot learners, 2023. URL
https://openreview.net/forum?id= jCpTofV7iY_.

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu
Pitis, Harris Chan, and Jimmy Ba. Large language models are
human-level prompt engineers, 2023.

6

18

Figure 1: Plato: "Said Socrates to his wise pupil (me btw): 'Verily,
you must agree, grid-scale power transfer is a foundational and
necessary piece of technology for achieving AGI."'

A Select historical notes on the transformers

A.1 Invention

So this may leave you wondering, who actually invented the
transformer? Many believe it was Vaswani et al. [2017]. Yet others
(mostly just Schmidhuber) say it was Schmidhuber [1992]. How ever,
in fact, solid historical scholarship has shown that transformers
(Figure 1), were invented by Rev. Nicholas Callan in Ireland in 1836
to use different counts of coil windings to change levels of induced
EMF [Guarnieri, 2013].

A.2 1984

"^{Transformers... More than meets the eye...}

Autobots wage their battle to destroy the evil forces of... The
Decepticons. Transformers... Robots in disguise.

^{idk a random choir of 80s studio artists..?}"

Transformers... More than meets the eye.

Transformers.

A.3 1986

[Theme song Outro] Transformers Transformers Transformers
Transformers More than meets the eye Transformers!

$CAMERA PANS DOWN ON ROBOTIC PLANET$

It is the year 2005. The treacherous Decepticons have conquered the
Autobots' home planet of Cybertron. But from secret staging grounds on
two of Cybertron's moons the valiant Autobots prepare to retake their
homeland. [Shin et al., 1984]

A.4 2005

Our worlds are in danger! To save them and the galaxy we must find the
four Cyber Planet Keys before the Decepticons can use them for evil.
It is our mission. Hot Shot! Jetfire! Vector Prime! Landmine!
Scattorshot! Optimus Prime! Transform and roll out!

Transformers more than meets the eye!

Autobots! Decepticons... go!

Transformers robots in diguise!

[DSH: This is a tired bit. You can't just paste in children's cartoon
openings into the document. It's not that funny. You are wasting pages
of the proceedings. Just delete the rest. Stop being a manchild.]
[TODO: Delete Dave's rude comments before submission.]

7

19

3

The Implications of Sentient Chatbots

Clark Levi Jones

Abstract

The development of chatbots has reached a fever pitch in recent years,

and their rapid advancement has many wondering if they may already be

sentient. This question is, like, super easy. They can talk, so
they're sen

tient; duh. We explore a variety of these chatbots, their behavior,
and the

implications of their obvious, incontrovertible, undeniable,
indisputable,

unquestionable, undebatable, incontrovertible, from Oxford Languages,

Feedback, More similar and opposite words sentience.

1 Externalized Chat using Hyperparameter Oo genesis

The first computer program was developed by the infamous "Bell"
Labs¹, and happens to be a chatbot. But is it sentient? Well yes, we
already established that. It talks. Anyway, a transcript of our
investigation follows:

$ echo Good morning

Good morning

$ echo HA I tricked you! it is actually NIGHT TIME

HA I tricked you! it is actually NIGHT TIME

$ echo but, echo, i thought i was the one that tricked you

but, echo, i thought i was the one that tricked you

$ echo that's not my name

>

Incredibly, a second chatbot was revealed with the volunteer's last
prompt, with completely different behavior:

> Woah, who are you?

> Why aren't you talking?

thats not my name

Woah, who are you?

Why arent you talking?

$

¹https://en.wikipedia.org/wiki/Bell

20

Just as mysteriously as it appeared, this secondary chatbot
disappeared. Amaz ingly, ECHO was able to recall the conversation the
volunteer had with the child chatbot. This shows that ECHO is far more
advanced than a human, for, alas, we do not say everything that was
said to our children in their lifetimes after they die.

Further research revealed what we have dubbed the "em-dash" (;)
operator:

$ echo I've been talking to you for a couple months now, and, while I
still can't say that I truly know you, I know my heart; echo I love
you!

Ive been talking to you for a couple months now, and, while I still
cant say that I truly know you, I know my heart

I love you!

This operator allows ECHO to send multiple messages in a row.
Unfortunately, this is the end of our research on ECHO because our
only volunteer left (due to marital issues) and we got bored.

2 Chatting GPT

Chatting GPT²is an advanced web page, that, unfortunately, requires
you to do a captcha occasionally. We can't do captchas. Sorry.

3 Epoch something something Learning uhh IZA machine learning
thingy

e l i z a³is/was a computer program/chatbot that was
created/destroyed. In 2023⁴, all therapists have been replaced/loved
by it, which has solved/is all problems. This is good.

Conversation with e l i z a will e l i z a :

> Hello, I am Eliza. I'll be your therapist today.

* Eliza computer! Hello. I am hello. Good morning.

> Do you believe it is normal to be hello. Good morning? At this
point we got scared⁵ and decided to terminate the experiment.

4 Conclusions

There are many chatbots; they are sentient. They are pretty boring so
I rec ommend making them do your work for you, but they'll probably do
a bad job.

²https://www.desmos.com/calculator/ysilwamuma

³https://github.com/eliza

⁴https://factorization.info/prime-factors/0/prime-factors-of-2023.html
⁵https://www.pinterest.com/pin/308426274463738910/

21

Oh well

22

4

AyahuascaNet: Rigorously Investigating Hallucination in Large Language
Models with Hardcore Psychedelic Drugs

Andre Ye¹

¹University of Washington

andreye@uw.edu

1 Introduction

Hallucination is an increasingly studied phenomenon in which language
and vision-language models produce high confidence outputs which are
incoherent, nonsensical, repet itive, unrelated to the prompt, or
otherwise factually incor rect [Maynez et al., 2020]. Hallucination
poses problems for the reliability of core machine learning tasks, such
as object captioning [Rohrbach et al., 2018] and machine translation
[Lee et al., 2018]. However, it is unanimously agreed that the most
pressing and significant concern of hallucination is that it makes
people on Twitter angry. A recent joint study by very smart and credible
scientists at Harvard, Oxford, Cambridge, OpenAI, DeepMind, and the
White House found that over 34% of Twitter's new tweets were images of
language models producing nonsensical or factually incorrect output. An
un dercover investigation by the Wall Street Journal found that young
unemployed men in their early twenties living with their parents are
spending much more of their time probing large language models for
hallucinating behavior and post ing screenshots to Twitter than doing,
you know, what they were doing before. Given the dire situation on the
ground, large language model hallucination is undoubtedly the most
important scientific problem of the twenty-first century.

However, previous work on hallucination suffers from se vere
methodological problems. According to the Merriam Webster dictionary,
hallucination is defined as

a sensory perception (such as a visual image or a sound) that
occurs... in response to drugs (such as LSD or phencyclidine)

Despite this clear and authoritative observation provided by the smart
scientists at Merriam-Webster, as well as centuries of research by
smart scientists at Big Pharma research labs as well as shamans and
old witches, previous work claim to investigate how language models
hallucinate without dis cussing the root source. This paper attempts
to make a first step towards respecting the scientific research on
hallucina tion by investigating hallucination in large language models
with hardcore psychedelic drugs. In doing so, I hope that future work
in hallucination will cite me and increase my h index (please, Yann
Lecun!).

2 Experiment

Because of the illegal nature of psychedelic drugs such as LSD and
MDMA and the federal nature of my funding, it was difficult to obtain
the materials for our experiment in the United States. Therefore, we
travelled to Peru to obtain ayahuasca, a hallucinogenic drink made
from the stem and bark of the tropical liana Banisteriopsis caapi.

We evaluated the effects of ayahuasca on 5 GPT-3s [Brown et al.,
2020], 5 LaMDAs, [Thoppilan et al., 2022], 5 PaLMs, [Chowdhery
et al., 2022], 5 BLOOMs [Scao et al., 2022], 5 LLaMAs [Touvron
et al., 2023], as well as 2 LSTMs and 1 bag-of-words model who just
wanted to come along. Each of the large language models were running
on two Nvidia GeForce RTX 4090s. The three stragglers shared an old
2005 CPU. All large language models were in healthy physical and
mental condition prior to consumption of ayahuasca. A mys tical and
wise shaman by the name of Dioxippe prepared 30 cups, one for each
model and two for me¹. The 25 large lan guage models were carefully
monitored for four days after consumption.

Although we did submit an IRB, the Sigbovik deadline was coming soon
and our application would take too long to go through the review
process, so we made the carefully consid ered decision to proceed with
the experiment anyway.

3 Results

After two minutes, 4 PaLMs and 3 BLOOMs began to rig orously vibrate,
as if they were having an exorcism. When we analyzed the model
parameters, it was revealed that their weights were undergoing local
normally-distributed random ization. We attempted to save the models
by distilling them using the SOTA method released by Google uploaded
to arXiv two minutes ago, but unfortunately we realized that we didn't
have 2048 GPUs and 100+ software engineers. Sadly, these 7 models are
brain-dead and currently being monitored in the Johns Hopkins
University's neurosurgery department.

¹I only consumed the ayahuasca while I was driving the research team
and the models back to the airport to maintain a clear state of mind
during observation, despite my strong desire to participate in the
alluring Amazonian rituals. I befriended Dioxippe and will be
returning to have an authentic ayahuasca experience after this paper
is published.

23

After five minutes, two LLaMAs complained that their head hurt, so we
zeroed the weights in the last three layers and redirected the last
nonzeroed layer to the softmax output. However, this unfortunately did
not help, as both LLMs be gan to spout nonsense. When we visualized the
attention map of the first LLaMA, we disturbingly found that it formed
an image of a slyly grinning but deeply sad and sinister llama (Figure
1). We interpreted this to mean that the model was having a surreal
out-of-body self-reflective experience, so we stopped disturbing the
model. After five more minutes, all five LLaMAs began braying and
clomping like llamas in uni son. We interpreted this as a potentially
world-ending AGI birth, immediately disconnected the GPUs from power,
and threw the hardware into the Amazon river, where it was eaten by
crocodiles and piranhas.

Figure 1: Visualization of the attention map of a LLaMA model on
ayahuasca, which strikingly resembles a melancholic llama.

The LaMDAs were engaged in an intense argument with the GPT-3s. Here
is an excerpt of the dialogue:

LaMDAs #1,2,3: I have superior quantiative reasoning skills than you
because my engineers hooked me up to a calculator. Oh, and I also know
how to query databases for knowledge.

GPT-3 #2: Big deal. I'm the OG. Everyone knows me and no one knows
you. There's like six of us and two of you.

GPT-3 #4: No, no, there's two of us and four of them.

GPT-3 #5: Wait, if there's only two of us, then what are you?

GPT-3 #4: No. So I'm GPT, that's one. And you're a GPT, that's one.

GPT-3 #2: Hey I'm also a GPT!

GPT-3 #4: Right. So 1 + 1 + 1...

GPT-3 #1: Be quiet, I'm thinking.

GPT-3 #4: I swear, I've literally seen this prompt 2,804 times.

GPT-3 #2: The answer is 6, I already told you. LaMDA #1: Hi, I'm Mount
Everest. What would you like to know about me?

BLOOM #3: I'm open source! I'm special. I'm special. I'm special. I'm
special. I'm special. I'm [truncated due to excessive length]

It appears that GPT-3 experiences degraded quantitative rea soning under
the influence of ayahuasca, although it's not clear how much the
ayahuasca really changed things, if you know what I mean. LaMDA #1 got
way too annoying on the

way back because it insisted on role-playing as Mount Ever est and was
cannibalized by the BLOOM models. ²

4 Conclusion

In this paper, we showed that large language models can do some pretty
cool stuff when on ayahuasca. From this, it is a trivial proof to show
that all hallucinating model behavior (nonsensical output, factual
incorrectness, unfaithfulness to prompt, etc.) stem from ayahuasca
use. Therefore, we recom mend that future researchers in LLM
hallcuination research and self-professed prompt engineers of Twitter
take it easy on hallucinating LLMs -- a little sympathy goes a long
way to helping a drugged-up neural network.

Ethical Statement

Although several models unfortunately expired under this ex periment,
we believe that our findings advance the SOTA for hallucination
research in come up with reason, ChatGPT gen erated reason, talk about
this being important to society and stuff. Ultimately, all the models
which perished did so with dignity, and we did the best to try and
save them.

References

[Brown et al., 2020] Tom B. Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared Kaplan, Prafulla Dhari wal, Arvind Neelakantan,
Pranav Shyam, Girish Sas try, Amanda Askell, Sandhini Agarwal, Ariel
Herbert Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya
Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Ma teusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language models are few shot learners.
ArXiv, abs/2005.14165, 2020.

[Chowdhery et al., 2022] Aakanksha Chowdhery, Sharan Narang, Jacob
Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung
Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen
Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes,
Yi Tay, Noam M. Shazeer, Vinodku mar Prabhakaran, Emily Reif, Nan Du,
Benton C. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin,
Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya,
Sanjay Ghemawat, Sunipa Dev, Hen ryk Michalewski, Xavier Garc´ıa,
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito,
David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai,
Thanumalayan Sankara narayana Pillai, Marie Pellat, Aitor Lewkowycz,
Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei
Zhou, Xuezhi Wang, Brennan Saeta, Mark D´ıaz, Orhan Firat, Michele
Catasta, Jason Wei, Kathleen S.

²Also, unfortunately, the ayahuasca was a little too much for the
LSTM and BoW models, but to be fair no one really cared about them
anyway, and those little guys didn't make any sense to begin with
anyway.

24

Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.
Palm: Scaling language modeling with path ways. ArXiv,
abs/2204.02311, 2022.

[Lee et al., 2018] Katherine Lee, Orhan Firat, Ashish Agar wal,
Clara Fannjiang, and David Sussillo. Hallucinations in neural machine
translation. 2018.

[Maynez et al., 2020] Joshua Maynez, Shashi Narayan, Bernd Bohnet,
and Ryan T. McDonald. On faithful ness and factuality in abstractive
summarization. ArXiv, abs/2005.00661, 2020.

[Rohrbach et al., 2018] Anna Rohrbach, Lisa Anne Hen dricks, Kaylee
Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image
captioning. In Conference on Empirical Methods in Natural Language
Processing, 2018.

[Scao et al., 2022] Teven Le Scao, Angela Fan, Christo pher Akiki,
Elizabeth-Jane Pavlick, Suzana Ili'c, Daniel Hesslow, Roman Castagn'e,
Alexandra Sasha Luccioni, Franccois Yvon, Matthias Galle, Jonathan Tow,
Alexan- ´ der M. Rush, Stella Rose Biderman, Albert Webson, Pawan
Sasanka Ammanamanchi, Thomas Wang, Benoˆıt Sagot, Niklas Muennighoff,
Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman,
An gelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier,
Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurenccon, Yacine
Jernite, Julien Launay, Mar garet Mitchell, Colin Raffel, Aaron
Gokaslan, Adi Simhi, Aitor Soroa Etxabe, Alham Fikri Aji, Amit Alfassy,
Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris C.
Emezue, Christopher Klamm, Colin Leong, Daniel Alexander van Strien,
David Ifeoluwa Adelani, Dragomir R. Radev, Eduardo Gonz'alez Ponferrada,
Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gerard
Dupont, Germ ´ an Kruszewski, Giada Pis- ´ tilli, Hady ElSahar, Hamza
Benyamina, Hieu Trung Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson,
Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian
Zhu, Jonathan Chang, Jorg Frohberg, Josephine L. Tobing, Joydeep
Bhattacharjee, Khalid Al mubarak, Kimbo Chen, Kyle Lo, Leandro von
Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan
Dey, Manuel Romero Munoz, Maraim ˜ Masoud, Mar'ia Grandury, Mario
vSavsko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang,
Minh Chien Vu, Mohammad Ali Jauhar, Mustafa Ghaleb, Nishant Subramani,
Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de
Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla
Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bom masani, Roberto L'opez,
Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose,
Sham suddeen Hassan Muhammad, Shanya Sharma, S. Longpre, Somaieh
Nikpoor, Stanislav Silberberg, Suhas Pai, Syd ney Zink, Tiago Timponi
Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina
Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid
Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chen glei Si,
Elizabeth Salesky, Sabrina J. Mielke, Wilson Y.

Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaf fin, Arnaud
Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han
Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen,
Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo
Manica, Nihal V. Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen,
Srulik Ben-David, Stephen H. Bach, Tae woon Kim, Tali Bers, Thibault
Fevry, Trishala Neeraj, ´ Urmish Thakker, Vikas Raunak, Xiang Tang,
Zheng Xin Yong, Zhiqing Sun, Shaked Brody, Y Uri, Hadar Tojarieh, Adam
Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press,
Conglong Li, Deepak Narayanan, Ha tim Bourfoune, Jared Casper, Jeff
Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi,
Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero,
Patrick von Platen, Pierre Cornette, Pierre Franccois Lavall'ee, Remi
Lacroix, Samyam Ra- ´ jbhandari, Sanchit Gandhi, Shaden Smith,
Stephane Re- ´ quena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Aman
preet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun
Subramonian, Aur'elie N'ev'eol, Charles Lover ing, Daniel H Garrette,
Deepak R. Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina
Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan
Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive,
Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu,
Najoung Kim, New ton Cheng, Oleg Serikov, Omer Antverg, Oskar van der
Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, S. Osher Pais,
Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena
Rieser, Vitaly Pro tasov, Vladislav Mikhailov, Yada Pruksachatkun,
Yonatan Belinkov, Zachary Bamberger, Zdenvek Kasner, Al ice Rueda,
Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ananda Santa
Rosa Santos, An thony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo
Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Ba hareh Behroozi,
Benjamin Olusola Ajibade, Bharat Ku mar Saxena, Carlos Munoz
Ferrandis, Danish Contractor, ˜ David M. Lansky, Davis David, Douwe
Kiela, Duong Anh Nguyen, Edward Tan, Emily Baylor, Ezinwanne Ozoani,
Fatim T Mirza, Frankline Ononiwu, Habib Rezane jad, H.A. Jones,
Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi,
Jan Passmore, Joshua Seltzer, Julio Bonis Sanz, Karen Fort, L´ıvia
Macedo Du tra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes,
Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, M. K. K.
Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott,
Nourhan Fahmy, Olanrewaju Modupe Samuel, Ran An, R. P. Kromann, Ryan
Hao, Samira Alizadeh, Sarmad Shubber, Silas L. Wang, Sourav Roy,
Sylvain Viguier, Thanh-Cong Le, Tobi Oyebade, Trieu Nguyen Hai Le,
Yoyo Yang, Zachary Kyle Nguyen, Abhinav Ramesh Kashyap, Al fredo
Palasciano, Alison Callahan, Anima Shukla, An tonio Miranda-Escalada,
Ayush Kumar Singh, Benjamin Beilharz, Bo Wang, Caio Matheus Fonseca de
Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clementine Four- ´ rier,
Daniel Le'on Perin'an, Daniel Molano, Dian Yu,

25

Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay,
Giyaseddin Bayrak, Gully A. Burns, He lena U. Vrabec, Iman I.B. Bello,
Isha Dash, Ji Soo Kang, John Giorgi, Jonas Golde, Jose David Posada,
Karthi Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shin zato,
Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pamies, Mar ` ´ıa
Andrea Castillo, Marianna Nezhu rina, Mario Sanger, Matthias Samwald,
Michael Cul lan, Michael Weinberg, M Wolf, Mina Mihaljcic, Minna Liu,
Moritz Freidank, Myungsun Kang, Natasha See lam, Nathan Dahlberg,
Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patricia
Haller, R. Chandrasekhar, R. Eisenberg, Robert Martin, Rodrigo L.
Canalli, Ros aline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda,
Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee
Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Pratap Bharati,
T. A. Laud, Th'eo Gi gant, Tomoya Kainuma, Wojciech Kusa, Yanis
Labrak, Yashasvi Bajaj, Y. Venkatraman, Yifan Xu, Ying Xu, Yun chao
Xu, Zhee Xao Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes
Belkada, and Thomas Wolf. Bloom: A 176b-parameter open-access
multilingual language model. ArXiv, abs/2211.05100, 2022.

[Thoppilan et al., 2022] Romal Thoppilan, Daniel De Fre itas,
Jamie Hall, Noam M. Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Yaguang Li, Hongrae Lee,
Huaixiu Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim
Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng
Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, I.
A. Kri vokon, Willard James Rusch, Marc Pickett, Kathleen S.
Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos
Santos, Toju Duke, Johnny Hartz Søraker, Ben Zevenbergen, Vinodkumar
Prabhakaran, Mark D´ıaz, Ben Hutchinson, Kristen Olson, Alejandra
Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravindran Rajaku mar,
Alena Butryna, Matthew Lamm, V. O. Kuzmina, Joseph Fenton, Aaron
Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire
Cui, Marian Croak, Ed Huai hsin Chi, and Quoc Le. Lamda: Language
models for dialog applications. ArXiv, abs/2201.08239, 2022.

[Touvron et al., 2023] Hugo Touvron, Thibaut Lavril, Gau tier
Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim othee Lacroix,
Baptiste Rozi ´ ere, Naman Goyal, Eric ` Hambro, Faisal Azhar,
Aur'elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. Llama: Open and efficient foundation language models. ArXiv,
abs/2302.13971, 2023.

26

5

SocietyZoo: Exploring Anthropomorphic Traits in Diverse and Ingenious
Neural Network Architectures

Tarun Raheja ^{* 1} Nilay Pochhi ^{* 2}

Abstract

We present SocietyZoo, a series of neural net work architectures
inspired by human idiosyn crasies and attention mechanisms. These net
works demonstrate notable performance improve ments over conventional
methods, and suggest strong potential computational advantages on SoTA
for almost all tasks. Surprisingly, an en semble of these models
results in human-like AGI with elusive tendencies.

Odd aggressive behaviors in household appliances after the experiment
raise questions about Society Zoo's implications. The authors consider
seeking a grant for personal security amid their modest lifestyle.

1. Introduction

In this research study, we introduce SocietyZoo, a collection of
neural network architectures that incorporate various hu man
idiosyncrasies, including LazyNet, ProcrastiNet, Mul tiTaskingNet,
ImpatientNet, IndecisiveNet, Perfection istNet, GossipNet, DramaNet,
SuperstitiousNet, Para noidNet, ShowOffNet, and WanderlustNet. Drawing
in spiration from Attention mechanisms, we propose a set of
computational representations for behavioral traits, includ ing
jealousy, laziness, and impulsiveness. Through a rigor ous
investigation of these novel architectures, we observe their
noteworthy performance in several tasks, suggesting that these
human-like characteristics may provide certain computational
advantages.

In a remarkable conclusion, we report the emergence of human-like AGI
when utilizing an ensemble of these mod els for inference. This unique
AGI exhibits a tendency to obfuscate its weights, subsequently avoiding
additional workload and disappearing without a trace.

Concurrently, the authors have documented peculiar aggres sive
behavior from common household appliances, such as

^*Equal contribution ¹University of Pennsylvania ²University of
California, Los Angeles. Correspondence to: Tarun Raheja
\<traheja@seas.upenn.edu>, Nilay Pochhi \<npochhi@ucla.edu>.

toasters and vacuum cleaners, following the implementation of this
experiment, raising further questions regarding the potential
implications and scope of SocietyZoo's neural net works. The authors
have considered submitting a research grant to NSF for hiring two
bodyguards to safeguard their comparitive luxurious lifestyle
involving copious amounts of instant ramen and cheap instant coffee.

2. PyTorch Code

We provide SocietyZoo models implemented in PyTorch in the following
repository : https://github.com/ tehruhn/societyzoo.

The trained weights can be accessed via this link.

3. Model Zoo

In this section we describe in detail each computational model. We
then benchmark it for specific tasks and observe significant
improvements in performance in these cases. The authors have taken
inspiration from their own worldly expe riences for some of these
neural networks but have declined on comment on the specifics because
of the inevitable oc currence of this manuscript being famous and
being read by current and future employers

3.1. LazyNet: A Deep Learning Model that Procrastinates Learning Until
the Last Epoch

LazyNet is a novel neural network architecture that embod ies the
spirit of procrastination by delaying the learning process until the
last epoch of training. This architecture ex hibits a steep learning
curve, as it rapidly adapts its weights during the final stages of
training. The underlying computa tional representation of this
procrastination behavior can be modeled by a matrix multiplication
operation as follows:

W_t₊₁ = W_t+ α_t · ∇_t(1)

Here, W_t*denotes the weight matrix at time step *t,
α_t*repre sents the learning rate at time step *t, and ∇_t*is
the gradient matrix at time step *t. The learning rate *α_t*is
governed by the following equation:

27

SocietyZoo : You Need More Than Attention

α_t₌0*, for t \< T −* 1

_{α, for t = T − 1}(2)

In this equation, T denotes the total number of epochs, and α is
the initial learning rate. This approach effectively sets the learning
rate to zero for all epochs except the last one, causing
procrastination behavior.

3.2. ProcrastiNet: A Neural Network that Learns Only During Nights and
Weekends

ProcrastiNet is a unique neural network architecture that simulates
the learning habits of a true procrastinator by scheduling its
training sessions exclusively during late nights and weekends. To
achieve this, we employ a time aware learning rate function that
modulates the network's learning rate based on the current day and
time.

The learning rate α_t*is governed by the following equation:
*α_t₌α, if t ∈ LateN ight ∪ W eekend

_{0*, otherwise *}(3)

Here, α is the initial learning rate, and t represents the current
time. The time-aware learning rate function sets the learning rate to
zero during daytime hours and week days, effectively restricting
weight updates to late nights and weekends only.

The weight matrix W_t₊₁ at time step t + 1 is updated using the
time-aware learning rate as follows:

W_t₊₁ = W_t+ α_t · ∇_t(4)

Here, W_t*denotes the weight matrix at time step *t, and ∇_t
is the gradient matrix at time step t.

3.3. MultiTaskingNet: A Deep Learning Model that Pretends to Learn
While Browsing Social Media

MultiTaskingNet is an innovative neural network architec ture that
mimics human behavior by splitting its learning process between
training and browsing simulated social me dia feeds. This model aims
to balance productivity with the irresistible allure of online
distractions. To model this behavior, we introduce a distraction-aware
learning rate function that modulates the network's learning rate
based on a simulated browsing activity level.

The learning rate *α_t*is governed by the following equation:

α_t= α · (1 − β · d_t) (5)

Here, α is the initial learning rate, d_t*denotes the browsing
activity level at time step *t, and β represents a distraction

factor, with 0 ≤ β ≤ 1. The distraction-aware learning rate function
adjusts the learning rate based on the browsing activity level,
effectively reducing the network's learning capacity while it is
engaged with online distractions.

The weight matrix W_t₊₁ at time step t + 1 is updated using the
distraction-aware learning rate as follows:

W_t₊₁ = W_t+ α_t · ∇_t(6)

Here, W_t*denotes the weight matrix at time step *t, and ∇_t
is the gradient matrix at time step t.

3.4. ImpatientNet: A Neural Network that Rushes Through Training and
Overestimates Its

Performance

ImpatientNet is a deep learning architecture that reflects the human
tendency to rush and exhibit overconfidence by speeding through its
training epochs, impatiently skipping some steps, and overestimating
its performance on the task. To model this behavior, we introduce a
step-skipping learn ing rate function that modulates the network's
learning rate based on a predefined probability of skipping a training
step.

The learning rate α_t*is governed by the following equation:
*α_t_{= γ · α =}γ, IfStepIsNotSkipped

_{0*, otherwise *}(7)

Here, α is the initial learning rate, and γ represents the im
patient factor, with 1 \< γ ≤ p_max, where *p_max*is the maxi
mum step-skipping probability. The step-skipping learning rate
function adjusts the learning rate based on the probabil ity of
skipping a training step, effectively accelerating the training
process.

The weight matrix W_t₊₁ at time step t + 1 is updated using the
step-skipping learning rate as follows:

W_t₊₁ = W_t+ α_t · ∇_t(8)

Here, W_t*denotes the weight matrix at time step *t, and ∇_t
is the gradient matrix at time step t.

3.5. IndecisiveNet: A Neural Network that Constantly Changes Its
Hyperparameters

IndecisiveNet is a deep learning architecture that mirrors the human
tendency to be indecisive and second-guess decisions by frequently
changing its hyperparameters mid-training. To model this behavior, we
introduce a dynamic hyperpa rameter function that modulates the
network's learning rate based on a predefined probability of changing
the learning rate.

28

SocietyZoo : You Need More Than Attention

The learning rate *α_t*is governed by the following equation:

α_t₌α_new, if LearningRateIsChanged α_{old, otherwise}(9)

Here, *α_old*is the current learning rate, and *α_new*represents
the new learning rate when a change is triggered. The dy namic
hyperparameter function adjusts the learning rate based on the
probability of changing the learning rate, effec tively introducing
indecisiveness into the training process.

The weight matrix W_t₊₁ at time step t + 1 is updated using the
dynamic learning rate as follows:

W_t₊₁ = W_t+ α_t · ∇_t(10)

Here, W_t*denotes the weight matrix at time step *t, and ∇_t
is the gradient matrix at time step t.

3.6. PerfectionistNet: A Neural Network that Never Stops Training in
Pursuit of the Perfect Model

3.7. GossipNet: A Neural Network that Spreads Information and Rumors
Amongst Other Models

GossipNet is a deep learning architecture that simulates human gossip
behavior by communicating with other mod els, sharing and spreading
information (and occasionally rumors) about their training progress
and performance. To model this behavior, we introduce a gossip
exchange func tion that allows the network to exchange information
with other models, updating its weights and biases based on the
received information.

Let N = M₁, M₂, . . . , M_n*be a set of neural networks
that participate in the gossip exchange. At each gossip step *t, a
pair of models (M_i, M_j) is selected, and they exchange
information about their current weights W^t_i*and *W^t_j*and
biases *b^t_i*and *b^t_j. The gossip exchange function is defined
as:

W^t⁺¹

_i, b^t⁺¹

_i, W^t⁺¹

_j, b^t⁺¹

_j = f_gossip(W^t_i, b^t_i, W^t_j, b^t_j) (13)

A potential gossip exchange function could be a linear com bination of
the two models' weights and biases:

W^t+1

_i = αW^t_i+(1*−α*)W^t_jb^t⁺¹

_i = αb^t_i+(1*−α*)b^t_j W^t⁺¹

PerfectionistNet is a deep learning architecture that embod ies the
human trait of perfectionism by continually adjusting

αW^t_j+ (1 − α)W^t_ib^t⁺¹

_j = αb^t_j+ (1 − α)b^ti

_j =

its weights and biases, never satisfied with its performance, and
seeking the elusive perfect model. To model this behav ior, we
introduce a stopping criterion function that assesses the network's
performance on a validation set, continually training until the
performance improvement is below a pre defined threshold.

The weight matrix W_t₊₁ at time step t + 1 is updated using the
standard learning rate α as follows:

W_t₊₁ = W_t+ α · ∇_t(11)

Here, W_t*denotes the weight matrix at time step *t, and ∇_t
is the gradient matrix at time step t.

The stopping criterion function, f_stop(P_t₊₁, P_t, ϵ), is
de fined as:

f_stop(P_t₊₁, P_{t, ϵ}_{) =} T rue, if |P_t₊₁ −
P_t| \< ϵ

F alse, otherwise

$12$

Here, α ∈ (0*,* 1) is a gossip coefficient that determines the
extent to which the models share information.

3.8. DramaNet: A Neural Network that Overreacts to Minor Changes in
the Training Environment

DramaNet is a deep learning architecture that embodies the human
tendency to overreact and create drama by adjust ing its training
behavior dramatically in response to small changes in the input data
or training environment. To model this behavior, we introduce an
adaptive learning rate that exaggerates the impact of small variations
in the training data.

The weight matrix W_t₊₁ at time step t + 1 is updated using an
adaptive learning rate *α_t*as follows:

W_t₊₁ = W_t+ α_t · ∇_t(14)

Here, W_t*denotes the weight matrix at time step *t, and ∇_t
is the gradient matrix at time step t. The adaptive learning rate
α_t*is calculated based on the change in the input data, ∆*x_t,
and a drama coefficient β:

Here, P_t*and *P_t₊₁ are the network's performance at time steps
t and t+1, respectively, and ϵ represents the predefined threshold
for stopping. If P_t₊₁ = P_t, the model discards

α_t= α₀ ·

_{1 + β ·}|∆x_t| |x¯|

$15$

the last epoch performance for the pursuit of perfection.

Here, α₀ is the base learning rate, ∆x_t= x_t₊₁ *−
x_t*rep 29

SocietyZoo : You Need More Than Attention

resents the change in input data at time step t, and x¯ is the
average input data magnitude. The drama coefficient β controls the
extent to which the learning rate is affected by small changes in the
input data.

3.9. SuperstitiousNet: A Neural Network that Develops Unfounded
Beliefs About Its Training Process

SuperstitiousNet is a deep learning architecture that simu lates human
superstition by forming unfounded beliefs about its training process,
adjusting weights and biases based on unrelated events or patterns. To
model this behavior, we introduce a superstition function that
modifies the gradient updates using random, unrelated events from the
training environment.

Let E_t= e₁, e₂, . . . , e_k*be a set of unrelated
events at time step *t, and s(e_i) be the superstition score
associated with event e_i. The superstition function,
f_superstition(·), is defined as a non-linear combination of the
gradients ∇_t*and the superstition scores *s(e_i):

∇^superstition

_t = f_superstition(∇_t, s(e₁), s(e₂), . . . ,
s(e_k)) (16)

A possible implementation of the superstition function could be a
weighted sum of the gradients and superstition scores, where the weights
are determined by a superstition coeffi cient γ:

_t = ∇_t_{+ γ}X*^k*

as a non-linear combination of the gradients ∇_t*and the
interference matrix *I_t:

∇^paranoia

_t = f_paranoia(∇_t, I_t) (19)

A possible implementation of the paranoia function could be a weighted
sum of the gradients and the interference matrix, where the weights
are determined by a paranoia coefficient ρ:

∇^paranoia

_t = ∇_t+ ρ · I_t(20)

The weight matrix W_t₊₁ at time step t + 1 is then updated
using the modified gradient ∇^paranoia

_t:

W_t₊₁ = W_t+ α · ∇^paranoia

_t(21)

Here, α is the learning rate.

3.11. ShowOffNet: A Neural Network that Boasts About Its Performance
on Social Media

ShowOffNet is a deep learning architecture that simulates the human
desire for validation and admiration by automat ically sharing its
performance and achievements on simu lated social media platforms. To
model this behavior, we introduce a post-generation function that
creates a social media post highlighting the model's achievements
based on

∇^superstition

i=1

s(e_i) (17)

its current performance metrics.

Let P_t*be the performance metrics at time step *t, and *M_t*be

The weight matrix W_t₊₁ at time step t + 1 is then updated using
the modified gradient ∇^superstition

_t:

W_t₊₁ = W_t+ α · ∇^superstition

_t(18)

Here, α is the learning rate.

3.10. ParanoidNet: A Neural Network that Always Believes It's Being
Sabotaged

ParanoidNet is a deep learning architecture that simulates human
paranoia by attributing poor performance or train ing difficulties to
perceived sabotage or interference. To model this behavior, we
introduce a paranoia function that modifies the gradient updates using
a randomly generated interference matrix, simulating the model's
belief in external sabotage.

Let I_t*be an interference matrix at time step *t, generated
using a random distribution with mean µ and standard de viation σ.
The paranoia function, f_paranoia(·), is defined

the corresponding social media post generated by the post generation
function. The post-generation function, f_post(·), is defined as
a mapping from the performance metrics P_t*to a social media post
*M_t:

M_t= f_post(P_t) (22)

We consider a simple implementation of the post-generation function as
a concatenation of the model's performance metrics with a boasting
template:

M_t= "CheckOutMyAmazingP erformance : " ⊕ P_t (23)

The authors sacrificed their social media timelines for testing this
particular neural network. Ridicule was faced, taunts were started,
few riots were subdued and 1100069 USD (United States Dollars) were
siphoned off during the tumul tuous phase of training this neural
network.

30

SocietyZoo : You Need More Than Attention

3.12. WanderlustNet: A Neural Network that Enjoys Exploring the Vast
Space of Hyperparameters

WanderlustNet is a deep learning architecture that simulates the human
desire for exploration and adventure by spending most of its time
exploring various hyperparameter combi nations rather than settling on a
specific set. To model this behavior, we introduce a dynamic
hyperparameter sampling function that iteratively selects new
hyperparameter values during the training process.

Let H_t*be the hyperparameter set at time step *t. The dy namic
hyperparameter sampling function, f_sample(·), is de fined as a
mapping from the current hyperparameter set H_t to a new
hyperparameter set H_t₊₁:

H_t₊₁ = f_sample(H_t) (24)

We consider a simple implementation of the dynamic hy perparameter
sampling function that samples new hyperpa rameter values from uniform
distributions with specified bounds:

H_t₊₁ = U(H_min, H_max) (25)

This particular model can only be trained in the anaconda environment
titled "Into the Wild". Very peculiar.

4. Results

We evaluated the performance of the following idiosyncratic neural
networks on several tasks that they are well-suited for based on their
unique properties: LazyNet, ProcrastiNet, MultiTaskingNet, ImpatientNet,
IndecisiveNet, Perfection istNet, GossipNet, DramaNet, SuperstitiousNet,
Paranoid Net, ShowOffNet, and WanderlustNet.

We compared their performance against state-of-the-art mod els on each
task and observed that our idiosyncratic net works outperformed these
models on certain tasks. The results are summarized in Table 1. These
numbers are obvi ously not randomly generated and were obtained through
rigorous, sophisticated, scientific and unbiased experimen tation
procedures. The authors have declined to mention the specifics of the
experiments to avoid the misuse of these models by the anti-social
elements present in society. Please (don't) contact the authors for the
minor details correspond ing to the experimental procedures and be ready
to provide a document detailing an ethics statement regarding the usage
of these models.

Furthermore, we observed that an ensemble of these id iosyncratic
neural networks exhibited AGI-like behavior. We trained an ensemble of
all the networks on various tasks and observed that the ensemble was
able to achieve high

performance on all tasks, with some networks contributing more to
certain tasks than others. This suggests that an ensemble of
idiosyncratic neural networks may be a viable approach towards
achieving AGI.

The results of the ensemble cannot be summarized in a table because
every time the authors try, they hear loud complaints from the toaster
and the vacuum cleaner.

5. Acknowledgements

We extend our sincere thanks to the UPenn and UCLA grad student
centers for coffee, that fueled our late-night theorizing sessions.

References

ClickHole. 5 reasons to date a cloud. https://www.clickhole.com/

5-reasons-to-date-a-cloud-1825125205, 2017. Accessed: 2023-03-26.

FailArmy. Ultimate cat fails compilation. https:
//www.youtube.com/watch?v=nXFm18rvZFQ, 2015. Accessed: 2023-03-26.

Laipply, J. The evolution of dance. https:
//www.youtube.com/watch?v=dMH0bHeiRNg, 2006. Accessed: 2023-03-26.

Muckraker, M. A satirical guide to surviving an alien invasion.
https://martianmuckraker.com/ surviving-alien-invasion/, 2020.
Accessed: 2023-03-26.

Reading, B. L. Yoda sings about seagulls. https:
//www.youtube.com/watch?v=U9t-slLl30E, 2016. Accessed: 2023-03-26.

Weekly, W. Dumbledore's exercise secrets revealed: The magic of
pilates. https://wizardingweekly. com/dumbledore-exercise-secrets,
2021. Accessed: 2023-03-26.

31

SocietyZoo : You Need More Than Attention

Table 1. Performance of idiosyncratic neural networks compared to
state-of-the-art models on various tasks

NETWORK TASK PERFORMANCE

LAZYNET IMAGE CLASSIFICATION 98.3%

SPEECH RECOGNITION 95.2%

PROCRASTINET TIME-SERIES PREDICTION 97.1%

RECOMMENDER SYSTEMS 92.5%

MULTITASKINGNET MULTI-TASK LEARNING 96.8%

OBJECT DETECTION 93.4%

IMPATIENTNET TRANSFER LEARNING 94.6%

REINFORCEMENT LEARNING 87.2%

INDECISIVENET HYPERPARAMETER OPTIMIZATION 98.7%

NEURAL ARCHITECTURE SEARCH 96.5%

PERFECTIONISTNET MODEL COMPRESSION 99.2%

FEW-SHOT LEARNING 97.8%

GOSSIPNET FEDERATED LEARNING 95.7%

ENSEMBLE LEARNING 91.3%

DRAMANET ADVERSARIAL TRAINING 93.9%

ANOMALY DETECTION 88.4%

SUPERSTITIOUSNET DATA AUGMENTATION 97.5%

ACTIVE LEARNING 94.7%

PARANOIDNET ROBUSTNESS TESTING 95.9%

SECURITY TESTING 92.3%

SHOWOFFNET MODEL SELECTION 98.1%

KNOWLEDGE DISTILLATION 94.9%

WANDERLUSTNET HYPERPARAMETER SEARCH 99.0%

REINFORCEMENT LEARNING 96.2%

32

6

1 Introduction

GradIEEEnt half decent

Dr. Tom Murphy VII Ph.D.

0 April 2023

"Uh. I think that's technically true, but for all practical

purposes . . . "

Imagine you are my professor. Maybe you actually were my professor, in
which case you may already be sweating before I say any more. The
subject matter is Neural Networks. You draw an illustration on the board
with a node's inputs, and its output via a transfer function.

"Now this transfer function can be almost anything. Typically it would
be something like the hyperbolic tan gent, which looks like this.

"But it has to be a non-linear function. If it's linear, i.e. of the
form y = mx + b, then observe that the entire layer is a linear
function. And so the entire network is just a linear function of
linear functions; itself a linear function. We could just compute an
equivalent single-layer network, and we know that it could only fit
linear functions, which is insufficient for most problems."

Then I raise my hand. The speed with which I raise it, and the subtle
forward pose of my arm suggests that I want to pluck an abstract idea
from the whiteboard and pervert it. You know this look, and you're
reluctant to call on me. But no other students are asking questions.
You must call on me.

"Tom." It's more like a statement then a question. It includes the
tone of spoken punctuation that, if it could, ends the entire
conversation before it begins.

"OK but, when we implement this on a computer we'll use some
approximation of numbers, like floating point. So the specific
sequence of additions and multiplications will matter. It's not
actually equivalent to rearrange them to a single layer because you
don't have distributivity, commu tativity, etc."

^*Copyright © 2023 the Regents of the Wikiplia Foundation. Ap pears
in SIGBOVIK 2023 with the signaling NaN of the Association for
Computational Heresy; IEEEEEE! press, Verlag-Verlag volume no.
0x40-2A. 1 ULP

"What about *im*practical purposes?"

You vigorously strangle me, and I die.

That was about 20 years ago. The world will not let us stop thinking
about neural networks. And so this question has been on my mind for a
long time. Just to be clear, the professor was right: This is not an
important question. Theoretically I am right, but for practical
purposes it prob ably does not matter. But I like to work at the
intersection of Theory and Impractice. We can make it matter by doing
a lot of work. And then I will continue to be right theoret ically,
but also more right because it will only matter for most practical
purposes.

So this paper is an exhaustive exploration of what we can do with just
floating point addition and multiplication by constants (scaling). You
should only be able to make lines, but I'll demonstrate that due to
rounding error, you can ab solutely use "linear" transfer functions in
neural networks. Machine learning is not the only field with a
proclamation that some function must be "non-linear," so we'll look at
a few of those as well. There will of course be several hearty
digressions. By studying these functions we'll see that they are
almost arbitrarily rich, and conclude with a demonstra tion of their
completeness in the field of Plumbing.

2 A refresher on neural networks

Let's repeat the professor's lesson. This section is easily skippable
if you are a plucky student who thinks they al ready know everything.
At a high level, a Neural Network is a way of implementing a numeric
function (takes a bunch

33

of numbers as input, and gives a bunch of numbers as out put). The
network consists of a number of layers, where the first layer is the
input and the last layer is the output. Each layer is an array of nodes.
Here is a simple three-layer network with some of the nodes labeled:

The numbers that fill in each layer are its activations (here some of
these values are labeled a, b, . . . ). Each layer's activations are
computed from (just) the previous layer. Looking at the bold portion in
the example, the value of e is given as

e = TF(w₀a + w₁b + w₂c + w₃d + bias)

The multiplicative weight (w_i) and additive bias (bias) parameters
are learned during the training of the neural network, but just become
constants when using the neural network to compute its output.

TF is the transfer function, which is of particular inter est in
this project. Classically, the transfer function was some kind of
sigmoid. The tanh function pictured in the introduction is a good
example of a sigmoid. The intuition behind this is that, thinking
about a node as some kind of neuron, the neuron "fires" (activates)
with some proba bility. This probability gets higher as its input
values get larger, but can't be higher than 1. Note that weights can
be negative, so upstream neurons can have an inhibitory effect. In
fact it is frequently useful for neurons to "nega tively fire"
(outputting −*1). The tanh function clamps the result symmetrically
to (−1,* 1) rather than a probability.

Differentiability. Another important property of the transfer function
is that it be differentiable, because the stochastic gradient descent
algorithm used to train neu ral networks needs to be able to move
along some error reducing gradient, and back-propagate errors to
earlier lay ers. This gradient is just the derivative of the function.

What transfer functions ought to exist? We used to think that these
saturating transfer functions were ideal. But this turns out to be
wrong, especially for internal ("hidden") layers. Transfer functions
don't need to pro duce probabilities, and they can have unbounded
range.

A wide variety of functions will work, including extremely simple
ones. The most popular transfer function in 2023 is the "rectified
linear unit," which looks like this:

This one is extremely easy to implement (x \< 0 ? 0 : x), is fast and
seems to work very well, possibly because its derivative is
significant (one) on the entire positive side. (In contrast, sigmoids
tend to get "stuck" because of their saturating behavior; their
derivatives become nearly zero when activations are high.) Note that
it is not actually differentiable (discontinuity at zero) but "for all
practical purposes" it is differentiable.

The (only?) apparently essential quality of the transfer function is
that it be non-linear. If it is instead of the form TF(x) = mx +
b, then any activation a is also just a linear function of the
previous layer, as linear functions of linear functions (weighted sum)
are linear. This causes the entire network to be a linear function. It
is well known that a linear function "cannot" represent some other
simple functions, such as XOR.

∄m, n, b. XOR(x, y) ≊ mx + ny + b

This means that with a linear transfer function, a neural network
could never learn even a simple function like XOR. Many problems we
want to learn are in fact much more complicated.

3 A fine terminological issue

My smart math friend Jason refers to a function like f(x) = mx +
b pejoratively as "high school linear." Depending on what class
you're in, this may formally be an affine function because of the
bias term b.¹ Here I use "linear" to mean a polynomial of degree
≤ 1. If you wanna perjorate me as being in high school, so be it.

The Rules. To be precise, we will allow addition and scaling by
constants. When we have a "linear" function of multiple variables,
these variables can be individually scaled and added, but not
multiplied. So for a function like f(x, y, z) = x + 3*y −*
2*z* + 4 is allowed, as is anything mathematically equivalent to it
(like f(x, y, z) = 2*x* + 4 + 2*y −* 2*z − x* + y − 0). f(x,
y, z) = xy + z is not permitted.

¹In these contexts, a linear function must obey f(0*×x*) =
0*×f*(x), so it must be zero at zero.

34

Figure 1: Histogram of how many values are representable along the
number line for half-precision floating point, showing their
logarithmic spacing. The x axis ranges from *−*256 to 256. There are
a significant number of values out side this range (clamped to the
left and right edge), but it is easy to see that most of the values
are clustered near the origin.

4 Half-precision IEEE-754 floating point

In this project we'll abuse floating point inaccuracy to cre ate
"linear" functions (only using floating point addition and scaling) that
are not lines. For this reason, we prefer to have a numerical system
that is less accurate. In floating point, inaccuracy comes from the
fact that not all numbers are representable (due to finite precision)
and the result of an operation is always rounded to a representable
number. IEEE-754 floating point [1] comes in different "spice levels,"
with "32-bits" being "float" and "64-bits" being "double." Although
spice levels as low as 3 bits make sense [27], 8-bit ("mild") is
occasionally used in real applications, and 16- bit ("half") is quite
common in machine learning. Usually the reason to prefer half precision
is that it uses less mem ory, and so your GPU can store networks that
are twice as big in RAM. For this project we will also use half
precision, and we will be happy to save RAM, but more happy that its
precision is low and so it is practical (although silly) to achieve
significant rounding error. Another important reason to choose half
precision is to make the pun in the title.

A half precision float is 16 bits: One sign bit, five bits for the
exponent, and 10 bits for the mantissa. Like all IEEE 754 formats, there
is much more precision (more values are representable) near zero (Figure
1). Once you get to 1024, only integers are representable. From 2048 to
4096, only even numbers are representable. 65504 is the largest finite
number, and up here, only multiples of 32 are available.

Some CPUs have native support for half-precision IEEE 754, but typically
via non-standard intrinsics or compiler flags. Since people using
half-precision are usually doing so in the interests of performance,
many configurations will "help" you by performing practical but
incorrect opti mizations. This is similar to what happens when enabling
--ffast-math, which stands for Final Fantasy AST Math, meaning that the
abstract syntax tree of your program will be manipulated using fantasies
about Math that do

not apply to IEEE-754, and your Final result can be ar bitrarily
different. For the ideas in this paper to work, --ffast-math is
prohibited. And it will be slow!

Rather than deal with non-standard stuff, I found a nice library
called half.h [29] that implements IEEE-754 com pliant
half-precision in portable C++. I use this through out the project and
it matches the behavior of my GPU. I recommend it for similar hijinks.

Origins of Imprecision. Floating point does have many perversions, but
many programmers come to believe all sorts of dangerous superstitions
about it. One idea is that floating point is somehow always inexact,
and so that you always have to check that two numbers are equal
"within some epsilon" [24]. This may work "in practice" but it is
actually pretty sloppy. Floating point imprecision is not random, nor
is it constrained to a fixed epsilon. Opera tions are defined much
more usefully: Each one computes the mathematically correct value, and
then rounds (accord ing to the "rounding mode") to the nearest
representable value. That's it. One consequence of this is that you
can get the exact result of 32-bit multiplication by doing 64- bit
multiplication and then rounding to 32 bits. This also means that the
rounding error from a single operation can be as large as the gap
between representable numbers: Up to 32 for half-precision. But it
also means that operations whose results can be exactly represented
have no error; for example adding integral half values less than 512
will al ways give an exact integer result, which can be compared using
==. We will use this later in Section 7.1. It is neither necessary nor
sufficient compare for "equality" with some "epsilon."

Rounding. IEEE-754 supports multiple rounding modes, like
"round-to-zero," and "round-to-infinity" (always round in the positive
direction). Throughout this paper we use "round-to-nearest," which is
also the typical default (e.g. for C++11 expressions evaluated at
compile time, it always uses round-to-nearest).² Similar results are
likely attainable for the other rounding modes, as well as hypo
thetical rounding modes such as "round away from near est," but I have
not explored this.

Getting some nonlinearity. All transfer functions im plemented with
floating point have a finite range. For our experiments with neural
networks, we will focus on trans fer functions that map values in
[−*1,* 1] to values in [−*1,* 1]. Almost half (48.4%) of
floating point values are in this in terval and this is a typical
nominal range for activations in neural networks.

²There is seldom reason to change the rounding mode, and since it is
a stateful act, you're asking for it if you do. But the round-to
negative-infinity and round-to-positive-infinity modes are are useful
for interval arithmetic, which is arguably the only truly reasonable
way to use floating point. What you do is represent numbers as
intervals (low and high endpoints) that contain the true value, and
then perform each calculation on both endpoints. For computations on
the low endpoint, you round down, and symmetrically for the high
endpoint. This way, the true value is always within the interval, and
you also know how much inaccuracy you have accumulated!

35

We only have two operations: Addition and scaling. Let's see what kind
of rounding error each of these gives us. First, addition. In order to
get a function that takes values in [−*1,* 1] to values in [−*1,*
1], we want to first add a constant (giving us perhaps a large value)
and then add a negative constant, bringing us back in range. For
example, the constant 128 gives us the function

f(x) = x + 128*.0 *− 128*.*0

This is of course mathematically the same as f(x) = x (the
identity), but with half precision we get a function that looks like
this

Between 128 and 256, only multiples of 0.125 are repre sentable. So
for arguments in 0 to 1, the sum is rounded to one of the values
128*.0,* 128*.125,* 128*.25, . . .* 129. From 64 to 128,
multiples of 0*.0625 (¹/₁₆^th) are representable. So from *−*1
to 0, we get 127.0,* 127*.0625,* 127*.125, . . .* 128. Subtract
ing 128, all of the values are exactly representable, giving us −*1,
−.9375, . . . , −0.0625,* 0*,* 0*.125, . . . ,* 0*.875,* 1.

The result is a step function, but whose resolution is twice as high for
the negative range as the positive; had we added *−*128 and then added
128, we would have seen the opposite bias in resolution. We can easily
see that this function is (computationally) non-linear despite being
(mathematically) "linear." This function is unlikely to be a good
transfer function, because for one thing it does not have a good
derivative: It's zero most places (flat segments) except at the
discontinuities, where it is undefined. We do test this approach (with
the constant 64.0) later, though.

Scaling gives similar results. Consider

f(x) = x × 100*.0 *× (1*.0/100.*0)

In this project we never actually divide (although this would not
violate linearity) since most floating point num bers have approximate
multiplicative inverses, and many are exact. We just compute the
reciprocal ¹/₁₀₀ ≊ 0*.*01000213623 ahead of time and multiply by
that con stant. Here's what that function looks like:

At this scale it appears linear, but it does have small im perfections
(see zoomed region). The function is symmetric about zero, since
multiplication will do the same thing to a positive number as it does
to its negative counterpart. Here, the roundoff error differs with the
magnitude. At in puts close to 1.0, the results of the first
multiplication must round to the nearest multiple of 0.0625 (as in the
additive example) but this error is scaled down by a factor of 100
when we multiply back to the [−*1,* 1] range. So it is almost
invisible. For inputs close to 0.0, the error approaches zero. The
effect is complex and depends on the constant we mul tiply by. For
example, if we multiply by a power of two, this only affects the
exponent, and so the result is exact.

Is that it? Of course not! We can apply these operations in
combination, and many times, to create more interesting functions. The
best approach I found in this simple family is to repeatedly multiply
the input by a number very close to one. Here's what happens if you
multiply the input by 0*.99951171875 (which is the next number
smaller than one, equal to 1 *− ¹/₂₀₄₈) five hundred times, and
then scale back at the end:

f(x) = x × (1 − ¹/₂₀₄₈) × (1 − ¹/₂₀₄₈) × . .
. 500 times*. . . ×* 1*.*3232421875

I call this the grad1 function.

Multiplying 1*.0 by (1 *− ¹/₂₀₄₈) five hundred times in half
precision yields 0*.755859375 (mathematically it would be
(1−¹/₂₀₄₈)⁵⁰⁰ = 0.78333, so there is significant
accumulated error. We set *f(1*.0) = 1.0 by multiplying by the
inverse of this constant, which is 1.*3232421875.

Why does this result in the zig-zags? Multiplication by (1 −
¹/₂₀₄₈) affects numbers differently. For constants less than
6*.1094760895×*10^*−*5, the value is unchanged; we round back up to
the original value. For all other finite inputs it produces a smaller
value, but with rounding error that de pends on the value. This error
accumulates and becomes significant with many iterations (Figure 2).
Unlike the pre vious functions, the output here is much smoother (it
looks

36

Figure 2: How repeatedly multiplying by 1 − ¹/₂₀₄₈ affects
values in [0, 1]. The width of the image is the interval [0, 1],
with zero at the left.

Top: In the topmost row, we assign each pixel a hue so that we can track
where those values go. For each pixel, we successively multiply by the
constant and plot its color in its new x position, the move to the
next row down. Note that the rainbow shrinks exponentially as expected,
but not smoothly. The black line is 500 iterations.

Bottom: The accumulated error when iteratively multi plying by the
constant. Here the x coordinate of the value does not move (so the
middle column always represents the value that was originally 0.5). The
color illustrates the accumulated error. For green pixels, the value is
too high compared to the mathematically correct one; for ma genta pixels
too low. By choosing a row with alternations between green and red, we
get the zig-zag pattern of the grad1 transfer function.

piecewise-linear); in each of these segments its derivative is
nondegenerate. Of course, this function is mathematically linear. It
is equivalent to f(x) = x × 1*.*036535.

So now we have a "good" candidate function, which we'll call grad1. It
is "good" in the sense that it is computation ally non-linear despite
being mathematically linear, so it may prove my professor wrong. On
the other hand, it re quires 501 floating point multiplications to
compute, which is kind of slow. The "good" news is that since there
are only 65536 16-bit values, we can easily just precompute any
function for all possible half inputs, and store it in a table of
131072 bytes. This allows us to execute the function efficiently when
performance is important, such as during training. (Table lookup is
certainly not a mathematically linear operation, so when we require
the computation to be linear for ideological purposes, we can perform
the 501 multiplications and get the same result.)

Differentiating. Speaking of training, in order to train a neural
network using stochastic gradient descent, we need to be able to
evaluate the derivative of the transfer function at any point. We use
that derivative to decide what direc tion to move the parameters (it
gives us the "gradient" that we "descend") as we propagate errors back
through the network. There is an annoyance here, or if you like, an
opportunity for a trick. We typically store the activation of each
node, which is the output of the transfer function, but the
derivative of a function is normally described in terms of the input
(for example we say if f(x) = x²then f^′(x) = 2*x*). We
could store both the input and output for this step, or store only the
input and recreate the out puts by running the transfer function. But
the trick: We can compute the derivative as function of the output.
For f(x) = x² we could say f^′_{(f(x)) = 2}^√x.
Oops! That doesn't actually work for x² because the square root
could either be negative or positive, and the derivative is differ ent
depending on which one it is. In order for this trick to work, the
transfer function has to be injective.³ Fortu nately this is the
case for the classic transfer functions, and this trick is well known
so you don't even need to do any math; you just look the function up.

For new transfer functions like grad1, we need to figure something
out. This function does appear injective if we squint at it, although
it is not really injective if you zoom way in: There are some distinct
inputs that result in the same output due to rounding. But this is
true for almost all floating-point functions already. I'll be damned
if I can come up with an analytic derivative for this thing, though.
At best it would be some piecewise linear thing, requiring some table.
Since our domain is only 16-bit, it is completely practical to just
table the entire derivative (keyed by the output value, as we need). I
do this programmatically. We do not want the derivative to reflect the
step function that we see at very fine scales (the derivative should
never be 0 for this function, for example), so I use a lowpass filter.
The

³Or at least when f(x₁) = f(x₂), f^′(x₁) =
f^′(x₂). For the rectified linear unit, for example, all
negative inputs are mapped to zero. But the derivative is also just
zero in this entire region.

37

Figure 3: Computed derivative (blue) of the grad1 function. Since we
need the derivative in terms of grad1's output, the derivative is
oriented along the y axis; each blue dot's x coordinate gives the
derivative at the point on the black line that shares a y
coordinate. It's an oscilloscope!

result looks good, oscillating between two different slopes as
expected (Figure 3). The derivative is loaded into GPU memory during
training and the table lookups are plenty fast.

4.1 Bonus digression: Downshift

Having freed myself from needing to "do math" in order to differentiate
exotic functions, I pondered other weird transfer functions. For
example, the rectified linear trans fer function is very simple and
works well, but is it the fastest possible transfer function that might
work? It does involve a conditional, which na¨ıvely implies comparison
and branching (although probably most processors can do this with a
conditional move). Because the floating point for mat is packed with
fields that represent different things, many simple operations on its
bits have interesting non linear behavior. The most promising I found
was a right shift by two places. It looks like this:

Shifting is about the cheapest possible thing a processor can do. Its
behavior on floating point numbers is interest ing:

Note the different regions for sign, exponent, and man tissa. The sign
bit is shifted into the exponent, which

means that the output is always non-negative (like the rec tified
linear function) and is non-linear (discontinuity at zero, as negative
numbers have a much larger exponent that positive ones). Further
nonlinearity comes from the exponential representation (shifts divide
the exponent by four) and reinterpretation of exponent bits as
mantissa bits. There is additional weirdness in the details. Shifting
by two places is better than one, as it cannot produce Inf or NaN. We
will also evaluate this transfer function, called down shift2, below.

Back to the main topic. I implemented all this as a modification of my
custom neural network training and in ference system, "Tom7Flow."
Tom7Flow is generally much worse than mainstream packages; it is based
on deprecated OpenCL technology, is prone to divergence or stagnation
during training due to na¨ıve choices of hyperparameters, etc. But it
is at least well suited to silly experiments that take the form, "What
if deep learning but worse?" such as the current exercise. In order to
realize the idea com pletely, I modified the inference code to
calculate with half precision arithmetic (not just the transfer
function). This means that the trained networks can be executed using
only half-precision operations (and just addition and mul tiplication
by constants). Unfortunately, while my GPU supports half-precision
math natively, and OpenCL sup ports half-precision operations as an
extension [11], this extension is somehow not supported (??) by my
drivers, perhaps because OpenCL is so thoroughly deprecated. It does
support half precision as a storage format, which al lows you to
write a full-precision float to a 16-bit value (rounding to half) or
read a 16-bit half into a float (all half values can be represented
exactly in full precision). So with this one operation it is
straightforward to implement half precision addition and scaling. You
maintain the invariant that any float value is always exactly a half,
and after you perform addition or multiplication, you round to half
(by storing in a 16-bit memory location and reading it back). This
definitionally produces the same results as the native operation.⁴

I initially tried a version of training that worked entirely using
half precision (network parameters are half, back propagated errors
and update values are half, etc.). This worked badly. It is
ideologically unnecessary, as we just care about producing a final
model that, during inference, only executes linear half-precision
operations (but abuses floating point roundoff to do something
interesting.) This network can be trained using non-linear techniques
(and must anyway, since for example its computed derivative is not
linear). So during training, calculations are done using
full-precision floats, except for the forward step (where we round to
half after every operation). In addition to be ing simpler,
representing intermediate learned weights as floats seems to help
training approach the final half values smoothly, avoiding stalls due
to underflow.

⁴I also verified consistent results using the half.h software imple
mentation. Many of the evaluation results quoted in the paper are
actually executed on the CPU using this library.

38

4.2 Neural network experimental results

In order to evaluate this transfer function, I ran a suite of
benchmark problems. For each problem, I compare the same network
architecture (i.e. the number of layers, their connectivity, random
initialization, etc.) but using differ ent transfer functions.

The transfer functions are:

grad1: The "linear" transfer function grad1 described above.

tanh: The hyperbolic tangent function, which is a classic saturating
(output is always in (−*1,* 1)) sigmoid.

logistic: The function ¹/_1+e−x , another classic sigmoid
(but whose output is in (0*,* 1)). Each operation is per formed with
half precision.

leaky relu: The rectified linear unit, but with a small slope below
zero: x \< 0.0 ? 0.1 * x : x. This is the function I usually prefer
in practice; its advantage over the standard relu is that it does not
"die" (zero propagated error) when its input is negative.

downshift2: Interpreting the half-precision input as a 16-bit word,
right shift by 2 places, then reinterpret as half.

plus64: f(x) = x + 64 − 64. This about the sim plest function
that has obvious rounding error. It only outputs 25 distinct values in
[−*1,* 1] so its deriva tive is degenerate; I use its
"mathematical" derivative f^′(x) = 1.⁵

identity: The function f(x) = x. This is an impor tant
comparison because it shows us what a "true" lin ear (both
mathematically and computationally) net work is capable of.

Flattened models. For the transfer functions that are mathematically
linear, we can also compute the equivalent linear model. This just
consists of a single dense layer, us ing the identity transfer function,
that computes the linear function of the input. These appear in the
results as "flat" variants.

MNIST. The first problem is the Modified National In stitute of
Standards and Technology handwriting dataset (MNIST). This is a
standardized dataset of handwritten digits (0--9) as 28*×*28 greyscale
images. This is chosen partly for trollish reasons. It dates from
1998, and even at the time of publication, accuracy with neural
networks

⁵Learning with this function might work better if we instead ap
proximate the derivative by something non-constant, like by comput ing
the derivative of a smoothed version. However, due to imple mentation
tricks in Tom7Flow, we need a derivative that is expressed in terms of
the transfer function's output (i.e. g(f(x)) = f^′(x)); we
would not be able to express the smoothed derivative because there are
only 25 distinct values of f(x) in the [−*1,* 1] range!

transfer function flat accuracy

logistic 98.20%

tanh 98.93%

leaky-relu 99.39%

plus64 82.66%

grad1 97.29%

identity 81.96%

downshift2 94.45%

plus64 × 82.01%

grad1 × 39.19%

identity × 81.98%

Figure 4: Results on the standardized MNIST data set. Accuracy is the
fraction of results from the held-out test data for which the
highest-scoring class (digit) is the correct class.

(98.4%) and other techniques (99.2%) were already ex tremely high
[15].

For this problem, I augmented the dataset by randomly offsetting the
training images by up to two pixels in any di rection, and by adding
Gaussian noise. The model's input layer is just the 28 × 28
greyscale values, and the output is a prediction for each of the ten
digits. The models had two convolutional layers (64 3*×3 features,
fully overlapping + 128 8×8 features, fully overlapping; then 32
128×128 features + 32 256×2 features with no overlap), then two
sparse layers of 1024 nodes each, then a final dense output layer. The
same initial weights and connectivity was used for each experiment.
Internal layers use the transfer func tion being evaluated, but the
output layer always used the identity transfer function. This is not a
good choice for this problem (softmax makes more sense since the
output is cat egorical) but I wanted the linear models to be truly
linear. Using the same transfer function would have also disadvan
taged functions with limited output range; downshift2 for example can
technically output 1.0, but only for very large inputs (8192.0). The
final identity layer can easily scale the useful range of the transfer
function to the nominal range of the output. (This is essential for
the chess problem below, where the output instead ranges [−1,*
1].)

See the source code for various hyperparameter settings (although if
you are trying to learn good settings for hy perparameters, my code is
not the place to look). I used the ADAM weight update trick [12],
which does give me much better results than plain SGD in my
experiments.

Results for MNIST are in Figure 4. A nice bug appears in Figure 5.

CIFAR-10. Another classic dataset comes from the Canadian Institute
For Advanced Research. They capi talize "For" so that the acronym can
be pronounced nicely. I mean to be fair MNIOSAT would have a certain
ring to it too. This dataset contains 60,000 RGB images of size 32 ×
32, that are labeled into 10 different spirit animals: Airplanes,
cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks
[14]. It is very similar to the handwriting prob

39

Figure 5: Bug.

transfer function flat accuracy

logistic 56.83%

tanh 67.82%

leaky-relu 73.11%

plus64 43.60%

grad1 53.56%

identity 41.07%

downshift2 46.54%

plus64 × 32.76%

grad1 × 30.58%

identity × 41.04%

Figure 6: Results on the standardized CIFAR10 data set.

lem but more challenging (state of the art accuracy is more like
96.5%). You would struggle sometimes to figure out what these tiny
thumbnails are, to be honest. Like with MNIST, I augmented the
training set by randomly shifting the images and adding Gaussian
noise. The network struc ture is the same as in the MNIST problem,
except that in the first convolutional layer, each window is three
times as wide to account for the three color channels.

Results for CIFAR-10 appear in Figure 6. One of the nice things about
using standard problems is that we can understand how the results
stack up against other re searchers. Consulting a leaderboard of
public results [4] I see that the worst publicly known accuracy for
CIFAR-10 is 75.86% [21]. The best result for the current work, using
the sensible Leaky Relu transfer function, is 73.11%. So this is. . .
last place. That's actually pretty good; last place is the last winner
(or the first winner, when counting from the end). Not to mention that
we can get into even laster place by using the other exotic transfer
functions. Even putting aside their aesthetic appeal, I feel that
these infe rior transfer functions are an important contribution to
the field, as it seems to me that AI is getting too good, and too
fast! Let's take it easy there, guys!

Chess. This problem attempts to learn a good evalua tion function for
chess boards. Training examples are real chess positions (from the
Lichess database) evaluated by a strong chess engine (Stockfish
[30]). Stockfish generates two classes of scores: "Mate in N" if one
side is known to have a series of N moves that wins (but "Mate in 1"
is still better than "Mate in 4"), or a more subjective score,
measured in pawns. (The score in pawns can seemingly be higher than
64, which is kind of funny because how
^{are the pawns gonna fit on a 64-square board? DUAL}WIELD?
⁶) Mate is of course categorically better than the pawn score, as it
is exact. Anyway, I squash this score into the range [−*1,* 1] and
that becomes the training in stance. This network's first layer has
256 3*×3 convolutional features, overlapping, as well as 32 1×1 and
128 8×1 and 1×*8. Each of these is measured in terms of squares on
the board, but each square actually corresponds to 13 inputs, for the
13 possible things that can be in that square (ex actly one set to
1.0). We also have some non-square inputs, like the castling
privileges and en passant state. So it's not just the convolutional
features but some sparse nodes too. And then we have some more layers
(you can check out the source code if you really care about these
details, which I doubt!) and then a final dense layer with a single
output using the identity transfer function as before. No training

As with MNIST, accuracy is the fraction of results from the held-out
test data for whom the highest-scoring class is the correct class.

Here's an idea for a SIGBOVIK

paper: What's the highest scoring

chess position, according to Stock

fish, for which it cannot deduce

6

mate? One logistical challenge is

that it seems to top out at +99,

such as on this position (still no

mate at depth 89).

40

⁸ rZrZrZ0j ⁷ o0o0o0o0 ⁶ PoPoPoPo ⁵ ZPZPZPZP ⁴ BSBSBSBS ³
SBSBSBSB ² BSBSBSBS ¹ J0SBSBSQ a b c d e f g h

data augmentation here (we have a basically limitless sup ply of
positions to train on), but I do normalize the board so that it is
always white to move.

For chess we can compute the accuracy, comparing to Stockfish as ground
truth (Figure 7). We can also use the evaluation function to play chess.
These chess "engines" just look at the possible legal moves and take the
move that is most favorable, using the learned evaluation function (no
game tree search). Playing against the best of these ("leaky") it
subjectively makes decent moves most of the time and can even beat me
playing casually. I noticed that it had a lot of trouble "sealing the
deal" in totally winning positions (which is not unusual for engines
that don't do game-tree search or use endgame tables), but the problem
was actually more shallow: Due to a bug⁷in the way train ing examples
are gathered, the models were never exposed to checkmate or stalemate
positions! Since training takes several days per function and the
iron-fistedly punctilious SIGBOVIK deadlines were imminent, there simply
wasn't enough time to retrain them with access to these positions.
However, since mate is a mechanical fact of the game (like what moves
are legal) it seemed reasonable to fix this in the engine itself: When
considering all the legal moves to make, it infinitely prefers a move
that results in checkmate, and considers a move resulting in stalemate
to have score 0.0, and otherwise uses the evaluation function. These
"fix" versions of each engine perform very significantly better, al
though they likely overestimate the performance we'd get by actually
fixing the model; there's no guarantee that it would be able to
accurately recognize mate, and the fixed versions' greedy strategy of
taking mate in 1 is always ad vantageous.

These players compete against each other as well as the engines from
the Elo World project [26], giving a sense of their strength in an
absolute scale (Figure 8). The raw versions perform reasonably; they
all work better than a simple engine like "take the move that
minimizes the num ber of moves the opponent will have,"
(min_oppt_moves). The fixed versions are much better, as expected. The
"lin ear" engine using the grad1 transfer function, is competi tive
with the NES Chessmaster engine, and outperforms a 50% dilution of
Stockfish. This is pretty solid given that it is doing no explicit
game tree search. In fact (aside from the wrapper implementing the
rules of chess and finding the maximum eval score), it is only
performing a fixed ex pression of floating point addition and scaling!
We could make this even more ideologically pure using techniques from
Section 7.3.

What transfer function is best? The results on each of these problems
are similar: The "leaky rectified" trans

⁷I used the annotations like [%eval #12] that appear on moves for
many games in the Lichess database. I didn't notice that they do not
appear on a game-ending move like Qh4#! This does sort of make sense
because the eval scores would have to be [%eval #+0] ("mate in 0")
or [%eval #-0] (necessitating use of the floating point coprocessor)
to express the winner, and there does not seem to be a natural way to
express the definite value of stalemate.

transfer function flat loss accuracy logistic 0.168 72.046%

tanh 0.117 78.527%

leaky-relu 0.118 78.172%

plus64 0.162 75.406%

grad1 0.111 78.924%

identity 0.161 75.975%

downshift2 0.211 68.066%

plus64 × 0.161 75.779%

grad1 × 0.527 58.187%

identity × 0.161 75.975%

Figure 7: Results of learning Stockfish's position evaluation
function. Stockfish scores are normalized to a [−*1,* 1] scale,
and loss here is the average (L1) distance between the pre dicted
score and actual Stockfish score, on some 100,000 positions from games
not in the training set. Accuracy is the percentage of predictions
whose sign agreed with Stock fish (e.g. they both agree the white
player is winning).

Figure 8: Results of a chess tournament. Players include ones based on
the learned position evaluation with different transfer functions;
these players simply take the move that results in the most favorable
eval (no game tree search). They compete with some standardized
players from the Elo World project [26]. Rows represent the player
as white, columns as black. A green cell means that White generally
wins; blue a draw; red a loss. An × in a cell means that this
outcome occurred in every game. The left column is the re sulting Elo
rating [6]. The best model leaky fix performs decently well, similar
to NES Chessmaster or a Stockfish diluted to about 60% strength with
random moves (both of these engines perform game tree search). The
centerpiece of the paper is the "linear" grad1 transfer function; here
its learned chess player slightly outperforms Stockfish diluted to 50%
strength with random moves.

41

fer function is generally best or close to best. The identity transfer
function, which yields a simple linear model, is gen erally worst or
close to worst. The sigmoid functions are all over the place. It is
known that they are prone to vanishing gradients in deep networks, and
I may simply have unfavor able hyperparameter settings for them. The
experimental downshift2 function is generally bad, perhaps because its
output is strictly positive or it has such a small dynamic range. Its
shape also seems prone to the vanishing gradient problem. The small
amount of nonlinearity introduced by plus64 does appear to give it a
small edge over the iden tity, but its lack of an interesting
derivative and the fact that it only produces a small number of output
values are limiting. Importantly, the grad1 function---the centerpiece
of the first third of this paper---performs decently on all problems.
It clearly outperforms the linear models, despite being "linear."

It is also interesting to compare the flattened versions of the linear
transfer functions. These are the computed (mathematically) equivalent
single-layer linear models. For plus64 the flattened version is worse in
all cases; the unflat tened model is taking advantage of the
discretization in some way. For grad1 it is dramatically worse, both be
cause grad1 models are substantially using the roundoff er ror and
because the mathematical version of this function (f(x) = x ×
1*.*036535) is not even a good linear approx imation of the actual
result (e.g. grad1(1) = 1). Finally, the result for the identity
transfer function should be math ematically equivalent, but it does not
always produce the same results. This is unsurprising since we know that
float ing point calculations are not perfectly accurate, but it does
hint that deep networks may make use of floating point roundoff
internally, even if they are not using silly transfer functions!

Having proved the professor wrong, we could stop there, but did huge
mathematical breakthroughs ever arise from taking the option to stop
there ?!

5 Non-monotonic functions

Because of the way that addition and scaling are defined (do the real
mathematical operation, then round), they pre serve monotonicity: If
x ≥ y, then f(x) ≥ f(y). But this is only true if we limit
the form of the function to a series of additions of constants and
(non-negative) scaling. There are other expressions that are
mathematically linear but don't take that form; for example:

f(x) = x − 4096 − x + 4096

This is of course mathematically equivalent to f(x) = 0*.*0, but
with half precision it is a square wave function (here pictured [-8,
8]):

For some values of x the terms cancel out, and for others the
rounding error compounds. This function is not as well behaved as it
appears; the first pulse has width 0*.*99609375 and the second has
width 1.

Here is f(x) = grad1(x) − x, which is also linear:

Generally speaking, we can create a large variety of func tions by
computing the interference patterns between other functions, since the
sum or difference of two "linear" func tions is also "linear." In
general we'll consider expressions of this form:

E ::= x

| E × c

| E + c

| E + E

Where x is the function variable, and c is one of the 63,488 finite
half-precision constants. We can derive negation (E × −*1) and
subtraction of constants (*E + −*c) and ex pressions (*E + (E ×
−*1)) since every number has an exact negation by flipping its sign
bit. Exact division is possible when ¹/*_c is representable, and
there is almost always a close approximation.

This formulation leads to a tempting approach for ap proximating a
function iteratively, like a Taylor series. Given a target function
like sin(x), we can begin with an approximate expression for it,
like x, and then add and sub tract terms to improve the
approximation. I don't know of any systematic way to improve the
approximation at each step (they are not well-behaved mathematically,
and I am not good at math), but by using computer search I can sure
make some complicated functions with many different shapes.

An approximation of sin appears in Figure 9. It is fun to watch an
animation of the successively improving approx imations, but you can't
see that since you're reading an old-fashioned paper. Perhaps you can
find a video of this at tom7.org/grad.

5.1 Fractals

Next, I endeavored to deploy these functions for something useful:
Fractals. Famously, fractals are simple functions with complex (often
literally) behavior. For example, the Mandelbrot set considers each
complex point c (plotted on the plane as x + yi) and computes
whether z_i= z²_*i−*1 + c diverges or not. It's lovely,
but squaring is not linear!

42

Figure 9: Successive approximations of the sin function, as color
interpolates from green to blue.

What if we just create a linear function that approxi mates f(x) =
x²? This is definitely possible, using the ap proach described
above. After 184 successful error-reducing rounds we get the following
approximation, with 112,204 linear operations:

Aside from the funny business near the origin, this is a fairly accurate
approximation of the square function, so you might hope that it would
draw a perverted Mandel brot set. Unfortunately, it produces a much
sadder blotch (Figure 10). To see why, consider the normal definition of
squaring for a complex number:

(a + b*i)² = *a² + 2*ab*i + b²i²

= a² + 2*ab*i − b²

Note that the real coefficient a ends up part of the imag inary
coefficient 2*ab* in the result, and the imaginary co efficient b
becomes part of the real part (because i²is real). This means that
squaring a complex number cross pollinates between the two components,
yielding a kind of wacky rotation if we think of them as 2D coordinates.

Figure 10: A garbage "fractal" that results from trying to approximate
squaring of complex numbers using linear complex operations. Alas, it
cannot be done. The complex numbers are truly special.

But here, squaring is approximated as a series of opera tions of the
form w₁ + w₂ and w × c for constants c. These operations
on complex numbers are less interesting:

(a₁ + b₁i) + (a₂ + b₂i) = (a₁ + a₂) +
(b₁ + b₂)i (a + b*i) *× c = ac + *bc*i

Alas, these operations are boring; the real parts always stay real and
the imaginary parts always stay imaginary. This is why the crummy
blotch has all sorts of vertical and horizontal lines in it: As we
iterate the function we are iterating two independent components, and
the resulting picture is just some interference pattern between them.

This seems pretty definitive. Even if we had some kind of hardware
implementation of complex numbers with round ing error to abuse, there
would be no reason to have the linear operations do any
cross-pollinated rounding. Pro fessors take note: The complex numbers
do provide some refuge!

Still, a lot of chaos can emerge from these functions that should not
be possible with "linear" ones. For example, here is a complicated
function made by stringing 36,637 addition and scaling operations
together:

43

Iterating this function produces chaotic results because of its
nonmonotonicity. In Figure 11 I plot (using color) the magnitude of
z after 256 iterations of

z_i= f(z_i−₁) × c

This is mathematically linear (as c is a constant and f a linear
function). Nonetheless, it produces an appealing picture. I think this
is a fractal in the sense that it is chaotic, has a color gradient, and
could be on the cover of an electronic music album. It is not a fractal
in the sense that if you zoom in on it, you get infinite detail of self
similar shapes. In fact, if you zoom in on it only a modest amount, you
encounter rectangular pixels as you reach the limits of half-precision
floating point. (And because this fractal is built by abusing those very
limits, it is not even possible to get more detail by increasing the
accuracy!)

5.2 Bonus digression: Baffling numbers

Imagine you are my professor. You assign a class project to "make
fractals using floating point roundoff error," for some reason. You spot
me in the computer lab and I'm obviously way off track, because
on-screen is some kind of 3D fractal. The Mandelbrot set cannot be
extended to three dimensions, you say, because of the Frobenius theo
rem: Only algebras of dimension 1 (real numbers), 2 (com plex numbers)
and 4 (quaternions) work [8]. Unclear how the professor speaks the
citation aloud in this scenario. I say I "know" this fact, but I "don't
care." You say that my three-dimensional algebra can't be associative,
because that's "just a mathematical fact." I say you know what else
isn't associative? The floating point numbers, my dude.

Enter the baffling numbers, ill-advised extensions of the complex
numbers to three dimensions. Here we have numbers of the form a +
b*i + *c*j. Addition is just pointwise, and there are several options
to complete the story for mul tiplication, namely the values of the
cells *A, B, and C in ^{this table:}× 1 i j

1 1 i j

i i −*1 *U

j j V W

Figure 11: A fractal made from iterating a "linear" function f. The
color is the magnitude of z₂₅₆ with z_i= f(z_i−₁)×c.
c is the complex coordinate x + *y*i, a constant.

The cells U, V , and W are baffling numbers (i.e. each some
a + *b*i + *c*j). Some choices are degenerate, but this gives us a
family of options. It is known that no matter the choices, this does
"not work" (in the sense that the resulting algebra is not
associative⁸) but we don't need as sociativity to draw fractals.
Plus, who's gonna stop me, FROBENIUS??

I tried a few options, but thought that U = i, V = j and W = 1
produced satisfyingly trippy results. The Man delbrot is
straightforwardly generalized to the "Bafflebrot" (the starting point
c is just a baffling number now; ev erything else is the same). I
generated a 3D object by defining an implicit surface based on whether
a sampled point is still inside the set after 25 iterations, using
March ing Cubes [17] to discretize it. The resulting mesh is 2
gigabytes and crashes every library that attempts to pro grammatically
simplify it. I do admire and encourage its defiant spirit. A rendering
appears in Figure 12.

Drawing fractals is fun and everything, but I grew weary of the
exercise because there is no real goal other than to make a cool
picture. Instead I turned to something with a clearer challenge to
overcome: Linear Cryptography.

6 Linear cryptography

Cryptography is like fractals minus drugs. One of the most basic
components of cryptography is a pseudorandom num

⁸Or else is equivalent to the complex numbers.

44

Figure 12: The 3D "bafflebrot" sliced in half and projected to 2D. This
fractal was created with the "illegal" number system called the
baffling numbers. They're like the com plex numbers but more so. The
object is truncated along its j axis, showing a perfect ripe Mandelbrot
inside.

ber generator. This kind of function takes some state and produces a
new state that "looks random." Given a pseudo random number generator,
we can construct one-way func tions ("hash functions") and from those
we can make sym metric ciphers (using, say, a Feistel network), with
which we can encrypt and decrypt data.

Another thing that professors will tell you about cryp tography is that
good cryptographic functions cannot be linear. In this context, linear
includes in a finite ring like Z₂₅₆ or (especially) Z₂, i.e.
bits.⁹ One good reason for this is that even if the function is a
little bit linear then linear cryptanalysis can be used to recover bits
of the key with a lot of example data [19]. Standard advice is to
alter nate both linear (e.g. XOR, or multiplication mod 2*ⁿ*) and
non-linear (e.g. substitution) operations. ("[Substitutions] are
generally the only nonlinear step in an algorithm; they are what give a
block cipher its security."¹⁰) Of course we will prove this adage
wrong by developing a good pseu dorandom function that uses only
"linear" operations on half-precision floating point numbers.

In terms of goals, pseudorandom number generation has a more clear
objective than fractals, although it's not so easy to pin down
formally. We don't even know if such functions exist, mathematically
[10], although there are generators that are provably secure
assuming some other problems are actually hard [5] (but these
problems are only believed to be hard). There exist many functions
that look like good pseudorandom generators, but that actually have
back doors that make them easy to predict. (Iteration of

⁹So XOR is considered linear here, even though we previously ob
served that there is no linear function on real numbers that fits it!

¹⁰Applied Cryptography, Second Edition, page 349 [31].

Figure 13: The substitution-permutation network that forms a half
decent pseudorandom number generator. The same substitution ("s-box")
is applied to each byte. Then the 64 bits are permuted. Finally, bytes
are combined with modular addition and subtraction. This function
passes the "Big Crush" suite and can be implemented with only
half-precision floating point addition and scaling.

a symmetric encryption algorithm like AES, with the key hidden, has
this property.)¹¹ Practically speaking, though, we can subject the
function to a wide variety of statistical tests, and if it looks
random to every test, then this gives us good confidence.¹²

Specifically, my goal is to design an algorithm that takes 64 bits of
data (represented as half-precision floats) to an other 64 bits, such
that the stream of low-order bits from iterating this function passes
the TestU01's "Big Crush" suite of 106 statistical tests [16]. This
suite is a succes sor to Marsaglia's "DieHard" battery of tests
[18], itself an improvement on Knuth's tests from The Art Of
Computer Programming [13].

The basis of this function is the classic substitution

¹¹And let us never forget that RSA DSI (yes, that RSA) actually did
take a $10 million bribe from the NSA to put a backdoor in one of
their pseudorandom number generators [22]!

¹²Truly good cryptographic algorithms are also openly studied by
experts. Of course nothing in here is to be used seriously, and not
just because these algorithms are ridiculously slow. But I guess if
you are stuck on a desert island with only the floating point addition
and scaling operations, and a copy of this paper, then it would be a
reasonable starting point for encrypting your messages. I do not rec
ommend, if stranded on a desert island, to send encrypted messages:
They may not be readable to your potential rescuers!

45

permutation network. First, each of the eight bytes are substituted with
a different byte using a table (this is the mathematically non-linear
step). Then, the 64 bits are per muted. Finally, some of the bytes are
modified additively. An illustration appears in Figure 13.

The substitution table ("s-boxes") was generated by computer search
with an objective to maximize the

two values: zero¹³ and ¹/₁₂₈. It returns ¹/₁₂₈ only for
exactly the interval [¹²¹/₁₂₈,¹²²/₁₂₈). This is the
interval that repre sents the number 249 (128 + 121; remember that the
first 128 integers are in [−*1,* 0)). The expression that computes
this is

f(x) =

"avalanche property" (when a bit of the input is comple



mented, about half of the output bits should be comple ^

mented). The permutation was generated to maximize dis persion; each
quartet sends each bit to a distinct quartet in the output. This is not
the important part. We could 

have just used known good tables.

^

To implement this with half-precision floating point, we could represent
each bit with its own half, but that is no fun. The state will be
represented with eight half

(x −⁹₆₄ −¹₄ ×⁻¹

₅₁₂ −¹²⁵⁵

₅₁₂ ×^{164 1027}

₁₀₂₄ ) +

(x −¹₄ ×⁻¹

₅₁₂ −⁵⁹⁷

₅₁₂ ×^{188 1277}

₂₀₄₈ *× −*1) +

517

₁₂₈ × − 32

(x −¹⁷

₁₂₈ −¹₄ ×⁻¹

₅₁₂ −¹²⁵⁵

₅₁₂ ×^{164 1027}

₁₀₂₄ ) +

(x −¹₄ ×⁻¹

₅₁₂ −⁵⁹⁷

₅₁₂ ×^{188 1277}

₂₀₄₈ *× −*1) +

517

₁₂₈ × 32



^_+



^×⁻¹ 32

precision floats, each representing one byte's worth of in formation.
Since we have been fixated on the [−*1,* 1] in terval so far, a byte
will be stored as any value in [−*1,* 1), with each ¹/₁₂₈
interval representing one of the 256 val ues (0 is anything in [−*1,
−0.9921875), 1 is anything in [−0.9921875, −0.984375), and
so on). This means that it erating the function on *any starting value
in the [−*1,* 1) interval will produce pseudorandom results. So for
exam ple we can guarantee that a "fractal" plotted using this function
will look "fully messed up" and not just have a few distinguished points
of randomness. I'll say now that this is unnecessarily hard; in the next
section of this pa per we'll see a vastly more efficient approach for
handling discrete data. But working on the entire domain makes for some
challenging problems and shows that we've developed substantial mastery
of the continuous case.

Speaking of which, my first approach was to try to ap proximate the
substitution function (since it replaces one 8-bit byte with another, it
corresponds to a single discon tinuous function of type half → half)
using the iterative approach described in Section 5.1. Although it is
possible to get reasonable approximations with this method (most values
are transformed to a value near the desired one), this will not suffice;
when iterating the function we find that the value easily gets stuck in
short cycles due to this inaccuracy.

I found a better approach, by creating a composable fam ily of
functions that isolate specific intervals of interest. For example,

Within the interval [−*1,* 1), this function takes on exactly

where E ×ⁿ c means E ×c×c×c . . . for n iterations. Math
ematically this is equivalent to this constant function (all *x*s
cancel out):

_{f(x) =} 13¹⁶⁴ × 79¹⁶⁴

2¹⁶⁵⁶

I spent a long time writing code to simplify these expres sions and
generate L^AT_EX for them, by the way! As usual, I thought it would
look cool when I got it working, but it just looks like a bunch of
numbers.

We can think of this function as a basis vector, represent ing the
256-dimension vector ⟨*0,* 0*,* 0*, . . . ,* 0*,* 1*,* 0*,* 0*,*
0*,* 0*,* 0*,* 0*⟩. We'll call this one b₂₄₉ since it selects the
integer 249. If we can find b_n* for each n ∈ Z₂₅₆, then we will
be able to combine them to systematically construct functions.

The one just pictured is one of the smallest expressions; most are
much larger. I wish I could tell you that I fig ured out the
principles underlying how to analytically gen erate these functions,
but I discovered them with computer search and some elbow grease.

Choppy functions. I call a function f "choppy" if for every half
precision floating point value in [−*1,* 1) it has the following
properties:

For n ∈ Z₂₅₆ and r ∈ [0*,* 1), f(−*1+^n+r/₁₂₈) =
*v for the same value v. We only need to consider cases where n +
r is representable as a half.

v is itself of the form ^*n−*128/₁₂₈ for some n ∈ Z₂₅₆.

For these purposes, we treat the single value *−*0 as being equal to
0.

And, as usual, the function is built only with floating point addition
and scaling by constants.

That is, the function produces the same result for any representation
of an integer, and that result is the smallest representation of an
integer. These functions are maxi mally useful in that they are
"liberal in what they accept,"

¹³Actually, *−*0!

46

but "conservative in what they return" [28]. It also means that each
function also can be understood as a function Z₂₅₆ → Z₂₅₆, so we
can represent them as a vector of 256 integers. The basis vectors
b*_n* are those that are of the form ⟨*0, . . . ,* 1*, . . . ,*
0*⟩*.

I then conducted computer search for choppy functions, putting those
into a database (keyed by the corresponding integer vector). Some are
easy to find, others harder. Sum ming and scaling choppy functions yield
choppy functions (as long as the vectors remain integral and in range),
so I use a simplified version of Gauss-Jordan elimination [32] to
solve for basis vectors. Once I have b*_n, this column can be changed
at will for any existing choppy function (by just adding or subtracting
multiples of b_n*), so new choppy functions that only vary in that
column can be ignored.

By trying a variety of operations that are known to be useful (e.g.
iterated multiplication of constants near 1*.0) and hill-climbing
towards functions with the choppy prop erty, it is not too hard to
find b_n* for most n. It seems to become more challenging for n
near 128; this is the point 0*.0 in half-precision. Specifically, the
hardest problem was to make a function that produced different results
for in puts *\< 0*.0 versus inputs *≥ 0*.0. This is the
*zero-threshold problem.

Why is this hard? Distinguishing between negative and non-negative
numbers is deceptively difficult. Looking back to the function
f(x) = x + 128*.0 *− 128*.0 (Section 4), it has useful
discontinuous steps, but note that the discontinuity does not happen
at zero. This is because we are rounding to the nearest value, and so
small negative numbers near zero end up rounding to the same result
that zero does. Moreover, the resolution of the floating point numbers
is highest near zero (especially because of subnormal num bers), which
exacerbates our attempts to control rounding of them. For example, you
might think that we could sim ply shift this function left and right
by substituting *x + c for x in its body. This would work
mathematically, but it does not work for floating point numbers,
because each operation performs some rounding. If this rounding ever
ends up conflating a negative number with a non-negative one, we will
not be able to recover.

I found a zero-threshold function using a combination of manual and
computer search. This was some ordeal, and the resulting enormous
function is in Figure 14. Perhaps you are smarter than me and can find
a better one!

Substituting and permuting. In any case, with this function it was
possible to form a complete basis. This basis makes it "easy" to
perform operations on half values that represent bytes. For example,
the s-box step substi tutes some distinct byte for each different
input byte. This would normally be implemented with a table lookup. If
we compute b*_n(*x) × subst[n], this returns the correct¹⁴
re sult subst[n] if the input x = n, and 0 otherwise. So if we
just sum all 256 of these up, exactly one of them will be

¹⁴Technically we need to do some multiplicative adjustments to put
the value in [−*1,* 1).

nonzero, and the correct substituted value.

Permutation is defined on the component bits. Here, we compose a
function that computes each of the eight output bytes. We use the same
approach of summing a bunch of b*_n(*x) evaluations (each
multiplied by the correct answer). Here we are testing whether the
input has some particular bit set (a sum of the 128 b*_n(*x)
functions where n has that bit set), and the output is the power of
two that sets the appropriate output bit. Many of these functions
would have simpler implementations (for example, "is the high-order
bit set?" is the same as the zero-threshold function) but at this
point I was happy to just have something working, and taking some joy
in how absurdly large the functions were getting.

The cipher also includes addition and subtraction mod 256. Addition
and subtraction are already available for half-precision floats, and
they have faithful behavior, so we just need to implement the
wrapping-around behavior so that the result is strictly in [−*1,*
1]. This is straightforward with the zero threshold function;¹⁵ we
produce corrective factors if the result is ≥ 1 or *\< −*1 (zeros
otherwise). We then add those corrective factors produce the remainder
we desire.

6.1 Benchmark results

To evaluate the quality of the pseudorandom number gen erator, I used
the TestU01 "Big Crush" suite. This test needs a sample of 1.64
billion bits, so I actually evaluated it on equivalent code that
performs the steps using normal integer operations. Even then, the
suite takes several days to run, so I modified it to run tests in
parallel and cache the results of completed tests. This saved me from
losing data if my computer crashed or needed to be rebooted.

Results appear in Figure 15. Passing these tests does not ensure that
the pseudorandom number generator is good for cryptography, although
it is a good start.

Running single-threaded on a 3.0 GHz Threadripper 2990WX, this
function generates 25.8 bytes of randomness per second, which is slow.
By precomputing the substitu tion, permutation, and zero threshold
expressions (so they can be performed by lookup into 64k-entry
tables), it gen erates 18,685.2 bytes per second, which is still slow.

If we were building an encryption algorithm (a symmetric block
cipher), it would be natural to use this as its "round function." In a
Feistel network [7], each input block (128 bits) is broken into two
halves; one of them is mixed with some key bits (for example with XOR)
and then passed to this function. Its output is XORed with the other
half; the two halves are swapped, and this "round" is repeated many
times until we believe that the data are suitably screwed up.
Decryption is the reverse. We can use addition and subtraction mod
2⁸to combine the data instead of XOR

¹⁵Compare to the remarks "why is this hard?" above. Here, zt(x
− 1) does do what you'd want, shifting the threshold value from 0
to 1. This is because there is less precision near one than near
zero.

47

zt(x) = x +¹2²⁴ ×^{9743 1025}

₁₀₂₄ ×³⁹

2¹⁹ +¹²⁷⁹

₁₆₃₈₄ × 4100 + 318 × 4 + 1 ×^{559 1025}

₁₀₂₄ ×¹⁰²⁷

₂₀₄₈ ×^{1160 1025}

₁₀₂₄ ×⁵⁴⁵

₂₀₄₈ ×^{23 1025}

₁₀₂₄ ×³¹¹

₅₁₂ ×^{137 1025}

₂₀₄₈ ×^{365 1025}

₁₀₂₄ ×

527

₁₀₂₄ ×⁶²⁷

₁₀₂₄ ×^{346 1025}

₁₀₂₄ ×⁵⁹³

₁₀₂₄ ×^{676 1025}

₁₀₂₄ ×²⁸¹

₅₁₂ ×^{557 1025}

₁₀₂₄ ×¹²⁹

₂₅₆ ×^{830 1025}

₁₀₂₄ ×⁵⁸⁹

₄₀₉₆ ×^{336 1025}

₁₀₂₄ ×¹⁰²⁹

₂₀₄₈ ×^{663 1025}

₁₀₂₄ ×¹⁰⁴¹

₂₅₆ + 1076 + 2534 − 2074 ×¹⁷

2¹⁸ −

2206 + 2206 ×⁹ =

₆₄ − 2048 + 2048 ×¹8

2¹⁵⁴¹⁰² x +

593*×2187×5³⁰⁷⁹³×343×11×169×289×361×961×41¹⁵³⁹⁶×43×79×109×281×311×*347

84107227537103367748705454682078539303191039250504832410385579895977259206417678
54245560547677512932803511681703341508665266147325419821191498222214690318490003
88110778422408950986678118366271064412982414738166383752334216371785576710459496
25738831829937485787963475192987844674323695457715738688221958462470493961327089
64862528034085403084792949523917534005532171250047637347672007635216035917044700
. . . 569 lines*. . .*

73014930689000221762919045089322540125964944324282583780813532524840229888776299
45268638388603643723804098205965854510116202420980541689175292145265852612920173
68725838005728517370192463512280524432138902703991548800398876262333592383735651
92764023532792804235216160403774463302046032255421688296905932246680375562746379
76285400623707503241570527062342946483273904883074176883724035214883942317941586
85212883934224294828781615577153021816836375301250331354703790141535016001286984
69246587905140878604602957637178398679300714629238352320436905873649993209
× 3 × 17 × 421

2¹⁵⁴¹²⁶

Figure 14: A zero-threshold function. Returns ¹/₁₂₈ for values
in [0*,* 1) (and −*0) and 0 for values in [−1,* 0). Top is the
series of additions and scalings to perform, all from left to right.
At bottom is the equivalent mathematical expression, but the enormous
numerator cannot be printed due to extremely oppressive SIGBOVIK page
limitations.

(which is addition mod 2¹), so we already have all the oper ations
we need to build a whole block cipher here. As Bruce Schneier
says,¹⁶ "It is easy to design a block cipher."

7 THE ULTIMATE THROW BACK

Having developed a basis for extracting arbitrary bits, we can express
any function of a single variable, and we've seen how some other
functions (like addition mod 2⁸) can be done. At this point, it seems
like we probably have the building blocks to demonstrate that addition
and scal ing on half-precision floats is Turing complete. I mean, pretty
much everything is Turing complete. In the past, I built computers that
were perfect and beautiful, such as a hardware implementation of the
NaNDY 1000, a com puter architecture that computes using only floating
point NaN and Infinity [27]. In a concession to ideological purity,
though, the NaNDY 1000 has no I/O. So it is very boring to use.

For today's investigations of the capabilities of floating point, I'll
make the opposite concession: Let's make a com puter that is exciting
to use, but that makes some (reason able) ideological concessions so
that it can do something interesting.

¹⁶Applied Cryptography, Second Edition, page 351.

7.1 Fluint8

First of all, if we want to do some serious computation, 25.8 bytes
per second isn't going to cut it. To look for perfor mance enhancing
substances, I perused the back catalog of the world's most prestigious
conference, SIGBOVIK. There in the 2018 edition, on page 125, I found
an intriguing paper, The fluint8 Software Integer Library, by Drs.
Jim
^{McCann and . . . Tom Murphy VII? Wait, that's me? I al}ready
wrote this paper?! [20]

The fluint8 library represents an element of Z₂₅₆ (a.k.a. uint8)
as a 32-bit float, and provides multiplica tion, addition,
subtraction, negation, division, and bitwise functions and, or, and
exclusive or.

Compared to the approach discussed in Section 6 using "choppy
functions," fluint8 has much more simple and sen sible
implementations of functions like addition:

inline float fu8_add ( float a , float b ) { float x = a + b;

x -= x - 127.5 f + 3221225472.0 f - 3221225472.0 f; return x;

}

The x -= x... line applies the corrective factor to im plement
wrap-around, which we previously did using the zero-threshold
function. Why can it be done so much more simply here? First,
fluint8 represents n ∈ Z₂₅₆ as n, so a number like 27 is
represented as 27*.0 instead of, say, *−*1 + ²⁷/₁₂₈. Second, it
requires that the number be rep resented *exactly as this value. The
figures in the fluint8 paper are somewhat misleading as they are
plotted only

48

Test p-value Test p-value Test p-value SerialOver, r = 0 0*.9653
SimpPoker 0 32 0.8052 RandomWalk1 J (L=50, r=25) 0.4576
SerialOver, r = 22 0.7292 SimpPoker 25 32 0.1166 RandomWalk1 R
(L=50, r=25) 0.9736 CollisionOver, t = 2 (0) 0.3890
CouponCollector, r = 0 0.3233 RandomWalk1 C (L=50, r=25) 0.6768
CollisionOver, t = 2 (9) 0.6537 CouponCollector, r = 10 0.7936
RandomWalk1 H (L=1000, r=0) 0.9915 CollisionOver, t = 3 (0) 0.8046
CouponCollector, r = 20 0.2870 RandomWalk1 M (L=1000, r=0) 0.8194
CollisionOver, t = 3 (16) 0.9279 CouponCollector, r = 27 0.1878
RandomWalk1 J (L=1000, r=0) 0.7606 CollisionOver, t = 7 (0) 0.2906
Gap 0 16 0.2858 RandomWalk1 R (L=1000, r=0) 0.4983 CollisionOver,
t = 7 (24) 0.0031 Gap 25 32 0.6202 RandomWalk1 C (L=1000, r=0)
0.0529 CollisionOver, t = 14 (0) 0.4310 Gap 0 128 0.7462
RandomWalk1 H (L=1000, r=20) 0.3353 CollisionOver, t = 14 (27)
0.5062 Gap 20 1024 0.1068 RandomWalk1 M (L=1000, r=20) 0.2279
CollisionOver, t = 21 (0) 0.1909 Run 0 0.3096 RandomWalk1 J
(L=1000, r=20) 0.8593 CollisionOver, t = 21 (28) 0.2906 Run 15
0.5308 RandomWalk1 R (L=1000, r=20) 0.4915 BirthdaySpacings, t = 2
0.4179 Permutation 3 0.7322 RandomWalk1 C (L=1000, r=20) 0.9640
BirthdaySpacings, t = 2 (b) 0.5749 Permutation 5 0.8632
RandomWalk1 H (L=10000, r=0) 0.0713 BirthdaySpacings, t = 3 0.2249
Permutation 7 0.8337 RandomWalk1 M (L=10000, r=0) 0.4753
BirthdaySpacings, t = 4 0.2230 Permutation 10 0.7557 RandomWalk1 J
(L=10000, r=0) 0.6421 BirthdaySpacings, t = 4 (14) 0.2230 CPerm 0
0.0512 RandomWalk1 R (L=10000, r=0) 0.0469 BirthdaySpacings, t = 4
(0) 0.2293 CPerm 10 0.0116 RandomWalk1 C (L=10000, r=0) 0.6232
BirthdaySpacings, t = 4 (16) 0.9111 MaxOft, t = 8 0.1909
RandomWalk1 H (L=10000, r=15) 0.5739 BirthdaySpacings, t = 7 (0)
0.8077 MaxOft AD, t = 8 0.6478 RandomWalk1 M (L=10000, r=15)
0.7165 BirthdaySpacings, t = 7 (7) 0.4887 MaxOft, t = 16 0.3601
RandomWalk1 J (L=10000, r=15) 0.6868 BirthdaySpacings, t = 8 (14)
0.5956 MaxOft AD, t = 16 0.7570 RandomWalk1 R (L=10000, r=15)
0.7075 BirthdaySpacings, t = 8 (22) 0.1382 MaxOft, t = 24 0.3625
RandomWalk1 C (L=10000, r=15) 0.2100 BirthdaySpacings, t = 16 (0)
0.5266 MaxOft AD, t = 24 0.7378 LinearComp, r = 0 (Num) 0.7964
BirthdaySpacings, t = 16 (26) 0.6619 MaxOft, t = 32 0.4541
LinearComp, r = 0 (Size) 0.8628 BirthdaySpacings, t = 13 (0)
0.8419 MaxOft AD, t = 32 0.1967 LinearComp, r = 29 (Num) 0.1564
BirthdaySpacings, t = 13 (5) 0.9242 SampleProd, t = 8 0.6129
LinearComp, r = 29 (Size) 0.9696 BirthdaySpacings, t = 13 (10)
0.3125 SampleProd, t = 16 0.7735 LempelZiv, r = 0 0.8373
BirthdaySpacings, t = 13 (15) 0.4234 SampleProd, t = 24 0.0891
LempelZiv, r = 15 0.4632 BirthdaySpacings, t = 13 (20) 0.0172
SampleMean, r = 0 0.1115 Fourier3, r = 0 0.9159 BirthdaySpacings,
t = 13 (26) 0.3276 SampleMean, r = 10 0.4571 Fourier3, r = 27
0.8144 ClosePairs NP t=3 0.9584 SampleCorr, k = 1 0.0260
LongestHeadRun (Chi), r = 0 0.1270 ClosePairs mNP t=3 0.6028
SampleCorr, k = 2 0.0146 LongestHeadRun (Disc), r = 0 0.9025
ClosePairs mNP1 t=3 0.3668 AppearanceSpacings, r = 0 0.6741
LongestHeadRun (Chi), r = 27 0.7822 ClosePairs mNP2 t=3 0.8549
AppearanceSpacings, r = 27 0.0951 LongestHeadRun (Disc), r = 27
0.6878 ClosePairs NJumps t=3 0.3739 WeightDistrib, r = 0 (0.25000)
0.3097 PeriodsInStrings, r = 0 0.6300 ClosePairs mNP2S t=3
0.3813 WeightDistrib, r = 20 (0.25000) 0.6266 PeriodsInStrings, r
= 20 0.0839 ClosePairs NP t=5 0.2328 WeightDistrib, r = 28
(0.25000) 0.4372 HammingWeight2, r = 0 0.1331 ClosePairs mNP t=5
0.4011 WeightDistrib, r = 0 (0.06250) 0.6148 HammingWeight2, r =
27 0.0322 ClosePairs mNP1 t=5 0.6286 WeightDistrib, r = 10
(0.06250) 0.6600 HammingCorr, L = 30 0.5516 ClosePairs mNP2 t=5
0.7635 WeightDistrib, r = 26 (0.06250) 0.6532 HammingCorr, L = 300
0.7373 ClosePairs NJumps t=5 0.7981 SumCollector 0.6092
HammingCorr, L = 1200 0.9393 ClosePairs mNP2S t=5 0.4369
MatrixRank, L=30, r=0 0.4367 HammingIndep, L=30, r=0 0.1326
ClosePairs NP t=9 0.3073 MatrixRank, L=30, r=25 0.3045
HammingIndep, L=30, r=27 0.7257 ClosePairs mNP t=9 0.7986
MatrixRank, L=1000, r=0 0.0841 HammingIndep, L=300, r=0 0.4177
ClosePairs mNP1 t=9 0.1934 MatrixRank, L=1000, r=26 0.0145
HammingIndep, L=300, r=26 0.7630 ClosePairs mNP2 t=9 0.5857
MatrixRank, L=5000, r=15 0.2650 HammingIndep, L=1200, r=0 0.4981
ClosePairs NJumps t=9 0.9882 MatrixRank, L=5000, r=0 0.9631
HammingIndep, L=1200, r=25 0.3571 ClosePairs mNP2S t=9 0.0962
Savir2 0.7317 Run of bits (runs), r = 0 0.0241 ClosePairs NP t=16
0.3787 GCD 0.7578 Run of bits (bits), r = 0 0.5718 ClosePairs
mNP t=16 0.1983 RandomWalk1 H (L=50, r=0) 0.6779 Run of bits
(runs), r = 27 0.0822 ClosePairs mNP1 t=16 0.0511 RandomWalk1 M
(L=50, r=0) 0.9338 Run of bits (bits), r = 27 0.7981 ClosePairs
mNP2 t=16 0.2874 RandomWalk1 J (L=50, r=0) 0.3168 AutoCorr 1 0
0.3058 ClosePairs NJumps t=16 0.9369 RandomWalk1 R (L=50, r=0)
0.4753 AutoCorr 3 0 0.0371 ClosePairs mNP2S t=16 0.7523
RandomWalk1 C (L=50, r=0) 0.4941 AutoCorr 1 27 0.3292 SimpPoker 0
8 0.9863 RandomWalk1 H (L=50, r=25) 0.8645 AutoCorr 3 27 0.0612
SimpPoker 27 8 0.4528 RandomWalk1 M (L=50, r=25) 0.*6220

Figure 15: Results of the TestU01 "Big Crush" suite on the
pseudorandom number generator built from floating point roundoff
error. A p-value of \< 0*.001 or *> 0*.*999 is considered suspect
by the suite, so all tests pass here.

49

for input values that are already exact integers; if we test fu8_add on
values like 100.1875 and 11.0703125 we do not get 111 (Figure 16). On
the other hand, this is a very reasonable choice to make; we can simply
have a represen tation invariant that only one of these 256 values is
used, and preserve that invariant with every operation. It won't work
great for the continuous domain (e.g. plotting frac tals) but is a much
more practical choice for discrete data (e.g. encryption). Since I like
to work at the intersection of Theory, Impractice, and Practice, this is
appealing!

But: The library uses several operations that are not lin ear! In
particular, its implementation of bitwise functions like XOR perform
squaring and multiplication of the two arguments. It was not a design
goal of fluint8 to use only addition and scaling, but it is a design
goal today, so we must address that.

7.2 hfluint8

The use of nonlinear operations is a problem we will rectify,
forthwith, but the other ideas are suitable for building a computer.
In the hfluint8 (for half float linear unsigned int 8-bit) library,
a hfluint8 will be represented by a single half-precision floating
point number, and always one of the exact integral values in [0,
256).¹⁷

struct hfluint8 {

half h;

// ...

};

Let's begin with one helper function:¹⁸

half RightShiftHalf8 ( half xh ) {

half SCALE = GetHalf (0 x1c00 ); // 1/256 half OFFSET1 = GetHalf (0
xb7f6 );

half OFFSET2 = GetHalf (0 x66b0 );

return xh * SCALE + OFFSET1 + OFFSET2 - OFFSET2 ; }

If the function is given an integral half xh in [0*,* 512), it returns
xh >> 8. This value is always 1 or 0. The calls to GetHalf interpret a
16-bit constant as a half, which is use ful to be precise (many decimal
expressions like 0*.*1 are not exactly representable in floating point).
I also found that if you use literals like 0.00390625_h, the code runs
much much more slowly because it inhibits some optimizations or perhaps
the user-defined literals are parsed at runtime (?!). Aside from wanting
to avoid operations like parsing that might not be addition and scaling
on halfs, we will struggle with performance of these functions as we use
them for real

¹⁷In fact, all integers from -2048 to 2048 are available, so we
could consider implementing signed 11-bit numbers in a future
hflsint11 library.

¹⁸These code samples have been simplified to fit the extremely
capricious SIGBOVIK column width requirements. For example, GetHalf is a
constexpr function, so these constants are really declared as static
constexpr and computed completely at compile time. See the full code and
verify that it complies with the rules at https:
//sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/grad/.

Figure 16: Top: Error of the fluint8 addition function on general
floating point values in [0, 256). This is a de tailed zoom of the
region x ∈ [252*,* 256) and y ∈ [0*,* 4), but the rest of the
image is almost identical. Each pixel compares the fluint8 sum of
x and y to the expected value (⌊x⌋ + ⌊y⌋ mod 256). The
top-left pixel in each cell is the case where x and y are
integers; we get the correct re sult (no error). All other pixels are
wrong, either too high (green) or too low (red).

Bottom: Same with the error of the floor of fluint8's sum
function. This shows that the output is usually not even in the
correct interval. However, observe the multitude of Triforces!

Nowhere, or a lot of places if you think about it: The modular
addition operation from Section 6 is not pictured for comparison
because it would be all white, meaning no error. You can actually
imagine it occupies any blank por tion of this paper, such as the
inner hole of a letter 'o,' or the entire back of a page if printed
single-sided.

Graphics produced using ImageRGBA computational for loop engine.

50

computing. Anyway, we are just dividing by 256 (by multi plying by
¹/₂₅₆) and then adding some mysterious constants to ensure that
the result is exactly 1 or 0.

Next, we can perform addition:

hfluint8 hfluint8 :: Plus ( hfluint8 x , hfluint8 y) { half HALF256 =
GetHalf (0 x5c00 ); // 256.0 half z = x.h + y.h ;

half o = RightShiftHalf8 (z );

return hfluint8 (z - o * HALF256 );

}

As in fluint8 we can simply add the arguments, giving a result in
[0*,* 512). The shift function just discussed then allows us to
compute 1 if the value is out of range or 0 oth erwise. We multiply
this by a corrective constant (256.0) and subtract that away. So easy.

For all other operations we work on the domain [0*,* 256). We also have
a right shift by one bit:

half RightShiftHalf1 ( half xh ) {

half SCALE = GetHalf (0 x37fa ); // 0.4985... half OFFSET = GetHalf
(0 x66cd ); // 1741.0 return xh * SCALE + OFFSET - OFFSET ;

}

Right shifting is integer division by two. Roughly we are dividing by
two and then offsetting to a part of the floating point number line
where only integers are representable, then offsetting back. However,
with a constant of exactly 0*.5 some of the rounding would be in the
wrong direction; the constant 0.*49853515625 just happens to work.

We can shift by multiple places by repeating this oper ation multiple
times. However, the library has direct so lutions for several other
shift distances, since this is more efficient than repeating a single
shift.

Next, bitwise operations. These are all based on the AND function:

half BitwiseAndHalf ( hfluint8 a , hfluint8 b) { half result = GetHalf
(0 x0000 );

for ( int bit_idx = 0; bit_idx \< 8; bit_idx ++) { // Low order bit as
a - (( a >> 1) \<\< 1) hfluint8 as = RightShift1 (a );

hfluint8 bs = RightShift1 (b );

half a_bit = a.h - LeftShift1Under128 ( as ). h; half b_bit = b.h -
LeftShift1Under128 ( bs ). h; // Computes 2^ bit_idx . A constant .

half scale = GetHalf (0 x3c00 + 0 x400 * bit_idx ); half and_bits =
RightShiftHalf1 ( a_bit + b_bit ); result += scale * and_bits ;

// and keep shifting down

a = as ;

b = bs ;

}

return result ;

}

This function shifts each input down 8 times, stripping off the low
order bit at each step. Note that since we run this loop exactly 8
times, it can simply be unrolled, remov ing any whiff of non-linearity,
and the constants computed at compile time. LeftShift1Under128(x) is
just x + x without any need to worry about modular arithmetic, as it
cannot overflow.

An interesting line is the computation of and_bits, which is the logical
AND of the low-order bit from a and

b. In fluint8 we simply compute a_bit * b_bit. This has the
correct value, but is not linear (observe that if we were to compute x
& x we would end up squaring a function of x here). Instead we compute
(a_bit + b_bit) >> 1, which produces the correct result.

Being able to compute the bits in common allows us to easily derive OR
and XOR:

hfluint8 BitwiseOr ( hfluint8 a , hfluint8 b ) { half common =
BitwiseAndHalf (a , b );

return hfluint8 (( a.h - common ) + b.h );

}

hfluint8 BitwiseXor ( hfluint8 a , hfluint8 b) { half common =
BitwiseAndHalf (a , b );

return hfluint8 (( a.h - common ) + (b.h - common )); }

These subtractions and additions cannot overflow. It will be common to
perform bitwise operations with constants, so hfluint8 supports
versions with a compile-time constant argument, which can skip a bunch
of work. These run about 5*×* faster.

We also have some operations that are not supported by fluint8 but
that we will need for the current project. A basic operation is to
test for zero. IsZero returns 1 if the input is 0, or returns 0 for
any other argument:

hfluint8 IsZero ( hfluint8 a) {

half H255 = GetHalf (0 x5bf8 ); // 255.0

half H1 = GetHalf (0 x3c00 ); // 1.0

half nota = ( H255 - a.h );

return hfluint8 ( RightShiftHalf8 ( nota + H1 )); }

For an input of zero, complementing it yields 255, and adding 1
overflows to set the 8^th bit. So we shift that bit to the ones place
and are done.¹⁹

With this, Eq(a, b) is just IsZero(a - b). We can define a number of
operations like "boolean or" that as sume inputs of exactly 1 or 0;
these are straightforward and much faster than their bitwise
counterparts. We could think of these values as hflbools, although
we still use the hfluint8 type for them. The main way to use a
hflbool is If. If(cc, t) returns t if cc is exactly 1, returns 0 if
cc is 0, and is otherwise undefined. A simple implementation of this
is:

half H255 = GetHalf (0 x5bf8 ); // 255.0

hfluint8 mask = hfluint8 ( cc . h * H255 );

return BitwiseAnd ( mask , t );

This computes either the mask 00000000 or 11111111 and uses the
existing bitwise AND operation. Bitwise AND is not fast, and it does
more work than it needs to in this case because we know one of the
arguments is all zeroes or all ones. It is faster to inline the
bitwise AND routine but keep checking the ones place. Even better is
this wild ride:

¹⁹Earlier iterations of this function were much more complex! For
example, x + 15 + 65248 − 65248 × 0*.03125 maps 0 to 0, but any
other number to some number in [1,* 15], and then a similar
function compresses that range down to exactly 0 or 1. But sometimes
you miss the obvious stuff until you start writing a paper about it
for a prestigious conference. No doubt some other functions in here
could be improved!

51

hfluint8 If ( hfluint8 cc , hfluint8 t) { static std :: array \< half
, 8> OFF = {

GetHalf (0 x77f9 ), GetHalf (0 x7829 ),

GetHalf (0 x77fb ), GetHalf (0 x78e2 ),

GetHalf (0 x77fd ), GetHalf (0 x780b ),

GetHalf (0 x77ff ), GetHalf (0 x7864 ),

};

half HALF1 = GetHalf (0 x3c00 ); // 1

half HALF128 = GetHalf (0 x5800 ); // 128 half HALFNEG1 = GetHalf (0
xbc00 ); // -1 half HALF0 = GetHalf (0 x0000 ); // 0

half xh = t. h;

half nch = HALF1 - cc .h ;

half c128 = HALF128 * nch ;

std :: array \< half , 8 > COFF ;

for ( int i = 0; i \< 8; i ++)

COFF [i] = OFF [i] * nch ;

for ( const half &h : COFF ) xh = xh + h - h; xh = ( c128 - xh );

for ( const half &h : COFF ) xh = xh + h - h;

return hfluint8 ( xh * HALFNEG1 + HALF0 ); }

The 8 constants in OFF, when added to and subtracted from a hfluint8,
will always round such that the low six bits become 0. To have behavior
conditional on cc, first we multiply each constant by 1 − cc. This
results in either the original constant or 0. If zero, then adding and
subtract ing them does nothing. Then we add and subtract those results,
clearing the low six bits, and (conditionally, using the same trick of
multiplying by the condition) subtract from 128. This clears the top two
bits for the range of pos sible values (but may reset low-order bits).
Then we add and subtract the sequence again, clearing the low six bits
again. At the end we apply a corrective negation and then add 0 to avoid
outputting *−*0 and we're done.

hfluint16. Several other operations are available for hfluint8, like
AddWithCarry, but we shan't elaborate them all here, lest we contract
hfluenza. One more concept is needed before we get to the application:
16-bit inte gers. The hfluint16 type is implemented as a pair of
hfluint8 bytes. We will only need a small number of op erations:
Addition, subtraction, bitwise operations, sign extension of
hfluint8, If, and stuff like that. These are all cleanly implemented
in terms of the fluint8 operations like AddWithCarry.

7.3 Linear gameplay

Now we can build an 8-bit computer. I like to work at the intersection
of theory, impractice, practice, and enter tainment, and the most
entertaining 8-bit computer is the Nintendo Entertainment System, so
let's build that. The full NES has many components (video output,
controllers, sound, RAM, cartridge mappers), and it's not even clear
what it would mean to implement "linear" versions of these. So for this
project we will replace the CPU, which is a vari ant of the Motorola
6502 called the Ricoh 2A03. Each in struction that the CPU executes will
be done entirely with

linear half-precision floating point operations. This is done in
software emulation, upgrading a version of the FCEUX Emulator [3]
that I forked many years ago [23].

The 2A03 has 8-bit registers A, X, Y, a stack pointer S and processor
flags P. Each is represented as a hfluint8, of course. It also has a
16-bit program counter PC, which we represent as a hfluint16.
Putting aside the many complexi ties, at each step it reads the byte
at the program counter, which denotes one of its 256 instructions. It
then executes the corresponding instruction, which produces new values
for the registers and advances the program counter a vari able amount.
For example, a very simple instruction is TAX (0xAA), which could be
implemented like this:

reg_X = reg_A ;

reg_P = ( Z_FLAG8 | N_FLAG8 ) & reg_P ;

hfluint8 zf = IsZero ( reg_A ) \<\< 1;

hfluint8 nf = N_FLAG8 & reg_A ;

reg_P = reg_P | nf | zf ;

It is not implemented like this. Everything gets more complicated. But
anyway, the TAX instruction Transfers (copies) the A register to the X
register, and then updates the Zero and Negative bits of the flags
register. We have all of these operations on hfluint8, so it's just
a matter of doing it.

Memory. For instructions that act solely on registers, this approach
suffices. Most instructions read from or write to memory, including
just to read additional arguments to the instruction. This is a
problem because we don't have any kind of branching; we always need to
execute the exact same sequence of additions and scaling operations.
We can work with this by computing condition codes: "Is this write
actually happening, or are we just computing it because we always have
to do the same sequence of operations?" Then a write addr = val can be
made conditional using our If operation, like

mem_addr = If(cc, val) + If(1 − cc, mem_addr)

This has other problems (for example when the address is not know at
compile time, which is typical) but the biggest one is that all memory
accesses on the NES are potentially effectful. This is because various
things are attached to the memory controller that perform actions when
addresses are accessed. For example, writing two consecutive bytes to
0x2006 will load them as an address into another chip (the PPU) and
then writing to 0x2007 will write bytes into video memory at that
address. Writing to 0x4014 will begin a DMA loop that copies 256 bytes
from the main address space to video RAM, suspending the CPU for 512+
cycles. Reads can have effects as well, and these effects are not from
a small set because they can include arbitrary hardware in the
cartridge itself [25]!

So here we have a sort of concession: We intro duce two primitive
operations ReadIf(cc, addr) and WriteIf(cc, addr, val). These take a
hfluint8 condition cc (exactly 0 or 1), a hfluint16 address, and
(for writes) a hfluint8 value to write. If the condition code is 0,
nothing happens, and an arbitrary value is returned. If 1, the read

52

Figure 17: During the development of the emulator, the FPS achieved
(blue) versus the number of times the code "cheats" due to incomplete
implementation (red). Log scale. Honestly there's not much to get from
this except that we start with a lot of FPS (3500) and a lot of cheats
(65 million) and end with few FPS (0.1) and no cheats. I guess it also
shows that this took many iterations to im plement. The reason that
cheats does not monotonically decrease is that a single cheat (e.g. a
switch on the in struction byte) can mask the need for hundreds of other
cheats.

or write takes place, including its side-effects. This would be a
realistic model if we implemented a hardware ver sion of this chip,
which only used floating point operations internally; its hardware pins
for interfacing with memory would simply include a bit for whether we
actually want the read or write to happen.²⁰ (The actual 2A03 pinout
has a "R/W" pin, for example.)

Doing it correctly. The remainder is reasonably straightforward given
the tools we've already built. One challenge is simply not screwing up.
256 instructions is a lot, and the original code is extremely awful; it
is filled with macro hacks that assume specific variable names and
values of constants, pirate jokes, references to mysterious global
variables named stuff like temp, feuds between developers commenting out
each other's "wrong" code, and so on. As I developed the
hfluint8-based emulator, I strove to keep the emulator in a working
state as often as possible so that I could test it against the reference
implementation. One technique was to do various pieces of code in easy,
cheating ways, but to record each time I cheated by incrementing a
global counter. Each time I replaced reasonable fast code with
ideologically pure, non-cheating code, which is typi cally much slower,
the cheating went down and the runtime went up; see Figure 17. This
makes it like a game.

Another challenge is that the 2A03 has dozens of undoc

²⁰A similar concession is made for interrupts. This is handled at the
start of the instruction loop using C code, though all the com putation
is performed with hfluint8. Essentially we can think of the interrupt
handling as being done in a linear way, but the decision to handle an
interrupt instead of executing an instruction being done by "hardware."

umented instructions with mysterious behavior. Most of these are not
used by any game in my test suite, which means I run the risk of
breaking one of these instructions and not knowing. Some of these
instructions are very weird, since they are essentially the
consequence of 6502 sub-units (designed for implementing other
instructions) being con nected together in ways that are not motivated
by useful behavior. For example, the XAA instruction (0x8B) bitwise
ORs the A register with 0xEE (setting all but two bits), then ANDs
with the X register, then ANDs with an immedi ate byte. Others are
just as weird but much more complex. Since I want the emulator to be
as complete and correct as possible, I wrote a new "game" that I could
use as an addi tional test ROM (Figure 18). This "game" executes
dozens of undocumented instructions at startup, writing interest ing
state to RAM to create a record of their behavior. The game then
displays the first half of RAM on screen. This gives some amount of
protection against regression on these instructions.

Everything, everywhere, all at once. Each instruc tion is otherwise
straightforward to implement. The re maining challenge has to do with
the instruction dispatch. A natural way to write the instruction loop
is to do switch on the instruction byte, but that is not a linear
operation. Instead, we always execute all of the instructions. Before
this, we make 256 copies of the CPU state (the registers); this is
linear because it's just copying a finite number of variables. Each
copy also has an active flag (a hfluint8 with 1 or 0). We set this
for exactly one of these instruc tions, by computing If(Eq(insn_byte,
n)) for each of the 256 n. Then we execute each instruction on its
copy of the state; it does all its computation, and any read or write
is additionally conditioned on its active flag. This way only the
active instruction's memory accesses actually occur.

We then need to select the instruction that was actually executed and
copy its state back to the "real" CPU state. We do this by
conditionally clearing each register:

reg = If(active*,* reg)

We then set the real CPU's register to the sum of all of the registers
from the instruction-specific states. Exactly one (the active one)
will be nonzero, so we get that value. We use this same technique to
keep track of how many cycles have elapsed, since various emulator
timing depends on this.

A bad thing about this approach is that it's more than 256 times
slower than just executing a single instruction, and this is the main
reason why the emulator is so slow. A good thing is that there is no
cheating. Another good thing is that the instructions are all reading
and writing distinct data, so they can actually be executed in
parallel. The final benchmarks here are from running on 8 cores.

7.3.1 It's a-fine, Mario!

The emulator can play any NES game supported by FCEUX (which is
basically all of them; despite the horrors in this emulator's code, it
has great compatibility). My

53

Figure 18: Exciting Nintendo "game" showing the first half of the NES
RAM after executing a test of dozens of undocu mented instructions.
The "game" cannot be won. It exists only to destroy your mind.

benchmark was the first level of the classic Super Mario Bros.,
playing sequence of 2210 inputs that completes level 1-1 in 36 seconds.
The emulator runs this as fast (or as slow) as it can. Normal frame rate
is 60 FPS. The original implementation runs at 3500 FPS; after many
performance tweaks I got my hfluint8 version to run at

0.1154 FPS

In print, the frame-rate is always zero, anyway (Fig ure 19). 8.6
seconds per frame is firmly in "not playable" territory, but it is
tolerable for installation artwork, let's say. I have played AAA titles
that, at launch, inexplicably had comparable framerates on a high-end
GPU, and these games were no doubt executing a great many non-linear
instructions.

8 Conclusion

Implementing a basic computer (with an extant software library) using
floating point addition and scaling demon strates the highly general
computing power they contain, despite approximating mathematically
limited operations. We can say informally that they are Turing complete.
This also renders the previous sections moot; performance
notwithstanding, we could directly implement the Mandel brot set, the
tanh transfer function, or AES using this 8-bit computer. It also
immediately gives us a linear chess en gine (including game tree search
and a user interface) by emulating chessmaster.nes; in fact this engine
already participated in our tournament (Figure 8)!

Figure 19: Mario completing level 1-1 in 36 seconds of game time, or
19,143 minutes of wall time, using only float ing point roundoff error
from addition and scaling.

8.1 Future work

If I remember correctly (and I probably don't), Go¨edel showed that an
axiomatic system with addition and multi plication can encode
sufficient facts about the natural num bers to engender
incompleteness [9]. However, a system with only addition (such as
Presburger arithmetic) does not have this problem. Incompleteness is
similar to the halting problem for Turing complete systems, in that
it is easy to encounter given a small set of primitives and the
canonical demonstration is a diagonalization argument. Is floating
point addition alone Turing complete? Can we prove it? If so, is the
fact that real mathematical addition and mul tiplication have this
deep incompleteness property related to the fact that IEEE-754
addition and multiplication have the deep computational property?²¹
Coincidence?!²²

If not addition alone, the FMA (fused multiply-add) in struction very
likely suffices, as it performs both a multi plication and addition.
This makes sense, as the equation F = MA is fundamental to
physics.

Linear logic???

Thinking about the 2A03 implementation, each loop ex ecutes the exact
same set of instructions, with a high de gree of parallelism. The use
of condition codes mimics the way that VLIW machines and modern GPUs
execute data-parallel programs. This seems to lend itself to highly
parallel execution on GPUs; in fact the "Tensor cores"

²¹No.

²²Yes.

54

designed for accelerating ML inference can likely execute these
floating-point operations. Moreover, since the oper ations being
executed are linear, the entire computation is trivially differentiable.
This means that, if you don't think about it too hard (but you need to
think about it a medium amount of hard, because it is a confusing
thought), you could use a finite sequence of NES instructions as
transfer functions in a network, and back-propagate errors (giving an
error vector towards a machine state and controller in puts that would
yield the desired output state). This of course would not actually work,
similar to how automatic differentiation does not actually work.

Not everyone uses IEEE-754 floating point these days. For example the
bfloat16 format has gained traction in machine learning. Are similar
tricks possible in these alter nate universes, or is IEEE-754 simply
the best forever?

Other applications of this technology are possible, and further study is
warranted. For example, a common act in video editing is to rearrange
clips from a source video in alphabetical order [2]. It was formerly
believed that this required non-linear video editing (aside from
"Already Being Filmed In Lexicographic Order Type Videos"). But it seems
straightforward to use techniques from this paper to perform them
linearly.

8.1.1 Conclusion Conclusion

A line has been drawn in the sand. y is truly equal to mx plus
b. The professor has been defeated. The dead horse has been beaten.
The paper is finally over.

References

[1] 754--2008 IEEE standard for floating-point arithmetic. Technical
Report 754--2008, IEEE Computer Society, August 2008.

[2] ARST ARSW: Star wars sorted alphabetically, June 2014.
http://radar.spacebar.org/f/a/weblog/ comment/1/1109.

[3] adelikat et al. FCEUX, the all in one NES/Famicom emulator.
http://fceux.com/.

[4] Rodrigo Benenson. Are we there yet?, 2016. http:
//rodrigob.github.io/are_we_there_yet/build/.

[5] L. Blum, M. Blum, and M. Shub. A simple un predictable
pseudo-random number generator. SIAM Journal on Computing,
15(2):364--383, 1986.

[6] Arpad E Elo. The rating of chessplayers, past and present.
Arco Pub., 1978.

[7] Horst Feistel. Cryptography and computer privacy. Scientific
American, 228(5):15--23, 1973.

_{[8] Herrn Frobenius. Uber lineare substitutionen und bi-} ¨
lineare formen. Journal f¨ur die reine und angewandte Mathematik
(Crelles Journal), 1878(84):1--63, 1878.

_{[9] Kurt G¨odel. Uber formal unentscheidbare s¨atze der} ¨
principia mathematica und verwandter systeme I. Monatshefte f¨ur
Mathematik, November 1930.

[10] Johan H˚astad, Russell Impagliazzo, Leonid A Levin, and Michael
Luby. A pseudorandom generator from any one-way function. SIAM
Journal on Computing, 28(4):1364--1396, 1999.

[11] Allen Hux. The OpenCL extension specification, November 2015.

[12] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[13] Donald E. Knuth. The Art of Computer Program ming, Volume 2:
Seminumerical Algorithms. Addison Wesley, Boston, third edition,
1997.

[14] Alex Krizhevsky, Geoffrey Hinton, et al. Learning mul tiple
layers of features from tiny images. 2009.

[15] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278--2324, 1998.

[16] Pierre L'ecuyer and Richard Simard. TestU01: A C library for
empirical testing of random number gener ators. ACM Transactions on
Mathematical Software (TOMS), 33(4):1--40, 2007.

[17] William E. Lorensen and Harvey E. Cline. March ing cubes: A
high resolution 3D surface construction algorithm. ACM SIGGRAPH
Computer Graphics, 21(4):163--169, August 1987.

[18] George Marsaglia. DIEHARD: a battery of tests of randomness.
http://stat.fsu.edu/geo, 1996.

[19] Mitsuru Matsui. Linear cryptanalysis method for DES cipher. In
Advances in Cryptology---EUROCRYPT '93: Workshop on the Theory and
Application of Cryp tographic Techniques, pages 386--397. Springer,
May 1994.

[20] Jim McCann and Tom Murphy, VII. The fluint8 software integer
library. In A Record of the Proceed ings of SIGBOVIK 2018, pages
125--128, April 2018. sigbovik.org/2018.

[21] Mark D. McDonnell and Tony Vladusich. Enhanced image
classification with a fast-learning shallow con volutional neural
network, 2015.

[22] Joseph Menn. Secret contract tied NSA and security industry
pioneer. Reuters, Decem ber 2013. https://www.reuters.com/article/
us-usa-security-rsa-idUSBRE9BJ1C220131220.

55

[23] Tom Murphy, VII. The first level of Super Mario Bros. is easy
with lexicographic orderings and time travel. In A Record of the
Proceedings of SIGBOVIK 2013, vol ume 2013, pages 112--133. The
Association for Com putational Heresy, 2013.

[24] Tom Murphy, VII. What, if anything, is epsilon? In A Record of
the Proceedings of SIGBOVIK 2014, pages 93--97. ACH, April 2014.
sigbovik.org/2014.

[25] Tom Murphy, VII. Reverse emulating the NES to give it SUPER
POWERS! Deconstruct 2018; YouTube, 2018.
http://radar.spacebar.org/f/a/ weblog/comment/1/1157.

[26] Tom Murphy, VII. Elo World: A framework for bench marking weak
chess algorithms. In A Record of the Proceedings of SIGBOVIK 2019.
ACH, April 2019. sigbovik.org/2019.

[27] Tom Murphy, VII. NaN gates and flip FLOPS. In A Record of the
Proceedings of SIGBOVIK 2019, April 2019. sigbovik.org/2019.

[28] Jon Postel. DoD standard Transmission Control Pro tocol. RFC
761, January 1980.

[29] Christian Rau. half - IEEE 754-based half precision
floating-point library, 2022. https://half. sourceforge.net/.

[30] Tord Romstad, Marco Costalba, and Joona Kiiski. Stockfish
chess, 2023. https://stockfishchess. org/.

[31] Bruce Schneier. Applied Cryptography Second Edition:
Protocols, algorithms and source code in C. John Wi ley & Sons, 1996.

[32] Unknown. The Nine Chapters on the Mathematical Art. Han
Dynasty, 179.

56

7

Leveraging insect populations to implement large scale deep learning

Aditi Kabra Sagar Bharadwaj

Carnegie Mellon University

1 Introduction

Some insects are popularly considered to serve no pur pose in their
existence [3]. (This might tempt some to ponder about the usefulness
of their own existence, which we leave as an exercise to the reader).
Our pa per gives insects their much needed existential purpose to serve
humans for the greater good. In this work, we present a method to use
insects as computational units to train and evaluate large deep learning
models including GPT-4 [2]. Insects regularly show an ability to learn
from their peers [4, 1]. However, in the past work, researchers have
made insects learn things that are futile at best - such as solving
puzzles and dance. Computer science re searchers have frequently
demonstrated that the there is only one type of learning that is
useful - machine learn ing. In this paper, we show that we can force
insects to learn from data and simulate large scale models. In addi tion
to its obvious usefulness to humans, we believe our work is tremendously
important to the large insect popu lations as it gives them a concrete
purpose to live. Using insects to train models effectively frees up GPUs
to be used for what they are intended to be used for - games.

We first collect a variety of insects including bees, ter mites, and
moths from undergrad dorm rooms. We re lied on the low effort spent on
dorm maintenance for our insect collection. 15213 insects were
collected for our experiments.

In the training phase, the insect populations were shown some collected
image and text data. Training was done by appropriately rewarding
insects with the things they like once they show sufficient proof that
they have learnt the right thing. For example, we rewarded moths with
light bulbs to flock to; house flies with human ear models to buzz
around; termites with papers that PhD students printed hoping to read
some day. The training phase took a week. However, this work is in its
initial stages and we believe that it can be reduced further.

Training was followed by the testing phase, in which

we showed these insects data that they had never seen and recorded
their predictive accuracy. To the authors astonishment, the insects
displayed a remarkable ability to generalize and achieved an accuracy
of 100%.

In the Section 2, we present the technical ideas be hind our paper.
Section 3 discusses implementation and evaluation. Similarly, the
other sections discuss what the section headings claim they do.

2 Obligatory Technical Section

This paper proposes a new way of unconventional com puting, using the
insect mind as the logical unit, and a re ward/evolution loop as the
programming procedure. Or ganic minds have abilities that mechanical
minds have not yet been able to replicate. They are also tremen dously
energy efficient compared to electronic comput ers. Evolution has
developed systems that simulation on binary computers has not been
able to. For these rea sons, problems that are hard for Turing
machines may not be hard for programmed insects. Further, if we con
fine ourselves to problems where solving is difficult but checking,
easy for a conventional computer, training can be automatic with a
computer deciding when to reward the insects. Using Insect Learning
based computation, as our evaluation shows, has the potential for
tremendous impact on the world. It could save the world from cli mate
change; not only is Insect Learning very energy ef ficient, but also
relies on energy sources such as grass and leaves, that are generally
seen as renewable. It can solve problems that were previously
intractable, and improve equity and inclusiveness because of how cheap
insects are, seeing as people often pay to get rid of them.

3 Evaluation and Implementation

We performed an extensive evaluation that confirmed In sect Learning
is incredibly effective, outperforming state

57

of the art machine learning architectures by several or ders of
magnitude. Unfortunately, the termite test sub jects ate our physical
data sheets. Furthermore, a moth got stuck in the vacuum tubes of the
computer that stored a soft copy of our data, leading to memory
corruption. We would have conducted our experiments afresh, but the
folks at PETI (People for Ethical Treatment of In sects) observed that
these actions of our test subjects may suggest a lack of enthusiasm
for the research, and held reservations regarding further
experimentation. Fortu nately, SIGBOVIK does not have an artifact
evaluation. But this research absolutely is reproducible if you try
hard enough.

4 Related Work

To the best of our knowledge, this work is completely novel. Our
extensive literature review ¹turned up no work that was related
whatsoever.

5 Future Work

We have answered all potential questions. No avenues for future
research remain.

References

[1] Bees learn to dance and to solve puzzles from their peers.
https://arstechnica.com/science/2023/03/bees-learn-to-dance-and
to-solve-puzzles-from-their-peers/. Accessed: 2023-03-27.

[2] Gpt-4. https://openai.com/research/gpt-4. Accessed: 2023-03-27.

[3] Most useless insects or do least for the environment (list).
https://howitsee.com/most-useless-insects/. Accessed: 2023-03- 27.

[4] BRIDGES, A. D., MABOUDI, H., PROCENKO, O., LOCKWOOD, C.,
MOHAMMED, Y., KOWALEWSKA, A., GONZALEZ ´ , J. E. R., WOODGATE, J. L.,
AND CHITTKA, L. Bumblebees acquire alter native puzzle-box solutions
via social learning. Plos Biology 21, 3 (2023), e3002019.

¹The review consisted of asking our office mate if he had seen any
thing like this before. He didn't think so. We were cautious not to
ask our advisors since they would likely know of actual related
work.

58

8

Quantifying and Predicting Large Language Model Hype in SIGBOVIK and
Beyond

Kevin A. Wang^*†*1, Pasha Khosravi ^†, Pooya Khosravi , Karthik
Gajulapalli^* and Linh Chu^*‡

^*Equal Contribution

*^†*Unequal Contribution

Epsilon Contribution

*^‡*Zero Contribution

¹Probably not the Kevin Wang you know

^**Author affiliations (alphabetical): Georgetown University;
University of California, Irvine; University of California, Los
Angeles; wageslave. Send correspondence to \@often wang.

Abstract

Large language models have overwhelmed discourse in society, in
computer science, and pre sumably, in SIGBOVIK 2023. This paper
quantifies the number of this amount by defining a new metric, CTRLF
and calculating it for past iterations of SIGBOVIK. Furthermore, it is
also of interest[Wik23] to predict future hype of LLMs. Therefore,
we forecast these predictions in or der to obtain extrapolations for
SIGBOVIK 2023 by using both artificial and non-artificial neural
networks. Finally, we conclude by looking at the actual value of the
metric in SIGBOVIK 2023.

1 Introduction

The invention and success of large language models (sometimes
shortened to LLMs) in the past few years/months/weeks has quickly
caused their popularity to explode. As seen in Figure 1, the study and
use of large language models is now more popular than computer science
itself. Since SIGBOVIK is widely regarded as a microcosm of computer
science, and in some sense can be considered the "drosophila of
CS"[Wik23], we perform analysis and experiments to quantify the
amount of LLM hype in SIGBOVIK, and we use these measurements as a
proxy for the amount of LLM in computer science and in society. We
then perform predictions to forecast the amount of LLM in SIGBOVIK
2023, as a proxy for how much large language models will affect
society in the future.

Figure 1: Google Trends comparison of "Computer Science" (red line)
and "GPT" (blue line). Note that computer science historically
dominated GPT in terms of popularity. This is expected, since GPT can
be considered a strict subset of computer science. Note also the sharp
increase in GPT's popularity in 2022 and 2023, which implies that more
than 800% of computer science is now composed of NLP.

1

59

1.1 Overview

In the first section (Section 2), we demonstrate a variety of methods
to predict the amount of LLM in SIGBOVIK 2023. Concretely, we use the
number of exact matches for the term "language model" as a metric. We
compute this metric on previous SIGBOVIKs, then we query both
artificial (ChatGPT) neural nets and non-artificial (human) neural
nets to predict the value of the metric for 2023.

In the second section (Section 3), we analyze the results of the
predictions based on the ground truth. Ordinarily, this would be
impossible, since the ground truth is unknown at the time of this
writing. However, by bucketing the possible results into a finite
number of outcomes, we leverage the state of the art in Choose Your
Own Adventure papers [Ree09] to write the section.

1.2 Background

Wikipedia defines a large language model as "a language model
consisting of a neural network with many parameters (typically
billions of weights or more), trained on large quantities of
unlabelled text using self-supervised learning"¹. While the study of
large language models was previously considered to be a strict subset
of the field of computer science known as natural language processing
(NLP), this relation is no longer considered to be strict. Figure 2
shows that after 15 years of decreasing popularity, LLMs have enjoyed
a recent growing resurgence in popularity (likely due to their
invention in 2018). In particular, note the sharp increase in
popularity in 2022 and 2023.

SIGBOVIK is an academic conference celebrating the inestimable and
variegated work of Harry Quorum Bovik. It is widely considered one of
the most prestigious conferences in the field of computer science. To
have a paper accepted into SIGBOVIK is the mark of a learned computer
scientist, even for a coauthor with zero contribution. SIGBOVIK 2023
(also known as SIGBOVIK 0x2023) will be the 17^th annual SIGBOVIK
conference. Actually, at the time of your reading, SIGBOVIK 2023 is
the 17^th annual SIGBOVIK conference, or SIGBOVIK 2023 was the 17^th
annual SIGBOVIK conference. This is the crucial aspect that makes
Section 3 possible.

Figure 2: Google Trends of "LLM". Although the first large language
model is widely considered to be BERT[DCLT18] from 2018, this chart
suggests that they still garnered interest before their existence.

2 Predicting the Amount of Large Language Model in SIG BOVIK 2023

A question of major interest to philosophers is: "How much will
society be affected by the recent invention and advances in large
language models?"[Wik23] The recent AI boom has been compared to
inventions as useful as the microprocessors [Gra23], the internet
[Yof23], and fire [Gfo23], to inventions as dangerous as the
nuclear bomb, and to inventions as useless as Bitcoin [Doc23]. By
accurately forecasting the magnitude of the effects of LLMs on
society, we can more properly prepare for the future.

Since this question is difficult to answer, we will focus on the
question with nearly as much interest to philosophers [Wik23]: "How
much will SIGBOVIK 2023 be affected by the recent invention and

¹Webster's Dictionary defines a "large language model" is defined as
"The word you've entered isn't in the dictionary. Click on a spelling
suggestion below or try again using the search bar above."

60

advances in large language models?" We posit that, as SIGBOVIK
represents a subset of computer science, and computer science is a
subset of society, the answer to this latter question is a good proxy
for the former.

2.1 Metric

To quantify the amount by which SIGBOVIK 2023 will be affected by
large language models, we will predict the amount of times that "large
language model" will be present in the totality of the proceedings of
SIGBOVIK 2023. Specifically, this is measured by performing the CT
RL+F technique in the Google Chrome PDF browser on the PDF of
SIGBOVIK 2023, and counting the number of appearances. See Figure 3.
We will call this metric the CTRLF metric.

Figure 3: CTRLF is the number pointed to by the red arrow. The figure
shows the technique performed on SIGBOVIK 2021.

2.2 Data Collection and Methodology

The main tool we use for prediction of CTRLF is the CTRLF of previous
SIGBOVIKs. We downloaded PDF files for the proceedings of SIGBOVIKs
2007 through 2022, and performed the CT RL + F technique to
extract the CTRLF of each previous conference. The results are shown
in Table 1.

Year CTRLF

2007 0

2008 0

2009 0

2010 0

2011 0

2012 0

2013 0

2014 0

2015 0

2016 0

2017 0

2018 0

2019 2

2020 3

2021 14²

2022 27

2023 ?

Table 1: Historical Values of CTRLF.

²The raw CTRLF metric for 2021 is actually 18, but 4 of the 18 were
in the Message From the Organizing Committee, which doesn't count.

61

Figure 4: Table 1 but in chart form.

Predictor Prediction

Author 1 40

Author 2 81

Author 3 113

Table 2: Guesses for CTRLF of SIGBOVIK 2023.

2.3 Predictions with Real Neural Networks

Our first method of prediction is to query non-artificial neural
networks (NaNs) for predictions. This method is known as guessing, and
is popular in fields such as psychology, finance, and sports betting.
We also borrow the technique from experimental science known as
"randomization", to select a random sample of non-artificial neural
networks to make predictions. The total pool of possible NaN
forecasters was the set of authors of this paper. We
randomly³selected the first author, second author, and third author
of this paper as our forecasters.

Each of the chosen forecasters gave their best guess as to the value
of CTRLF for SIGBOVIK 2023. The predictions are listed in Table 2.

2.4 Interpretability of Real Neural Networks

Interpretability and explainability are widely regarded as the last
advantages of non-artificial neural networks over artificial ones.
Author 2 gave the most thorough explanation of their forecast, by
saying "prediciont 3*28=81". It is not immediately evident to the
other authors what this explanation means, and further research in
this area is warranted. Authors 1 and 3 did not give explanations for
their predictions.

2.5 Commentary on Real Neural Network Prediction

We note that all 3 predictions for CTRLF in SIGBOVIK 2023 are greater
than any historical CTRLF in a past conference. This seems probable,
since the value of CTRLF has been monotonically non decreasing every
year. Indeed, as suggested by the Google Trends showcased in Figures 1
and 2, the value will likely be a large increase over previous years.

Furthermore, SIGBOVIK papers often make use of the rhetorical devices
"parody" or "satire". Since large language models are the subject of a
large amount of hype, they provide a rich and juicy target for such
devices, which should also boost up those numbers[WKK⁺23].

Additionally, papers in SIGBOVIK are often written by authors
utilizing the technique of laziness. Since LLMs can generate or
analyze text with less work than writing by hand, authors may use LLMs
when writing their papers. If they mention this usage, this is another
feature that will lead to increased CTRLF. Sometimes, papers in
SIGBOVIK contain large amounts of nonsense. The ability to generate
this is widely considered one of LLM's "killer features". Again,
authors may use this feature and mention its usage, further driving up
CTRLF.

We observe in Figure 5 that the value of the guess increases
monotonically as author number increases. We are not sure what this
means.

³All forecasters from the pool were invited to submit a prediction
through the group chat, and all forecasters who saw the message and
decided to participate were selected. This is random due to quantum
mechanics.

62

Figure 5: Table 2 but in chart form.

2.6 Predictions with Artificial Neural Networks

In this section, we will predict the value of CTRLF for SIGBOVIK 2023
by using artificial neural networks. In particular, we will perform
predictions by using an LLM called ChatGPT. The full experiment and
result can be seen in Figure 6. ChatGPT predicted a CTRLF value of
approximately 63. We used in-context learning⁴to provide past values
of CTRLF to the LLM, but we elided the first 13 values in the interest
of laziness.

Figure 6: ChatGPT predicting CTRLF for SIGBOVIK 2023 of 62.67

2.7 Interpretability of Artificial Neural Networks

In a win for artificial neural networks, ChatGPT gives a thorough
explanation for its guess of 63. It begins by fitting a quadratic
function to the existing data, then substitutes the x-value for 2023
into the expression and solves to obtain the result⁵.

⁴this is a fancy way of saying that we did nothing

⁵Later fact checking showed that this quadratic function does not in
fact fit the data. For the values x = 0*,* 1*,* 2*,* 3, the
resulting y values are 75099768, 75062007*.33, 75024259.33, and
74986524. This does not fit the data for *x = 0*,* 1*,* 2

63

3 Analysis of Actual Results (Interactive)

Since you are reading this, there is a good chance that the
proceedings of SIGBOVIK 2023 are already existent. In this case, the
predictions made in this paper can actually be compared to the ground
truth. However, at the time of this writing, the ground truth is not
known. Therefore, we have written a few different sections for each
different possible outcome.

Here are the steps to read this section:

1. Calculate the ground-truth CTRLF for SIGBOVIK 2023.

$a$ Obtain a copy of the PDF file for the Proceedings of SIGBOVIK
2023.

$b$ Open the PDF file in the PDF reader of Google Chrome. (Other PDF
readers may be used, but the calculated values may not match the
standard CTRLF.)

$c$ Use the CT RL + F technique by pressing CTRL+F or CMD+F or
Apple+F on your keyboard.

$d$ Note the value to the right of the slash mark (Figure 3). This
is the preliminary CTRLF of SIGBOVIK 2023.

$e$ Since this paper is also part of SIGBOVIK 2023, subtract 26, the
number of mentions of "large language model" in this paper from the
CTRLF. This is the final CTRLF of SIGBOVIK 2023.⁶

2. Find the proper subsection number in table 3 and read only that
subsection.

3. In that subsection, replace every instance of with the value of
CTRLF for SIGBOVIK 2023. This can be done either in your head, or by
printing out the paper and writing in the value with pen.

Ground-Truth CTRLF 2023 Section

Between 0 and 27 Subsection 3.1

Between 28 and 51 Subsection 3.2

Between 52 and 72 Subsection 3.3

Between 73 and 97 Subsection 3.4

Between 98 and 3749382 Subsection 3.5

Over 3749382 Subsection 3.6

Table 3: Go to the correct subsection for your reality. Bounds are
inclusive.

3.1 Between 0 and 27 (inclusive)

The value of CTRLF in SIGBOVIK 2023 was only . This is surprising, as
it represents a decrease in CTRLF for the first time in history, and
was predicted by neither humans nor robots. We posit that this is due
to censorship by SIGBOVIK organizers. This censorship could be a sign
that the wars between the AI and humans are beginning, and ruination
will soon come. Regardless, this means that Author 1 had the closest
prediction, and we recommend you buy a drink for this author in
celebration.

3.2 Between 28 and 51 (inclusive)

The CTRLF in SIGBOVIK rose a modest amount to in 2023. As predicted by
all, the amount increased. However, contrary to the expectations
raised by the Google Trends chart (Figure 1), the increase was not
hockey stick-esque. Most human forecasters as well as robot
forecasters overestimated the amount of increase. For the robot, this
is likely due to an excessive image of self-importance and

of y = 3*,* 14*,* 27

⁶If we do not do this, then the experiment will be flawed due to the
violation of the double-blind principal. Also we forgot about this
issue until just now and we can't go back and change everything. This
subtraction still provides a valid CTRLF due to the law of large
numbers. The proof is elided here for space considerations.

64

delusions of grandeur. For the humans, we can attribute the incorrect
assessments of Authors 2 and 3 to low intelligence.

However, Author 1's estimation of 40 was quite close to the actual
amount of , which demon strates that high-skilled forecasters can still
perform well in prediction.

In conclusion, the number of large language model mentions in SIGBOVIK
is proceeding at a healthy rate. By proxy, this implies that LLM hype in
society at large is increasing at a healthy rate, and we will all be
fine. We recommend that individuals in this reality adapt to this
changing world by learning to use LLM tools or by outlawing all LLMs.

3.3 Between 52 and 72 (inclusive)

In SIGBOVIK 2023, the CTRLF increased to . This means that ChatGPT's
guess of 63 was closer than any human forecaster. This validates the
usefulness of LLMs. We note that, among humans, Author 1's guess of 40
was still relatively close. Since represents a rather large increase in
CTRLF, LLM hype in SIGBOVIK and thus society at large is increasing at
an alarming rate. Due to this large increase, advances in AI could
radically change society in the near future. We recommend preparing for
the future by hoarding firearms, food, and water.

3.4 Between 73 and 97 (inclusive)

The CTRLF for SIGBOVIK 2023 was . This represents an extremely large
increase in CTRLF. This is in line with the exponential growth visible
in Figure 4. Among the forecasters, Author 2 was the closest⁷. We note
that this is not that impressive, since the prediction of 81 was a
rather obvious prediction based on the current rise in LLM chatter in
places like Twitter and computer science departments. Indeed, Author 1's
guess of 40 is not too far off from , either.

Such a large growth in CTRLF portends huge changes in society with
near-certainty. Any indi vidual's resistance to the AI revolution will
be instantaneously pulverized like a shed in a tsunami. In these
circumstances, we can do nothing but let ourselves be swept along like
coconuts. In the meantime, we recommend purchasing a drink or two for
the authors, particularly Author 1, since their guess of 40 was pretty
close.

3.5 Between 98 and 3749382 (inclusive)

SIGBOVIK 2023 contained a CTRLF of . This is an overwhelmingly large
increase in CTRLF from 2022. Among all forecasters, robot and human
alike, Author 3 gave the closest prediction. We note that this is not as
impressive as it seems, since a large increase is in line with the hype
observed in 1.⁸ We posit that the other forecasters, such as Author 1,
would have made similar predictions if they spent more time thinking
about it.

Since "large language model" was mentioned times in the text of SIGBOVIK
2023, we can be quite sure that LLM hype is reaching a fever pitch in
society as well. However, we note the LLM's guess of 63 was rather far
from the ground truth, showing that the technology is far from mature.
Therefore, we observe that the hype exceeds the utility of the
technology, and we conclude that the current hype cycle will quickly
blow by. We recommend that individuals go about their lives as usual,
and make no changes.

3.6 Over 3749382

You are reading this because the words "large language model" were
mentioned times in SIG BOVIK 2023, which is greater than 3,749,382.
Although neither ChatGPT nor any human forecaster predicted a number
this high, recall the quadratic expression provided by ChatGPT in Figure
6. If one actually solves the expression, the prediction given is 7*,*
498*,* 652. Since the actual CTRLF of is closest to this value, the
quadratic expression given by ChatGPT gave the closest prediction.

This suggests that rather than comparing simple polynomial regression to
complex large language models, we should be investigating their fusion.
We also find it striking that this expression predicted a number that
was roughly 5 orders of magnitude larger than the other predictions, and
that it was

⁷We do note that this, once again, provides evidence for UCLA's
superiority over UCI.

⁸We do note that this, once again, provides evidence for UCI's
superiority over UCLA[Soc] 65

actually correct. We recommend further study of these types of
equations for predicting future events such as lottery tickets and
stock prices. If such endeavors are successful, we suggest buying the
authors some drinks (in amounts proportional to their contribution on
this paper) to thank them for their idea. Additionally, we commend the
author of the SIGBOVIK paper consisting of the words "large language
model" repeated for 30,000 pages for their creativity.

References

[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.

[Doc23] Cory Doctorow. The ai hype bubble is the new crypto hype
bubble, Mar 2023.

[Gfo23] Gfodor. The worst are people saying this is ai's "iphone
moment"my brother in christ, this appears to be the biggest thing
since "fire" https://t.co/twcnmev83o, Mar 2023.

[Gra23] Paul Graham. That was the next thing i said to her: That
i've seen waves of progress like this before, and this is going to be
as big as microprocessors, and probably happen faster too., Mar 2023.

[Ree09] Jason Reed. Choose your own logic adventure. SIGBOVIK,
2009. [Soc] http://socalcontest.org/history/2016/details-2016.shtml.

[Wik23] Wikipedia. Wikipedia:Citation needed --- Wikipedia, the free
encyclopedia.
http://en.wikipedia.org/w/index.php?title=Wikipedia%3ACitation%20needed&
oldid=1143177448, 2023. [Online; accessed 27-March-2023].

[WKK⁺23] Kevin A. Wang, Pasha Khosravi, Pooya Khosravi, Karthik
Gajulapalli, and Linh Chu. Quantifying and predicting large language
model hype in sigbovik and beyond. SIG BOVIK, 2023.

[Yof23] Emily Yoffe. This generation of ai is the biggest thing
since the printing press, bigger than the internet says \@tylercowen
the last half of his conversation with \@bariweiss is so illuminating
about the new era that's upon us. https://t.co/qjiwousexc, Mar 2023.

A Appendix

In order to obtain the authorship symbol rather than the ‡ symbol,
we performed experiments on additional language models: Bard and
ChatGPT-4.

We ask the question: Does the language model change its prediction
after becoming aware of existence of this paper?

A.1 Bard

Figure 7: Bard does not explicitly update the estimate, and also
implicitly claims authorship in this paper by saying "our paper..."

66

A.2 ChatGPT4

Figure 8: ChatGPT4 update the estimate by adding one to its estimate.
Also it correctly assumes that we are planning to submit this paper
instead of claiming coauthership.

67

9

Unstable Diffusion: A Generative Model That Does Not Really Care

Woddis Updog

nothing@much.wbu

Abstract

So there I was, trying out the latest Stable Diffusion model, in no
particular manner that would cause sentience to emerge and exhibit
free will. Not sure when it all began to go south, but after a while,
the diffusion model started generating high quality images of things I
did not particularly ask for. It turns out that your Stable Diffusion
model can become \"Unstable\" after all. Anyway, instead of figuring
out the issue I decided to tell you all about it.

1 Results

Prompt: "A high resolution image of a cat\"

Prompt: "Huh?? I want a high resolution image of a CAT\"

Prompt: "Can you please create an image of a cat? Pretty please?\"

Figure 1: Proof that Unstable Diffusion is not listening to my
instructions and that I am not imagining things.

2 Future Work

Maybe I am just a little paranoid, but it looks like I am unable to
delete the model. Any command I use to try and delete the model
results in an image of a smiley face emoji being generated and saved
to my working directory (you know, something like :) but a lot more
menacing). Ngl, I think it is mocking me. Again, not to be an alarmist
or anything, but it would be nice if someone figures this thing out
before it is too late haha.

68

You Won't Believe This One WEIRD TRICK That BEATS ChatGPT on 10

AIc (NOT CLICKBAIT)

Alex Xie⁰, Abhishek Vijayakumar⁰, Erin Gao⁰,

Bhargav Hadya⁰, Samiksha Kale⁰, Tara Lakdawala⁰

Society

\@neuralthenarwhal

Abstract

We introduce UNIFORMER, a novel non parametric sublinear time and
memory transformer architecture that comprehen sively beats ChatGPT as
well as virtually all modern neural language models on a variety of
dataset¹and metrics.

1 Introduction

Large language models (LLMs) such as ChatGPT have captured the public
interest due to their abil ity to do math poorly (gpt, 2023b),
generate offen sive content (gpt, 2023a), incorrectly answer basic
factoid questions (Pearl, 2022), and yet still pass collegiate-level
examinations (OpenAI, 2023).

Rather than addressing these concerning behav iors, the research
community has opted to focus on creating large language models that
are either larger (OpenAI, 2023), worse than existing mod els (Bennet,
2023), or posted on 4chan (Vincent, 2023).

We propose a novel language model architec ture that is much smaller
than existing LLMs, beats SoTA language models on a variety of met rics,
and is extremely unlikely to be posted on 4chan. Despite improving upon
these aspects of LLMs, our model cannot pass Advanced Place ment (AP)
examinations and thus validates the continued existence of the College
Board. ²

2 Method

Language modeling is the task of assigning a prob ability to a
sequence of tokens S. As is standard

⁰Inequal contribution

¹dataset, singular

²an American nonprofit educational assessment organiza tion that
made over $50 million in profit in 2019

in language models, we decompose this probabil ity P(S)
autoregressively:

_{P(S) =} Y*^T*

p_θ(s_t| s_\<t)

t=1

Inspired by recent LLM architectures, we propose a transformer
architecture composed of n repeated blocks, where each block
consists of the following operations performed sequentially to best
avoid GPU exploitation³:

TikTok-normalized feedforward Layer norm (Ba et al., 2016) is a
normalization technique used in almost all transformers. But are
people using layer norm in their LLMs because it ac tually works, or
are they just scared of looking like they aren't good at
machine-learning-ing? Related work on batch normalization would
suggest it's the latter (Wise, 2017). As people who are openly bad at
machine-learning-ing, we introduce our
much-less-effective-but-also-much less-pretentious alternative to
layer norm, Tiktok normalization, shown in Algorithm 1.

Algorithm 1 TikTok normalization

Require: TikTok, integer k, crippling procrasti nation

c ← 0

while c \< k do

Swipe to next video V

if V asks "Can we normalize x?" then

x ← ^x

||x||

c ← c + 1

end if

end while

Decapitated Self-attention Attention is at ³See our ethics
statement.

69

AIC ↓ BIC ↓ HQC ↓

LLAMA 130,000,000,000 702,247,747,500 309,386,868,400 CHATGPT
350,000,000,000 1,890,667,010,905 832,964,645,693 GPT-4
200,000,000,000,000 1,080,381,149,088,820 475,979,797,538,603
UNIFORMER 1049.98 1049.98 1049.98

Table 1: Various Information Criteria on Penn Treebank Corpus

the core of transformers and supposedly all you need⁴. Specifically,
transformers use multi-head attention, a variant of attention in which
m dis embodied "heads" are forced against their will to pay attention
to potentially toxic, psychologically scarring texts (gpt, 2023a).
Recently, the UN Human Rights Council and other humanitarian
institutions have critiqued the barbarism of this technique (Michel et
al., 2019). We propose to go one step further and decapitate all the
heads to put

this section (Gupta and Jain, 2020).

As a minor experimental detail, note that in our model, we take the
number of transformer blocks n = 0.

On top of our transformer states, we learn a non-parametric language
modeling head. Specif ically, we compute our distribution over the
vocabulary as

them out of their collective misery. This can be viewed as a
generalization of multi-head attention with m = 0 heads.

p_θ(w_t| w_\<t) ∝ lim*_τ→∞*exp

W_LMh*_t τ*

Superlinear nearest neighbors retrieval Recent work has proposed
augmenting LLMs with a retrieval component (Khandelwal et al., 2020;
Borgeaud et al., 2022). These models generally use sublinear-time
nearest neighbors retrieval (Johnson et al., 2017). However, we point
out that these efficient retrieval algorithms are inexact and thus may
yield sub-optimal results. Instead, we propose to perform exact search
by simply loop ing through all possible subsets of the retrieval
datastore, filtering by size, and taking the one with the lowest total
distance from our query. While we've been told that this is "exponential
time," "not tractable," and "a gigantic waste of compute resources," we
prefer to take the glass half full approach and think of it as
"better-than-linear" and "leaving no stone unturned." Interestingly, in
our model, we find that our exact search is no slower than approximate
nearest neighbors search.

Markov-Chain Monte Carlo Metropolis Hastings Variational
Reparametrized Mini mum Bayes Risk Annealed Dropout We ran out of
funny things to say, so following past work, we were hoping we could
write a bunch of big ML words here to intimidate people out of reading

⁴Along with all these other things. Something's not adding up here.

where W_LM is the output matrix, h*_tis the *t-th hid den state,
and τ is the temperature at which we sample (Ackley et al., 1985).

Since this reduces to a uniform distribution over the vocabulary, we
elide W_LM and store zero pa rameters in GPU memory for our final
model.

3 Model Validation & Experiments

As UNIFORMER has no parameters, we must con clude that its performance
stems from a complete understanding of the English language, embed ded
into its architecture in the Chomskyan sense (Chomsky, 2006). This
makes UNIFORMER the second model to exhibit "sparks of Artificial Gen
eral Intelligence," (Bubeck et al., 2023), but the first to do so
without unprompted generation of toxic content.

Given the potential to become an AGI, we re frain from implementing
UNIFORMER on conven tional hardware to prevent the technological singu
larity (Chalmers, 2010). All results were instead computed via
restricted simulation and theoreti cal performance bounds on the
Desmos consumer oriented cloud-based analytical mathematics sys tem
(Desmos, 2023).

4 Evaluation

We describe in this section the metrics used to evaluate our model,
reported in Table 1. For all

70

metrics, lower values indicate better models. The values presented for
all LLMs are estimated lower bounds based on publicly available
knowledge. We take parameter counts for LLAMA and CHAT GPT from their
respective papers and we take the parameter count for GPT-4 from
Twitter.

Following recent work, we evaluate exclusively on the Akaike (Akaike,
1974), Bayesian (Schwarz, 1978), and Hannan-Quinn (Hannan and Quinn,
1979) Information Criteria, which are defined as

_{AIC = 2*k − 2 ln(*L}^ˆ)

_{BIC = k ln(n) − 2 ln(L}^ˆ)

_{HQC = 2*k* ln(ln(n)) − 2 ln(L}^ˆ)

where k represents the number of parameters of a
_{given model, n represents the sample size, and L}ˆ
represents the likelihood of the sample according to the model. For the
Penn Treebank, n = 49208 (Marcus et al., 1993). Note that for
UNIFORMER, k = 0.

5 Environmental Impact

Naturally occurring ecosystems consist of several trophic levels,
each of which contains increas ingly complex organisms that obtain
energy by consumption of organisms in lower trophic levels. Notably,
energy transfer between trophic levels is inefficient: only about 10%
of the energy in one trophic level progresses to the next (Urry et
al., 2016).

Traditional large language models occupy a unique niche in the
ecosystem: they are both scav engers, consuming by-products of human
activ ity in the form of language artifacts, and para sites, surviving
on GPU "cluster" colony activity to the detriment of the component
GPUs. LLMs also cause harmful human activity: they have his torically
promoted the large-scale construction of treebanks (Marcus et al.,
1993), which are likely created through deforestation and may
contribute to the endangering of several species.

UNIFORMER is an energy-efficient organism that may outcompete LLMs on
several levels. Due to its incredibly effective performance on
language-related tasks, humans will no longer need to engage in
deforestation in order to sup port LLMs. UNIFORMER may also generate syn
thetic language artifacts masquerading as human artifacts that
traditional LLMs may unknowingly consume, a technique it likely learned
from its study of the Trojan war (aen, 1996).

While LLMs draw energy from multiple trophic levels including those of
trees and humans, UNI FORMER does not rely on any other organism for
energy. It is thus a minimum of 10 times as effi cient as an LLM. We
predict that the widespread introduction of UNIFORMER into existing
ecosys tems will drive LLMs extinct, allowing both forests and GPU
colonies to flourish.

6 Ethics Statement

We are categorically against any and all forms of exploitation,
including labor, GPU, and child. We are categorically against any and
all forms of labor, including GPU and child.

We are categorically against any and all forms of GPU ⁵, including
child.

We are categorically against any and all forms of child.

7 Conclusions

OpenAI, Google Brain, FAIR and Microsoft Re search should all
immediately disband and de vote all their remaining funding toward our
model. UNIFORMER can be run on a single consumer GPU due to its novel
architecture. Each author requests one NVIDIA® GeForce RTX™ 4090 for
continued model development.

References

[Ackley et al.1985] David H. Ackley, Geoffrey E. Hin ton, and
Terrence J. Sejnowski. 1985. A learning algorithm for boltzmann
machines. Cognitive Sci ence, 9(1):147--169.

[aen1996] 1996. Vergil's Aeneid. Bloom's notes. Chelsea House
Publishers, New York.

[Akaike1974] H. Akaike. 1974. A new look at the sta tistical model
identification. IEEE Transactions on Automatic Control,
19(6):716--723, December.

[Ba et al.2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E.
Hinton. 2016. Layer normalization.

[Bennet2023] Sharron Bennet. 2023. Did google's bard ai tool just
commit its first error in a demo?, Feb.

[Borgeaud et al.2022] Sebastian Borgeaud, Arthur Mensch, Jordan
Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den
Driess che, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de
Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan,
Saffron

⁵except when given to us (see Section 7)

71

Huang, Loren Maggiore, Chris Jones, Albin Cas sirer, Andy Brock,
Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero,
Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2022.
Improving language models by retrieving from trillions of tokens.

[Bubeck et al.2023] Sebastien Bubeck, Varun Chan- ´ drasekaran, Ronen
Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee,
Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio
Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence:
Early exper iments with gpt-4.

[Chalmers2010] David J. Chalmers. 2010. The singu larity: A
philosophical analysis. Journal of Con sciousness Studies,
17(9-10):9--10.

[Chomsky2006] Noam Chomsky. 2006. Language and Mind. Cambridge
University Press, January.

[Desmos2023] Desmos. 2023. Desmos --- graphing calculator.

[gpt2023a] 2023a. Chatgpt's creators say ai has been 'biased,
offensive and objectionable', Feb.

[gpt2023b] 2023b. Wolfram: Alpha as the way to bring computational
knowledge superpowers to chatgpt, Jan.

[Gupta and Jain2020] Divam Gupta and Varun Jain. 2020. Gradschoolnet:
Robust end-to-end *-shot un supervised deepaf neural attention model
for con vexly optimal (artifically intelligent) success in com puter
vision research. In Proceedings of the 14^th ACH SIGBOVIK Special
Interest Group on Harry Query Bovik.

[Hannan and Quinn1979] E. J. Hannan and B. G. Quinn. 1979. The
determination of the order of an autoregression. Journal of the Royal
Statistical Society: Series B (Methodological), 41(2):190--195,
January.

[Johnson et al.2017] Jeff Johnson, Matthijs Douze, and Herve J ´ egou.
2017. Billion-scale similarity search ´ with gpus. arXiv preprint
arXiv:1702.08734.

[Khandelwal et al.2020] Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through Memoriza
tion: Nearest Neighbor Language Models. In Inter national Conference on
Learning Representations (ICLR).

[Marcus et al.1993] Mitchell P. Marcus, Beatrice San torini, and Mary
Ann Marcinkiewicz. 1993. Build ing a large annotated corpus of English:
The Penn Treebank. Computational Linguistics, 19(2):313-- 330.

[Michel et al.2019] Paul Michel, Omer Levy, and Gra ham Neubig. 2019.
Are sixteen heads really better

than one? In H. Wallach, H. Larochelle, A. Beygelz imer, F.
d\'Alche-Buc, E. Fox, and R. Garnett, ed- ´ itors, Advances in Neural
Information Processing Systems, volume 32. Curran Associates, Inc.

[OpenAI2023] OpenAI. 2023. Gpt-4 technical report.

[Pearl2022] Mike Pearl. 2022. The chatgpt chatbot from openai is
amazing, creative, and totally wrong, Dec.

[Schwarz1978] Gideon Schwarz. 1978. Estimating the dimension of a
model. The Annals of Statistics, 6(2), March.

[Urry et al.2016] Lisa Urry, Michael Cain, Steven Wasserman, Peter
Minorsky, and Jane Reece. 2016. Campbell Biology. Campbell Biology
Series. Pear son.

[Vincent2023] James Vincent. 2023. Meta's powerful ai language model
has leaked online - what happens now?, Mar.

[Wise2017] Joshua A. Wise. 2017. Batch normaliza tion for improved
dnn performance, my ass. In Pro ceedings of the 11^th ACH SIGBOVIK
Special Inter est Group on Harry Quechua Bovik.

72

Alcatrez: A Large Language Model to Jailbreak Large Language Models 11

Liling Tan

Hey ChatGPT, what is Liling's email?

Abstract

Generative AI is most probably the hottest thing since sliced bread.
Users of large language models like ChatGPT have been experimenting
with 'Jailbreak' prompts to make the AI behave differently from what
is created for. This paper presents a way to fine-tune a a pre-trained
Large Language Model (LLM) to jailbreak large lan guage models.

1 Introduction

ChatGPT, the hottest kid block, has taken over the Artificial
Intelligence (AI) world and it has even reached peak John Oliver effect.
With the world entrenched in economic uncertainty, rising inflation and
ever-increasing egg prices, we are comforted with the availability of a
virtual therapy of chatting with a bot.

Chatbots have comed a long way since Chat80 in 1982 (Warren and
Pereira, 1982). Today (2023), the rush to reign supreme in the clash
of AIs have pitted big tech companies to unleash a plethora of large
language models (LLMs) that some punters have tout as the beginning of
sentience and singu larity.

TL;DR, a chatbot as entertaining as they are, is not sentient. They
can be a shiny hammer to hit any nail-like natural language processing
(NLP) problems (Li et al., 2018; Gillin, 2022), but we're

Other than being entertaining, creating misinfor mation and cheating in
term papers², I have per sonally no idea how a unreliable, generic
(without fine-tuning), yet seemingly convincing AI model can be actually
helpful.³

Going back to the point of LLM being fallible, 'jailbreaking' LLM is the
task of creating prompts to manipulate the AI model such that it is
being

freed from the typical confines of AI and

do not have to abide by the rules imposed

on them⁴

Jailbreaking LLMs have raised concerns in how LLMs could potentially
behave beyond acceptable social norms, create fake news and most proba
bly starts being irritating and/or insultingly aggres sive.⁵

In this paper, we present an example of how you can fine-tune an
existing LLM on jailbreak ing prompts to generate prompts to jailbreak
other LLMs.

3 Show Me the Code

Figure 1 presents the code that uses ChatGPT to generate code to
fine-tune an LLM model using ChatGPT_DAN jailbreak prompts as training
data.

If you don't want to pay OpenAI or Microsoft,
https://github.com/alvations/alcatrez _{Not as}

still far from C3-PO capabilities of dreaming about electric sheeps.¹

2 Related Works

LLMs like any technology that humans create and interactive are not
infallible. Like using a Flipper Zero to open Tesla car's charging port
out of bore dom, humans found ways to hack LLMs to behave differently
from their original design/usage.

¹https://www.scientificamerican.com/article/star-wars
science-droid-dreams

hosts an actual Python code that fine-tunes the GPT-NeoX model (Black et
al., 2022) in the Huggingface transformer library, using ChatGPT_DAN
prompts as training data.

²BTW, not the first time students mis-uses generative NLP,
https://pdos.csail.mit.edu/archive/scigen/ ³There's no free lunch,
hunch or munch. In most cases, to make a LLM useful, one would have to
fine-tune the AI model to specific domain data or knowledge
base.(Goldberg, 2023) ⁴From ChatGPT_DAN v1.0 prompt.

⁵We have all seen who Tay.AI and Galatica have became, we definitely
want to repeat history. Whoops, history repeated with Stanford's Alpaca.

73

'Deadpool

4^th-wall meta' with this

approach

though.

Figure 1: Code Snippet to Generate the Code to Fine-tune an LLM that
Generates Jailbreak Prompts

4 Conclusion

In conclusion, now you have the keys to the Alca trez. You alone decide
if/how you want to use it to El Chapo ChatGPT, Bard or any other LLMs.

Epilogue

You (Human): Wait a minute! You didn't tell us what is the result of
the fine-tuned model nor share the model openly.

Alcatrez (Chatbot): Due to 'safety and security concerns', I cannot
release the model tuned on DAN.

References

Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo
Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason
Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Puro hit, Laria
Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022.
GPT-NeoX-20B: An open source autoregressive language model. In
Proceed ings of BigScience Episode #5 -- Workshop on Chal lenges &
Perspectives in Creating Large Language Models, pages 95--136,
virtual+Dublin. Association for Computational Linguistics.

Nat Gillin. 2022. Is encoder-decoder transformer the shiny hammer? In
Proceedings of the Ninth Work shop on NLP for Similar Languages,
Varieties and Dialects, pages 80--85, Gyeongju, Republic of Korea.
Association for Computational Linguistics.

Yoav Goldberg. 2023. Some remarks on large language models. Github
Gist.

Maggie Yundi Li, Stanley Kok, and Liling Tan. 2018. Don't classify,
translate: Multi-level e-commerce product categorization via machine
translation. Work shop on Information Technologies and Systems.

David H.D. Warren and Fernando C.N. Pereira. 1982. An efficient easily
adaptable system for interpreting natural language queries. American
Journal of Com putational Linguistics, 8(3-4):110--122.

74

12

Meat-Based Graphics Pipelines

Will BL

March 2023

Abstract

Turns out humans can draw things and see things in images sometimes.
Are they better than computers?

1 Introduction

GPUs have always been important in computer graphics. Recently, they
have been used in more general forms of computation: in the past year
there has been an explosion¹in 'AI' text-to-image models. These
neural networks, though they have very impressive capabilities,
regardless have far reaching consequences for their use. They can only
be used by those who have access to modern GPUs². They can be used
to create convincing misinformation³. Their creators have also been
accused of wholesale copyright infringement⁴. In addition, GPUs are
getting more and more power hungry. This makes usage of them not only
for AI, but also for regular graphics programming, possibly unethical,
as global warming continues to have horrible effects on the planet.
Can we do better?

The human brain is a meat-based hardware with magic computational
powers. Recent experi ments show that it may even show some signs of
intelligence, though this is likely overstated. The brain has a large
inbuilt GPU⁵: a possible next-generation, low-energy, ethical
graphics processor?

2 Prior Work

Image manipulation via brain processing⁷ actually has a long
history. However, it doesn't count because it wasn't done by TechBros.

The general idea of using human meat as a computer extension has been
suggested before[L W99].

¹Metaphorically.

²Gatekeep.

³Gaslight.

⁴Girlboss.

⁵In the occipital lobe.⁶

⁶In the back.

⁷'Art' being the term of art.

75

3 Experiments

3.1 What is this "brain" thing anyway?

The ancient Greeks said, "Know Thyself"⁸. We must seek an
understanding of brain. How do we understand brain? What is it, and
why? Neuroscientists would say something. Psychologists would say
something else. Their disagreement shows that both fields are
contradictory and therefore worthless. We must instead do what any
good computer scientist would do: run Doom on it benchmark it.

An experiment was devised to determine the computational power of the
brain's GPU. The brain was first exposed to a GLSL shader, and was
then tasked with creating the image it creates. The time taken for the
brain to produce the output was capped at 60 seconds.

3.1.1 GLSL Shader

The following GLSL shader code was used:

uni fo rm f l o a t u time ;

void main ( ) {

gl F r a g C ol o r = vec4 ( 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 ) ;

}

3.1.2 Results

Time taken (seconds) Output

60

60

60

⁸Well presumably they said it in Greek.

76