## Additional ResourcesÂ

## Groping and Experiment

Alexander Bain's concept of learning by “groping and experiment” is a pivotal aspect of his work in psychology and philosophy, particularly in his approach to understanding how individuals acquire knowledge and develop skills.

## Perceptron – the earliest learning neural network model

Inspired by the McCulloch-Pitts neuron, the perceptron was developed by Frank Rosenblatt in 1958. The perceptron is a type of artificial neuron that can learn to classify inputs into different categories by adjusting its weights based on error feedback. This learning capability marked a significant advancement from the fixed McCulloch-Pitts model, enabling the perceptron to adapt to new data.

## Individuality

The essence of trial-and-error learning, encapsulated in Edward Thorndike's “Law of Effect,” is a foundational concept in behavioral psychology.

## Conditioned Reflexes

The term “reinforcement” first appeared in an english translation of the work of Pavlov on conditioned reflexes. This introduction of the term was significant because it helped bridge Pavlov's pioneering research in Russia with the growing field of behaviorism in the English-speaking world.

## First Ideas of Trial-and-Error Learning

Alexander Bain’s discussion of learning by “groping and experiment”

## Perceptron – the earliest learning neural network model.Â â€‹

Inspired by the McCulloch-Pitts neuron, the perceptron was developed by Frank Rosenblatt in 1958. The perceptron is a type of artificial neuron that can learn to classify inputs into different categories by adjusting its weights based on error feedback. This learning capability marked a significant advancement from the fixed McCulloch-Pitts model, enabling the perceptron to adapt to new data.

## FirstÂ Modern BackpropagationÂ â€‹

The milestone work that introduces a method for analyzing and minimizing rounding errors in numerical computations using Taylor series expansion. Base for modern backpropagation.

## Gemini

The Gemini paper by Google DeepMind introduces advanced multimodal AI models capable of processing text, images, audio, and video within a single framework.

## Human or Not

An online game inspired by the Turing test, measuring AI chatbots' ability to mimic humans and humans' ability to identify bots, attracted over 1.5 million users in a month.

## Llama

The LLaMA (Large Language Model Meta AI) paper outlines Meta AI's development of large language models that excel in various NLP tasks.

## ChatGPT

Launched in Novemeber, 2022, it quickly became the fastest growing application in the human history, with 100 milion unique users in a record time. It also marks the beginning of the LLM race.

## GPT-3.5

GPT-3.5, an improvement of OpenAI's GPT-3, features enhanced capabilities in natural language understanding and generation.

## GPT-3

An upgrade to GPT-2. The GPT-3 model, with 175 billion parameters, achieved state-of-the-art performance on many NLP tasks by leveraging few-shot, one-shot, and even zero-shot learning.

## GPT-2

An upgrade to GPT-1. It demonstrated that large-scale language models, such as GPT-2, can perform a variety of tasks without explicit supervision by leveraging vast amounts of text data for training.

## Generative Adversarial Networks

Two neural networks that compete with each other. The architecture consists of a generator network that tries to fool the discriminator network, whose task it to discriminate generated data (for example images).

## First CNN on GPU

The first work that parallelized the computations required for the training of a CNN using a GPU (Graphical Processing Unit).

## ImageNet challenge

The creation of a big dataset with images that allowed the machine learning community to shift their focus from data acqusition and its labelling to the development of algorithms.

## Feature extraction by neural networks

The first work that showed that (deep) neural networks can be used as a feature extractors. The values on the hidden neurons were used as an embedding of the input data, and similar data clustered closely using the learned representation.

## Le Net

An early instance of a successful gradient-based learning technique. It was also employed commercially for reading bank checks (several milion checks per day). It provides a detailed description of the used neural architecture, with step-by-step derivations.

## Long Short Term Memory Networks

A milestone work that tackled the problem of vanishing gradient when training deeper neural architectures. Allowed to train deeper architectures than ever before.

## A survey of algorithmic methods for partially observed Markov decision processes

In his 1991 paper “A Survey of Algorithmic Methods for Partially Observed Markov Decision Processes” in Annals of Operations Research, W. Lovejoy reviews algorithmic approaches for solving POMDPs, enhancing their understanding and application in decision-making under uncertainty.

## First version of dropout

The generalization of dropout, a popular optimization technique. It can be shown that dropout is a particular case of this method. However, it was not applied for optimization of neural networks.

## Learning from delayed rewards

Full integration of dynamic programming with learning methods.

## Conv Nets by Yann LeCun

The first application of the backpropagation algorithm to a real world problem fed directly with images (previous approaches relied on supplying the feature vectors).

## New training regime of Adaline algorithms

Proposed a new training algorithm for the Adaline algorithm.

## Boltzman Machines

Inspired by their construction in statistical physics, they were popularized in the cognitive sciences by Geoffrey Hinton, Terry Sejnowski and Yann Lecunn. Although their practical usability remained limited, under some conditions they remain useful.

## First Conv Neural Networks

First neural architecture with an ability to recognize stimulus patterns based on the geometrical similarity of their shapes without affected by their positions.

## First Modern Backpropagation

The milestone work that introduced a method for analyzing and minimizing rounding errors in numerical computations using Taylor series expansion. Base for modern backpropagation.

## GPT-4

The GPT-4 Technical Report by OpenAI describes GPT-4 as a multimodal model that processes text and image inputs to generate text outputs.

## BERT

The authors proposed an architecture designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

## GPT-1

The authors demonstrate that large gains on tasks such as textual entailment, question answering, semantic similarity assessment etc. can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text.

## Fast Learning Algorithm for Deep Belief Nets

The authors present a layer-by-layer training method for Deep Belief Networks, using Restricted Boltzmann Machines to significantly enhance training efficiency and performance through unsupervised pre-training and fine-tuning.

## Transformer

The Transformer model employs an attention mechanism to process entire sequences simultaneously, enabling better parallelization of computation.

## ResNet

The authors extended the results presented in Highway networks paper to networks with 1000 layers. Until today, this is the most cited paper in machine learning.

## Highway networks

First successful training of very deep neural networks, a precursor for ResNet. By introducing gated shortcuts, highway networks enabled the flow of information across layers without degradation, allowing for the effective training of networks with unprecedented depth.

## Sequence-to-Sequence

The Seq2Seq paper by Sutskever, Vinyals, and Le introduced a model that maps sentences to vector representations and back using LSTM networks, setting a new standard for machine translation.

## Chatbot Arena

Introduces an open platform for evaluating LLMs based on human preferences.

## ImageNet

Alex Krizhevsky et al. managed to significantly improve on the previous results on the ImageNet challenge through parallelization of the training of their network on multiple GPUs.

## Ising’s model

The first non-learning recurrent NN architecture (the Ising model or Lenz-Ising model) was introduced and analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s. It settles into an equilibrium state in response to input conditions, and is the foundation of the first well-known learning RNNs.

## McCulloch-Pitts Neuron

The McCulloch-Pitts neuron, introduced in 1943, is one of the earliest formal models of a neuron, using binary outputs and weighted inputs to mimic neural processing and decision-making, and it laid the foundation for the development of artificial neural networks, logical function computation, and modern neural network architectures used in various applications such as pattern recognition and control systems.

## Theory of neural-analog reinforcement systems and its application to the brain-model problem

Marvin Minsky explores the foundational principles of neural networks and reinforcement learning, and their application to modeling brain function.

## Dartmouth Workshop on AI

The Dartmouth Workshop on AI is considered the founding event of Artificial Intelligence as a field (that's also when the term AI was proposed). A group of around a dozen scientists made important contributions to the then-nascent field throughout this summer workshop.

## A Markovian decision process

In his 1957 paper “A Markovian Decision Process” in the Journal of Mathematics and Mechanics, R. Bellman introduces the framework of Markov decision processes, significantly advancing decision-making under uncertainty.

## Steps toward Artificial Intelligence

Marvin Minsky discusses several issues around RL, like the “credit assignment problem” (1961) “How do you distribute credit for success among the many decisions that may have been involved in producing it?”

## First Deep Learning (8 layer networks)

The use of neural networks with more than 2 layers was unthinkable due to the lack of training methods. Networks in the 60s and 70s were not trained with the current backpropagation method, and while it was possible to tune weights in 1-layer networks, it was prohibitevely difficult to do so in deeper ones.

## Adaptive Ising Model

Introduced a method for pattern recognition and sequence learning using self-organizing networks composed of threshold elements. This work laid the foundation for the development of self-organizing neural networks and unsupervised learning algorithms.

## Deep Blue Defeats Garry Kasparov

Chess was considered an activity that requires intelligence and where only humans can excel. The defeat of Garry Kasparov by Deep Blue showed that it is possible to make a highly specialised machine that can defeat the best of humans, even though by performing greedy tree search of possible moves.

## Backpropagation applied to Deep Learning

The authors demonstrate that using deep neural networks with many layers and numerous neurons per layer can significantly enhance the accuracy of handwritten digit recognition on the MNIST dataset.

## Word2Vec

A shallow two-layer neural network that is trained to reconstruct linguistic contexts of words. It was the first word embedding capable of preserving the context: it turned out that, for example (France – Paris = Italy – Rome).

## The Turing Machine

The Turing machine, introduced by Alan Turing in 1936, is a theoretical model that formalizes the concept of computation, using an infinitely long tape and a set of rules to simulate any computer algorithm, thereby defining the limits and capabilities of what can be computed.

## Experimental Psychology

“Experimental Psychology” by Robert Sessions Woodworth and Harold Schlosberg, published in 1954, describe animal behavior using the terms “trial-and-error”, which first have been introduced by Conway Lloyd 1894.

## Linear regression

Linear regression is still a frequently used approximation for explaining linear dependancies between phenomenas. It is the first out-of-shelf method in, for example, econometrics. Linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables.

## First successful learning of deeper architectures.

Up until this point it was widely assumed that it is close to impossible to train neural networks (or perceptrons) with more than 2 layers. It was also the first work that employed iterative weight update rule, although not the currently used backpropagation.

## First analysis of neuronal activity (non-learning)

Provided the first mathematical analysis of the neural activity which was proposed in 1943. S.C. Kleene explored how nerve nets and finite automata can represent events. This work significantly contributes to the fields of computer science and neuroscience by establishing foundational concepts in automata theory and neural networks.

## Dynamic Programming and a new formalism in the calculus of variations

In his 1954 paper “Dynamic Programming and a New Formalism in the Calculus of Variations” in the Proceedings of the National Academy of Sciences, R. Bellman introduces dynamic programming as a method for solving complex optimization problems. This approach significantly advances the calculus of variations by providing a systematic way to break down problems into simpler subproblems.

## Stochastic Gradient Descent.

Stochastic Gradient Descent (SGD) updates model parameters using single data points or small batches, making it computationally more efficient and faster for large datasets compared to traditional gradient descent, and its inherent randomness aids in escaping local minima and improving generalization, with enhancements like mini-batch, momentum, and adaptive methods further optimizing its performance in applications such as deep learning, reinforcement learning, and online learning.

## Turing Proposes his Test for thinking

In this work, Alan Turing pondered the question whether machines can think. The novelty of his approach was to avoid a quicksand of thinking, and instead propose an equivalent 'test'. The idea was to include a human interrogator that would communicate with the interogees through a machine written text, and based on that interaction is supposed to judge who of the interogees is a man and who is a woman (in the original proposition). The same idea translates immediately to distinguishing between a human and a machine. Our approach in the Turing Game expands Turing's idea.

## Intelligent Machinery

Alan Turing, in his exploration of artificial intelligence and machine learning, introduced the concept of a system that operates based on principles similar to the “Law of Effect,” which he referred to as the “pleasure-pain system”.

## Gradient descent technique

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for finding a local minimum of a differentiable multivariate function.

## Chain Rule of Differential Calculus

Gottfried Wilhelm Leibniz developed the chain rule, which is a crucial part of todays' optimization algorithms. It is a key ingredient of backpropagation, present in all supervised learning of neural networks.