CS50's Introduction to Artificial Intelligence with Python
Search
general pattern:
- begin
- correct sequence of actions
- goal
agent - perceives the environment and acts upon the environment
state - configuration of the environment
initial state
actions - choices taken in any given state. fuctions. ACTIONS(s) -> set of actions that can be done in given state.
transition model. RESULT(s, a) -> state after performing action a
in state s
state space. graph
goal state. determines whether a state is a goal state.
path cost. numerical cost associated with a given path.
Search problem:
- initial state
- actions
- transition model
- goal test
- path const function
optimal solution
node:
- state
- parent
- action
- path cost
frontier - all that is to be explored:
- Start with initial state.
- Start with empty explored set.
- repeat:
- If empty, there is no solution.
- remove a single node from the frontier
- if it is a goal - we’ve found a solution
- add node to the explored set
- expand (look at all neighbours of) node, add resulting nodes to the frontier if not in the frontier and not in explored set.
search algorithms
- stack - last in first out. Used for frontier. So we get a depth-first search.
- breath-first search always expands the shallowest node in the frontier. Uses a queue (first-in first-out)
BFS finds always the optimal path. DFS might save you memory.
uninformed search - doesn’t use any problem specific knowledge. informed search:
- Greedy best-first search (GBF) - expand the node that is closest to the goal by using a heuristic function
h(n)
- Manhattan distance - distance between two points along axes of right angles.
A* search: expand the node with the lowest value of g(n) + h(n)
where g(n)
is the cost to reach node. Optimal if:
- h(n) is admissible (never overestimates the true cost) - should never think I’m further away from the goal that I actually am.
- h(n) is consistent. Cost c,
h(n) <= h(n') + c
adversarial search
minimax - X wants to maximize max(o)
, O wants to minimize the score min(o)
.
TicTacToe functions:
- PLAYER(s) - who’s turn is it?
- ACTIONS(s) - possible actions for a state
- RESULTS(s, a) - perform a action and return the state
- TERMINAL(s) - did we finish the game?
- UTILITY(s) - score for a state
optimizations:
- Alfa-Beta-pruning - remove (don’t consider) some of the nodes using already computed information.
Depth-Limited Minimax:
- evaluation function - evaluates expected utility of the game from a given state
Knowledge
knowledge-based agents
sentence - assertion about the world in a knowledge representation language
propositional logic - based on statements about the world.
logical connectives - not and or, -> (implication), <-> (biconditional)
implication (if then)
P (premise) | Q (conclusion) | P->Q |
---|---|---|
false | false | true |
false | true | true |
true | false | false |
true | true | true |
biconditional (if and only if)
P | Q | P<->Q |
---|---|---|
false | false | true |
false | true | false |
true | false | false |
true | true | true |
model - assignment of a truth value to every propositional symbol (possible world). Total possibilities number_of_values**variables
knowledge base (KB) - set of sentences known a by a knowledge agent to be true
entailment A |= B
(A entails B) - in every model (possible world) in which sentence A
is true, then B
is also true.
We want our AI to figure out what the possible entailments are. To infer, to draw conclusions. inference - deriving new conclusions from old ones.
Model Checking
To determine if KB |= alfa
:
- Enumerate all possible models
- If in every model where
KB
is true,alfa
is true, thenKB entails alfa
. - Otherwise
KB
does not entailalfa
.
alfa
is the query.
Knowledge Engineering
Example: clue (game):
- Establish propositional symbols
- add knowledge
- check_model
Inference Rules
Modus Ponens: application of implication
alfa -> beta
alfa
------------
beta
And Elimination
alfa AND beta
-------------
alfa
and the same for beta
.
Double Negation Elimination
NOT NOT alfa
------------
alfa
Implication Elimination
alfa -> beta
------------
NOT alfa OR beta
Biconditional Elimination
alfa <-> beta
---------------------------------
(alfa -> beta) AND (beta -> alfa)
De Morgan’s Law
NOT(alfa AND beta)
--------------------
NOT alfa OR NOT beta
Reverse De Morgan’s Law
NOT(alfa OR beta)
---------------------
NOT alfa AND NOT beta
Distributive Property
(alfa AND (beta OR gama)
----------------------------------
(alfa AND beta) OR (alfa AND gama)
works also when AND is exchanged with OR and vice versa.
Theorem Proving
- initial state: staring knowledge base
- actions: inference rules
- transition model: new knowledge after inference
- goal test: check statement we’re trying to prove
- path cost functions: number of steps in proof
Unit Resolution Rule
P OR Q
NOT P
--------
Q
In an OR clause the order of arguments doesn’t matter.
P OR Q
NOT P OR R
-----------
Q OR R
Q and R can be multiple propositional symbols, not single ones.
clause: a disjunction (connected with OR) of literal
Conjuntive Normal Form (CNF)
logical sentence that is a conjunction of clauses, e.g. (A OR B OR C) AND (D OR NOT E) AND (F OR G)
. Any sentence in logic can be converted into this.
Conversion to CNF:
- Eliminate bi-conditionals
- Eliminate implications
- Move NOT inwards using De Morgan’s Laws
Factoring removing duplicate variables.
Empty clause (()
) is always false.
Proof by contradiction: assume the opposite and if you can demonstrate it is a contradiction the thing we want to prove is true.
First-Order Logic
- Constant Symbols (Minerva, Gryffindor …)
- Predicate Symbol (Person, House, BelongsTo, …)
Example:
Person(Minerva) # Minerva is a person
House(Gryffindor) # Gryffindor is a house
NOT House(Minerva) # Minerva is not a house
BelongsTo(Minerva, Gryffindor) # Minerva belongs to Gryffindor
Minimizes the number of symbols needed.
Universal Quantification - true for all
Existential Quantification - true for some, at least one value
Uncertainty
Probability
Possible Worlds (Ω - omega). P(omega)
Probability: 0 <= P(omega) <= 1
, sum(P(omega)) == 1
unconditional probability: degree of belief in a proposition in the absence of any other evidence
conditional probability: we have some initial information about the world and how it works (evidence has been revealed to us). P(a | b)
Probability of a
given we already know that b
is true. Formula:
P(a|b) = P(a AND b) / P(b) # or
P(a AND b) = P(b) P(a|b) # or
P(a AND b) = P(a) P(b|a)
random variable: a variable with a domain of possible values it can take on. E.g.: traffic {none,light,heavy}
, probability distribution:
P(traffic = none) = .6
P(traffic = light) = .3
P(traffic = heavy) = .1
P(traffic) = <0.6, 0.3, 0.1> # equal but less verbose
independence: knowledge of one event doesn’t influence the probability of the other events. P(a AND b) = P(a) P(b)
. Example: roll of two dices.
Bayes’ Rule:
P(b|a) = P(b) P(a|b) / P(a)
Example: given clouds in the morning, what is the probability of the rain in the afternoon?
- 80% of rainy afternoons start with cloudy mornings.
- 40% of days have cloudy mornings
- 10% of days have rainy afternoons
Solution:
P(rain | clouds) = P(clouds | rain) P(rain) / P(clouds)
= 0.8 * 0.1 / 0.4
= 0.2
Knowing P(visible effect | unknown cause)
we can calculate P(unknown cause | visible effect)
.
Joint Probability
Example:
C = 0.4, not C = 0.6
R = 0.1, not R 0.9
R not R
C .08 .32
not C .02 .58
P(cloudy | rain) = P(cloudy, rain) / P(rain) # , = AND
= alfa P(C, rain) # alfa = normalization constant
= alfa <.08,.02>
= <.8, .2>
Probability Rules
Negation: P(not a) = 1 - P(a)
Inclusion-Exclusion: P(a OR b) = P(a) + P(b) - P(a AND b)
Marginalization: P(a) = P(a, b) + P(a, not b)
. Example:
P(C = cloud) = P(C = cloud, R = rain) + P(C = cloud, R = not rain)
= 0.08 + 0.32
= 0.4
Conditioning: P(a) = P(a|b)P(b) + P(a| not b)|P(not b)
Bayesian Networks
- data structures that represent the dependencies among random variables
- directed graph
- each node represents a random variable
- arrow from
X
toY
meansX
is a parent ofY
- each node X has a probability distribution
P(X | Parents(X))
Inference problem:
- Query
X
: variable for which to compute the distribution - Evidence variables
E
: observed variables for evente
- Hidden variables
Y
: non-evidence, non-query variable - Goal: Calculate
P(X | e)
Inference by Enumeration: P(X | e) = alfa P(X, e) = alfa sum_y(P(X, e, y)
Approximate Inference, e.g. by sampling.
Likelihood weighting:
- fix the values for evidence variables,
- sample the non-evidence variables using conditional probabilities of the Bayesian Network,
- weight each sample by its likelihood: the probability of all of the evidence.
Markov Models
Uncertainty over time. -> Huge amount of data / complexity.
Markov assumption: assumption that the current state depends on only a finite fixed number of previous states.
Markov chain: a sequence of random variables where the distribution of each variable follows the Markov assumption.
hidden state vs. observation. E.g. user engagement vs. website analytics or weather vs. people bringing umbrellas to the office. The AI doesn’t know the truth of the world (hidden state) all it has access to is some observation that is related to the hidden state.
Hidden Markov Model a Markov model for a system with hidden states that generate some observed event. Also often called Sensor Model.
sensor Markov Assumption: the assumption that the evidence variable depends only on the corresponding state.
Tasks:
- filtering: given observations from start until now calculate the distribution for current state
- prediction: … for a future state
- smoothing: … for a past state
- most likely explanation: … most likely sequence of states
Optimization
- choosing the best option from a set of options
local search: search algorithms that maintain a single node and searches by moving to a neighboring node. Most useful when we don’t care about the path but the solution only.
state-space landscape. Global minimum - cost function. Global maximum - objective function. We move to neighbours from current state.
hill climbing algorithm
function hill-climb(problem):
current = initial state of problem
repeat:
neighbor = highest valued neighbor of current
if neighbor not better than current:
return current
Hill climbing might get stuck at a local maxima. Therefore there are hill climbing variants:
- steepest-ascent: choose the highest-valued neighbor
- stochastic: choose randomly from higher-valued neighbors
- first-choice: choose the first-higher valued neighbor
- random-restart: conduct hill climbing multiple times
- local beam search: choose the
k
highest-valued neighbors
Real problem with hill climbing is that they never make a move that makes our position worse. And that’s what’s needed to get to a global maximum.
Simulated Annealing
- Early on, higher “temperature”: more likely to accept neighbors that are worse than current state
- Later on, lower “temperature”.
function SIMULATED-ANNEALING(problem, max):
current = initial state of problem
for t = 1 to max:
T = temperature(t)
neighbor = random neighbor of current
delta_e = how much better neighbor is than current
if delta_e > 0:
current = neighbor
with probability e**(delta_e/T) set current = neighbor
return current
Linear Programming
- Minimize a cost function
- With constraints
- With bounds for each variable
Example:
Cost function: 50x + 80y
Constraint: 5x + 2y <= 20
Constraint: 10x + 12y >= 90 -> (-10x) + (-12y) <= -90
Constraint Satisfaction
- Set of variables {X1, …, Xn}
- Set of domains for each variable {D1, D2, …, Dn} - possible values each of the variables can take on.
- Set of constraints C
hard constraints - must be satisfied in a correct solution
soft constraints - preferences
unary constraint - involves just a single variable, e.g.: A != monday
binary constraint - involves two variables, e.g.: A != B
node consistency: when all the variables in a variable’s domain satisfy the variable’s unary constraints. For each of possible nodes remove possible values of the domain so all the unary constraint are satisfied.
arc consistency
when all the values in a variables domain satisfy the variable’s binary constraints
function REVISE(csp, X, Y): # csp = constraint satisfaction problem
revised = false
for x in X.domain: # loop over all values of X's domain
if no y in Y.domain satisfies constraint for (X, Y):
delete x from X.domain
revised = true
return revised
function AC-3(csp):
queue = all arcs in csp
while queue non-empty:
(X, Y) = DEQUEUE(queue) # (X, Y) is an arc (or edge)
if REVISE(csp, X, Y):
if size of X.domain == 0:
return false
for each Z in X.neighbors - {Y}:
ENQUEUE(queue, (Z, X)) # removing X might not satistfy Z anymore
return true
AC-3 will not always solve the problem. What remains is a Search Problem:
- initial state: empty assignment (no variables)
- actions: add a
{variable = value}
to assignment - transition model: shows how adding an assignment changes the assignment
- goal check: check if all variables are assigned and all constraints satisfied
- path cost function: irrelevant (all paths have same cost)
Backtracking Search
function BACKTRACK(assignment, csp):
if assignment complete:
return assignment
var = SELECT-UNASSIGNED-VAR(assignment, csp)
for value in DOMAIN-VALUES(var, assignment, csp):
if value consistent with assignment: # doesn't violate any constraints
add {var = value} to assignment
result = BACKTRACK(assignment, csp)
if result != failure: return result
remove {var = value} from assignment
return failure
Inference: use Arc Consistency interleaved with search, so we backtrack less.
maintaining arc-consistency: algorithm for enforcing arc-consistency every time we make a new assignment. When we make a new assignment to X
, call AC-3
, starting with a queue of all arcs (Y, X)
where Y
is a neighbour of X
.
SELECT-UNASSIGNED-VAR heuristics:
- minimum remaining values (MRV) heuristic: select the variable that has the smallest domain
- degree heuristic: select the variable that has the highest degree (connected to the most other nodes)
DOMAIN-VALUES heuristics:
- least constraining value: return variables in order by number of choices that are ruled out for neighboring variables:
- try least-constraint values first
Learning, ML (learning from data and experience)
No explicit instructions how to perform a task, but give access to data and let the computer figure it out.
Supervised Learning
given a data set of input-output pairs, learn a function to map inputs to outputs.
classification: supervised learning task of learning a function mapping of an input to a discrete category. E.g. banknote -> counterfeit/real
nearest-neighbor classification: algorithm that, given an input, chooses the class of the nearest data point to that input:
- k-nearest-neighbour classification: …
k
nearest data points …- could be slow
Perceptron Learning
hypothesis function to determine on which side of boundary is the input. Example:
x1 = Humidity
x2 = Pressure
Weight Vector w: (w0, w1, w2)
Input Vector x: (1, x1, x2)
w * x: w0 + w1x1 + w2x2
h(x1, x2) = Rain if w0 + w1x1 + w2x2 >= 0 else no Rain
The goal of the ML algorithm is to learn what the Weight Vector will be.
General hypothesis form: h(x) = 1 if w*x >= else 0
perceptron learning rule: given data point (x, y)
update each weight according to: w_i = w_i + alfa(y - h_w(x)) * x_i
, which means w_i = w_i + alfa(actual value - estimate) * x_i
. alfa = learning rate (how quickly we’re going to update the weight values).
hard threshold vs. soft threshold (logistic regression)
Support Vector Machines
Can work in higher dimensions.
maximum margin separator: boundary that maximizes the distance between any of the data point.
Regression
Supervised learning task of learning a function mapping an input point to a continuous value.
Evaluating Hypotheses
Think of a Optimization Problem.
loss function: function that expresses how poorly our hypothesis performs.
0-1 loss function:
L(actual, predicted) =
0 if actual = predicted,
1 otherwise
L1 loss function:
L(actual, predicted) = |actual - predicted|
L2 loss function - penalizes much more harshly far away predictions:
L2(actual, predicted) = (actual - predicted)**2
overfitting - a model that fits too closely (little or no loss) to a particular data set and therefore may fail to generalize to future data.
regularization - penalizing hypotheses that are more complex to favor simpler, more general hypothesis. We want to give preference to a simpler solution which might generalize better: cost(h) = loss(h) + lambda*complexity(h)
holdout cross-validation - splitting data into a training set and a test set, such that learning happens on the training set and is evaluated on the test set. Downside: a fair amount of data is not used to train.
k-fold cross-validation - splitting data into k
sets, and experimenting k
times, using each set as a test once and using remaining data as a training test
Reinforcement learning
Given a set of rewards or punishments, learn what actions are to be taken in future:
Environment
/ \
action^ state_v, reward_v
\ /
Agent
Markov Decision Process model for decision-making, representing states, actions, and their rewards:
- Set of states
S
- Set of actions
ACTIONS(s)
- Transition model
P(s'|s, a)
- Reward function
R(s, a, s')
Q-Learning method for learning a function Q(s, a)
, estimate of the value of performing action a
in state s
. Overview:
- Start with
Q(s, a) = 0
for alls
,a
- When we take an action and receive a reward:
- Estimate the value of
Q(s, a)
based on current reward and expected future rewards. - Update
Q(s, a)
to take into account old estimate as well as our new estimate. - Formally:
Q(s, a) <- Q(s, a) + alfa(new value estimate - old value estimate)
Q(s, a) <- Q(s, a) + alfa((r + future reward estimate) - Q(s, a))
Q(s, a) <- Q(s, a) + alfa((r + max_a'(Q(s', a')) - Q(s, a))
Q(s, a) <- Q(s, a) + alfa((r + y*max_a'(Q(s', a')) - Q(s, a))
- to give preference to current rewards (y)
- Estimate the value of
alfa
represents how much we value new information compared to how much we value old information. alfa
of 1 means we only consider new information.
Greedy Decision-Making: when in state s
, choose action a
with highest Q(s, a)
Tension between Explore vs. Exploit (using knowledge we already have, which might not need to optional solution).
epsilon-greedy
- Set
epsilon
equal to how often we want to move randomly. - With probability
1 - epsilon
, choose estimated best move. - With probability
epsilon
, choose a random move. Esp. in the beginning.
function approximation - approximating Q(s, a)
, often by a function combining various features, rather than storing one value for every state-action pair.
Unsupervised Learning
given input data without any additional feedback, learn pattern.
clustering organizing a set of objects into groups in such a way that similar objects tend to be in the same group.
k-means clustering - algorithm for clustering data based on repeatedly assigning points to clusters and updating those cluster’s centers.
Neural Networks
- Neurons are connected to and receive electrical signals from other neurons.
- Neurons process input signals and can be activated.
pip install tensorflow-macos
Artificial Neural Network - mathematical model for learning inspired by biological neural networks:
- Model mathematical function from input to outputs based on the structure and parameters of the network.
- Allow for learning the network’s parameters based on data.
activation function - function that gets applied to the result and decides whether a threshold is passed (e.g. “it rains or not”).
Schematics for a very simple neural network:
w0
|
x1 \ |
g(w0 + w1*x1 + w2*x2)
x2 /
Neural network will learn what the values of w0
, w1
and w2
should be in order to get the results that we would expect. General formula g(sum(x_i*w_i) + w0)
.
gradient descent - algorithm for minimizing loss when training neural network. Gradient is going to tell us in which direction we should be moving the weights in order to minimize the loss.
General idea:
- Start with a random choice of weights
- Repeat:
- Calculate the gradient based on all data points: direction that will lead to decreasing loss (all data points - expensive operation).
- Update weights according to the gradient.
Stochastic Gradient Descent - s/all data points/one data point randomly/
Mini-Batch Gradient Descent - s/all data points/one small batch/
Perceptron: a single one is only capable of learning linearly separable decision boundary, because we’re taking linear combination of input.
multilayer neural network - artificial neural network with an input layer, an output layer, and eat least one hidden layer. Gives us the ability to model more complex function.
backpropagation algorithm for training neural networks with hidden layers. Key algorithm for making neural networks possible. Idea:
- start with a random choice of weights
- Repeat:
- Calculate error for output layer
- For each layer, starting with output layer and moving inwards towards earliest hidden layer:
- Propagate error back one layer.
- Update weights.
deep neural network - neural network with multiple hidden layers.
overfitting
dropout - temporarily removing units - selected at random - from a neural network to prevent over reliance on certain units.
computer vision computational methods for analyzing and understanding digital images.
image convolution - applying a filter that adds each pixel value of an image to its neighbors, weighted according to kernel matrix.
pooling reducing the size of an input by sampling from regions in the input. max-pooling chooses the maximal value in a each region.
convolutional neural network (CNN) - uses convolution, usually for analysing images.
image -> convolution -> pooling -> flatting -> neural network
General structure of NN: input -> network -> output
. This is a feed-forward neural network - NN that has connections only in one direction. Limitations: fixed shape (number of) input and output neurons.
Recurrent Neural Network - generates output that gets feds back into the network as input for future runs of the network:
/--\
v |
input -> network -> output
This allows to store state. This is particularly helpful when dealing with sequences of data.
Example: Microsoft’s Captionbot or analyzing videos.
1 -> N relationships as opposed to 1:1 from before.
N -> 1: e.g. classification of a voice from a audio sample
N -> N: e.g. google translate
Human Language
Common tasks:
- automatic summarization
- information extraction
- language identification
- machine translation
- named entity recognition
- speech recognition (Alexa)
- text classification
- word sense disambiguation
- …
syntax - structure
semantics - meaning
formal grammar a system of rules for generating sentences in a language.
Context-Free Grammar
Replacing one symbol with another symbol.
Non-terminal symbols (noun - N
, verb - V
) are used to generate terminal symbols (“she”, “saw”).
On the left we have always a non-terminal symbol, on the right of the arrow it could be either terminal or non-terminal symbols. E.g. NP -> N | D N
.
pip install nltk
n-gram
a contiguous sequence of n
items from a sample text. e.g.: character n-grams or word n-grams.
unigram - a contiguous sentence of 1 item from a sample text.
bigram - … 2 items …
trigrams - … 3 items …
Allows for segmentation of the text for analysis and while the AI might have not seen the whole sentence before, it is likely that it has encountered a N-gram from it before.
tokenization
the task of splitting a sequence of characters into pieces (tokens). E.g. to extra n-grams.
word tokenizations - splitting into words.
Markov Model
It will allow us to predict what the next unit will be.
pip install markovify
Text Categorization
a classification problem. E.g. spam/not spam. Positive / negative sentiment.
Bag-of-Words -model that represents text as unordered collection of words. Works pretty well for positive/negative sentiment approach.
Naive Bayes
How to not worry about multiplying by zero:
- additive smoothing - adding a value
alfa
to each value in our distribution to smooth the data. - Laplace smoothing - adding 1 to each value in our distribution, pretending that we’ve seen each value one more time than we actually have.
Naive Bayes basically looks for differentiating words (which have a big weight in one category and not so in others).
Information Retrieval
the task of finding relevant documents in response to a query.
topic modeling - models for discovering the topics for a set of documents. What are the important documents in a word?
tf-idf
term frequency - number of times a term appears in a document.
function words - words that have little meaning on their own, but are used to grammatically connect other words. E.g.: am, by do, is , which, with, yet...
.
content words - words that carry meaning independently. E.g.: algorithm, category, computer, ...
inverse document frequency - measure of how common or rare a word is across documents. Form: log(totalNumberOfDocuments/numberOfDocumentsContaining(word))
.
tf-idf - ranking of what are important in a document by multiplying term frequency (TF) by inverse document frequency (IDF).
Semantics
information extraction - the task of extracting knowledge from documents.
WordNet
Word Representation
one-hot representation - representation of meaning as a vector of single 1, and with other values as 0.
distributed representation - representation of meaning distributed across multiple values.
You shall know a word by the company it keeps – J.R. Firth, 1957
word2vec - model for generating word vectors
skip-gram architecture - neural network architecture for predicting context words giving a target word.