CS50's Introduction to Artificial Intelligence with Python

Table of Contents

src

general pattern:

  • begin
  • correct sequence of actions
  • goal

agent - perceives the environment and acts upon the environment

state - configuration of the environment

initial state

actions - choices taken in any given state. fuctions. ACTIONS(s) -> set of actions that can be done in given state.

transition model. RESULT(s, a) -> state after performing action a in state s

state space. graph

goal state. determines whether a state is a goal state.

path cost. numerical cost associated with a given path.

Search problem:

  • initial state
  • actions
  • transition model
  • goal test
  • path const function

optimal solution

node:

  • state
  • parent
  • action
  • path cost

frontier - all that is to be explored:

  • Start with initial state.
  • Start with empty explored set.
  • repeat:
    • If empty, there is no solution.
    • remove a single node from the frontier
    • if it is a goal - we’ve found a solution
    • add node to the explored set
    • expand (look at all neighbours of) node, add resulting nodes to the frontier if not in the frontier and not in explored set.

search algorithms

  • stack - last in first out. Used for frontier. So we get a depth-first search.
  • breath-first search always expands the shallowest node in the frontier. Uses a queue (first-in first-out)

BFS finds always the optimal path. DFS might save you memory.

uninformed search - doesn’t use any problem specific knowledge. informed search:

  • Greedy best-first search (GBF) - expand the node that is closest to the goal by using a heuristic function h(n)
    • Manhattan distance - distance between two points along axes of right angles.

A* search: expand the node with the lowest value of g(n) + h(n) where g(n) is the cost to reach node. Optimal if:

  • h(n) is admissible (never overestimates the true cost) - should never think I’m further away from the goal that I actually am.
  • h(n) is consistent. Cost c, h(n) <= h(n') + c

minimax - X wants to maximize max(o), O wants to minimize the score min(o).

TicTacToe functions:

  • PLAYER(s) - who’s turn is it?
  • ACTIONS(s) - possible actions for a state
  • RESULTS(s, a) - perform a action and return the state
  • TERMINAL(s) - did we finish the game?
  • UTILITY(s) - score for a state

optimizations:

  • Alfa-Beta-pruning - remove (don’t consider) some of the nodes using already computed information.

Depth-Limited Minimax:

  • evaluation function - evaluates expected utility of the game from a given state

Knowledge

knowledge-based agents

sentence - assertion about the world in a knowledge representation language

propositional logic - based on statements about the world.

logical connectives - not and or, -> (implication), <-> (biconditional)

implication (if then)

P (premise)Q (conclusion)P->Q
falsefalsetrue
falsetruetrue
truefalsefalse
truetruetrue

biconditional (if and only if)

PQP<->Q
falsefalsetrue
falsetruefalse
truefalsefalse
truetruetrue

model - assignment of a truth value to every propositional symbol (possible world). Total possibilities number_of_values**variables

knowledge base (KB) - set of sentences known a by a knowledge agent to be true

entailment A |= B (A entails B) - in every model (possible world) in which sentence A is true, then B is also true.

We want our AI to figure out what the possible entailments are. To infer, to draw conclusions. inference - deriving new conclusions from old ones.

Model Checking

To determine if KB |= alfa:

  • Enumerate all possible models
  • If in every model where KB is true, alfa is true, then KB entails alfa.
  • Otherwise KB does not entail alfa.

alfa is the query.

Knowledge Engineering

Example: clue (game):

  • Establish propositional symbols
  • add knowledge
  • check_model

Inference Rules

Modus Ponens: application of implication

alfa -> beta
alfa
------------
beta

And Elimination

alfa AND beta
-------------
alfa

and the same for beta.

Double Negation Elimination

NOT NOT alfa
------------
alfa

Implication Elimination

alfa -> beta
------------
NOT alfa OR beta

Biconditional Elimination

alfa <-> beta
---------------------------------
(alfa -> beta) AND (beta -> alfa)

De Morgan’s Law

NOT(alfa AND beta)
--------------------
NOT alfa OR NOT beta

Reverse De Morgan’s Law

NOT(alfa OR beta)
---------------------
NOT alfa AND NOT beta

Distributive Property

(alfa AND (beta OR gama)
----------------------------------
(alfa AND beta) OR (alfa AND gama)

works also when AND is exchanged with OR and vice versa.

Theorem Proving

  • initial state: staring knowledge base
  • actions: inference rules
  • transition model: new knowledge after inference
  • goal test: check statement we’re trying to prove
  • path cost functions: number of steps in proof

Unit Resolution Rule

P OR Q
NOT P
--------
Q

In an OR clause the order of arguments doesn’t matter.

P OR Q
NOT P OR R
-----------
Q OR R

Q and R can be multiple propositional symbols, not single ones.

clause: a disjunction (connected with OR) of literal

Conjuntive Normal Form (CNF)

logical sentence that is a conjunction of clauses, e.g. (A OR B OR C) AND (D OR NOT E) AND (F OR G). Any sentence in logic can be converted into this.

Conversion to CNF:

  • Eliminate bi-conditionals
  • Eliminate implications
  • Move NOT inwards using De Morgan’s Laws

Factoring removing duplicate variables.

Empty clause (()) is always false.

Proof by contradiction: assume the opposite and if you can demonstrate it is a contradiction the thing we want to prove is true.

First-Order Logic

  • Constant Symbols (Minerva, Gryffindor …)
  • Predicate Symbol (Person, House, BelongsTo, …)

Example:

Person(Minerva)                 # Minerva is a person
House(Gryffindor)               # Gryffindor is a house
NOT House(Minerva)              # Minerva is not a house
BelongsTo(Minerva, Gryffindor)  # Minerva belongs to Gryffindor

Minimizes the number of symbols needed.

Universal Quantification - true for all

Existential Quantification - true for some, at least one value

Uncertainty

Probability

Possible Worlds (Ω - omega). P(omega)

Probability: 0 <= P(omega) <= 1, sum(P(omega)) == 1

unconditional probability: degree of belief in a proposition in the absence of any other evidence

conditional probability: we have some initial information about the world and how it works (evidence has been revealed to us). P(a | b) Probability of a given we already know that b is true. Formula:

P(a|b) = P(a AND b) / P(b)  # or

P(a AND b) = P(b) P(a|b)    # or

P(a AND b) = P(a) P(b|a)

random variable: a variable with a domain of possible values it can take on. E.g.: traffic {none,light,heavy}, probability distribution:

P(traffic = none) = .6
P(traffic = light) = .3
P(traffic = heavy) = .1

P(traffic) = <0.6, 0.3, 0.1>     # equal but less verbose

independence: knowledge of one event doesn’t influence the probability of the other events. P(a AND b) = P(a) P(b). Example: roll of two dices.

Bayes’ Rule:

P(b|a) = P(b) P(a|b) / P(a)

Example: given clouds in the morning, what is the probability of the rain in the afternoon?

  • 80% of rainy afternoons start with cloudy mornings.
  • 40% of days have cloudy mornings
  • 10% of days have rainy afternoons

Solution:

P(rain | clouds) = P(clouds | rain) P(rain) / P(clouds) 
= 0.8 * 0.1 / 0.4 
= 0.2

Knowing P(visible effect | unknown cause) we can calculate P(unknown cause | visible effect).

Joint Probability

Example:

C = 0.4, not C = 0.6
R = 0.1, not R 0.9

        R       not R
C       .08     .32
not C   .02     .58

P(cloudy | rain) = P(cloudy, rain) / P(rain)        # , = AND
= alfa P(C, rain)                                   # alfa = normalization constant
= alfa <.08,.02>
= <.8, .2>

Probability Rules

Negation: P(not a) = 1 - P(a)

Inclusion-Exclusion: P(a OR b) = P(a) + P(b) - P(a AND b)

Marginalization: P(a) = P(a, b) + P(a, not b). Example:

P(C = cloud) = P(C = cloud, R = rain) + P(C = cloud, R = not rain)
= 0.08 + 0.32
= 0.4

Conditioning: P(a) = P(a|b)P(b) + P(a| not b)|P(not b)

Bayesian Networks

  • data structures that represent the dependencies among random variables
  • directed graph
  • each node represents a random variable
  • arrow from X to Y means X is a parent of Y
  • each node X has a probability distribution P(X | Parents(X))

Inference problem:

  • Query X: variable for which to compute the distribution
  • Evidence variables E: observed variables for event e
  • Hidden variables Y: non-evidence, non-query variable
  • Goal: Calculate P(X | e)

Inference by Enumeration: P(X | e) = alfa P(X, e) = alfa sum_y(P(X, e, y)

Approximate Inference, e.g. by sampling.

Likelihood weighting:

  • fix the values for evidence variables,
  • sample the non-evidence variables using conditional probabilities of the Bayesian Network,
  • weight each sample by its likelihood: the probability of all of the evidence.

Markov Models

Uncertainty over time. -> Huge amount of data / complexity.

Markov assumption: assumption that the current state depends on only a finite fixed number of previous states.

Markov chain: a sequence of random variables where the distribution of each variable follows the Markov assumption.

hidden state vs. observation. E.g. user engagement vs. website analytics or weather vs. people bringing umbrellas to the office. The AI doesn’t know the truth of the world (hidden state) all it has access to is some observation that is related to the hidden state.

Hidden Markov Model a Markov model for a system with hidden states that generate some observed event. Also often called Sensor Model.

sensor Markov Assumption: the assumption that the evidence variable depends only on the corresponding state.

Tasks:

  • filtering: given observations from start until now calculate the distribution for current state
  • prediction: … for a future state
  • smoothing: … for a past state
  • most likely explanation: … most likely sequence of states

Optimization

  • choosing the best option from a set of options

local search: search algorithms that maintain a single node and searches by moving to a neighboring node. Most useful when we don’t care about the path but the solution only.

state-space landscape. Global minimum - cost function. Global maximum - objective function. We move to neighbours from current state.

hill climbing algorithm

function hill-climb(problem):
    current = initial state of problem
    repeat:
        neighbor = highest valued neighbor of current
        if neighbor not better than current:
            return current

Hill climbing might get stuck at a local maxima. Therefore there are hill climbing variants:

  • steepest-ascent: choose the highest-valued neighbor
  • stochastic: choose randomly from higher-valued neighbors
  • first-choice: choose the first-higher valued neighbor
  • random-restart: conduct hill climbing multiple times
  • local beam search: choose the k highest-valued neighbors

Real problem with hill climbing is that they never make a move that makes our position worse. And that’s what’s needed to get to a global maximum.

Simulated Annealing

  • Early on, higher “temperature”: more likely to accept neighbors that are worse than current state
  • Later on, lower “temperature”.
function SIMULATED-ANNEALING(problem, max):
    current = initial state of problem
    for t = 1 to max:
        T = temperature(t)
        neighbor = random neighbor of current
        delta_e = how much better neighbor is than current
        if delta_e > 0:
            current = neighbor
        with probability e**(delta_e/T) set current = neighbor
    return current

Linear Programming

  • Minimize a cost function
  • With constraints
  • With bounds for each variable

Example:

Cost function: 50x + 80y
Constraint: 5x + 2y <= 20
Constraint: 10x + 12y >= 90 -> (-10x) + (-12y) <= -90

Constraint Satisfaction

  • Set of variables {X1, …, Xn}
  • Set of domains for each variable {D1, D2, …, Dn} - possible values each of the variables can take on.
  • Set of constraints C

hard constraints - must be satisfied in a correct solution

soft constraints - preferences

unary constraint - involves just a single variable, e.g.: A != monday

binary constraint - involves two variables, e.g.: A != B

node consistency: when all the variables in a variable’s domain satisfy the variable’s unary constraints. For each of possible nodes remove possible values of the domain so all the unary constraint are satisfied.

arc consistency

when all the values in a variables domain satisfy the variable’s binary constraints

function REVISE(csp, X, Y):     # csp = constraint satisfaction problem
    revised = false
    for x in X.domain:          # loop over all values of X's domain
        if no y in Y.domain satisfies constraint for (X, Y):
            delete x from X.domain
            revised = true
    return revised

function AC-3(csp):
    queue = all arcs in csp
    while queue non-empty:
        (X, Y) = DEQUEUE(queue)         # (X, Y) is an arc (or edge)
        if REVISE(csp, X, Y):
            if size of X.domain == 0:
                return false
            for each Z in X.neighbors - {Y}:
                ENQUEUE(queue, (Z, X))  # removing X might not satistfy Z anymore
    return true

AC-3 will not always solve the problem. What remains is a Search Problem:

  • initial state: empty assignment (no variables)
  • actions: add a {variable = value} to assignment
  • transition model: shows how adding an assignment changes the assignment
  • goal check: check if all variables are assigned and all constraints satisfied
  • path cost function: irrelevant (all paths have same cost)
function BACKTRACK(assignment, csp):
    if assignment complete:
        return assignment
    var = SELECT-UNASSIGNED-VAR(assignment, csp)
    for value in DOMAIN-VALUES(var, assignment, csp):
        if value consistent with assignment:    # doesn't violate any constraints
            add {var = value} to assignment
            result = BACKTRACK(assignment, csp)
            if result != failure: return result
        remove {var = value} from assignment
    return failure

Inference: use Arc Consistency interleaved with search, so we backtrack less.

maintaining arc-consistency: algorithm for enforcing arc-consistency every time we make a new assignment. When we make a new assignment to X, call AC-3, starting with a queue of all arcs (Y, X) where Y is a neighbour of X.

SELECT-UNASSIGNED-VAR heuristics:

  • minimum remaining values (MRV) heuristic: select the variable that has the smallest domain
  • degree heuristic: select the variable that has the highest degree (connected to the most other nodes)

DOMAIN-VALUES heuristics:

  • least constraining value: return variables in order by number of choices that are ruled out for neighboring variables:
    • try least-constraint values first

Learning, ML (learning from data and experience)

No explicit instructions how to perform a task, but give access to data and let the computer figure it out.

Supervised Learning

given a data set of input-output pairs, learn a function to map inputs to outputs.

classification: supervised learning task of learning a function mapping of an input to a discrete category. E.g. banknote -> counterfeit/real

nearest-neighbor classification: algorithm that, given an input, chooses the class of the nearest data point to that input:

  • k-nearest-neighbour classification: … k nearest data points …
    • could be slow

Perceptron Learning

hypothesis function to determine on which side of boundary is the input. Example:

x1 = Humidity
x2 = Pressure

Weight Vector w: (w0, w1, w2)
Input Vector x: (1, x1, x2)
w * x: w0 + w1x1 + w2x2

h(x1, x2) = Rain if w0 + w1x1 + w2x2 >= 0 else no Rain 

The goal of the ML algorithm is to learn what the Weight Vector will be.

General hypothesis form: h(x) = 1 if w*x >= else 0

perceptron learning rule: given data point (x, y) update each weight according to: w_i = w_i + alfa(y - h_w(x)) * x_i, which means w_i = w_i + alfa(actual value - estimate) * x_i. alfa = learning rate (how quickly we’re going to update the weight values).

hard threshold vs. soft threshold (logistic regression)

Support Vector Machines

Can work in higher dimensions.

maximum margin separator: boundary that maximizes the distance between any of the data point.

Regression

Supervised learning task of learning a function mapping an input point to a continuous value.

Evaluating Hypotheses

Think of a Optimization Problem.

loss function: function that expresses how poorly our hypothesis performs.

0-1 loss function:

L(actual, predicted) =
    0 if actual = predicted,
    1 otherwise

L1 loss function:

L(actual, predicted) = |actual - predicted|

L2 loss function - penalizes much more harshly far away predictions:

L2(actual, predicted) = (actual - predicted)**2

overfitting - a model that fits too closely (little or no loss) to a particular data set and therefore may fail to generalize to future data.

regularization - penalizing hypotheses that are more complex to favor simpler, more general hypothesis. We want to give preference to a simpler solution which might generalize better: cost(h) = loss(h) + lambda*complexity(h)

holdout cross-validation - splitting data into a training set and a test set, such that learning happens on the training set and is evaluated on the test set. Downside: a fair amount of data is not used to train.

k-fold cross-validation - splitting data into k sets, and experimenting k times, using each set as a test once and using remaining data as a training test

Reinforcement learning

Given a set of rewards or punishments, learn what actions are to be taken in future:

    Environment
    /      \
action^     state_v, reward_v
    \      /
    Agent

Markov Decision Process model for decision-making, representing states, actions, and their rewards:

  • Set of states S
  • Set of actions ACTIONS(s)
  • Transition model P(s'|s, a)
  • Reward function R(s, a, s')

Q-Learning method for learning a function Q(s, a), estimate of the value of performing action a in state s. Overview:

  • Start with Q(s, a) = 0 for all s, a
  • When we take an action and receive a reward:
    • Estimate the value of Q(s, a) based on current reward and expected future rewards.
    • Update Q(s, a) to take into account old estimate as well as our new estimate.
    • Formally:
      • Q(s, a) <- Q(s, a) + alfa(new value estimate - old value estimate)
      • Q(s, a) <- Q(s, a) + alfa((r + future reward estimate) - Q(s, a))
      • Q(s, a) <- Q(s, a) + alfa((r + max_a'(Q(s', a')) - Q(s, a))
      • Q(s, a) <- Q(s, a) + alfa((r + y*max_a'(Q(s', a')) - Q(s, a)) - to give preference to current rewards (y)

alfa represents how much we value new information compared to how much we value old information. alfa of 1 means we only consider new information.

Greedy Decision-Making: when in state s, choose action a with highest Q(s, a)

Tension between Explore vs. Exploit (using knowledge we already have, which might not need to optional solution).

epsilon-greedy

  • Set epsilon equal to how often we want to move randomly.
  • With probability 1 - epsilon, choose estimated best move.
  • With probability epsilon, choose a random move. Esp. in the beginning.

function approximation - approximating Q(s, a), often by a function combining various features, rather than storing one value for every state-action pair.

Unsupervised Learning

given input data without any additional feedback, learn pattern.

clustering organizing a set of objects into groups in such a way that similar objects tend to be in the same group.

k-means clustering - algorithm for clustering data based on repeatedly assigning points to clusters and updating those cluster’s centers.

Neural Networks

notes

  • Neurons are connected to and receive electrical signals from other neurons.
  • Neurons process input signals and can be activated.

pip install tensorflow-macos

Artificial Neural Network - mathematical model for learning inspired by biological neural networks:

  • Model mathematical function from input to outputs based on the structure and parameters of the network.
  • Allow for learning the network’s parameters based on data.

activation function - function that gets applied to the result and decides whether a threshold is passed (e.g. “it rains or not”).

Schematics for a very simple neural network:

     w0
     |
x1 \ |
     g(w0 + w1*x1 + w2*x2)
x2 /

Neural network will learn what the values of w0, w1 and w2 should be in order to get the results that we would expect. General formula g(sum(x_i*w_i) + w0).

gradient descent - algorithm for minimizing loss when training neural network. Gradient is going to tell us in which direction we should be moving the weights in order to minimize the loss.

General idea:

  • Start with a random choice of weights
  • Repeat:
    • Calculate the gradient based on all data points: direction that will lead to decreasing loss (all data points - expensive operation).
    • Update weights according to the gradient.

Stochastic Gradient Descent - s/all data points/one data point randomly/

Mini-Batch Gradient Descent - s/all data points/one small batch/

Perceptron: a single one is only capable of learning linearly separable decision boundary, because we’re taking linear combination of input.

multilayer neural network - artificial neural network with an input layer, an output layer, and eat least one hidden layer. Gives us the ability to model more complex function.

backpropagation algorithm for training neural networks with hidden layers. Key algorithm for making neural networks possible. Idea:

  • start with a random choice of weights
  • Repeat:
    • Calculate error for output layer
    • For each layer, starting with output layer and moving inwards towards earliest hidden layer:
      • Propagate error back one layer.
      • Update weights.

deep neural network - neural network with multiple hidden layers.

overfitting

dropout - temporarily removing units - selected at random - from a neural network to prevent over reliance on certain units.

computer vision computational methods for analyzing and understanding digital images.

image convolution - applying a filter that adds each pixel value of an image to its neighbors, weighted according to kernel matrix.

pooling reducing the size of an input by sampling from regions in the input. max-pooling chooses the maximal value in a each region.

convolutional neural network (CNN) - uses convolution, usually for analysing images.

image -> convolution -> pooling -> flatting -> neural network

General structure of NN: input -> network -> output. This is a feed-forward neural network - NN that has connections only in one direction. Limitations: fixed shape (number of) input and output neurons.

Recurrent Neural Network - generates output that gets feds back into the network as input for future runs of the network:

              /--\
             v    |
    input -> network -> output

This allows to store state. This is particularly helpful when dealing with sequences of data.

Example: Microsoft’s Captionbot or analyzing videos.

1 -> N relationships as opposed to 1:1 from before.

N -> 1: e.g. classification of a voice from a audio sample

N -> N: e.g. google translate

Human Language

Common tasks:

  • automatic summarization
  • information extraction
  • language identification
  • machine translation
  • named entity recognition
  • speech recognition (Alexa)
  • text classification
  • word sense disambiguation

syntax - structure

semantics - meaning

formal grammar a system of rules for generating sentences in a language.

Context-Free Grammar

Replacing one symbol with another symbol.

Non-terminal symbols (noun - N, verb - V) are used to generate terminal symbols (“she”, “saw”).

On the left we have always a non-terminal symbol, on the right of the arrow it could be either terminal or non-terminal symbols. E.g. NP -> N | D N.

pip install nltk

n-gram

a contiguous sequence of n items from a sample text. e.g.: character n-grams or word n-grams.

unigram - a contiguous sentence of 1 item from a sample text.

bigram - … 2 items …

trigrams - … 3 items …

Allows for segmentation of the text for analysis and while the AI might have not seen the whole sentence before, it is likely that it has encountered a N-gram from it before.

tokenization

the task of splitting a sequence of characters into pieces (tokens). E.g. to extra n-grams.

word tokenizations - splitting into words.

Markov Model

It will allow us to predict what the next unit will be.

pip install markovify

Text Categorization

a classification problem. E.g. spam/not spam. Positive / negative sentiment.

Bag-of-Words -model that represents text as unordered collection of words. Works pretty well for positive/negative sentiment approach.

Naive Bayes

How to not worry about multiplying by zero:

  • additive smoothing - adding a value alfa to each value in our distribution to smooth the data.
  • Laplace smoothing - adding 1 to each value in our distribution, pretending that we’ve seen each value one more time than we actually have.

Naive Bayes basically looks for differentiating words (which have a big weight in one category and not so in others).

Information Retrieval

the task of finding relevant documents in response to a query.

topic modeling - models for discovering the topics for a set of documents. What are the important documents in a word?

tf-idf

term frequency - number of times a term appears in a document.

function words - words that have little meaning on their own, but are used to grammatically connect other words. E.g.: am, by do, is , which, with, yet....

content words - words that carry meaning independently. E.g.: algorithm, category, computer, ...

inverse document frequency - measure of how common or rare a word is across documents. Form: log(totalNumberOfDocuments/numberOfDocumentsContaining(word)).

tf-idf - ranking of what are important in a document by multiplying term frequency (TF) by inverse document frequency (IDF).

Semantics

information extraction - the task of extracting knowledge from documents.

WordNet

Word Representation

one-hot representation - representation of meaning as a vector of single 1, and with other values as 0.

distributed representation - representation of meaning distributed across multiple values.

You shall know a word by the company it keeps – J.R. Firth, 1957

word2vec - model for generating word vectors

skip-gram architecture - neural network architecture for predicting context words giving a target word.