Program

0: download anaconda for python 3.5 if you don't!

Part I:

  • Preliminars:
    • What's python?
    • Why python?
    • Installing python
    • jupyter notebook basics
  • Python language (basic_syntax.ipynb):
    • Basic syntax and core structures (loops, functions, lists, ifs, strings...)
    • file operations (list directory, r/w files, get data from web,...)
    • Example dogs vs cats.
  • coffee break!!
  • numpy (numpy.ipynb):
    • basic matrices
    • read image as matrix
    • basics on matrix transformations.
    • Construct data set for machine learning task.
  • matplotlib:
    • plot image, histogram

Part II:

  • scikit-learn (machine_learning.ipynb):
    • predict label on image
    • show prediction metrics (accuracy, ROC, confussion matrix).
  • Deep learning (deep_learning.ipynb):
    • Use pretrained CNN with keras.
  • Conclussions:
    • my way of working.
    • Where to go to learn more.

What's python?

Why python?

Why python?

  • It's free
  • Specifically dessigned to be:
    • Easy to read
    • Easy to learn (we'll see...)
    • Easy to maintain
  • It can be easily integrated with C, C++ and Java.
  • High level language.
  • It has a great ecosystem of packages for scientific computing, for scripting and for web development.

Scientific (core) packages:

  • numpy: for eficient and easy to read/write matrix manipulation (inspired by Matlab).
  • matplotlib: for graphics (also inspired by Matlab)
  • pandas: for tabular (DataFrame) structures (like R's Dataframe).
  • scikit-learn: for machine learning algorithms (really good documentation). example
  • SciPy. Optimization, interpolation, fourier transform, sparse matrices...

Scientific (good to know) packages:

  • bokeh library for interactive graphics
  • seaborn highlevel graphics based on matplotlib.
  • Deep learning (using GPU):
  • GPy. For gaussian processes.
  • numba for highly efficient python code.
  • sympy symbolic maths (derivatives, integrals, limits...)
  • cvx convex optimization.

GIS and teledetection:

Images, computer vision:

  • PIL (pillow) open, resize images.
  • scikit-image
  • cv2 OpenCV highly optimized computer vision library

Web development packages:

  • Flask. servidor web lightweight.
  • Jinja2. templates (generating documents, webs)

Other:

  • luigi pipelines of batch jobs
In [1]:
# %load server.py
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello IPL!"

if __name__ == "__main__" and False:
    app.run(port=6969, host="0.0.0.0")

Installing python

Getting started with conda, anaconda...

conda is a package manager. Anaconda is a set of (~150) python packages for scientific python together with jupyter and spyder (spyder is a MATLAB/rstudio like environment).

Python Zen Principles

In [2]:
import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

jupyter basic usage

  • Insert > Cell Above / Cell Bellow
  • Edit > Split Cell / Move Cell Up/Down
  • Help > Markdown syntax help. In markdown cells you can embed HMTL
  • Double click on the cell to edit the content and see the source
  • Ctrl + Enter execute current cell
  • In markdown cells you can write $\LaTeX$ equations: $$ \mathcal{L}(\omega) = \sum_{i=0}^N \Big(f(x_i;\omega)-y_i\Big)^2 $$
  • Cell > Cell type to distinguish between code cells and text (markdown) cells.
  • File > Download as > HTML to download the content of the notebook as a plain HMTL.
  • This presentation is done in jupyter:

jupyter nbconvert --to slides PythonCourse.ipynb --reveal-prefix=reveal.js --post serve

Basic syntax notebook

basic_python.ipynb

Preparing for the exercises

Code for downloading the data of dogs vs cats kaggle competition.

In [3]:
import os
import urllib.request
import zipfile

# urllib.request.urlretrieve('http://example.com/big.zip', 'file/on/disk.zip')
train_file = "train.zip"
train_folder = "train"
# Download data

if not os.path.exists(train_file) and not os.path.exists(train_folder):
    print("Proceding to download the data. This process may take some time..")
    urllib.request.urlretrieve("https://www.dropbox.com/s/8lbkqktfofzjraj/train.zip?raw=1", train_file)
    print("Done")
else:
    print("data file {} has already been downloaded".format(train_file))

# unzip files

# If folder train does not exists extract all elements
if not os.path.exists(train_folder):
    with zipfile.ZipFile(train_file, 'r') as myzip:
        myzip.extractall()
    print("Extracted")
else:
    print("Data has already been extracted")
    
all_files = os.listdir("train")
data file train.zip has already been downloaded
Data has already been extracted

Exercise 1

Retrieve a list with only dog files.

In [4]:
# "standard" solution:
dog_files = []
for fil in all_files:
    if 'dog' in fil:
        dog_files.append(fil)
print(len(dog_files))
12500
In [5]:
# "list comprenhension" solution:
dog_files = [fil for fil in all_files if 'dog' in fil]
print(len(dog_files))
12500
In [6]:
# Cheating solution:
import glob

dog_files = glob.glob("train/dog*.jpg")
print(len(dog_files))
12500

Exercise 2

Display an image using Image from Ipython.display

In [7]:
from IPython.display import Image

Image(dog_files[30])
Out[7]:

Exercise 3

Create a function that download and unzip the data. Put that function in a file under the current directory. Import the module and run the function.

In [8]:
# %load dogs_vs_cats.py
import os
import urllib.request
import zipfile

def image_files(train_file = "train.zip",train_folder = "train"):
    if not os.path.exists(train_file):
        print("Proceding to download the data. This process may take some time..")
        urllib.request.urlretrieve("https://www.dropbox.com/s/8lbkqktfofzjraj/train.zip?raw=1", train_file)
        print("Done")
    else:
        print("data file {} has already been downloaded".format(train_file))

    # unzip files
    # If folder train does not exists extract all elements
    if not os.path.exists(train_folder):
        with zipfile.ZipFile(train_file, 'r') as myzip:
            myzip.extractall()
        print("Extracted")
    else:
        print("Data has already been extracted")

    return [os.path.join("train",img) for img in os.listdir("train")]

numpy

NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis. It is the foundation on which nearly all of the higher-level tools scientific computing tools are built. Here are some of the things it provides:

  • ndarray , a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities.
  • Standard mathematical functions for fast operations on entire arrays of data without having to write loops
  • Linear algebra, random number generation, and Fourier transform capabilities
  • Tools for integrating code written in C, C++, and Fortran

Taken from chapter 4.

(Annoying/Awesome) Differences with MATLAB

  • Indexes start in 0 and the last element is not included! 0:2 means indexes 0,1
  • Multiplication * is element-wise multiplication
  • Column vectors are NOT 1D arrays. If you want a column vector from a 1D vector you need to expand_dims your 1D vector.
  • It is possible to operate arrays of different shapes according to this broadcasting rules.
  • MATLAB is dessigned for matrices while numpy is dessigned for multidimensional arrays.

Exercise 4

Generate a training set of 4000 randomly selected images (features and labels).

  • Feature matrix must have in each row a flatted image resized to (50,50,3) (i.e. shape must be (4000,50*50*3)).
  • label matrix must have ones if the row contains the image of a dog and 1 if it has a cat.

Show image number 124 of such array using matplotlib.pyplot.

In [9]:
import dogs_vs_cats
all_files = dogs_vs_cats.image_files()
data file train.zip has already been downloaded
Data has already been extracted
In [10]:
import skimage.transform as sktrans
from scipy import ndimage
import numpy as np

n_images = 4000
index_files_selected = np.random.choice(len(all_files),size=n_images,replace=False)
feature_array = np.ndarray((n_images,50*50*3),dtype=np.float16)
label = np.ndarray((n_images),dtype=np.uint8)
for i,indice in enumerate(index_files_selected):
    print("Reading image (%d/%d)"%(i,n_images))
    image = ndimage.imread(all_files[indice])
    image_down = sktrans.resize(image, (50, 50, 3))
    feature_array[i,:] = image_down.ravel() # equivalent to feature_array[i]
    label[i] = "dog" in all_files[indice]  
    
In [11]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(feature_array[124].reshape(50,50,3))
Out[11]:
<matplotlib.image.AxesImage at 0x7fa16c03e128>

Exercise 5

Count proportion of dogs/cats on your sampled labels.

In [12]:
prop_cats = np.sum(label == 0)/len(label)
prop_dogs =  1-prop_cats
print("There are {0:.2f}% of cats and {1:.2f}% of dogs".format(prop_cats*100,prop_dogs*100))
There are 50.05% of cats and 49.95% of dogs

Exercise 6

Save the training set as a .mat file. Use scipy.io.savemat. Reload the training set with scipy.io.loadmat.

In [13]:
import scipy.io as sio
?sio.savemat
In [14]:
sio.savemat("training_set.mat",
            {"feature_array": feature_array,
             "labels": label})

cargado = sio.loadmat("training_set.mat")
cargado

# feature_array_loaded = cargado["feature_array"]
Out[14]:
{'__globals__': [],
 '__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Oct 10 12:06:11 2016',
 '__version__': '1.0',
 'feature_array': array([[ 0.22436523,  0.30664062,  0.4831543 , ...,  0.93798828,
          0.93798828,  0.93798828],
        [ 0.47119141,  0.3996582 ,  0.34399414, ...,  0.57519531,
          0.56738281,  0.59423828],
        [ 0.25463867,  0.13439941,  0.0791626 , ...,  0.81640625,
          0.78515625,  0.73876953],
        ..., 
        [ 0.53320312,  0.35302734,  0.10980225, ...,  0.66113281,
          0.54345703,  0.43359375],
        [ 0.4296875 ,  0.68115234,  0.83740234, ...,  1.        ,
          0.98046875,  0.98046875],
        [ 0.59228516,  0.65869141,  0.76074219, ...,  0.47802734,
          0.43017578,  0.37988281]]),
 'labels': array([[1, 0, 1, ..., 0, 0, 0]], dtype=uint8)}

Machine Learning with scikit-learn

  • fit a model to a set of images
  • test the model.
  • hyper-parameter estimation with cross-validation

Exercise 7

Train a different classifier. http://scikit-learn.org/stable/supervised_learning.html

In [15]:
import dogs_vs_cats
all_files = dogs_vs_cats.image_files()
train_features, train_labels, test_features, test_labels = dogs_vs_cats.training_test_datasets(all_files,15000,400,
                                                                                              (33,33,1))
data file train.zip has already been downloaded
Data has already been extracted
Loading train set
Loading test set
In [16]:
from  sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100,verbose=1,n_jobs=-1)
clf.fit(train_features, train_labels)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    8.1s finished
Out[16]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=1,
            warm_start=False)
In [17]:
print("Train score: {}".format(clf.score(train_features,train_labels)))
print("Test score: {}".format(clf.score(test_features, test_labels)))
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished
Train score: 1.0
Test score: 0.855
In [18]:
%matplotlib inline
y_proba = clf.predict_proba(test_features)
dogs_vs_cats.plotROC(test_labels,y_proba[:,1])
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished

Conclussions

  • python is easy to use
  • python is fast to program.
  • python has many modules/packages that are really powerful.

My way of working

  • Try things and documents in notebooks.
  • When I have useful code snippets I put them in classes/functions in modules (.py).
  • Edit/manage this .py with spyder/pycharm/eclipse (version control, conda integration, autocomplete, module inspection...)
  • The usage of this funcs/classes in example notebooks.

Where to go to learn more?

  • Take a problem/project/goal: image recognition, making a web site, kaggle competition, migrate some code in other language. and..

  • Learn about the modules you have to use to solve this problem.
  • Everything is rapidply evolving...

Where to go to learn more?

explosion gif photo: explosion atompilz.gif