Program¶

0: download anaconda for python 3.5 if you don't!

Part I:

Preliminars:
- What's python?
- Why python?
- Installing python
- jupyter notebook basics
Python language (basic_syntax.ipynb):
- Basic syntax and core structures (loops, functions, lists, ifs, strings...)
- file operations (list directory, r/w files, get data from web,...)
- Example dogs vs cats.
coffee break!!

numpy (numpy.ipynb):
- basic matrices
- read image as matrix
- basics on matrix transformations.
- Construct data set for machine learning task.
matplotlib:
- plot image, histogram

Part II:

scikit-learn (machine_learning.ipynb):
- predict label on image
- show prediction metrics (accuracy, ROC, confussion matrix).
Deep learning (deep_learning.ipynb):
- Use pretrained CNN with keras.
Conclussions:
- my way of working.
- Where to go to learn more.

Dogs vs Cats kaggle¶

https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

What's python?¶

It's a programming language.
It's an interpreted programming language.

Why python?¶

It's free
Specifically dessigned to be:
- Easy to read
- Easy to learn (we'll see...)
- Easy to maintain
It can be easily integrated with C, C++ and Java.
High level language.
It has a great ecosystem of packages for scientific computing, for scripting and for web development.

Scientific (core) packages:

numpy: for eficient and easy to read/write matrix manipulation (inspired by Matlab).
matplotlib: for graphics (also inspired by Matlab)
pandas: for tabular (DataFrame) structures (like R's Dataframe).
scikit-learn: for machine learning algorithms (really good documentation). example
SciPy. Optimization, interpolation, fourier transform, sparse matrices...

Scientific (good to know) packages:

bokeh library for interactive graphics
seaborn highlevel graphics based on matplotlib.
Deep learning (using GPU):
- Theano.
- Tensorflow the one released by google.
- keras built on top of Theano or TensorFlow. It comes with Pretrained CNN.
GPy. For gaussian processes.
numba for highly efficient python code.
sympy symbolic maths (derivatives, integrals, limits...)
cvx convex optimization.

GIS and teledetection:

gdal, rasterio io with rasters.
Folium. leaflet maps.

Images, computer vision:

PIL (pillow) open, resize images.
scikit-image
cv2 OpenCV highly optimized computer vision library

Web development packages:

Flask. servidor web lightweight.
Jinja2. templates (generating documents, webs)

Other:

luigi pipelines of batch jobs

In [1]:

# %load server.py
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello IPL!"

if __name__ == "__main__" and False:
    app.run(port=6969, host="0.0.0.0")

Click: http://asuranceturix.uv.es:6969/

Installing python¶

Getting started with conda, anaconda...¶

conda is a package manager. Anaconda is a set of (~150) python packages for scientific python together with jupyter and spyder (spyder is a MATLAB/rstudio like environment).

Download & install python 3.5 anaconda: https://www.continuum.io/downloads (The 150 packages together with conda package manager).
Download repository of the talk: (https://github.com/gonzmg88/python_course_ipl). (clone repo or download zip from here)
Open terminal in the folder of the repository of the talk.
Copy the train.zip in the folder of the repository.
Execute: jupyter notebook and select the notebook basic_syntax.ipynb.

Resources:

Python Zen Principles¶

In [2]:

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

jupyter basic usage¶

Insert > Cell Above / Cell Bellow
Edit > Split Cell / Move Cell Up/Down
Help > Markdown syntax help. In markdown cells you can embed HMTL
Double click on the cell to edit the content and see the source
Ctrl + Enter execute current cell
In markdown cells you can write $\LaTeX$ equations: $$ \mathcal{L}(\omega) = \sum_{i=0}^N \Big(f(x_i;\omega)-y_i\Big)^2 $$
Cell > Cell type to distinguish between code cells and text (markdown) cells.
File > Download as > HTML to download the content of the notebook as a plain HMTL.
This presentation is done in jupyter:

jupyter nbconvert --to slides PythonCourse.ipynb --reveal-prefix=reveal.js --post serve

Basic syntax notebook¶

basic_python.ipynb

Preparing for the exercises¶

Code for downloading the data of dogs vs cats kaggle competition.

In [3]:

import os
import urllib.request
import zipfile

# urllib.request.urlretrieve('http://example.com/big.zip', 'file/on/disk.zip')
train_file = "train.zip"
train_folder = "train"
# Download data

if not os.path.exists(train_file) and not os.path.exists(train_folder):
    print("Proceding to download the data. This process may take some time..")
    urllib.request.urlretrieve("https://www.dropbox.com/s/8lbkqktfofzjraj/train.zip?raw=1", train_file)
    print("Done")
else:
    print("data file {} has already been downloaded".format(train_file))

# unzip files

# If folder train does not exists extract all elements
if not os.path.exists(train_folder):
    with zipfile.ZipFile(train_file, 'r') as myzip:
        myzip.extractall()
    print("Extracted")
else:
    print("Data has already been extracted")
    
all_files = os.listdir("train")

data file train.zip has already been downloaded
Data has already been extracted

Exercise 1¶

Retrieve a list with only dog files.

In [4]:

# "standard" solution:
dog_files = []
for fil in all_files:
    if 'dog' in fil:
        dog_files.append(fil)
print(len(dog_files))

In [5]:

# "list comprenhension" solution:
dog_files = [fil for fil in all_files if 'dog' in fil]
print(len(dog_files))

In [6]:

# Cheating solution:
import glob

dog_files = glob.glob("train/dog*.jpg")
print(len(dog_files))

Exercise 2¶

Display an image using Image from Ipython.display

In [7]:

from IPython.display import Image

Image(dog_files[30])

Out[7]:

Exercise 3¶

Create a function that download and unzip the data. Put that function in a file under the current directory. Import the module and run the function.

In [8]:

# %load dogs_vs_cats.py
import os
import urllib.request
import zipfile

def image_files(train_file = "train.zip",train_folder = "train"):
    if not os.path.exists(train_file):
        print("Proceding to download the data. This process may take some time..")
        urllib.request.urlretrieve("https://www.dropbox.com/s/8lbkqktfofzjraj/train.zip?raw=1", train_file)
        print("Done")
    else:
        print("data file {} has already been downloaded".format(train_file))

    # unzip files
    # If folder train does not exists extract all elements
    if not os.path.exists(train_folder):
        with zipfile.ZipFile(train_file, 'r') as myzip:
            myzip.extractall()
        print("Extracted")
    else:
        print("Data has already been extracted")

    return [os.path.join("train",img) for img in os.listdir("train")]

`numpy`¶

NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis. It is the foundation on which nearly all of the higher-level tools scientific computing tools are built. Here are some of the things it provides:

ndarray , a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities.
Standard mathematical functions for fast operations on entire arrays of data without having to write loops
Linear algebra, random number generation, and Fourier transform capabilities
Tools for integrating code written in C, C++, and Fortran

Taken from chapter 4.

(Annoying/Awesome) Differences with `MATLAB`¶

Indexes start in 0 and the last element is not included! 0:2 means indexes 0,1
Multiplication * is element-wise multiplication
Column vectors are NOT 1D arrays. If you want a column vector from a 1D vector you need to expand_dims your 1D vector.
It is possible to operate arrays of different shapes according to this broadcasting rules.
MATLAB is dessigned for matrices while numpy is dessigned for multidimensional arrays.

Exercise 4¶

Generate a training set of 4000 randomly selected images (features and labels).

Feature matrix must have in each row a flatted image resized to (50,50,3) (i.e. shape must be (4000,50*50*3)).
label matrix must have ones if the row contains the image of a dog and 1 if it has a cat.

Show image number 124 of such array using matplotlib.pyplot.

In [9]:

import dogs_vs_cats
all_files = dogs_vs_cats.image_files()

data file train.zip has already been downloaded
Data has already been extracted

In [10]:

import skimage.transform as sktrans
from scipy import ndimage
import numpy as np

n_images = 4000
index_files_selected = np.random.choice(len(all_files),size=n_images,replace=False)
feature_array = np.ndarray((n_images,50*50*3),dtype=np.float16)
label = np.ndarray((n_images),dtype=np.uint8)
for i,indice in enumerate(index_files_selected):
    print("Reading image (%d/%d)"%(i,n_images))
    image = ndimage.imread(all_files[indice])
    image_down = sktrans.resize(image, (50, 50, 3))
    feature_array[i,:] = image_down.ravel() # equivalent to feature_array[i]
    label[i] = "dog" in all_files[indice]

In [11]:

import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(feature_array[124].reshape(50,50,3))

Out[11]:

<matplotlib.image.AxesImage at 0x7fa16c03e128>

Exercise 5¶

Count proportion of dogs/cats on your sampled labels.

In [12]:

prop_cats = np.sum(label == 0)/len(label)
prop_dogs =  1-prop_cats
print("There are {0:.2f}% of cats and {1:.2f}% of dogs".format(prop_cats*100,prop_dogs*100))

There are 50.05% of cats and 49.95% of dogs

Exercise 6¶

Save the training set as a .mat file. Use scipy.io.savemat. Reload the training set with scipy.io.loadmat.

In [13]:

import scipy.io as sio
?sio.savemat

In [14]:

sio.savemat("training_set.mat",
            {"feature_array": feature_array,
             "labels": label})

cargado = sio.loadmat("training_set.mat")
cargado

# feature_array_loaded = cargado["feature_array"]

Out[14]:

{'__globals__': [],
 '__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Oct 10 12:06:11 2016',
 '__version__': '1.0',
 'feature_array': array([[ 0.22436523,  0.30664062,  0.4831543 , ...,  0.93798828,
          0.93798828,  0.93798828],
        [ 0.47119141,  0.3996582 ,  0.34399414, ...,  0.57519531,
          0.56738281,  0.59423828],
        [ 0.25463867,  0.13439941,  0.0791626 , ...,  0.81640625,
          0.78515625,  0.73876953],
        ..., 
        [ 0.53320312,  0.35302734,  0.10980225, ...,  0.66113281,
          0.54345703,  0.43359375],
        [ 0.4296875 ,  0.68115234,  0.83740234, ...,  1.        ,
          0.98046875,  0.98046875],
        [ 0.59228516,  0.65869141,  0.76074219, ...,  0.47802734,
          0.43017578,  0.37988281]]),
 'labels': array([[1, 0, 1, ..., 0, 0, 0]], dtype=uint8)}

Machine Learning with `scikit-learn`¶

fit a model to a set of images
test the model.
hyper-parameter estimation with cross-validation

Exercise 7¶

Train a different classifier. http://scikit-learn.org/stable/supervised_learning.html

In [15]:

import dogs_vs_cats
all_files = dogs_vs_cats.image_files()
train_features, train_labels, test_features, test_labels = dogs_vs_cats.training_test_datasets(all_files,15000,400,
                                                                                              (33,33,1))

data file train.zip has already been downloaded
Data has already been extracted
Loading train set
Loading test set

In [16]:

from  sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100,verbose=1,n_jobs=-1)
clf.fit(train_features, train_labels)

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    8.1s finished

Out[16]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=1,
            warm_start=False)

In [17]:

print("Train score: {}".format(clf.score(train_features,train_labels)))
print("Test score: {}".format(clf.score(test_features, test_labels)))

[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished

Train score: 1.0
Test score: 0.855

In [18]:

%matplotlib inline
y_proba = clf.predict_proba(test_features)
dogs_vs_cats.plotROC(test_labels,y_proba[:,1])

[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished

Conclussions¶

python is easy to use
python is fast to program.
python has many modules/packages that are really powerful.

My way of working¶

Try things and documents in notebooks.
When I have useful code snippets I put them in classes/functions in modules (.py).
Edit/manage this .py with spyder/pycharm/eclipse (version control, conda integration, autocomplete, module inspection...)
The usage of this funcs/classes in example notebooks.

Where to go to learn more?¶

Take a problem/project/goal: image recognition, making a web site, kaggle competition, migrate some code in other language. and..

Learn about the modules you have to use to solve this problem.
Everything is rapidply evolving...

Where to go to learn more?¶

explosion gif photo: explosion atompilz.gif

kdnudggets blog about ml/data science.
pyimagesearch blog about image classification with python (good reference about installing tensorflow/theano/keras/cv2 with gpu support).
Deep Learning with tensorflow udacity course by google https://www.udacity.com/course/deep-learning--ud730
Using pandas and scikit-learn.
kaggle kernels. Many code snippets, in competitions and blog there are tons of code examples.
coursera, edx courses.

Program¶

Dogs vs Cats kaggle¶

What's python?¶

Why python?¶

Why python?¶

Installing python¶

Getting started with conda, anaconda...¶

Python Zen Principles¶

jupyter basic usage¶

Basic syntax notebook¶

Preparing for the exercises¶

Exercise 1¶

Exercise 2¶

Exercise 3¶

numpy¶

(Annoying/Awesome) Differences with MATLAB¶

Exercise 4¶

Exercise 5¶

Exercise 6¶

Machine Learning with scikit-learn¶

Exercise 7¶

Conclussions¶

My way of working¶

Where to go to learn more?¶

Where to go to learn more?¶

`numpy`¶

(Annoying/Awesome) Differences with `MATLAB`¶

Machine Learning with `scikit-learn`¶