
0: download anaconda for python 3.5 if you don't!
Part I:
jupyter notebook basicsbasic_syntax.ipynb):numpy (numpy.ipynb):matplotlib:Part II:
scikit-learn (machine_learning.ipynb):deep_learning.ipynb):keras.
Scientific (core) packages:
Scientific (good to know) packages:
GIS and teledetection:
Images, computer vision:
Web development packages:
Other:
# %load server.py
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello IPL!"
if __name__ == "__main__" and False:
app.run(port=6969, host="0.0.0.0")
conda is a package manager. Anaconda is a set of (~150) python packages for scientific python together with jupyter and spyder (spyder is a MATLAB/rstudio like environment).
train.zip in the folder of the repository. jupyter notebook and select the notebook basic_syntax.ipynb. Resources:
import this
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
Insert > Cell Above / Cell BellowEdit > Split Cell / Move Cell Up/DownHelp > Markdown syntax help. In markdown cells you can embed HMTLCtrl + Enter execute current cell Cell > Cell type to distinguish between code cells and text (markdown) cells.File > Download as > HTML to download the content of the notebook as a plain HMTL.jupyter nbconvert --to slides PythonCourse.ipynb --reveal-prefix=reveal.js --post serve
Code for downloading the data of dogs vs cats kaggle competition.
import os
import urllib.request
import zipfile
# urllib.request.urlretrieve('http://example.com/big.zip', 'file/on/disk.zip')
train_file = "train.zip"
train_folder = "train"
# Download data
if not os.path.exists(train_file) and not os.path.exists(train_folder):
print("Proceding to download the data. This process may take some time..")
urllib.request.urlretrieve("https://www.dropbox.com/s/8lbkqktfofzjraj/train.zip?raw=1", train_file)
print("Done")
else:
print("data file {} has already been downloaded".format(train_file))
# unzip files
# If folder train does not exists extract all elements
if not os.path.exists(train_folder):
with zipfile.ZipFile(train_file, 'r') as myzip:
myzip.extractall()
print("Extracted")
else:
print("Data has already been extracted")
all_files = os.listdir("train")
data file train.zip has already been downloaded Data has already been extracted
Retrieve a list with only dog files.
# "standard" solution:
dog_files = []
for fil in all_files:
if 'dog' in fil:
dog_files.append(fil)
print(len(dog_files))
12500
# "list comprenhension" solution:
dog_files = [fil for fil in all_files if 'dog' in fil]
print(len(dog_files))
12500
# Cheating solution:
import glob
dog_files = glob.glob("train/dog*.jpg")
print(len(dog_files))
12500
Display an image using Image from Ipython.display
from IPython.display import Image
Image(dog_files[30])
Create a function that download and unzip the data. Put that function in a file under the current directory. Import the module and run the function.
# %load dogs_vs_cats.py
import os
import urllib.request
import zipfile
def image_files(train_file = "train.zip",train_folder = "train"):
if not os.path.exists(train_file):
print("Proceding to download the data. This process may take some time..")
urllib.request.urlretrieve("https://www.dropbox.com/s/8lbkqktfofzjraj/train.zip?raw=1", train_file)
print("Done")
else:
print("data file {} has already been downloaded".format(train_file))
# unzip files
# If folder train does not exists extract all elements
if not os.path.exists(train_folder):
with zipfile.ZipFile(train_file, 'r') as myzip:
myzip.extractall()
print("Extracted")
else:
print("Data has already been extracted")
return [os.path.join("train",img) for img in os.listdir("train")]
numpy¶NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis. It is the foundation on which nearly all of the higher-level tools scientific computing tools are built. Here are some of the things it provides:
ndarray , a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities.Taken from chapter 4.

MATLAB¶* is element-wise multiplicationexpand_dims your 1D vector.numpy is dessigned for multidimensional arrays.Generate a training set of 4000 randomly selected images (features and labels).
50*50*3)).Show image number 124 of such array using matplotlib.pyplot.
import dogs_vs_cats
all_files = dogs_vs_cats.image_files()
data file train.zip has already been downloaded Data has already been extracted
import skimage.transform as sktrans
from scipy import ndimage
import numpy as np
n_images = 4000
index_files_selected = np.random.choice(len(all_files),size=n_images,replace=False)
feature_array = np.ndarray((n_images,50*50*3),dtype=np.float16)
label = np.ndarray((n_images),dtype=np.uint8)
for i,indice in enumerate(index_files_selected):
print("Reading image (%d/%d)"%(i,n_images))
image = ndimage.imread(all_files[indice])
image_down = sktrans.resize(image, (50, 50, 3))
feature_array[i,:] = image_down.ravel() # equivalent to feature_array[i]
label[i] = "dog" in all_files[indice]
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(feature_array[124].reshape(50,50,3))
<matplotlib.image.AxesImage at 0x7fa16c03e128>
Count proportion of dogs/cats on your sampled labels.
prop_cats = np.sum(label == 0)/len(label)
prop_dogs = 1-prop_cats
print("There are {0:.2f}% of cats and {1:.2f}% of dogs".format(prop_cats*100,prop_dogs*100))
There are 50.05% of cats and 49.95% of dogs
Save the training set as a .mat file. Use scipy.io.savemat. Reload the training set with scipy.io.loadmat.
import scipy.io as sio
?sio.savemat
sio.savemat("training_set.mat",
{"feature_array": feature_array,
"labels": label})
cargado = sio.loadmat("training_set.mat")
cargado
# feature_array_loaded = cargado["feature_array"]
{'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Oct 10 12:06:11 2016',
'__version__': '1.0',
'feature_array': array([[ 0.22436523, 0.30664062, 0.4831543 , ..., 0.93798828,
0.93798828, 0.93798828],
[ 0.47119141, 0.3996582 , 0.34399414, ..., 0.57519531,
0.56738281, 0.59423828],
[ 0.25463867, 0.13439941, 0.0791626 , ..., 0.81640625,
0.78515625, 0.73876953],
...,
[ 0.53320312, 0.35302734, 0.10980225, ..., 0.66113281,
0.54345703, 0.43359375],
[ 0.4296875 , 0.68115234, 0.83740234, ..., 1. ,
0.98046875, 0.98046875],
[ 0.59228516, 0.65869141, 0.76074219, ..., 0.47802734,
0.43017578, 0.37988281]]),
'labels': array([[1, 0, 1, ..., 0, 0, 0]], dtype=uint8)}
scikit-learn¶fit a model to a set of imagesTrain a different classifier. http://scikit-learn.org/stable/supervised_learning.html
import dogs_vs_cats
all_files = dogs_vs_cats.image_files()
train_features, train_labels, test_features, test_labels = dogs_vs_cats.training_test_datasets(all_files,15000,400,
(33,33,1))
data file train.zip has already been downloaded Data has already been extracted Loading train set Loading test set
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100,verbose=1,n_jobs=-1)
clf.fit(train_features, train_labels)
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 3.2s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 8.1s finished
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=False, random_state=None, verbose=1,
warm_start=False)
print("Train score: {}".format(clf.score(train_features,train_labels)))
print("Test score: {}".format(clf.score(test_features, test_labels)))
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s [Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 0.1s finished [Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s [Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 0.0s finished
Train score: 1.0 Test score: 0.855
%matplotlib inline
y_proba = clf.predict_proba(test_features)
dogs_vs_cats.plotROC(test_labels,y_proba[:,1])
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s [Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 0.0s finished

.py with spyder/pycharm/eclipse (version control, conda integration, autocomplete, module inspection...)
