{ "cells": [ { "cell_type": "markdown", "id": "758b463d", "metadata": {}, "source": [ "# Machine learning in bioinformatics\n", "\n", "In this chapter we'll begin talking about machine learning algorithms. Machine learning algorithms are used in bioinformatics for tasks where the user would like an algorithm to assist in the identification of patterns in a complex dataset. As is typically the case in this book, we'll work through implementing a few algorithms but these are not the implementations that you should use in practice. The code is written to be accessible for learning. [scikit-learn](http://scikit-learn.org/) is a popular and well-documented Python library for machine learning which many bioinformatics researchers and software developers use in their work. If you'd like to start trying some of these tools out, scikit-learn is a great place to start. \n", "\n", "```{warning}\n", "Machine learning algorithms can easily be misused, either intentionally or unintentionally, to provide misleading results. This chapter will cover some guidelines for how to use these techniques, but it is only intended as a primer to introduce machine learning. It's not a detailed discussion of how machine learning algorithms should and shouldn't be used. If you want to start applying machine learning tools in your own research, I recommend moving from this chapter to the scikit-learn documentation, and their content on [Common pitfalls and recommended practices](https://scikit-learn.org/stable/common_pitfalls.html).\n", "```\n", "\n", "## The feature table\n", "\n", "Machine learning algorithms generally are provided with a table of **samples** and user-defined **features** of those samples. These data are typically represented in a matrix, where samples are the rows and features are the columns. This matrix is referred to as a **feature table**, and it is central to machine learning and many subfields of bioinformatics. The terms used here are purposefully general. Samples are intended to be any unit of study, and features are attributes of those samples. Sometimes **labels** or **response variables** will also be associated with the samples, in which case a different class of methods can be applied. \n", "\n", "scikit-learn provides a few example datasets that can be used for learning. Let's start by taking a look and one of them to get an idea of what input might look like in a machine learning task.\n", "\n", "### The Iris dataset\n", "\n", "The [Iris dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset) is a classic example used in machine learning, originally published by RA Fisher {cite}`Fisher1936-tk`. This feature table describes four features of 150 specimens of Iris, a genus of flowering plant, representing three species. The feature table follows:" ] }, { "cell_type": "code", "execution_count": 1, "id": "2dc9e362", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# This cell loads data from scikit-learn and organizes it into some strcutures that\n", "# we'll use to conveniently view the data.\n", "\n", "import sklearn.datasets\n", "import pandas as pd\n", "\n", "iris_dataset = sklearn.datasets.load_iris(as_frame=True)\n", "iris_feature_table = iris_dataset.frame.drop('target', axis=1)\n", "iris_feature_table.index.name = 'sample-id'\n", "# map target integers onto species names\n", "iris_labels = pd.Series(iris_dataset.target_names[iris_dataset.target], \n", " index=iris_dataset.target.index, name='species').to_frame()\n", "iris_labels.index.name = 'sample-id'" ] }, { "cell_type": "code", "execution_count": 2, "id": "203ee7d4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sepal length (cm) | \n", "sepal width (cm) | \n", "petal length (cm) | \n", "petal width (cm) | \n", "
---|---|---|---|---|
sample-id | \n", "\n", " | \n", " | \n", " | \n", " |
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
145 | \n", "6.7 | \n", "3.0 | \n", "5.2 | \n", "2.3 | \n", "
146 | \n", "6.3 | \n", "2.5 | \n", "5.0 | \n", "1.9 | \n", "
147 | \n", "6.5 | \n", "3.0 | \n", "5.2 | \n", "2.0 | \n", "
148 | \n", "6.2 | \n", "3.4 | \n", "5.4 | \n", "2.3 | \n", "
149 | \n", "5.9 | \n", "3.0 | \n", "5.1 | \n", "1.8 | \n", "
150 rows × 4 columns
\n", "\n", " | species | \n", "
---|---|
sample-id | \n", "\n", " |
0 | \n", "setosa | \n", "
1 | \n", "setosa | \n", "
2 | \n", "setosa | \n", "
3 | \n", "setosa | \n", "
4 | \n", "setosa | \n", "
... | \n", "... | \n", "
145 | \n", "virginica | \n", "
146 | \n", "virginica | \n", "
147 | \n", "virginica | \n", "
148 | \n", "virginica | \n", "
149 | \n", "virginica | \n", "
150 rows × 1 columns
\n", "\n", " | GATG | \n", "ATGA | \n", "TGAA | \n", "GAAC | \n", "AACG | \n", "ACGC | \n", "CGCT | \n", "GCTA | \n", "CTAG | \n", "TAGC | \n", "... | \n", "CCCT | \n", "GCTC | \n", "GCGA | \n", "TTGC | \n", "CGCC | \n", "TGTG | \n", "CTCT | \n", "TTCC | \n", "GTTC | \n", "ATTC | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1020921 | \n", "4 | \n", "3 | \n", "2 | \n", "2 | \n", "4 | \n", "2 | \n", "1 | \n", "3 | \n", "1 | \n", "3 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1111241 | \n", "5 | \n", "2 | \n", "2 | \n", "1 | \n", "3 | \n", "1 | \n", "0 | \n", "3 | \n", "1 | \n", "2 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
241971 | \n", "6 | \n", "5 | \n", "4 | \n", "3 | \n", "4 | \n", "2 | \n", "1 | \n", "4 | \n", "1 | \n", "5 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
970921 | \n", "3 | \n", "4 | \n", "2 | \n", "2 | \n", "3 | \n", "2 | \n", "4 | \n", "5 | \n", "3 | \n", "5 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
867450 | \n", "5 | \n", "3 | \n", "2 | \n", "1 | \n", "4 | \n", "2 | \n", "1 | \n", "4 | \n", "2 | \n", "3 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4226754 | \n", "3 | \n", "1 | \n", "1 | \n", "3 | \n", "4 | \n", "2 | \n", "2 | \n", "2 | \n", "0 | \n", "1 | \n", "... | \n", "2 | \n", "2 | \n", "2 | \n", "1 | \n", "2 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
403853 | \n", "3 | \n", "1 | \n", "0 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "2 | \n", "0 | \n", "1 | \n", "... | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
862869 | \n", "2 | \n", "1 | \n", "1 | \n", "4 | \n", "4 | \n", "2 | \n", "4 | \n", "2 | \n", "0 | \n", "1 | \n", "... | \n", "2 | \n", "2 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4121939 | \n", "3 | \n", "1 | \n", "0 | \n", "3 | \n", "4 | \n", "2 | \n", "2 | \n", "2 | \n", "0 | \n", "1 | \n", "... | \n", "2 | \n", "2 | \n", "2 | \n", "1 | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1028571 | \n", "3 | \n", "1 | \n", "0 | \n", "4 | \n", "3 | \n", "3 | \n", "2 | \n", "1 | \n", "0 | \n", "1 | \n", "... | \n", "3 | \n", "1 | \n", "2 | \n", "1 | \n", "4 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
535359 | \n", "4 | \n", "2 | \n", "2 | \n", "2 | \n", "3 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "2 | \n", "... | \n", "1 | \n", "0 | \n", "2 | \n", "2 | \n", "0 | \n", "2 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "
981912 | \n", "4 | \n", "1 | \n", "3 | \n", "3 | \n", "5 | \n", "2 | \n", "1 | \n", "3 | \n", "1 | \n", "3 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "2 | \n", "1 | \n", "1 | \n", "
12 rows × 256 columns
\n", "\n", " | domain | \n", "phylum | \n", "class | \n", "order | \n", "family | \n", "genus | \n", "species | \n", "legend entry | \n", "
---|---|---|---|---|---|---|---|---|
id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1020921 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Flavobacteriia | \n", "o__Flavobacteriales | \n", "f__Flavobacteriaceae | \n", "g__Flavobacterium | \n", "s__succinicans | \n", "Flavobacterium succinicans (Bacteroidetes) | \n", "
1111241 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Flavobacteriia | \n", "o__Flavobacteriales | \n", "f__Flavobacteriaceae | \n", "g__Flavobacterium | \n", "s__succinicans | \n", "Flavobacterium succinicans (Bacteroidetes) | \n", "
241971 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Flavobacteriia | \n", "o__Flavobacteriales | \n", "f__Flavobacteriaceae | \n", "g__Flavobacterium | \n", "s__succinicans | \n", "Flavobacterium succinicans (Bacteroidetes) | \n", "
970921 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Flavobacteriia | \n", "o__Flavobacteriales | \n", "f__Flavobacteriaceae | \n", "g__Flavobacterium | \n", "s__succinicans | \n", "Flavobacterium succinicans (Bacteroidetes) | \n", "
867450 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Flavobacteriia | \n", "o__Flavobacteriales | \n", "f__Flavobacteriaceae | \n", "g__Flavobacterium | \n", "s__succinicans | \n", "Flavobacterium succinicans (Bacteroidetes) | \n", "
4226754 | \n", "k__Bacteria | \n", "p__Actinobacteria | \n", "c__Actinobacteria | \n", "o__Actinomycetales | \n", "f__Propionibacteriaceae | \n", "g__Propionibacterium | \n", "s__acnes | \n", "Propionibacterium acnes (Actinobacteria) | \n", "
403853 | \n", "k__Bacteria | \n", "p__Actinobacteria | \n", "c__Actinobacteria | \n", "o__Actinomycetales | \n", "f__Propionibacteriaceae | \n", "g__Propionibacterium | \n", "s__acnes | \n", "Propionibacterium acnes (Actinobacteria) | \n", "
862869 | \n", "k__Bacteria | \n", "p__Actinobacteria | \n", "c__Actinobacteria | \n", "o__Actinomycetales | \n", "f__Propionibacteriaceae | \n", "g__Propionibacterium | \n", "s__acnes | \n", "Propionibacterium acnes (Actinobacteria) | \n", "
4121939 | \n", "k__Bacteria | \n", "p__Actinobacteria | \n", "c__Actinobacteria | \n", "o__Actinomycetales | \n", "f__Propionibacteriaceae | \n", "g__Propionibacterium | \n", "s__acnes | \n", "Propionibacterium acnes (Actinobacteria) | \n", "
1028571 | \n", "k__Bacteria | \n", "p__Actinobacteria | \n", "c__Actinobacteria | \n", "o__Actinomycetales | \n", "f__Propionibacteriaceae | \n", "g__Propionibacterium | \n", "s__acnes | \n", "Propionibacterium acnes (Actinobacteria) | \n", "
535359 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Bacteroidia | \n", "o__Bacteroidales | \n", "f__Prevotellaceae | \n", "g__Prevotella | \n", "s__melaninogenica | \n", "Prevotella melaninogenica (Bacteroidetes) | \n", "
981912 | \n", "k__Bacteria | \n", "p__Bacteroidetes | \n", "c__Bacteroidia | \n", "o__Bacteroidales | \n", "f__Prevotellaceae | \n", "g__Prevotella | \n", "s__melaninogenica | \n", "Prevotella melaninogenica (Bacteroidetes) | \n", "