{ "cells": [ { "cell_type": "markdown", "id": "3880155a", "metadata": {}, "source": [ "# 2 - Experimental quality assessment\n", "\n", "**Description:** This notebook contains an example on how to check for outliers regarding numerosity. \n", "\n", "Removing this outliers is recommended (is the default for the other notebooks).\n", "\n", "In this notebooks we first check if the distribution of numerosities across an experiment (normally tied to the same chromosome / genomic region) is homogeneous or some of the traces stand out as outliers. \n", "\n", "We do this in two steps.\n", "* __Kruskal-Wallis test__: the aim of this test is to identify if we have possible outliers. This tests only gives this information. To know which are the outliers we perform a second test. This test is an all-vs-all.\n", "* __Kolmogorov-Smirnov test__: the aim of this test is to identify if a particular trace is an outlier in respect to the background distribution of numerosities. This background distribution is build by adding up all the numerosities from all the experiments for that region. This test is a one-vs-all.\n", "\n", "The following index has links to the different sections of the notebook. Some sections may have additional indexes to access the subparts. \n", "If you are using VS Code to visualize the notebook, the links may not work. This is due to how VC Code render the notebook. To navigate the notebook use the outline panel. To know more about it, check this [link](https://code.visualstudio.com/docs/getstarted/userinterface#_outline-view).\n", "\n", "Content:\n", "\n", "- [Library imports](#library-imports-and-functions)\n", "- [Setting some variables](#setting-some-variables)\n", "- [Exporting the data and running the tests](#exporting-the-data-and-running-the-tests)\n", "- [Getting a file per chromosome with the traces that pass the assessment](#getting-a-file-per-chromosome-with-the-traces-that-pass-the-assessment)\n", "- [Violin plots of the different distributions of the traces against the background](#violin-plots-of-the-different-distributions-of-the-traces-against-the-background)\n", "- [Plotting efficiency matrices](#plotting-efficiency-matrices)\n", "- [Anisotropy factor](#anisotropy-factor)\n", "\n", "Next notebook to run is: \n", "[3D Reconstruction Assessment](3_3D_Reconstruction_Assessment.ipynb)" ] }, { "cell_type": "markdown", "id": "2de88e5e", "metadata": {}, "source": [ "## Library imports and functions" ] }, { "cell_type": "markdown", "id": "0e489a66", "metadata": {}, "source": [ "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 1, "id": "e2585fb6", "metadata": {}, "outputs": [], "source": [ "import re\n", "from pathlib import Path\n", "from collections import defaultdict\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from cima.parsers.parser_csv import CSVParser\n", "from scipy.stats import ks_2samp, kruskal\n", "\n", "import matplotlib.pyplot as plt\n", "import matplotlib.patheffects as pe\n", "import seaborn as sns\n", "import matplotlib as mpl\n", "from matplotlib.collections import PolyCollection\n", "\n", "## Regular expressions to get the metadata from the filename\n", "regex_patterns = {\n", " \"nucleusID\": re.compile(r\"(?i)nuc(\\d+)\"), # This searches for \"Nuc\" or \"nuc\" followed by digits\n", " \"cellID\": re.compile(r\"(?i)cell(\\d+)\"), # This searches for \"Cell\" or \"cell\" followed by digits\n", " \"locationID\" : re.compile(r\"(?i)loc-?(\\d+)\"), # This searches for \"Loc\" or \"loc\" followed by optional hyphen and digits\n", " \"date\" : re.compile(r\"(\\d{4}[-\\.]\\d{2}[-\\.]\\d{2})\"), # This searches for dates in the format YYYY-MM-DD or YYYY.MM.DD\n", " \"homolog\" : re.compile(r\"_([ABPM01pm])\"), ## Added just in case the homolog is in the filename.\n", " \"chromosome\" : re.compile(r\"(?i)_(chr[\\d+MXY])\") ## Added just in case the chromosome is in the filename.\n", " }\n", "\n", "# Set the options for saving the plots\n", "plot_opts = {\"dpi\": 300, \"bbox_inches\": 'tight', \"transparent\": True}" ] }, { "cell_type": "code", "execution_count": 2, "id": "c7aaa2ec", "metadata": {}, "outputs": [], "source": [ "def key_explicit(s):\n", " date, rest = s.split(\"_cell\")\n", " cell_num, suffix = rest.split(\"_\")\n", " return (date, int(cell_num), suffix)" ] }, { "cell_type": "markdown", "id": "c6641e11", "metadata": {}, "source": [ "## Setting some variables" ] }, { "cell_type": "markdown", "id": "e1b7fbb2", "metadata": {}, "source": [ "In the following cell you should set up some variables for the script to run properly. Change the variables at your convenience, and then you can press \"Run all\" in the Jupyter Notebook. \n", "\n", "This notebook assumes that your csv files are in SRX format.\n", "\n", "Brief explanation of the different variables:\n", "