{ "cells": [ { "cell_type": "markdown", "id": "3880155a", "metadata": {}, "source": [ "# 2 - Experimental quality assessment\n", "\n", "**Description:** This notebook contains an example on how to check for outliers regarding numerosity. \n", "\n", "Removing this outliers is recommended (is the default for the other notebooks).\n", "\n", "In this notebooks we first check if the distribution of numerosities across an experiment (normally tied to the same chromosome / genomic region) is homogeneous or some of the traces stand out as outliers. \n", "\n", "We do this in two steps.\n", "* __Kruskal-Wallis test__: the aim of this test is to identify if we have possible outliers. This tests only gives this information. To know which are the outliers we perform a second test. This test is an all-vs-all.\n", "* __Kolmogorov-Smirnov test__: the aim of this test is to identify if a particular trace is an outlier in respect to the background distribution of numerosities. This background distribution is build by adding up all the numerosities from all the experiments for that region. This test is a one-vs-all.\n", "\n", "The following index has links to the different sections of the notebook. Some sections may have additional indexes to access the subparts. \n", "If you are using VS Code to visualize the notebook, the links may not work. This is due to how VC Code render the notebook. To navigate the notebook use the outline panel. To know more about it, check this [link](https://code.visualstudio.com/docs/getstarted/userinterface#_outline-view).\n", "\n", "Content:\n", "\n", "- [Library imports](#library-imports-and-functions)\n", "- [Setting some variables](#setting-some-variables)\n", "- [Exporting the data and running the tests](#exporting-the-data-and-running-the-tests)\n", "- [Getting a file per chromosome with the traces that pass the assessment](#getting-a-file-per-chromosome-with-the-traces-that-pass-the-assessment)\n", "- [Violin plots of the different distributions of the traces against the background](#violin-plots-of-the-different-distributions-of-the-traces-against-the-background)\n", "- [Plotting efficiency matrices](#plotting-efficiency-matrices)\n", "- [Anisotropy factor](#anisotropy-factor)\n", "\n", "Next notebook to run is: \n", "[3D Reconstruction Assessment](3_3D_Reconstruction_Assessment.ipynb)" ] }, { "cell_type": "markdown", "id": "2de88e5e", "metadata": {}, "source": [ "## Library imports and functions" ] }, { "cell_type": "markdown", "id": "0e489a66", "metadata": {}, "source": [ "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 1, "id": "e2585fb6", "metadata": {}, "outputs": [], "source": [ "import re\n", "from pathlib import Path\n", "from collections import defaultdict\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from cima.parsers.parser_csv import CSVParser\n", "from scipy.stats import ks_2samp, kruskal\n", "\n", "import matplotlib.pyplot as plt\n", "import matplotlib.patheffects as pe\n", "import seaborn as sns\n", "import matplotlib as mpl\n", "from matplotlib.collections import PolyCollection\n", "\n", "## Regular expressions to get the metadata from the filename\n", "regex_patterns = {\n", " \"nucleusID\": re.compile(r\"(?i)nuc(\\d+)\"), # This searches for \"Nuc\" or \"nuc\" followed by digits\n", " \"cellID\": re.compile(r\"(?i)cell(\\d+)\"), # This searches for \"Cell\" or \"cell\" followed by digits\n", " \"locationID\" : re.compile(r\"(?i)loc-?(\\d+)\"), # This searches for \"Loc\" or \"loc\" followed by optional hyphen and digits\n", " \"date\" : re.compile(r\"(\\d{4}[-\\.]\\d{2}[-\\.]\\d{2})\"), # This searches for dates in the format YYYY-MM-DD or YYYY.MM.DD\n", " \"homolog\" : re.compile(r\"_([ABPM01pm])\"), ## Added just in case the homolog is in the filename.\n", " \"chromosome\" : re.compile(r\"(?i)_(chr[\\d+MXY])\") ## Added just in case the chromosome is in the filename.\n", " }\n", "\n", "# Set the options for saving the plots\n", "plot_opts = {\"dpi\": 300, \"bbox_inches\": 'tight', \"transparent\": True}" ] }, { "cell_type": "code", "execution_count": 2, "id": "c7aaa2ec", "metadata": {}, "outputs": [], "source": [ "def key_explicit(s):\n", " date, rest = s.split(\"_cell\")\n", " cell_num, suffix = rest.split(\"_\")\n", " return (date, int(cell_num), suffix)" ] }, { "cell_type": "markdown", "id": "c6641e11", "metadata": {}, "source": [ "## Setting some variables" ] }, { "cell_type": "markdown", "id": "e1b7fbb2", "metadata": {}, "source": [ "In the following cell you should set up some variables for the script to run properly. Change the variables at your convenience, and then you can press \"Run all\" in the Jupyter Notebook. \n", "\n", "This notebook assumes that your csv files are in SRX format.\n", "\n", "Brief explanation of the different variables:\n", "\n", "\n", "The notebook will create a folder in the work_folder path called `2_experimental_quality_assessment` where the plots/files created will be stored.\n", "\n", "For the purposes of this notebook, save_plots is set to False to show the plots. By doing so only plots for chr3 will be shown. If you want to obtain the plots for the other chromosomes, run this notebook with save_plots set to True and check the folder.\n", "\n", "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 3, "id": "81202e1e", "metadata": {}, "outputs": [], "source": [ "data_folder = \"/scratch/CIMA_tutorial/data/\"\n", "work_folder = \"/scratch/CIMA_tutorial/\"\n", "\n", "number_of_steps_library = 16\n", "starting_step = {\"chr3\" : 6, \"chr5\" : 5, \"chr8\" : 4,\"chr16\" : 2, \"chrX\" : 3}\n", "\n", "kw_test_thresholds = 0.01\n", "\n", "save_plots = True\n", "\n", "experimental_quality_assessment_folder = Path(work_folder, \"2_experimental_quality_assessment\")\n", "experimental_quality_assessment_folder.mkdir(parents=True, exist_ok=True)\n", "\n", "nice_feature_names = {\n", " \"numerosity\": \"Numerosity\",\n", " \"precX\": \"X-axis precision\",\n", " \"precY\": \"Y-axis precision\",\n", " \"precZ\": \"Z-axis precision\",}" ] }, { "cell_type": "markdown", "id": "dffbbd88", "metadata": {}, "source": [ "## Exporting the data and running the tests\n", "\n", "We tests four different things:\n", "\n", "\n", "\n", "Then we will do violin plots to show the data of each trace against the \"background distribution\". We create this \"background distribution\" by aggregating the data of the different traces. \n", "\n", "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 4, "id": "5628c3ff", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr3\n", "Total number of files found: 33\n", "Excluding timepoint 19 in experimentID: 2018-09-04, nucleusID: 7, homolog: A because it has less than 10 points.\n", "Processing file 33: 2018-09-04_nuc06_A_chr3\n", "Total number of objects created: 33\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr5\n", "Total number of files found: 35\n", "Processing file 35: 2018-09-04_nuc04_A_chr5\n", "Total number of objects created: 35\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr8\n", "Total number of files found: 22\n", "Processing file 22: 2018-09-04_nuc01_B_chr8\n", "Total number of objects created: 22\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr16\n", "Total number of files found: 10\n", "Processing file 10: 2018-09-04_nuc04_A_chr16\n", "Total number of objects created: 10\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chrX\n", "Total number of files found: 5\n", "Processing file 5: 2018-09-04_nuc01_A_chrX\n", "Total number of objects created: 5\n", "\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "kw_test_data = defaultdict(list)\n", "\n", "violin_plot_data = {}\n", "ks_test_data = {}\n", "\n", "for chrom in starting_step.keys():\n", "\n", " violin_plot_chr = defaultdict(list)\n", "\n", " data_folder_chr = Path(data_folder, chrom)\n", "\n", " print(f\"Processing CSV files in folder: {data_folder_chr}\")\n", "\n", " # Get a list of all CSV files in the specified data folder\n", " csv_files = list(data_folder_chr.glob('*.csv'))\n", " if len(csv_files) == 0:\n", " print(\"No files found, please input the correct work_folder\")\n", " else:\n", " print(f\"Total number of files found: {len(csv_files)}\")\n", "\n", " # Get a list of metadata keys to extract from filenames\n", " metadata_keys = [\"nucleusID\", \"cellID\", \"locationID\", \"date\", \"homolog\", \"chromosome\"]\n", "\n", " # Initialize an empty list to store the StructuralObject instances\n", " obj_list = []\n", "\n", " # Initialize lists to store numerosity data for KS test and violin plot\n", " features_total = defaultdict(list)\n", "\n", " # Initialize a dictionary to store numerosities by file for the KS test\n", " features_by_file = defaultdict(lambda: defaultdict(list))\n", "\n", " if len(csv_files) > 0:\n", " # Iterate over each file in the list of file names\n", " for count, filein in enumerate(csv_files, start=1):\n", "\n", " stem = Path(filein).stem\n", "\n", " print(f\"Processing file {count}: {stem}\", end=\"\\r\")\n", "\n", " # We mine the metadata from the filename using regex patterns. If not possible, assign default values.\n", " metadata_CIMA = {\n", " key: (match.group(1) if (match := regex_patterns[key].search(stem)) else default)\n", " for key, default in zip(metadata_keys, [count, count, count, stem, \"A\", \"chrN\"])\n", " }\n", "\n", " # Adjust the metadata for nucleus and cell (if nucleus is count, but cell is something, use the cell match in nucleus)\n", " if \"cell\" in metadata_CIMA and metadata_CIMA[\"nucleusID\"] == count:\n", " metadata_CIMA[\"nucleusID\"] = metadata_CIMA[\"cell\"]\n", "\n", " metadata_CIMA[\"nucleusID\"] = int(metadata_CIMA[\"nucleusID\"])\n", " metadata_CIMA[\"cellID\"] = int(metadata_CIMA[\"cellID\"])\n", " metadata_CIMA[\"locationID\"] = int(metadata_CIMA[\"locationID\"])\n", "\n", " # Set experimentID as date for consistency\n", " metadata_CIMA[\"experimentID\"] = metadata_CIMA[\"date\"]\n", "\n", " # Read the CSV file and create a StructuralObject instance\n", " objin = CSVParser.read_CSV_file(filein.as_posix(), metadata = metadata_CIMA, content_type = \"srx\")\n", "\n", " objin.atomList = objin.atomList[(objin.atomList['timepoint'] >= starting_step[chrom] - 1) &\n", " (objin.atomList['timepoint'] < (starting_step[chrom] + number_of_steps_library - 1))].copy()\n", " \n", " for tp in objin.atomList[\"timepoint\"].unique():\n", " # Take out the timepoint from the atomList if it has less tha 10 points\n", " tp_mask = objin.atomList[\"timepoint\"] == tp\n", " if tp_mask.sum() < 10:\n", " print(f\"Excluding timepoint {tp} in experimentID: {metadata_CIMA['experimentID']}, nucleusID: {metadata_CIMA['nucleusID']}, homolog: {metadata_CIMA['homolog']} because it has less than 10 points.\")\n", " objin.atomList = objin.atomList.loc[~tp_mask].copy()\n", " \n", " # We store the numerosities for each timepoint in the file for the KW/KS tests.\n", " for tp in objin.atomList.timepoint.unique():\n", " temp = objin.get_Time_segment(timepoint=tp)\n", " features_total[\"numerosity\"].append(temp.atomList.shape[0])\n", " features_total[\"precX\"].append(np.nanmedian(temp.atomList.xprec))\n", " features_total[\"precY\"].append(np.nanmedian(temp.atomList.yprec))\n", " features_total[\"precZ\"].append(np.nanmedian(temp.atomList.zprec))\n", " features_by_file[stem][\"numerosity\"].append(temp.atomList.shape[0])\n", " features_by_file[stem][\"precX\"].append(np.nanmedian(temp.atomList.xprec))\n", " features_by_file[stem][\"precY\"].append(np.nanmedian(temp.atomList.yprec))\n", " features_by_file[stem][\"precZ\"].append(np.nanmedian(temp.atomList.zprec))\n", " \n", " # Append the object to the list\n", " obj_list.append(objin)\n", "\n", " # Nice print to tell the user the number of objects created.\n", " print(f\"\\nTotal number of objects created: {len(obj_list)}\")\n", "\n", " # Clean up memory by deleting some of the variables no longer needed\n", " del count, filein, stem, metadata_CIMA, match\n", "\n", " if 'default' in locals():\n", " del default\n", " if 'key' in locals():\n", " del key\n", "\n", " del csv_files, metadata_keys\n", "\n", " ks_test_summary = defaultdict(list)\n", "\n", " for objin in obj_list:\n", "\n", " repID = f\"{objin.metadata['experimentID']}_cell{objin.metadata['nucleusID']}_{objin.metadata['homolog']}\"\n", "\n", " file_features = features_by_file[Path(objin.metadata[\"filename\"]).stem]\n", "\n", " ks_test_summary[\"filename\"].append(Path(objin.metadata[\"filename\"]).stem)\n", "\n", " for feature in [\"numerosity\", \"precX\", \"precY\", \"precZ\"]:\n", " ks_test = ks_2samp(file_features[feature], features_total[feature])\n", " ks_test_summary[f\"statistic_{feature}\"].append(ks_test.statistic)\n", " ks_test_summary[f\"pvalue_{feature}\"].append(ks_test.pvalue)\n", " \n", " ks_test_summary[\"n_total\"].append(len(features_total[\"numerosity\"]))\n", "\n", " violin_plot_chr[\"filename\"].extend([Path(objin.metadata[\"filename\"]).stem] * len(file_features[\"precX\"]))\n", " violin_plot_chr[\"expID\"].extend([objin.metadata[\"experimentID\"]] * len(file_features[\"precX\"]))\n", " violin_plot_chr[\"repID\"].extend([repID] * len(file_features[\"precX\"]))\n", " violin_plot_chr[\"numerosity\"].extend(file_features[\"numerosity\"])\n", " violin_plot_chr[\"precX\"].extend(file_features[\"precX\"])\n", " violin_plot_chr[\"precY\"].extend(file_features[\"precY\"])\n", " violin_plot_chr[\"precZ\"].extend(file_features[\"precZ\"])\n", "\n", " violin_plot_data[chrom] = pd.DataFrame(violin_plot_chr)\n", " ks_test_data[chrom] = pd.DataFrame(ks_test_summary)\n", "\n", " kw_test_data[\"chromosome\"].append(chrom)\n", "\n", " for feature in [\"numerosity\", \"precX\", \"precY\", \"precZ\"]:\n", " kw_test = kruskal(*[features_by_file[fn][feature] for fn in features_by_file])\n", " kw_test_data[f\"statistic_{feature}\"].append(kw_test.statistic)\n", " kw_test_data[f\"pvalue_{feature}\"].append(kw_test.pvalue)\n", "\n", " del feature, features_by_file, file_features, ks_test, ks_test_summary, kw_test, repID, temp, tp, violin_plot_chr\n", " del objin, features_total\n", "\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "kw_test_data = pd.DataFrame(kw_test_data)\n", "\n", "del chrom, data_folder_chr, obj_list" ] }, { "cell_type": "markdown", "id": "39f7f8a3", "metadata": {}, "source": [ "## Violin plots of the different distributions of the traces against the background" ] }, { "cell_type": "markdown", "id": "3e3cb07b", "metadata": {}, "source": [ "Here we create the mentioned violin plots.\n", "\n", "The trace is on the right, colored by date. The background is on the right, colored in grey." ] }, { "cell_type": "code", "execution_count": 5, "id": "04a2215f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating violin plot for chromosome chr3 and feature: Numerosity\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/violin_plot_numerosity_chr3.pdf\n", "Creating violin plot for chromosome chr3 and feature: X-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/violin_plot_precX_chr3.pdf\n", "Creating violin plot for chromosome chr3 and feature: Y-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/violin_plot_precY_chr3.pdf\n", "Creating violin plot for chromosome chr3 and feature: Z-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/violin_plot_precZ_chr3.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Creating violin plot for chromosome chr5 and feature: Numerosity\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/violin_plot_numerosity_chr5.pdf\n", "Creating violin plot for chromosome chr5 and feature: X-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/violin_plot_precX_chr5.pdf\n", "Creating violin plot for chromosome chr5 and feature: Y-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/violin_plot_precY_chr5.pdf\n", "Creating violin plot for chromosome chr5 and feature: Z-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/violin_plot_precZ_chr5.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Creating violin plot for chromosome chr8 and feature: Numerosity\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/violin_plot_numerosity_chr8.pdf\n", "Creating violin plot for chromosome chr8 and feature: X-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/violin_plot_precX_chr8.pdf\n", "Creating violin plot for chromosome chr8 and feature: Y-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/violin_plot_precY_chr8.pdf\n", "Creating violin plot for chromosome chr8 and feature: Z-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/violin_plot_precZ_chr8.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Creating violin plot for chromosome chr16 and feature: Numerosity\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/violin_plot_numerosity_chr16.pdf\n", "Creating violin plot for chromosome chr16 and feature: X-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/violin_plot_precX_chr16.pdf\n", "Creating violin plot for chromosome chr16 and feature: Y-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/violin_plot_precY_chr16.pdf\n", "Creating violin plot for chromosome chr16 and feature: Z-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/violin_plot_precZ_chr16.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Creating violin plot for chromosome chrX and feature: Numerosity\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/violin_plot_numerosity_chrX.pdf\n", "Creating violin plot for chromosome chrX and feature: X-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/violin_plot_precX_chrX.pdf\n", "Creating violin plot for chromosome chrX and feature: Y-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/violin_plot_precY_chrX.pdf\n", "Creating violin plot for chromosome chrX and feature: Z-axis precision\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/violin_plot_precZ_chrX.pdf\n", "\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "color_palette = {'2018-01-11':\"#a6cee3\" ,'2018-06-28':'#1f78b4', '2018-07-10':'#b2df8a', '2018-09-04':'#33a02c'}\n", "default_trace_color = \"skyblue\"\n", "bg_color = \"lightgray\"\n", "\n", "for chrom in violin_plot_data.keys():\n", "\n", " if not save_plots and chrom != \"chr3\":\n", " continue\n", "\n", " for feature in [\"numerosity\", \"precX\", \"precY\", \"precZ\"]:\n", "\n", " print(f\"Creating violin plot for chromosome {chrom} and feature: {nice_feature_names[feature]}\")\n", "\n", " violin_plot_chr = violin_plot_data[chrom]\n", " ks_test_chr = ks_test_data[chrom]\n", " numerosity_total_chr = violin_plot_chr[feature].to_numpy()\n", "\n", " groups = sorted(violin_plot_chr[\"repID\"].unique(), key=key_explicit)\n", "\n", " rows = []\n", "\n", " for gr in groups:\n", " cond_vals = violin_plot_chr.loc[violin_plot_chr[\"repID\"] == gr, feature].to_numpy()\n", " rows.append(pd.DataFrame({\n", " \"Group\": f\"{gr}\",\n", " \"Side\": \"Trace\",\n", " \"Value\": cond_vals\n", " }))\n", " rows.append(pd.DataFrame({\n", " \"Group\": f\"{gr}\",\n", " \"Side\": \"Background\",\n", " \"Value\": numerosity_total_chr\n", " }))\n", "\n", " plot_df = pd.concat(rows, ignore_index=True)\n", " plot_df[\"Pair\"] = \"Pair\" \n", "\n", " # For vertical set x to Side and y to Value\n", " # For horizontal set x to Value and y to Side\n", " g = sns.catplot(\n", " data = plot_df,\n", " x = \"Side\", # Side will put a space in the middle, Pair will take it out\n", " y = \"Value\",\n", " hue = \"Side\", # Side will determine the color (Trace vs Background)\n", " col = \"Group\", # Group will determine the facetting (one facet per file)\n", " col_wrap = 7, # N violins per row\n", " kind = \"violin\", # Type of estimation to use\n", " split = True, # Split violins to show Trace vs Background in the same violin\n", " inner = \"quart\", # Show the quartiles inside the violins\n", " linewidth = 1, height = 2.4, aspect = 0.9,\n", " sharey = True, # Share the y-axis across facets to make them comparable\n", " hue_order = [\"Trace\", \"Background\"],\n", " palette = {\"Trace\": default_trace_color, \"Background\": bg_color},\n", " )\n", "\n", " g.set_titles(\"\")\n", " g.set_ylabels(nice_feature_names[feature])\n", " g.set_xlabels(\"\")\n", "\n", " col2check = \"Group\" if \"Group\" in violin_plot_chr.columns else \"repID\"\n", "\n", " # Draw a rectangle around the ones that failed the KS test at p-value < 0.05\n", " for ax, gr in zip(g.axes.flat, groups):\n", "\n", " temp = gr.split(\"_\")\n", " nice_title = f\"Cell {temp[1].replace('cell', '')} allele {temp[2]}\"\n", " ax.set_title(nice_title, fontsize = 10)\n", "\n", " mask = violin_plot_chr[col2check] == gr\n", " filename_temp = violin_plot_chr.loc[mask, \"filename\"].iat[0]\n", "\n", " exp_val = violin_plot_chr.loc[mask, \"expID\"].iat[0]\n", " trace_color = color_palette.get(exp_val, default_trace_color)\n", "\n", " # Find PolyCollection objects (violin halves). seaborn draws one poly per half.\n", " polys = [c for c in ax.collections if isinstance(c, PolyCollection)]\n", "\n", " # Apply colors to first two polys if present (Trace, Background)\n", " for poly, color in zip(polys[:2], (trace_color, bg_color)):\n", " try:\n", " poly.set_facecolor(mpl.colors.to_rgba(color))\n", " except Exception:\n", " pass\n", "\n", " # Get per-file n_total (if present) and derive a sensible p-threshold\n", " n_total_series = ks_test_chr.loc[ks_test_chr[\"filename\"] == filename_temp, \"n_total\"]\n", " if not n_total_series.empty and int(n_total_series.iat[0]) > 0:\n", " p_threshold = 0.05 / int(n_total_series.iat[0])\n", " else:\n", " p_threshold = 0.05\n", "\n", " failed = ks_test_chr.loc[\n", " (ks_test_chr[\"filename\"] == filename_temp) &\n", " (ks_test_chr[f\"pvalue_{feature}\"] < p_threshold)\n", " ]\n", "\n", " if not failed.empty:\n", " ax.text(\n", " 0.97, 0.97,\n", " \"*\",\n", " transform = ax.transAxes,\n", " ha = \"right\", va = \"top\",\n", " fontsize = 18, fontweight = \"bold\", color = \"red\",\n", " path_effects = [pe.withStroke(linewidth = 3, foreground = \"white\")],\n", " )\n", "\n", " # Remove duplicate legends from facets; keep one\n", " if g._legend is not None:\n", " g._legend.set_title(\"\")\n", " g._legend.set_bbox_to_anchor((1.02, 0.5))\n", " g._legend._loc = 6\n", "\n", " plt.tight_layout()\n", "\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"violin_plot_{feature}_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", "\n", " plt.close()\n", "\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "del chrom, violin_plot_chr, ks_test_chr, groups, rows, gr, cond_vals, plot_df, g, col2check, feature\n", "del ax, temp, nice_title, mask, filename_temp, exp_val, trace_color, polys, poly, color, n_total_series, p_threshold, failed\n", "del bg_color, default_trace_color, color_palette, numerosity_total_chr" ] }, { "cell_type": "markdown", "id": "a3d49e67", "metadata": {}, "source": [ "## Plotting X/Y/Z axis precision against numerosity\n", "\n", "We can also plot the X/Y/Z axis mean precision of a given chromatin segment against its numerosity.\n", "\n", "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 6, "id": "707eb14d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/precX_vs_numerosity_chr3.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/precY_vs_numerosity_chr3.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/precZ_vs_numerosity_chr3.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/precX_vs_numerosity_chr5.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/precY_vs_numerosity_chr5.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/precZ_vs_numerosity_chr5.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/precX_vs_numerosity_chr8.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/precY_vs_numerosity_chr8.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/precZ_vs_numerosity_chr8.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/precX_vs_numerosity_chr16.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/precY_vs_numerosity_chr16.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/precZ_vs_numerosity_chr16.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/precX_vs_numerosity_chrX.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/precY_vs_numerosity_chrX.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/precZ_vs_numerosity_chrX.pdf\n" ] } ], "source": [ "color_palette = {'2018-01-11':\"#a6cee3\" ,'2018-06-28':'#1f78b4', '2018-07-10':'#b2df8a', '2018-09-04':'#33a02c'}\n", "\n", "for chrom in violin_plot_data.keys():\n", "\n", " if not save_plots and chrom != \"chr3\":\n", " continue\n", "\n", " violin_plot_chr = violin_plot_data[chrom]\n", "\n", " sns.scatterplot(data=violin_plot_chr, x=\"precX\", y=\"numerosity\", hue=\"expID\", palette=color_palette, legend=False)\n", " plt.xlabel(\"X-axis precision\")\n", " plt.ylabel(\"Numerosity\")\n", " plt.tight_layout()\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"precX_vs_numerosity_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " sns.scatterplot(data=violin_plot_chr, x=\"precY\", y=\"numerosity\", hue=\"expID\", palette=color_palette, legend=False)\n", " plt.xlabel(\"Y-axis precision\")\n", " plt.ylabel(\"Numerosity\")\n", " plt.tight_layout()\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"precY_vs_numerosity_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " sns.scatterplot(data=violin_plot_chr, x=\"precZ\", y=\"numerosity\", hue=\"expID\", palette=color_palette, legend=False)\n", " plt.xlabel(\"Z-axis precision\")\n", " plt.ylabel(\"Numerosity\")\n", " plt.tight_layout()\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"precZ_vs_numerosity_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n" ] }, { "cell_type": "markdown", "id": "71041444", "metadata": {}, "source": [ "## Getting a file per chromosome with the traces that pass the assessment\n", "\n", "We create a file per each chromosome with the traces that passed the experimental assessment. \n", "\n", "This way the user has a reference file to filter the traces in the next steps, without needing to move the files around.\n", "\n", "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 7, "id": "1b7b49e8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Kruskal-Wallis test for chromosome chr3 and feature Numerosity: p-value=2.3501e-14\n", "Kruskal-Wallis test for chromosome chr3 and feature X-axis precision: p-value=5.0175e-08\n", "Kruskal-Wallis test for chromosome chr3 and feature Y-axis precision: p-value=6.3053e-15\n", "Kruskal-Wallis test for chromosome chr3 and feature Z-axis precision: p-value=8.1420e-09\n", "Traces that \u001b[1;4mpassed\u001b[0m the KS test for chromosome chr3: 30 out of 33 total traces.\n", "\n", "------------------------------------------------------\n", "\n", "Kruskal-Wallis test for chromosome chr5 and feature Numerosity: p-value=2.7458e-15\n", "Kruskal-Wallis test for chromosome chr5 and feature X-axis precision: p-value=4.7086e-09\n", "Kruskal-Wallis test for chromosome chr5 and feature Y-axis precision: p-value=3.7215e-13\n", "Kruskal-Wallis test for chromosome chr5 and feature Z-axis precision: p-value=8.6281e-10\n", "Traces that \u001b[1;4mpassed\u001b[0m the KS test for chromosome chr5: 33 out of 35 total traces.\n", "\n", "------------------------------------------------------\n", "\n", "Kruskal-Wallis test for chromosome chr8 and feature Numerosity: p-value=2.4712e-05\n", "Kruskal-Wallis test for chromosome chr8 and feature X-axis precision: p-value=3.0020e-04\n", "Kruskal-Wallis test for chromosome chr8 and feature Y-axis precision: p-value=3.5394e-06\n", "Kruskal-Wallis test for chromosome chr8 and feature Z-axis precision: p-value=2.4363e-04\n", "Traces that \u001b[1;4mpassed\u001b[0m the KS test for chromosome chr8: 22 out of 22 total traces.\n", "\n", "------------------------------------------------------\n", "\n", "Kruskal-Wallis test for chromosome chr16 and feature Numerosity: p-value=3.9739e-01\n", "Kruskal-Wallis test for chromosome chr16 and feature X-axis precision: p-value=7.2336e-01\n", "Kruskal-Wallis test for chromosome chr16 and feature Y-axis precision: p-value=5.9635e-01\n", "Kruskal-Wallis test for chromosome chr16 and feature Z-axis precision: p-value=3.1732e-01\n", "Traces that \u001b[1;4mpassed\u001b[0m the KS test for chromosome chr16: 10 out of 10 total traces.\n", "\n", "------------------------------------------------------\n", "\n", "Kruskal-Wallis test for chromosome chrX and feature Numerosity: p-value=1.9648e-07\n", "Kruskal-Wallis test for chromosome chrX and feature X-axis precision: p-value=4.9749e-01\n", "Kruskal-Wallis test for chromosome chrX and feature Y-axis precision: p-value=2.3816e-01\n", "Kruskal-Wallis test for chromosome chrX and feature Z-axis precision: p-value=3.1606e-01\n", "Traces that \u001b[1;4mpassed\u001b[0m the KS test for chromosome chrX: 4 out of 5 total traces.\n", "\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "for chrom in ks_test_data.keys():\n", "\n", " ks_test_chr = ks_test_data[chrom]\n", " kw_test_chr = kw_test_data[kw_test_data[\"chromosome\"] == chrom]\n", "\n", " Path(experimental_quality_assessment_folder, \"tsv\").mkdir(parents=True, exist_ok=True)\n", "\n", " ks_test_chr.to_csv(Path(experimental_quality_assessment_folder, \"tsv\", f\"ks_test_summary_{chrom}.tsv\"), index=False, sep=\"\\t\")\n", " kw_test_chr.to_csv(Path(experimental_quality_assessment_folder, \"tsv\", f\"kw_test_summary_{chrom}.tsv\"), index=False, sep=\"\\t\")\n", "\n", " for feature in [\"numerosity\", \"precX\", \"precY\", \"precZ\"]:\n", " print(f\"Kruskal-Wallis test for chromosome {chrom} and feature {nice_feature_names[feature]}: p-value={kw_test_chr[f'pvalue_{feature}'].iat[0]:.4e}\")\n", "\n", " trace_pass = []\n", "\n", " for _, row in ks_test_chr.iterrows():\n", "\n", " # display(row)\n", "\n", " p_value_threshold = 0.05 / row[\"n_total\"] if row[\"n_total\"] > 0 else 0.05\n", "\n", " if (row[\"pvalue_numerosity\"] > p_value_threshold) & \\\n", " (row[\"pvalue_precX\"] > p_value_threshold) & \\\n", " (row[\"pvalue_precY\"] > p_value_threshold) & \\\n", " (row[\"pvalue_precZ\"] > p_value_threshold):\n", " \n", " trace_pass.append(row[\"filename\"])\n", "\n", " # Write the contents of taces_pass to a text file\n", " with open(Path(experimental_quality_assessment_folder, f\"experimental_assessment_pass_{chrom}.txt\"), \"w\") as f:\n", " for item in trace_pass:\n", " f.write(f\"{item}\\n\")\n", "\n", " \n", " print(f\"Traces that \\033[1;4mpassed\\033[0m the KS test for chromosome {chrom}: {len(trace_pass)} out of {ks_test_chr.shape[0]} total traces.\")\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "del chrom, ks_test_chr, kw_test_chr, trace_pass, row, p_value_threshold, f, item, feature\n" ] }, { "cell_type": "markdown", "id": "8a37bbde", "metadata": {}, "source": [ "## Plotting efficiency matrices" ] }, { "cell_type": "markdown", "id": "19f68df7", "metadata": {}, "source": [ "Here we plot what we call efficiency matrices. \n", "\n", "They show the different traces and (one per row) and the different chromatin segments (one per column).\n", "\n", "If a chromatin segment is present in a trace, it is shown as a blue square; if absent is shown as a white square.\n", "\n", "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 8, "id": "0bf7245c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating efficiency matrices and plots for each chromosome...\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr3\n", "Total number of files found: 33\n", "Excluding timepoint 19 in experimentID: 2018-09-04, nucleusID: 7, homolog: A because it has less than 10 points.\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/efficiency_matrix_pre_experimental_assessment_chr3.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/efficiency_matrix_post_experimental_assessment_chr3.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr5\n", "Total number of files found: 35\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/efficiency_matrix_pre_experimental_assessment_chr5.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/efficiency_matrix_post_experimental_assessment_chr5.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr8\n", "Total number of files found: 22\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/efficiency_matrix_pre_experimental_assessment_chr8.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/efficiency_matrix_post_experimental_assessment_chr8.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr16\n", "Total number of files found: 10\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/efficiency_matrix_pre_experimental_assessment_chr16.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/efficiency_matrix_post_experimental_assessment_chr16.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chrX\n", "Total number of files found: 5\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/efficiency_matrix_pre_experimental_assessment_chrX.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/efficiency_matrix_post_experimental_assessment_chrX.pdf\n", "\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "color_palette = {'2018-01-11':\"#a6cee3\" ,'2018-06-28':'#1f78b4', '2018-07-10':'#b2df8a', '2018-09-04':'#33a02c'}\n", "\n", "print(\"Creating efficiency matrices and plots for each chromosome...\\n\")\n", "\n", "for chrom in starting_step.keys():\n", "\n", " if save_plots == False and chrom != \"chr3\":\n", " continue\n", "\n", " data_folder_chr = Path(data_folder, chrom)\n", "\n", " violin_plot_chr = violin_plot_data[chrom]\n", "\n", " print(f\"Processing CSV files in folder: {data_folder_chr}\")\n", "\n", " # Get a list of all CSV files in the specified data folder\n", " csv_files = list(data_folder_chr.glob('*.csv'))\n", " if len(csv_files) == 0:\n", " print(\"No files found, please input the correct work_folder\")\n", " else:\n", " print(f\"Total number of files found: {len(csv_files)}\")\n", "\n", " traces_pass = pd.read_csv(Path(experimental_quality_assessment_folder, f\"experimental_assessment_pass_{chrom}.txt\"), sep=\"\\t\", names=[\"trace\"])\n", "\n", " # Get a list of metadata keys to extract from filenames\n", " metadata_keys = [\"nucleusID\", \"cellID\", \"locationID\", \"date\", \"homolog\", \"chromosome\"]\n", "\n", " file_names = []\n", "\n", " if len(csv_files) > 0:\n", " efficiency_matrix_1 = np.zeros((len(csv_files), number_of_steps_library))\n", " efficiency_matrix_2 = np.zeros((len(csv_files), number_of_steps_library))\n", " # Iterate over each file in the list of file names\n", " for count, filein in enumerate(csv_files, start=1):\n", "\n", " stem = Path(filein).stem\n", "\n", " file_names.append(violin_plot_chr[violin_plot_chr[\"filename\"] == stem][\"repID\"].unique()[0])\n", "\n", " print(f\"Processing file {count}: {stem}\", end=\"\\r\")\n", "\n", " # We mine the metadata from the filename using regex patterns. If not possible, assign default values.\n", " metadata_CIMA = {\n", " key: (match.group(1) if (match := regex_patterns[key].search(stem)) else default)\n", " for key, default in zip(metadata_keys, [count, count, count, stem, \"A\", \"chrN\"])\n", " }\n", "\n", " # Adjust the metadata for nucleus and cell (if nucleus is count, but cell is something, use the cell match in nucleus)\n", " if \"cell\" in metadata_CIMA and metadata_CIMA[\"nucleusID\"] == count:\n", " metadata_CIMA[\"nucleusID\"] = metadata_CIMA[\"cell\"]\n", "\n", " metadata_CIMA[\"nucleusID\"] = int(metadata_CIMA[\"nucleusID\"])\n", " metadata_CIMA[\"cellID\"] = int(metadata_CIMA[\"cellID\"])\n", " metadata_CIMA[\"locationID\"] = int(metadata_CIMA[\"locationID\"])\n", "\n", " # Set experimentID as date for consistency\n", " metadata_CIMA[\"experimentID\"] = metadata_CIMA[\"date\"]\n", "\n", " # Read the CSV file and create a StructuralObject instance\n", " objin = CSVParser.read_CSV_file(filein.as_posix(), metadata = metadata_CIMA, content_type = \"srx\")\n", "\n", " objin.atomList = objin.atomList[(objin.atomList['timepoint'] >= starting_step[chrom] - 1) &\n", " (objin.atomList['timepoint'] < (starting_step[chrom] + number_of_steps_library - 1))].copy()\n", " \n", " for tp in objin.atomList[\"timepoint\"].unique():\n", " # Take out the timepoint from the atomList if it has less tha 10 points\n", " tp_mask = objin.atomList[\"timepoint\"] == tp\n", " if tp_mask.sum() < 10:\n", " print(f\"Excluding timepoint {tp} in experimentID: {metadata_CIMA['experimentID']}, nucleusID: {metadata_CIMA['nucleusID']}, homolog: {metadata_CIMA['homolog']} because it has less than 10 points.\")\n", " objin.atomList = objin.atomList.loc[~tp_mask].copy()\n", "\n", "\n", " # We store the numerosities for each timepoint in the file for the KW/KS tests.\n", " for tp in objin.atomList.timepoint.unique():\n", " efficiency_matrix_1[count-1, tp-starting_step[chrom]] = 1\n", " if stem in traces_pass[\"trace\"].values:\n", " efficiency_matrix_2[count-1, tp-starting_step[chrom]] = 1\n", "\n", " efficiency_matrix = pd.DataFrame(efficiency_matrix_1, columns = [f\"{i+1}\" for i in range(number_of_steps_library)])\n", " efficiency_matrix.index = file_names\n", " rownames = [temp.split(\"_\")[1] for temp in efficiency_matrix.index]\n", " rownames = [f\"Cell {temp.split('_')[1].replace('cell', '')} allelle {temp.split('_')[2]}\" for temp in efficiency_matrix.index]\n", "\n", " fig_height = max(2, 0.3 * len(efficiency_matrix.index))\n", "\n", " g = sns.clustermap(data = efficiency_matrix, figsize=(5, fig_height),\n", " vmin=0, vmax=1, cmap=['#f7f7f7', '#00008B'], \n", " row_colors=[color_palette.get(d.split(\"_\")[0], \"#a6cee3\") for d in efficiency_matrix.index],\n", " col_cluster=False, row_cluster=False,\n", " yticklabels=rownames,\n", " xticklabels=[f\"{i+1}\" for i in range(efficiency_matrix.shape[1])],\n", " )\n", "\n", " g.ax_heatmap.set_xlabel('Step ID')\n", " g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xticklabels(), rotation=0, fontsize=8)\n", " g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(), rotation=0, fontsize=8)\n", " g.ax_row_dendrogram.set_visible(False)\n", " g.ax_col_dendrogram.set_visible(False)\n", " g.ax_cbar.set_visible(False)\n", "\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"efficiency_matrix_pre_experimental_assessment_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " efficiency_matrix = pd.DataFrame(efficiency_matrix_2, columns = [f\"Step{i+1}\" for i in range(number_of_steps_library)])\n", " efficiency_matrix.index = file_names\n", "\n", " g = sns.clustermap(data = efficiency_matrix, figsize=(5, fig_height),\n", " vmin=0, vmax=1, cmap=['#f7f7f7', '#00008B'], \n", " row_colors=[color_palette.get(d.split(\"_\")[0], \"#a6cee3\") for d in efficiency_matrix.index],\n", " col_cluster=False, row_cluster=False,\n", " yticklabels=rownames,\n", " xticklabels=[f\"{i+1}\" for i in range(efficiency_matrix.shape[1])],\n", " )\n", "\n", " g.ax_heatmap.set_xlabel('Step ID')\n", " g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xticklabels(), rotation=0, fontsize=8)\n", " g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_yticklabels(), rotation=0, fontsize=8)\n", " g.ax_row_dendrogram.set_visible(False)\n", " g.ax_col_dendrogram.set_visible(False)\n", " g.ax_cbar.set_visible(False)\n", "\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"efficiency_matrix_post_experimental_assessment_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "del chrom, data_folder_chr, csv_files, traces_pass, metadata_keys, efficiency_matrix_1, efficiency_matrix_2\n", "del count, filein, stem, metadata_CIMA, match, objin, tp, efficiency_matrix, g, fig_height, file_names, rownames\n", "del violin_plot_chr\n" ] }, { "cell_type": "markdown", "id": "732f7c64", "metadata": {}, "source": [ "## Anisotropy factor" ] }, { "cell_type": "markdown", "id": "254e1fd2", "metadata": {}, "source": [ "Here we calculate and plot the anisotropy factor. \n", "\n", "The anisotrpy factor is a measure on how different is the error of the localization in the different axis.\n", "\n", "It is calculated following this formula: \n", "\n", "$$\n", "\\text{Anisotropy factor} = \\frac{\\operatorname{mean}(x_{\\text{spread}},\\ y_{\\text{spread}})}{z_{\\text{spread}}}\n", "$$\n", "\n", "In the plots, lines are shown for 0.2, 0.3 and 0.5 to give the user a visual reference.\n", "\n", "[Back to the general index](#experimental-quality-assessment)" ] }, { "cell_type": "code", "execution_count": 9, "id": "af16fca1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr3\n", "Total number of files found: 33\n", "Skipping file 2: 2018-07-10_nuc04_A_chr3 because it did not pass the experimental quality assessment.\n", "Skipping file 5: 2018-07-10_nuc03_A_chr3 because it did not pass the experimental quality assessment.\n", "Excluding timepoint 19 in experimentID: 2018-09-04, nucleusID: 7, homolog: A because it has less than 10 points.\n", "Skipping file 26: 2018-09-04_nuc01_A_chr3 because it did not pass the experimental quality assessment.\n", "Processing file 33: 2018-09-04_nuc06_A_chr3\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr5\n", "Total number of files found: 35\n", "Skipping file 2: 2018-07-10_nuc04_A_chr5 because it did not pass the experimental quality assessment.\n", "Skipping file 34: 2018-09-04_nuc01_A_chr5 because it did not pass the experimental quality assessment.\n", "Processing file 35: 2018-09-04_nuc04_A_chr5\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr8\n", "Total number of files found: 22\n", "Processing file 22: 2018-09-04_nuc01_B_chr8\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chr16\n", "Total number of files found: 10\n", "Processing file 10: 2018-09-04_nuc04_A_chr16\n", "------------------------------------------------------\n", "\n", "Processing CSV files in folder: /scratch/CIMA_tutorial/data/chrX\n", "Total number of files found: 5\n", "Skipping file 2: 2018-09-04_nuc04_A_chrX because it did not pass the experimental quality assessment.\n", "Processing file 5: 2018-09-04_nuc01_A_chrX\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "collection_prec = defaultdict(list)\n", "for chrom in starting_step.keys():\n", "\n", " data_folder_chr = Path(data_folder, chrom)\n", "\n", " print(f\"Processing CSV files in folder: {data_folder_chr}\")\n", "\n", " # Get a list of all CSV files in the specified data folder\n", " csv_files = list(data_folder_chr.glob('*.csv'))\n", " if len(csv_files) == 0:\n", " print(\"No files found, please input the correct work_folder\")\n", " else:\n", " print(f\"Total number of files found: {len(csv_files)}\")\n", "\n", " traces_pass = pd.read_csv(Path(experimental_quality_assessment_folder, f\"experimental_assessment_pass_{chrom}.txt\"), sep=\"\\t\", names=[\"trace\"])\n", "\n", " # Get a list of metadata keys to extract from filenames\n", " metadata_keys = [\"nucleusID\", \"cellID\", \"locationID\", \"date\", \"homolog\", \"chromosome\"]\n", "\n", " obj_list = []\n", "\n", " if len(csv_files) > 0:\n", " # Iterate over each file in the list of file names\n", " for count, filein in enumerate(csv_files, start=1):\n", "\n", " stem = Path(filein).stem\n", "\n", " if stem not in traces_pass[\"trace\"].values:\n", " print(f\"Skipping file {count}: {stem} because it did not pass the experimental quality assessment.\")\n", " continue\n", "\n", " print(f\"Processing file {count}: {stem}\", end=\"\\r\")\n", "\n", " # We mine the metadata from the filename using regex patterns. If not possible, assign default values.\n", " metadata_CIMA = {\n", " key: (match.group(1) if (match := regex_patterns[key].search(stem)) else default)\n", " for key, default in zip(metadata_keys, [count, count, count, stem, \"A\", \"chrN\"])\n", " }\n", "\n", " # Adjust the metadata for nucleus and cell (if nucleus is count, but cell is something, use the cell match in nucleus)\n", " if \"cell\" in metadata_CIMA and metadata_CIMA[\"nucleusID\"] == count:\n", " metadata_CIMA[\"nucleusID\"] = metadata_CIMA[\"cell\"]\n", "\n", " metadata_CIMA[\"nucleusID\"] = int(metadata_CIMA[\"nucleusID\"])\n", " metadata_CIMA[\"cellID\"] = int(metadata_CIMA[\"cellID\"])\n", " metadata_CIMA[\"locationID\"] = int(metadata_CIMA[\"locationID\"])\n", "\n", " # Set experimentID as date for consistency\n", " metadata_CIMA[\"experimentID\"] = metadata_CIMA[\"date\"]\n", "\n", " # Read the CSV file and create a StructuralObject instance\n", " objin = CSVParser.read_CSV_file(filein.as_posix(), metadata = metadata_CIMA, content_type = \"srx\")\n", "\n", " objin.atomList = objin.atomList[(objin.atomList['timepoint'] >= starting_step[chrom] - 1) &\n", " (objin.atomList['timepoint'] < (starting_step[chrom] + number_of_steps_library - 1))].copy()\n", " \n", " for tp in objin.atomList[\"timepoint\"].unique():\n", " # Take out the timepoint from the atomList if it has less tha 10 points\n", " tp_mask = objin.atomList[\"timepoint\"] == tp\n", " if tp_mask.sum() < 10:\n", " print(f\"Excluding timepoint {tp} in experimentID: {metadata_CIMA['experimentID']}, nucleusID: {metadata_CIMA['nucleusID']}, homolog: {metadata_CIMA['homolog']} because it has less than 10 points.\")\n", " objin.atomList = objin.atomList.loc[~tp_mask].copy()\n", "\n", " obj_list.append(objin)\n", "\n", " for obj in obj_list:\n", " for tp in obj.atomList[\"timepoint\"].unique():\n", " # Take out the timepoint from the atomList if it has less tha 10 points\n", " tp_mask = obj.atomList[\"timepoint\"] == tp\n", " temp = obj.atomList.loc[tp_mask].copy()\n", "\n", " x_spread = max(temp[\"xprec\"].values) - min(temp[\"xprec\"].values)\n", " y_spread = max(temp[\"yprec\"].values) - min(temp[\"yprec\"].values)\n", " z_spread = max(temp[\"zprec\"].values) - min(temp[\"zprec\"].values)\n", " anisotropy_factor = np.nanmean([x_spread, y_spread]) / z_spread\n", "\n", " collection_prec[\"chromosome\"].append(chrom)\n", " collection_prec[\"file\"].append(obj.metadata[\"experimentID\"] + \"_\" + str(obj.metadata[\"nucleusID\"]) + \"_\" + obj.metadata[\"homolog\"])\n", " collection_prec[\"tp\"].append(tp)\n", " collection_prec[\"spread_x\"].append(x_spread)\n", " collection_prec[\"spread_y\"].append(y_spread)\n", " collection_prec[\"spread_z\"].append(z_spread)\n", " collection_prec[\"anisotropy_factor\"].append(anisotropy_factor)\n", "\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "collection_prec = pd.DataFrame(collection_prec)\n", "\n", "del chrom, data_folder_chr, csv_files, count, filein, traces_pass, metadata_keys, stem, metadata_CIMA, match, objin\n", "del tp, tp_mask, obj_list, obj, temp, x_spread, y_spread, z_spread, anisotropy_factor" ] }, { "cell_type": "code", "execution_count": 10, "id": "8d1068dc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "For chr3 the min anisotropy factor is 0.22194411423835986\n", "For chr3 the max anisotropy factor is 0.8102558862289374\n", "\n", "For chr5 the min anisotropy factor is 0.18651602294796202\n", "For chr5 the max anisotropy factor is 0.7496463612103765\n", "\n", "For chr8 the min anisotropy factor is 0.17096927338239262\n", "For chr8 the max anisotropy factor is 0.5108808891327506\n", "\n", "For chr16 the min anisotropy factor is 0.1550711582050499\n", "For chr16 the max anisotropy factor is 0.5208693780499624\n", "\n", "For chrX the min anisotropy factor is 0.17967870020756443\n", "For chrX the max anisotropy factor is 0.4708960725633753\n", "\n" ] } ], "source": [ "for chrom in starting_step.keys():\n", "\n", " temp = collection_prec[collection_prec[\"chromosome\"] == chrom]\n", " print(f\"For {chrom} the min anisotropy factor is {temp.anisotropy_factor.min()}\")\n", " print(f\"For {chrom} the max anisotropy factor is {temp.anisotropy_factor.max()}\\n\")\n", "\n", "del chrom, temp" ] }, { "cell_type": "code", "execution_count": 11, "id": "096828e8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Plotting anisotropy factor per trace for chromosome chr3...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/anisotropy_factor_per_tracechr3.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor per trace for chromosome chr5...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/anisotropy_factor_per_tracechr5.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor per trace for chromosome chr8...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/anisotropy_factor_per_tracechr8.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor per trace for chromosome chr16...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/anisotropy_factor_per_tracechr16.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor per trace for chromosome chrX...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/anisotropy_factor_per_tracechrX.pdf\n", "\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "for chrom in starting_step.keys():\n", "\n", " if not save_plots and chrom != \"chr3\":\n", " continue\n", " \n", " print(f\"Plotting anisotropy factor per trace for chromosome {chrom}...\")\n", "\n", " collection_prec_chr = (\n", " collection_prec.loc[collection_prec[\"chromosome\"] == chrom, [\"file\", \"tp\", \"anisotropy_factor\"]]\n", " .sort_values([\"file\", \"tp\"])\n", " .copy()\n", " )\n", "\n", " g = sns.relplot(\n", " data=collection_prec_chr,\n", " x=\"tp\",\n", " y=\"anisotropy_factor\",\n", " col=\"file\",\n", " col_wrap=6, # Number of plots per row, adjust as needed\n", " kind=\"line\",\n", " marker=\"o\",\n", " linewidth=1.2,\n", " height=2.5,\n", " aspect=1.2,\n", " facet_kws={\"sharey\": True, \"sharex\": True},\n", " # ylim = (0, 1)\n", " )\n", "\n", " g.set_axis_labels(\"Timepoint (tp)\", \"Anisotropy factor\")\n", " g.set_titles(\"{col_name}\")\n", " g.set(ylim=(0,1))\n", "\n", " for ax in g.axes.flat:\n", " x_min = int(collection_prec_chr[\"tp\"].min())\n", " x_max = int(collection_prec_chr[\"tp\"].max())\n", " ax.set_xticks(range(x_min, x_max + 1))\n", " ax.set_xticklabels(range(1, number_of_steps_library + 1), rotation=0)\n", " ax.axhline(0.25, color = \"red\", alpha = 0.25, zorder = 0)\n", " ax.axhline(0.3, color = \"green\", alpha = 0.25, zorder = 0)\n", " ax.axhline(0.5, color = \"black\", alpha = 0.25, zorder = 0)\n", "\n", " plt.tight_layout()\n", "\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"anisotropy_factor_per_trace{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "del chrom, g, ax, collection_prec_chr, x_max, x_min" ] }, { "cell_type": "code", "execution_count": 12, "id": "195137a6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Plotting anisotropy factor distribution for chromosome chr3...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/anisotropy_factor_boxenplot_chr3.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr3/anisotropy_factor_lineplot_chr3.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor distribution for chromosome chr5...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/anisotropy_factor_boxenplot_chr5.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr5/anisotropy_factor_lineplot_chr5.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor distribution for chromosome chr8...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/anisotropy_factor_boxenplot_chr8.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr8/anisotropy_factor_lineplot_chr8.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor distribution for chromosome chr16...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/anisotropy_factor_boxenplot_chr16.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chr16/anisotropy_factor_lineplot_chr16.pdf\n", "\n", "------------------------------------------------------\n", "\n", "Plotting anisotropy factor distribution for chromosome chrX...\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/anisotropy_factor_boxenplot_chrX.pdf\n", "Saving plots to: /scratch/CIMA_tutorial/2_experimental_quality_assessment/plots/chrX/anisotropy_factor_lineplot_chrX.pdf\n", "\n", "------------------------------------------------------\n", "\n" ] } ], "source": [ "for chrom in collection_prec[\"chromosome\"].unique():\n", "\n", " if not save_plots and chrom != \"chr3\":\n", " continue\n", "\n", " print(f\"Plotting anisotropy factor distribution for chromosome {chrom}...\")\n", "\n", " collection_prec_chr = collection_prec[collection_prec[\"chromosome\"] == chrom]\n", "\n", " sns.boxenplot(data = collection_prec_chr, x = \"tp\", y = \"anisotropy_factor\")\n", " plt.ylim(min(collection_prec_chr[\"anisotropy_factor\"].min() * 0.9, 0), collection_prec_chr[\"anisotropy_factor\"].max() * 1.1)\n", " plt.xlabel(\"Timepoint\")\n", " plt.ylabel(\"Anisotropy factor (mean(X,Y) spread / Z spread )\")\n", " plt.axhline(0.25, color = \"red\")\n", " plt.axhline(0.20, color = \"blue\")\n", " plt.title(f\"Chromosome {chrom} - Boxenplot\")\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"anisotropy_factor_boxenplot_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " sns.lineplot(data = collection_prec_chr, x = \"tp\", y = \"anisotropy_factor\")\n", " plt.ylim(collection_prec_chr[\"anisotropy_factor\"].min() * 0.9, collection_prec_chr[\"anisotropy_factor\"].max() * 1.1)\n", " plt.xlabel(\"Timepoint\")\n", " plt.ylabel(\"Anisotropy factor (mean(X,Y) spread / Z spread )\")\n", " plt.title(f\"Chromosome {chrom} - Lineplot\")\n", " if save_plots:\n", " plot_folder = Path(experimental_quality_assessment_folder, \"plots\", chrom)\n", " plot_folder.mkdir(parents=True, exist_ok=True)\n", " filename = Path(plot_folder, f\"anisotropy_factor_lineplot_{chrom}.pdf\")\n", " print(f\"Saving plots to: {filename}\")\n", " plt.savefig(filename, **plot_opts)\n", " del plot_folder, filename\n", " else:\n", " plt.show()\n", " plt.close()\n", "\n", " print(f\"\\n{'-'*54}\\n\")\n", "\n", "del chrom, collection_prec_chr" ] } ], "metadata": { "kernelspec": { "display_name": "CIMA_new_install", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }