{ "cells": [ { "cell_type": "markdown", "id": "5d40d974", "metadata": {}, "source": [ "# 99 - Writting to and reading from different formats\n", "\n", "**Description:** This notebook contains an example on how to write a list of CIMA objects into the FOF_CT volumetric \n", "\n", "The following index has links to the different sections of the notebook. Some sections may have additional indexes to access the subparts. \n", "If you are using VS Code to visualize the notebook, the links will not work. This is due to how VC Code render the notebook. To navigate the notebook use the outline panel. To know more about it, check this [link](https://code.visualstudio.com/docs/getstarted/userinterface#_outline-view)." ] }, { "cell_type": "markdown", "id": "5820879c", "metadata": {}, "source": [ "Content:\n", "\n", "- [Library imports and functions](#library-imports-and-functions)\n", "- [Setting some variables](#setting-some-variables)\n", "- [Exporting the data with CIMA](#exporting-the-data-with-cima)\n", "- [Writting to FOF-CT volumetric format](#writting-to-fof-ct-volumetic-format)\n", "- [Reading the FOF-CT volumetric format](#reading-the-fof-ct-volumetic-format)\n", "- [Reading from XYZ coordinates format](#reading-from-xyz-coordinate-format)" ] }, { "cell_type": "markdown", "id": "8cde2c0d", "metadata": {}, "source": [ "## Library imports and functions" ] }, { "cell_type": "markdown", "id": "1723e176", "metadata": {}, "source": [ "Content:\n", "\n", "- [Library imports and functions](#library-imports-and-functions)\n", "- [Setting some variables](#setting-some-variables)\n", "- [Exporting the data with CIMA](#exporting-the-data-with-cima)\n", "- [Extracting morphological features](#extracting-the-morphological-features)" ] }, { "cell_type": "markdown", "id": "07f8c8b7", "metadata": {}, "source": [ "[Back to the general index](#morphological-features-extraction)" ] }, { "cell_type": "code", "execution_count": 1, "id": "4608fdf1", "metadata": {}, "outputs": [], "source": [ "import re\n", "from pathlib import Path\n", "from itertools import product\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "from cima.parsers.parser_csv import CSVParser\n", "from cima.utils.misc import fof_volumetric_ct_writer, fof_volumetric_ct_reader, fof_bs_ct_reader\n", "\n", "## Regular expressions to get the metadata from the filename\n", "regex_patterns = {\n", " \"nucleusID\": re.compile(r\"(?i)nuc(\\d+)\"), # This searches for \"Nuc\" or \"nuc\" followed by digits\n", " \"cellID\": re.compile(r\"(?i)cell(\\d+)\"), # This searches for \"Cell\" or \"cell\" followed by digits\n", " \"locationID\" : re.compile(r\"(?i)loc-?(\\d+)\"), # This searches for \"Loc\" or \"loc\" followed by optional hyphen and digits\n", " \"date\" : re.compile(r\"(\\d{4}[-\\.]\\d{2}[-\\.]\\d{2})\"), # This searches for dates in the format YYYY-MM-DD or YYYY.MM.DD\n", " \"homolog\" : re.compile(r\"_([ABPM01pm])\"), ## Added just in case the homolog is in the filename.\n", " \"chromosome\" : re.compile(r\"(?i)(chr[\\d+MXY])\") ## Added just in case the chromosome is in the filename.\n", " }" ] }, { "cell_type": "markdown", "id": "d970bd6d", "metadata": {}, "source": [ "## Setting some variables" ] }, { "cell_type": "markdown", "id": "65c741e1", "metadata": {}, "source": [ "In the following cell you should set up some variables for the script to run properly. Change the variables at your convenience, and then you can press \"Run all\" in the Jupyter Notebook. \n", "\n", "Brief explanation of the different variables:\n", "\n", "\n", "The notebook will create a folder in the work_folder path called `6_morphological_features_extraction`. Inside this folder 2 subfolders will be created, one for `tsvs` and one for `plots`. Inside each folder, a sub-subfolder with the suffix name will have the plots/tsvs for that experiment.\n", "\n", "
\n",
    "6_morphological_features_extraction/\n",
    "|-- tsvs/\n",
    "|   |-- example_tsv_SUFFIX.tsv\n",
    "|-- plots/\n",
    "|   |-- example_plot_SUFFIX.pdf\n",
    "
\n", "\n", "[Back to the general index](#morphological-features-extraction)" ] }, { "cell_type": "code", "execution_count": 2, "id": "4b6667e1", "metadata": {}, "outputs": [], "source": [ "work_folder = \"/scratch/CIMA_tutorial/\"\n", "data_folder = \"/scratch/CIMA_tutorial/data/chr3\"\n", "\n", "info_file = \"/scratch/CIMA_tutorial/data/chr3/Info_chr3.txt\"\n", "column_round = \"Round\"\n", "column_name = \"Name\"\n", "column_size = \"Size(kb)\"\n", "column_chr = \"Chr\"\n", "column_start_pos = \"Start(hg19)\"\n", "column_end_pos = \"End(hg19)\"\n", "\n", "number_of_steps_library = 16\n", "starting_step = 6\n", "\n", "prec_mean_dataset = 30\n", "\n", "save_plots = False\n", "\n", "file_suffix = \"chr3\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "92e8536d", "metadata": {}, "outputs": [], "source": [ "work_folder = Path(work_folder)\n", "data_folder = Path(data_folder)\n", "pre_assessment_folder = Path(work_folder, \"2_experimental_quality_assessment\")\n", "reconstruction_tsv_folder = Path(work_folder, \"3_3D_reconstruction_assessment\", \"tsvs\", file_suffix)\n", "\n", "info_df = pd.read_table(Path(info_file), sep=\"\\t\")\n", "info_df.columns = info_df.columns.str.strip() # In case there are leading or trailing spaces in the column names\n", "\n", "info_file_dict = {\n", " \"column_round\" : column_round,\n", " \"column_name\" : column_name,\n", " \"column_size\" : column_size,\n", " \"column_chr\" : column_chr,\n", " \"column_start_pos\" : column_start_pos,\n", " \"column_end_pos\" : column_end_pos,\n", "}\n", "\n", "del column_round, column_name, column_size, column_chr, column_start_pos, column_end_pos, info_file" ] }, { "cell_type": "markdown", "id": "6d73273a", "metadata": {}, "source": [ "## Exporting the data with CIMA" ] }, { "cell_type": "markdown", "id": "85c2d610", "metadata": {}, "source": [ "Content:\n", "- [Creating the list of CIMA StructuralObject](#creating-the-list-of-cima-structuralobject)\n", "- [Checking the information file](#checking-the-information-file)\n", "- [Creating a color palette and the results files](#creating-a-color-palette-and-the-results-folders-for-the-tsv-and-plot-files)\n", "\n", "[Back to the general index](#morphological-features-extraction)" ] }, { "cell_type": "markdown", "id": "e9c8ffb6", "metadata": {}, "source": [ "### Creating the list of CIMA StructuralObject" ] }, { "cell_type": "markdown", "id": "07b3463a", "metadata": {}, "source": [ "In this cell we create a list of StructuralObjects (the main CIMA class type). \n", "This list is the basis to do the assessment through this Jupyter Notebook. \n", "\n", "When loading the data, we will also take out those traces or CS that did not pass the assessment in the previous notebooks. \n", " \n", "When reading the file, it will also add some metadata gathered from the filename of the csv file:\n", "\n", "\n", "[Back to the section index](#exporting-the-data-with-cima)" ] }, { "cell_type": "code", "execution_count": 4, "id": "99dc6c02", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of files found: 33\n", "Loading the experimental quality data from the CSV files\n", "Loading the bad flags from the CSV files\n", "\tCreating the 'flag' column for reconstruction-based removal\n", "Skipping file 2: 2018-07-10_nuc04_A_chr3 because it did not pass the experimental quality assessment.\n", "Skipping file 5: 2018-07-10_nuc03_A_chr3 because it did not pass the experimental quality assessment.\n", "\tExcluding timepoint 19 in experimentID: 2018-09-04, nucleusID: 7, homolog: A because it has less than 10 points.\n", "Skipping file 26: 2018-09-04_nuc01_A_chr3 because it did not pass the experimental quality assessment.\n", "Processing file 33: 2018-09-04_nuc06_A_chr3\n", "Total number of objects created: 30\n" ] } ], "source": [ "# Get a list of all CSV files in the specified data folder\n", "csv_files = list(data_folder.glob('*.csv'))\n", "\n", "if len(csv_files) == 0:\n", " print(\"No files found, please input the correct work_folder\")\n", "else:\n", " print(f\"Total number of files found: {len(csv_files)}\")\n", "\n", "print(\"Loading the experimental quality data from the CSV files\")\n", "traces_pass = pd.read_csv(Path(pre_assessment_folder, f\"experimental_assessment_pass_{file_suffix}.txt\"), sep=\"\\t\", names=[\"trace\"])\n", "\n", "print(\"Loading the bad flags from the CSV files\")\n", "segments_to_remove_CCC = pd.read_csv(Path(reconstruction_tsv_folder, f'segments2remove_bycccvariation_t2{(\"_\" + file_suffix) if file_suffix != \"\" else \"\"}.tsv'), sep=\"\\t\")\n", "if \"flag\" not in segments_to_remove_CCC.columns and not segments_to_remove_CCC.empty:\n", " print(\"\\tCreating the 'flag' column for reconstruction-based removal\")\n", " segments_to_remove_CCC[\"flag\"] = segments_to_remove_CCC[[\"experimentID\", \"nucleusID\", \"homolog\", \"timepoint\"]].astype(str).apply(\"_\".join, axis=1)\n", " bad_flags = set(segments_to_remove_CCC[\"flag\"].astype(str).values)\n", "elif not segments_to_remove_CCC.empty:\n", " bad_flags = set(segments_to_remove_CCC[\"flag\"].astype(str).values)\n", "else:\n", " bad_flags = set()\n", "\n", "# Get a list of metadata keys to extract from filenames\n", "metadata_keys = [\"nucleusID\", \"cellID\", \"locationID\", \"date\", \"homolog\", \"chromosome\"]\n", "\n", "# Initialize an empty list to store the StructuralObject instances\n", "obj_list = []\n", "\n", "if len(csv_files) > 0:\n", " # Iterate over each file in the list of file names\n", " for count, filein in enumerate(csv_files, start=1):\n", "\n", " stem = Path(filein).stem\n", "\n", " if not stem in traces_pass[\"trace\"].values:\n", " print(f\"Skipping file {count}: {stem} because it did not pass the experimental quality assessment.\")\n", " continue\n", "\n", " print(f\"Processing file {count}: {stem}\", end=\"\\r\")\n", "\n", " # We mine the metadata from the filename using regex patterns. If not possible, assign default values.\n", " # In the case of nucleus, cell and location, the default is the filecount (first file read will be number 1, ...)\n", " # In the case of date, if no date can be found, the filename gets used.\n", " # If no valid homolog denomination is found, the default value is A.\n", " # If no valid chromosome denomination is found, the default value is chrN.\n", " metadata_CIMA = {\n", " key: (match.group(1) if (match := regex_patterns[key].search(stem)) else default)\n", " for key, default in zip(metadata_keys, [count, count, count, stem, \"A\", \"chrN\"])\n", " }\n", "\n", " # Adjust the metadata for nucleus and cell (if nucleus is count, but cell is something, use the cell match in nucleus)\n", " if \"cellID\" in metadata_CIMA and metadata_CIMA[\"nucleusID\"] == count:\n", " metadata_CIMA[\"nucleusID\"] = metadata_CIMA[\"cellID\"]\n", "\n", " # Set experimentID as date for consistency\n", " metadata_CIMA[\"experimentID\"] = metadata_CIMA[\"date\"]\n", "\n", " # Read the CSV file and create a StructuralObject instance\n", " objin = CSVParser.read_CSV_file(filein.as_posix(), metadata = metadata_CIMA, content_type = \"srx\")\n", "\n", " # Filter the atomList to include only the specified timepoints in the info file\n", " objin.atomList = objin.atomList[(objin.atomList['timepoint'] >= starting_step - 1) &\n", " (objin.atomList['timepoint'] < (starting_step + number_of_steps_library - 1))].copy()\n", " \n", " experiment_id = objin.metadata[\"experimentID\"]\n", " nucleus_id = int(objin.metadata[\"nucleusID\"])\n", " homolog = objin.metadata[\"homolog\"]\n", " \n", " full_flags = (experiment_id + \"_\" + str(nucleus_id) + \"_\" +homolog + \"_\" + objin.atomList[\"timepoint\"].astype(str))\n", " remove_mask = full_flags.isin(bad_flags)\n", "\n", " if remove_mask.any():\n", " removed_tps = objin.atomList.loc[remove_mask, \"timepoint\"].unique()\n", " # removed_tps = \",\".join([tp_names[tp] for tp in removed_tps])\n", " # print(f\"Excluding timepoints {removed_tps} in experimentID: {experiment_id}, nucleusID: {nucleus_id}, homolog: {homolog} due to poor assessment.\")\n", " objin.atomList = objin.atomList.loc[~remove_mask].copy()\n", "\n", " for tp in objin.atomList[\"timepoint\"].unique():\n", " # Take out the timepoint from the atomList if it has less tha 10 points\n", " tp_mask = objin.atomList[\"timepoint\"] == tp\n", " if tp_mask.sum() < 10:\n", " print(f\"\\tExcluding timepoint {tp} in experimentID: {experiment_id}, nucleusID: {nucleus_id}, homolog: {homolog} because it has less than 10 points.\")\n", " objin.atomList = objin.atomList.loc[~tp_mask].copy()\n", " \n", " # Append the object to the list\n", " obj_list.append(objin)\n", "\n", " # Nice print to tell the user the number of objects created.\n", " print(f\"\\nTotal number of objects created: {len(obj_list)}\")\n", "\n", " # Clean up memory by deleting some of the variables no longer needed\n", " del count, filein, stem, metadata_CIMA, match, objin\n", "\n", "del csv_files, metadata_keys\n", "del bad_flags\n", "del experiment_id, nucleus_id, homolog, full_flags, remove_mask\n", "if 'segments_to_remove_precision' in locals():\n", " del segments_to_remove_CCC" ] }, { "cell_type": "markdown", "id": "caa19ca6", "metadata": {}, "source": [ "### Checking the information file" ] }, { "cell_type": "markdown", "id": "f7a2a8f2", "metadata": {}, "source": [ "This cells makes an adjustment in the info_df. \n", "\n", "It checks if the column of the info_df that we will use as the link between this data frame and our walks has the same timepoints. This happens because most programs start with timepoint 0, while most information files start at timepoint 1. \n", "\n", "This will create a new column in the dataframe that is called 'timepoint'. This column will be modified so it has the same range as the timepoint column from the csv files.\n", "\n", "[Back to the section index](#exporting-the-data-with-cima)" ] }, { "cell_type": "code", "execution_count": 5, "id": "73ebe4db", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "List of timepoints found across all objects: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]\n", "List of timepoints in the information file: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]\n", "\n", "Note: The column for the information file that has the timepoint is offset by 1 compared to the objects, adjusting accordingly.\n" ] } ], "source": [ "list_of_timepoints = sorted({int(tp) for obj in obj_list for tp in obj.atomList.timepoint.unique()})\n", "info_of_timepoints = list(map(int, info_df[info_file_dict[\"column_round\"]].tolist()))\n", "\n", "print(f'List of timepoints found across all objects: {list_of_timepoints}')\n", "print(f'List of timepoints in the information file: {info_of_timepoints}\\n')\n", "\n", "if list_of_timepoints == info_of_timepoints:\n", " print('The timepoints found in the objects match those in the information file.')\n", " info_df[\"timepoint\"] = info_df[info_file_dict[\"column_round\"]]\n", "elif set(list_of_timepoints).issubset(info_of_timepoints):\n", " print('Note: The timepoints found in the objects are a subset of those in the information file. Not all timepoints are present.')\n", " info_df[\"timepoint\"] = info_df[info_file_dict[\"column_round\"]]\n", "elif list_of_timepoints == [x - 1 for x in info_of_timepoints]:\n", " print('Note: The column for the information file that has the timepoint is offset by 1 compared to the objects, adjusting accordingly.')\n", " info_df[\"timepoint\"] = info_df[info_file_dict[\"column_round\"]] - 1\n", "elif set(list_of_timepoints).issubset({x - 1 for x in info_of_timepoints}):\n", " print('Note: The timepoints found in the objects are a subset of those in the information file, which are offset by 1 compared to the objects. Adjusting accordingly.')\n", " info_df[\"timepoint\"] = info_df[info_file_dict[\"column_round\"]] - 1\n", "else:\n", " print('Warning: The timepoints found in the objects do not match those in the information file.')\n", "\n", "if info_file_dict[\"column_size\"] == \"\":\n", " print(\"Calculating size column from start and end positions...\")\n", " column_size = \"size\"\n", " info_file_dict[\"column_size\"] = column_size\n", " info_df[column_size] = info_df[info_file_dict[\"column_end_pos\"]] - info_df[info_file_dict[\"column_start_pos\"]]\n", "\n", "if info_file_dict[\"column_name\"] == \"\":\n", " print(\"Setting name column from round numbers...\")\n", " info_file_dict[\"column_name\"] = \"name\"\n", " info_df[\"name\"] = [f\"Step{i}\" for i in range(1, info_df.shape[0]+1)]\n", "\n", "tp_names = info_df.set_index(\"timepoint\")[info_file_dict[\"column_name\"]].to_dict()\n", " \n", "del list_of_timepoints, info_of_timepoints" ] }, { "cell_type": "markdown", "id": "36fc7977", "metadata": {}, "source": [ "### Creating a color palette and the results folders for the tsv and plot files\n", "\n", "Here we create a color palette automatically based on the different experiment ID found across the files. Experiment IDs are usually the date of the experiment. This allows to easily check for a possible batch effect.\n", "\n", "We also create the folders were to store the results.\n", "\n", "\n", "[Back to the section index](#exporting-the-data-with-cima)" ] }, { "cell_type": "code", "execution_count": 6, "id": "1215c5d5", "metadata": {}, "outputs": [], "source": [ "# Define the color palette for experiments. Use the date or the experimentID (date) as the key.\n", "color_palette = {exp_date: color for exp_date, color in zip(\n", " # We get a set of unique dates from the metadata of the objects\n", " set([file.metadata[\"experimentID\"] for file in obj_list]),\n", " # We assign a color from the tab10 palette for each unique date\n", " sns.color_palette(\"tab10\", n_colors=len(set([file.metadata[\"experimentID\"] for file in obj_list]))) \n", ")}\n", "\n", "# This palette is fixed for the current experiments used in the tutorial, which aim to reproduce the figures in the CIMA paper.\n", "color_palette = {'2018-01-11':\"#a6cee3\" ,'2018-06-28':'#1f78b4', '2018-07-10':'#b2df8a', '2018-09-04':'#33a02c'}\n", "\n", "# Create the folders to save the results. If the folders already exist, do not raise an error.\n", "fof_ct_folder = Path(work_folder, \"FOF_CT_files\", file_suffix)\n", "fof_ct_folder.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "markdown", "id": "d04a54de", "metadata": {}, "source": [ "## Writting to FOF-CT volumetic format" ] }, { "cell_type": "markdown", "id": "f5a1bdf3", "metadata": {}, "source": [ "CIMA can write the contents of an object into the FOF_CT volumetric format. \n", "
\n", "This format is composed of 3 different tables (core.csv, trace.csv and spot.csv)" ] }, { "cell_type": "markdown", "id": "59fe8edb", "metadata": {}, "source": [ "#### core.csv\n", "This has information about the Spot_ID. In our case the Spot_ID corresponds to a chromatin segment. Internally it's called 4dn_FOF_CT_core. The columns of this table are as follow:\n", "" ] }, { "cell_type": "markdown", "id": "3ab2c6eb", "metadata": {}, "source": [ "#### trace.csv\n", "this has information about the Trace_ID. In our case, the Trace_ID corresponds to the trace. Internally I think is called 4dn_FOF_CT_trace. The columns of this table are as follow:\n", "" ] }, { "cell_type": "markdown", "id": "ea19fe74", "metadata": {}, "source": [ "#### spot.csv\n", "This has the data that each Spot_ID contains. In our case, this table contains the blinks themselves. Internally it is called 4dn_FOF_CT_quality. The columns of this table are as follow:\n", "" ] }, { "cell_type": "markdown", "id": "435a94ae", "metadata": {}, "source": [ "In order for the fof_ct_wirter function to work, the user must provide a bit of metadata of the experiment. It is described below:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 7, "id": "54257d4a", "metadata": {}, "outputs": [], "source": [ "GENOME_ASSEMBLY = \"hg19\"\n", "LAB_NAME = \"Wu Lab\"\n", "EXPERIMENTER_NAME = \"Ting Wu, Guy Nir\"\n", "EXPERIMENTER_CONTACT = \"ting.wu@example.com, guy.nir@example.com\"\n", "DESCRIPTION = \"chr3 files\"\n", "\n", "# This is just an example, the user should fill this dictionary with the actual fluorophores used in the experiment.\n", "# The structure of the dictionary should be {timepoint: fluorophore}, where timepoint is the integer timepoint and fluorophore is a string with the name of the fluorophore used in that timepoint.\n", "# Numbering of timepoints sould be consistent with the timepoint column in the CSV file.\n", "# If you are using the structure of the tutorial files, you have a DataFrame with a column named \"timepoint\" that has the timepoint information.\n", "# Example of the dictionary structure:\n", "FLUOROPHORE_DICT = {\n", " 5: 'Alexa647',\n", " 6: 'Alexa647',\n", " 7: 'Alexa647',\n", " 8: 'Alexa647',\n", " 9: 'Alexa647',\n", " 10: 'Alexa647',\n", " 11: 'Alexa647',\n", " 12: 'Alexa647',\n", " 13: 'Alexa647',\n", " 14: 'Alexa647',\n", " 15: 'Alexa647',\n", " 16: 'Alexa647',\n", " 17: 'Alexa647',\n", " 18: 'Alexa647',\n", " 19: 'Alexa647',\n", " 20: 'Alexa647'}\n", "\n", "# This is a pythonic way to create the same dictionary using data avialable through the tutorial notebooks.\n", "# This only works if you have only ONE fluorophore across all timepoints.\n", "# FLUOROPHORE_DICT = {tp: color for tp, color in product(info_df[\"timepoint\"], [\"Alexa647\"])}" ] }, { "cell_type": "markdown", "id": "417e02e5", "metadata": {}, "source": [ "The function as is done below will write a vlumetric FOF_CT compliant file.
\n", "If you do not add the dictionary that contains the information about the fluorophore for every timepoint, it will throw an error.
\n", "In a similar fashon, if the dictionary does not contain the same number of keys as timepoints in the dataset, it will throw an error.
\n", "\n", "The function assumes that the order of the regions matches the order of timepoints in the file." ] }, { "cell_type": "code", "execution_count": 8, "id": "9c2db27d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing file 2018-09-04_nuc03_A_chr3 [1/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "\tTime segment not found for region chr3:152500000-153000000, skipping...\n", "\tTime segment not found for region chr3:153500000-154000000, skipping...\n", "Processing file 2018-07-10_nuc02_A_chr3 [2/30]\n", "Processing file 2018-07-10_nuc05_A_chr3 [3/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc07_A_chr3 [4/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc09_A_chr3 [5/30]\n", "Processing file 2018-07-10_nuc11_A_chr3 [6/30]\n", "Processing file 2018-09-04_nuc01_B_chr3 [7/30]\n", "Processing file 2018-07-10_nuc06_B_chr3 [8/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc05_B_chr3 [9/30]\n", "Processing file 2018-07-10_nuc06_A_chr3 [10/30]\n", "Processing file 2018-06-28_nuc08_A_chr3 [11/30]\n", "Processing file 2018-06-28_nuc05_A_chr3 [12/30]\n", "\tTime segment not found for region chr3:152000000-152500000, skipping...\n", "Processing file 2018-07-10_nuc07_A_chr3 [13/30]\n", "\tTime segment not found for region chr3:156500000-157000000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc01_A_chr3 [14/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc02_B_chr3 [15/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc01_A_chr3 [16/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "Processing file 2018-09-04_nuc04_A_chr3 [17/30]\n", "Processing file 2018-07-10_nuc04_B_chr3 [18/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-09-04_nuc03_B_chr3 [19/30]\n", "Processing file 2018-07-10_nuc03_B_chr3 [20/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc01_B_chr3 [21/30]\n", "Processing file 2018-09-04_nuc07_A_chr3 [22/30]\n", "\tTime segment not found for region chr3:151500000-152000000, skipping...\n", "\tTime segment not found for region chr3:157000000-157500000, skipping...\n", "Processing file 2018-06-28_nuc02_A_chr3 [23/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc10_A_chr3 [24/30]\n", "Processing file 2018-07-10_nuc08_A_chr3 [25/30]\n", "Processing file 2018-09-04_nuc05_A_chr3 [26/30]\n", "Processing file 2018-06-28_nuc01_B_chr3 [27/30]\n", "Processing file 2018-09-04_nuc10_A_chr3 [28/30]\n", "\tTime segment not found for region chr3:157000000-157500000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc07_B_chr3 [29/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "\tTime segment not found for region chr3:152000000-152500000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-09-04_nuc06_A_chr3 [30/30]\n" ] } ], "source": [ "fof_ct_metadata = {\"genome_assembly\": GENOME_ASSEMBLY,\n", " \"lab_name\": LAB_NAME,\n", " \"experimenter_name\": EXPERIMENTER_NAME,\n", " \"experimenter_contact\": EXPERIMENTER_CONTACT,\n", " \"description\": DESCRIPTION,\n", " }\n", "\n", "fof_volumetric_ct_writer(table_folder=fof_ct_folder,\n", " data2write=obj_list,\n", " fof_metadata=fof_ct_metadata,\n", " reg_dict={row[\"timepoint\"]: (row[\"Chr\"], row[\"Start(hg19)\"], row[\"End(hg19)\"]) for _, row in info_df.iterrows()},\n", " fluor=FLUOROPHORE_DICT,\n", " file_suffix = file_suffix\n", " )" ] }, { "cell_type": "markdown", "id": "96a3f9d5", "metadata": {}, "source": [ "Using the same function we can turn volumetric format off to write as a standard ball and stick FOF_CT format" ] }, { "cell_type": "code", "execution_count": 9, "id": "d2485792", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing file 2018-09-04_nuc03_A_chr3 [1/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "\tTime segment not found for region chr3:152500000-153000000, skipping...\n", "\tTime segment not found for region chr3:153500000-154000000, skipping...\n", "Processing file 2018-07-10_nuc02_A_chr3 [2/30]\n", "Processing file 2018-07-10_nuc05_A_chr3 [3/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc07_A_chr3 [4/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc09_A_chr3 [5/30]\n", "Processing file 2018-07-10_nuc11_A_chr3 [6/30]\n", "Processing file 2018-09-04_nuc01_B_chr3 [7/30]\n", "Processing file 2018-07-10_nuc06_B_chr3 [8/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc05_B_chr3 [9/30]\n", "Processing file 2018-07-10_nuc06_A_chr3 [10/30]\n", "Processing file 2018-06-28_nuc08_A_chr3 [11/30]\n", "Processing file 2018-06-28_nuc05_A_chr3 [12/30]\n", "\tTime segment not found for region chr3:152000000-152500000, skipping...\n", "Processing file 2018-07-10_nuc07_A_chr3 [13/30]\n", "\tTime segment not found for region chr3:156500000-157000000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc01_A_chr3 [14/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc02_B_chr3 [15/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc01_A_chr3 [16/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "Processing file 2018-09-04_nuc04_A_chr3 [17/30]\n", "Processing file 2018-07-10_nuc04_B_chr3 [18/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-09-04_nuc03_B_chr3 [19/30]\n", "Processing file 2018-07-10_nuc03_B_chr3 [20/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc01_B_chr3 [21/30]\n", "Processing file 2018-09-04_nuc07_A_chr3 [22/30]\n", "\tTime segment not found for region chr3:151500000-152000000, skipping...\n", "\tTime segment not found for region chr3:157000000-157500000, skipping...\n", "Processing file 2018-06-28_nuc02_A_chr3 [23/30]\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-07-10_nuc10_A_chr3 [24/30]\n", "Processing file 2018-07-10_nuc08_A_chr3 [25/30]\n", "Processing file 2018-09-04_nuc05_A_chr3 [26/30]\n", "Processing file 2018-06-28_nuc01_B_chr3 [27/30]\n", "Processing file 2018-09-04_nuc10_A_chr3 [28/30]\n", "\tTime segment not found for region chr3:157000000-157500000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-06-28_nuc07_B_chr3 [29/30]\n", "\tTime segment not found for region chr3:150000000-150500000, skipping...\n", "\tTime segment not found for region chr3:152000000-152500000, skipping...\n", "\tTime segment not found for region chr3:157500000-158000000, skipping...\n", "Processing file 2018-09-04_nuc06_A_chr3 [30/30]\n" ] } ], "source": [ "fof_ct_metadata = {\"genome_assembly\": GENOME_ASSEMBLY,\n", " \"lab_name\": LAB_NAME,\n", " \"experimenter_name\": EXPERIMENTER_NAME,\n", " \"experimenter_contact\": EXPERIMENTER_CONTACT,\n", " \"description\": DESCRIPTION,\n", " }\n", "\n", "fof_volumetric_ct_writer(table_folder=fof_ct_folder,\n", " data2write=obj_list,\n", " fof_metadata=fof_ct_metadata,\n", " reg_dict={row[\"timepoint\"]: (row[\"Chr\"], row[\"Start(hg19)\"], row[\"End(hg19)\"]) for _, row in info_df.iterrows()},\n", " # fluor=FLUOROPHORE_DICT,\n", " file_suffix = file_suffix,\n", " volumetric_format = False\n", " )" ] }, { "cell_type": "markdown", "id": "107aed50", "metadata": {}, "source": [ "## Reading the FOF-CT volumetic format" ] }, { "cell_type": "markdown", "id": "6f5c29b0", "metadata": {}, "source": [ "In a similar fashion, CIMA can read the core.csv, trace.csv and spot.csv created by the fof_ct_writer function. With this we can recreate a list of CIMA objects similar to the original one." ] }, { "cell_type": "code", "execution_count": 8, "id": "79eb7274", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reconstructing file: loc1_nuc3_A\n", "Reconstructing file: loc3_nuc2_A\n", "Reconstructing file: loc4_nuc5_A\n", "Reconstructing file: loc6_nuc7_A\n", "Reconstructing file: loc7_nuc9_A\n", "Reconstructing file: loc8_nuc11_A\n", "Reconstructing file: loc9_nuc1_B\n", "Reconstructing file: loc10_nuc6_B\n", "Reconstructing file: loc11_nuc5_B\n", "Reconstructing file: loc12_nuc6_A\n", "Reconstructing file: loc13_nuc8_A\n", "Reconstructing file: loc14_nuc5_A\n", "Reconstructing file: loc15_nuc7_A\n", "Reconstructing file: loc16_nuc1_A\n", "Reconstructing file: loc17_nuc2_B\n", "Reconstructing file: loc18_nuc1_A\n", "Reconstructing file: loc19_nuc4_A\n", "Reconstructing file: loc20_nuc4_B\n", "Reconstructing file: loc21_nuc3_B\n", "Reconstructing file: loc22_nuc3_B\n", "Reconstructing file: loc23_nuc1_B\n", "Reconstructing file: loc24_nuc7_A\n", "Reconstructing file: loc25_nuc2_A\n", "Reconstructing file: loc27_nuc10_A\n", "Reconstructing file: loc28_nuc8_A\n", "Reconstructing file: loc29_nuc5_A\n", "Reconstructing file: loc30_nuc1_B\n", "Reconstructing file: loc31_nuc10_A\n", "Reconstructing file: loc32_nuc7_B\n", "Reconstructing file: loc33_nuc6_A\n" ] } ], "source": [ "core_file = Path(fof_ct_folder, f\"core_{file_suffix}_volumetric.csv\")\n", "trace_file = Path(fof_ct_folder, f\"trace_{file_suffix}_volumetric.csv\")\n", "spot_file = Path(fof_ct_folder, f\"spot_{file_suffix}_volumetric.csv\")\n", "\n", "obj_CIMA_list = fof_volumetric_ct_reader(fof_core_file=core_file, fof_spot_file=spot_file, fof_trace_file=trace_file)" ] }, { "cell_type": "code", "execution_count": 9, "id": "f47ef0cb", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "imageID", "rawType": "int64", "type": "integer" }, { "name": "cycle", "rawType": "int64", "type": "integer" }, { "name": "zstep", "rawType": "int64", "type": "integer" }, { "name": "frame", "rawType": "int64", "type": "integer" }, { "name": "accum", "rawType": "int64", "type": "integer" }, { "name": "photoncount", "rawType": "int64", "type": "integer" }, { "name": "photoncount11", "rawType": "int64", "type": "integer" }, { "name": "photoncount12", "rawType": "int64", "type": "integer" }, { "name": "photoncount21", "rawType": "int64", "type": "integer" }, { "name": "photoncount22", "rawType": "int64", "type": "integer" }, { "name": "psfx", "rawType": "int64", "type": "integer" }, { "name": "psfy", "rawType": "int64", "type": "integer" }, { "name": "psfz", "rawType": "int64", "type": "integer" }, { "name": "psfphotoncount", "rawType": "int64", "type": "integer" }, { "name": "x", "rawType": "float64", "type": "float" }, { "name": "y", "rawType": "float64", "type": "float" }, { "name": "z", "rawType": "float64", "type": "float" }, { "name": "stdev", "rawType": "int64", "type": "integer" }, { "name": "amp", "rawType": "int64", "type": "integer" }, { "name": "background11", "rawType": "int64", "type": "integer" }, { "name": "background12", "rawType": "int64", "type": "integer" }, { "name": "background21", "rawType": "int64", "type": "integer" }, { "name": "background22", "rawType": "int64", "type": "integer" }, { "name": "maxResidualSlope", "rawType": "int64", "type": "integer" }, { "name": "chi", "rawType": "int64", "type": "integer" }, { "name": "loglike", "rawType": "int64", "type": "integer" }, { "name": "accuracy", "rawType": "int64", "type": "integer" }, { "name": "llr", "rawType": "int64", "type": "integer" }, { "name": "clusterID", "rawType": "int64", "type": "integer" }, { "name": "xprec", "rawType": "float64", "type": "float" }, { "name": "yprec", "rawType": "float64", "type": "float" }, { "name": "zprec", "rawType": "float64", "type": "float" }, { "name": "timepoint", "rawType": "int64", "type": "integer" }, { "name": "record_time", "rawType": "int64", "type": "integer" }, { "name": "record_name", "rawType": "str", "type": "string" }, { "name": "chromosomes", "rawType": "int64", "type": "integer" }, { "name": "s11", "rawType": "float64", "type": "float" }, { "name": "s12", "rawType": "float64", "type": "float" }, { "name": "shiftz", "rawType": "float64", "type": "float" }, { "name": "mass", "rawType": "object", "type": "unknown" } ], "ref": "b250a615-4f54-4629-979b-666c8e9d14e6", "rows": [ [ "0", "335775", "0", "6", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11777.0", "1338.5799560546875", "1321.719970703125", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "10.845", "9.96838", "85.8496", "6", "0", "sLOC", "0", null, null, "1321.719970703125", "10.0" ], [ "1", "335783", "0", "6", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11786.400390625", "1340.9599609375", "1337.5400390625", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11.8319", "10.767", "90.884", "6", "0", "sLOC", "0", null, null, "1337.5400390625", "10.0" ], [ "2", "336116", "0", "8", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11702.7998046875", "1566.4200439453125", "1431.239990234375", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "5.01669", "4.67027", "37.7125", "6", "0", "sLOC", "0", null, null, "1431.239990234375", "10.0" ], [ "3", "336118", "0", "8", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11719.599609375", "1555.4599609375", "1376.93994140625", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "7.10123", "6.80031", "60.6664", "6", "0", "sLOC", "0", null, null, "1376.93994140625", "10.0" ], [ "4", "336119", "0", "8", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11706.7998046875", "1550.989990234375", "1417.5799560546875", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "6.98016", "6.51492", "55.7733", "6", "0", "sLOC", "0", null, null, "1417.5799560546875", "10.0" ] ], "shape": { "columns": 40, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageIDcyclezstepframeaccumphotoncountphotoncount11photoncount12photoncount21photoncount22...ypreczprectimepointrecord_timerecord_namechromosomess11s12shiftzmass
0335775060000000...9.9683885.849660sLOC0NaNNaN1321.71997110.0
1335783060000000...10.7670090.884060sLOC0NaNNaN1337.54003910.0
2336116080000000...4.6702737.712560sLOC0NaNNaN1431.23999010.0
3336118080000000...6.8003160.666460sLOC0NaNNaN1376.93994110.0
4336119080000000...6.5149255.773360sLOC0NaNNaN1417.57995610.0
\n", "

5 rows × 40 columns

\n", "
" ], "text/plain": [ " imageID cycle zstep frame accum photoncount photoncount11 \\\n", "0 335775 0 6 0 0 0 0 \n", "1 335783 0 6 0 0 0 0 \n", "2 336116 0 8 0 0 0 0 \n", "3 336118 0 8 0 0 0 0 \n", "4 336119 0 8 0 0 0 0 \n", "\n", " photoncount12 photoncount21 photoncount22 ... yprec zprec \\\n", "0 0 0 0 ... 9.96838 85.8496 \n", "1 0 0 0 ... 10.76700 90.8840 \n", "2 0 0 0 ... 4.67027 37.7125 \n", "3 0 0 0 ... 6.80031 60.6664 \n", "4 0 0 0 ... 6.51492 55.7733 \n", "\n", " timepoint record_time record_name chromosomes s11 s12 shiftz \\\n", "0 6 0 sLOC 0 NaN NaN 1321.719971 \n", "1 6 0 sLOC 0 NaN NaN 1337.540039 \n", "2 6 0 sLOC 0 NaN NaN 1431.239990 \n", "3 6 0 sLOC 0 NaN NaN 1376.939941 \n", "4 6 0 sLOC 0 NaN NaN 1417.579956 \n", "\n", " mass \n", "0 10.0 \n", "1 10.0 \n", "2 10.0 \n", "3 10.0 \n", "4 10.0 \n", "\n", "[5 rows x 40 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj_CIMA_list[0].atomList.head()" ] }, { "cell_type": "markdown", "id": "3f2b1578", "metadata": {}, "source": [ "## Reading the FOF-CT bs format" ] }, { "cell_type": "markdown", "id": "e2078f10", "metadata": {}, "source": [ "In a similar fashion, CIMA can read the core.csv nad trace.csv. With this we can recreate a list of CIMA objects similar to the original one.
\n", "\n", "
\n", " IMPORTANT\n", "
\n", "
\n", "
\n", "The atomList of the object will be mostly filled with 0s, as most of the information is missing.\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "id": "b1d0b142", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reconstructing file: loc1_nuc3_A\n", "Reconstructing file: loc3_nuc2_A\n", "Reconstructing file: loc4_nuc5_A\n", "Reconstructing file: loc6_nuc7_A\n", "Reconstructing file: loc7_nuc9_A\n", "Reconstructing file: loc8_nuc11_A\n", "Reconstructing file: loc9_nuc1_B\n", "Reconstructing file: loc10_nuc6_B\n", "Reconstructing file: loc11_nuc5_B\n", "Reconstructing file: loc12_nuc6_A\n", "Reconstructing file: loc13_nuc8_A\n", "Reconstructing file: loc14_nuc5_A\n", "Reconstructing file: loc15_nuc7_A\n", "Reconstructing file: loc16_nuc1_A\n", "Reconstructing file: loc17_nuc2_B\n", "Reconstructing file: loc18_nuc1_A\n", "Reconstructing file: loc19_nuc4_A\n", "Reconstructing file: loc20_nuc4_B\n", "Reconstructing file: loc21_nuc3_B\n", "Reconstructing file: loc22_nuc3_B\n", "Reconstructing file: loc23_nuc1_B\n", "Reconstructing file: loc24_nuc7_A\n", "Reconstructing file: loc25_nuc2_A\n", "Reconstructing file: loc27_nuc10_A\n", "Reconstructing file: loc28_nuc8_A\n", "Reconstructing file: loc29_nuc5_A\n", "Reconstructing file: loc30_nuc1_B\n", "Reconstructing file: loc31_nuc10_A\n", "Reconstructing file: loc32_nuc7_B\n", "Reconstructing file: loc33_nuc6_A\n" ] } ], "source": [ "core_file = Path(fof_ct_folder, f\"core_{file_suffix}_bs.csv\")\n", "trace_file = Path(fof_ct_folder, f\"trace_{file_suffix}_bs.csv\")\n", "\n", "obj_CIMA_list = fof_bs_ct_reader(fof_core_file=core_file, fof_trace_file=trace_file)" ] }, { "cell_type": "code", "execution_count": 11, "id": "2e7a1592", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "imageID", "rawType": "int64", "type": "integer" }, { "name": "cycle", "rawType": "int64", "type": "integer" }, { "name": "zstep", "rawType": "int64", "type": "integer" }, { "name": "frame", "rawType": "int64", "type": "integer" }, { "name": "accum", "rawType": "int64", "type": "integer" }, { "name": "photoncount", "rawType": "int64", "type": "integer" }, { "name": "photoncount11", "rawType": "int64", "type": "integer" }, { "name": "photoncount12", "rawType": "int64", "type": "integer" }, { "name": "photoncount21", "rawType": "int64", "type": "integer" }, { "name": "photoncount22", "rawType": "int64", "type": "integer" }, { "name": "psfx", "rawType": "int64", "type": "integer" }, { "name": "psfy", "rawType": "int64", "type": "integer" }, { "name": "psfz", "rawType": "int64", "type": "integer" }, { "name": "psfphotoncount", "rawType": "int64", "type": "integer" }, { "name": "x", "rawType": "float64", "type": "float" }, { "name": "y", "rawType": "float64", "type": "float" }, { "name": "z", "rawType": "float64", "type": "float" }, { "name": "stdev", "rawType": "int64", "type": "integer" }, { "name": "amp", "rawType": "int64", "type": "integer" }, { "name": "background11", "rawType": "int64", "type": "integer" }, { "name": "background12", "rawType": "int64", "type": "integer" }, { "name": "background21", "rawType": "int64", "type": "integer" }, { "name": "background22", "rawType": "int64", "type": "integer" }, { "name": "maxResidualSlope", "rawType": "int64", "type": "integer" }, { "name": "chi", "rawType": "int64", "type": "integer" }, { "name": "loglike", "rawType": "int64", "type": "integer" }, { "name": "accuracy", "rawType": "int64", "type": "integer" }, { "name": "llr", "rawType": "int64", "type": "integer" }, { "name": "clusterID", "rawType": "int64", "type": "integer" }, { "name": "xprec", "rawType": "int64", "type": "integer" }, { "name": "yprec", "rawType": "int64", "type": "integer" }, { "name": "zprec", "rawType": "int64", "type": "integer" }, { "name": "timepoint", "rawType": "int64", "type": "integer" }, { "name": "record_time", "rawType": "int64", "type": "integer" }, { "name": "record_name", "rawType": "str", "type": "string" }, { "name": "chromosomes", "rawType": "int64", "type": "integer" }, { "name": "s11", "rawType": "float64", "type": "float" }, { "name": "s12", "rawType": "float64", "type": "float" }, { "name": "shiftz", "rawType": "float64", "type": "float" }, { "name": "mass", "rawType": "object", "type": "unknown" } ], "ref": "09367220-399a-4c12-9ec0-3f4b88a348db", "rows": [ [ "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "11715.0888671875", "1557.898193359375", "1481.0479736328125", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "sLOC", "0", null, null, "1481.0479736328125", "10.0" ], [ "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "12328.1142578125", "1338.091064453125", "1398.165283203125", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "2", "0", "sLOC", "0", null, null, "1398.165283203125", "10.0" ], [ "2", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "12419.5302734375", "1516.2933349609375", "963.5399780273438", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "3", "0", "sLOC", "0", null, null, "963.5399780273438", "10.0" ], [ "3", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "12283.1416015625", "1473.8743896484375", "951.85205078125", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "4", "0", "sLOC", "0", null, null, "951.85205078125", "10.0" ], [ "4", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "12763.0087890625", "1573.85546875", "919.4849243164062", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "6", "0", "sLOC", "0", null, null, "919.4849243164062", "10.0" ] ], "shape": { "columns": 40, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageIDcyclezstepframeaccumphotoncountphotoncount11photoncount12photoncount21photoncount22...ypreczprectimepointrecord_timerecord_namechromosomess11s12shiftzmass
00000000000...0010sLOC0NaNNaN1481.04797410.0
10000000000...0020sLOC0NaNNaN1398.16528310.0
20000000000...0030sLOC0NaNNaN963.53997810.0
30000000000...0040sLOC0NaNNaN951.85205110.0
40000000000...0060sLOC0NaNNaN919.48492410.0
\n", "

5 rows × 40 columns

\n", "
" ], "text/plain": [ " imageID cycle zstep frame accum photoncount photoncount11 \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " photoncount12 photoncount21 photoncount22 ... yprec zprec timepoint \\\n", "0 0 0 0 ... 0 0 1 \n", "1 0 0 0 ... 0 0 2 \n", "2 0 0 0 ... 0 0 3 \n", "3 0 0 0 ... 0 0 4 \n", "4 0 0 0 ... 0 0 6 \n", "\n", " record_time record_name chromosomes s11 s12 shiftz mass \n", "0 0 sLOC 0 NaN NaN 1481.047974 10.0 \n", "1 0 sLOC 0 NaN NaN 1398.165283 10.0 \n", "2 0 sLOC 0 NaN NaN 963.539978 10.0 \n", "3 0 sLOC 0 NaN NaN 951.852051 10.0 \n", "4 0 sLOC 0 NaN NaN 919.484924 10.0 \n", "\n", "[5 rows x 40 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj_CIMA_list[0].atomList.head()" ] }, { "cell_type": "markdown", "id": "d32cc4e2", "metadata": {}, "source": [ "## Reading from XYZ coordinate format" ] }, { "cell_type": "markdown", "id": "93a139c7", "metadata": {}, "source": [ "CIMA also can read files that only contain X, Y, Z coordinates.
\n", "In this case we have to use the SegmentXYZ. This only requires that the data have the columns x, y and z. All other columns will be preserved.
\n", "Here there is an example using chr21 data as an example." ] }, { "cell_type": "code", "execution_count": 12, "id": "f5d15789", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "Z(nm)", "rawType": "float64", "type": "float" }, { "name": "X(nm)", "rawType": "float64", "type": "float" }, { "name": "Y(nm)", "rawType": "float64", "type": "float" }, { "name": "Genomic coordinate", "rawType": "str", "type": "string" }, { "name": "Chromosome copy number", "rawType": "int64", "type": "integer" }, { "name": "Gene names", "rawType": "str", "type": "string" }, { "name": "Transcription", "rawType": "str", "type": "string" }, { "name": "TSS ZXY(nm)", "rawType": "str", "type": "string" } ], "ref": "7cde1dca-05e2-43dd-81a7-bf4f7d4a5c63", "rows": [ [ "0", "2449.0", "4700.0", "7234.0", "chr21:10400001-10450001", "1", null, null, null ], [ "1", "3731.0", "4629.0", "7409.0", "chr21:10500001-10550001", "1", null, null, null ], [ "2", "2248.0", "4690.0", "7148.0", "chr21:10600001-10650001", "1", null, null, null ], [ "3", "2211.0", "4065.0", "7567.0", "chr21:13250001-13300001", "1", null, null, null ], [ "4", "2499.0", "3904.0", "7255.0", "chr21:14000001-14050001", "1", null, null, null ] ], "shape": { "columns": 8, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Z(nm)X(nm)Y(nm)Genomic coordinateChromosome copy numberGene namesTranscriptionTSS ZXY(nm)
02449.04700.07234.0chr21:10400001-104500011NaNNaNNaN
13731.04629.07409.0chr21:10500001-105500011NaNNaNNaN
22248.04690.07148.0chr21:10600001-106500011NaNNaNNaN
32211.04065.07567.0chr21:13250001-133000011NaNNaNNaN
42499.03904.07255.0chr21:14000001-140500011NaNNaNNaN
\n", "
" ], "text/plain": [ " Z(nm) X(nm) Y(nm) Genomic coordinate Chromosome copy number \\\n", "0 2449.0 4700.0 7234.0 chr21:10400001-10450001 1 \n", "1 3731.0 4629.0 7409.0 chr21:10500001-10550001 1 \n", "2 2248.0 4690.0 7148.0 chr21:10600001-10650001 1 \n", "3 2211.0 4065.0 7567.0 chr21:13250001-13300001 1 \n", "4 2499.0 3904.0 7255.0 chr21:14000001-14050001 1 \n", "\n", " Gene names Transcription TSS ZXY(nm) \n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from cima.segments.segment_info_xyz import SegmentXYZ\n", "\n", "chr21_bs_data = pd.read_csv('https://zenodo.org/records/3928890/files/chromosome21.tsv?download=1', sep='\\t')\n", "\n", "display(chr21_bs_data.head())" ] }, { "cell_type": "code", "execution_count": 13, "id": "adf4f016", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "x", "rawType": "float64", "type": "float" }, { "name": "y", "rawType": "float64", "type": "float" }, { "name": "z", "rawType": "float64", "type": "float" }, { "name": "gen_coord", "rawType": "str", "type": "string" }, { "name": "chr_copy_num", "rawType": "int64", "type": "integer" }, { "name": "timepoint", "rawType": "int64", "type": "integer" }, { "name": "clusterID", "rawType": "int64", "type": "integer" }, { "name": "record_name", "rawType": "str", "type": "string" }, { "name": "mass", "rawType": "float64", "type": "float" } ], "ref": "04c98968-ee8e-480c-9c5a-4dc1e2bf8d9b", "rows": [ [ "0", "4700.0", "7234.0", "2449.0", "chr21:10400001-10450001", "1", "0", "0", "sLOC", "10.0" ], [ "1", "4629.0", "7409.0", "3731.0", "chr21:10500001-10550001", "1", "0", "0", "sLOC", "10.0" ], [ "2", "4690.0", "7148.0", "2248.0", "chr21:10600001-10650001", "1", "0", "0", "sLOC", "10.0" ], [ "3", "4065.0", "7567.0", "2211.0", "chr21:13250001-13300001", "1", "0", "0", "sLOC", "10.0" ], [ "4", "3904.0", "7255.0", "2499.0", "chr21:14000001-14050001", "1", "0", "0", "sLOC", "10.0" ] ], "shape": { "columns": 9, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyzgen_coordchr_copy_numtimepointclusterIDrecord_namemass
04700.07234.02449.0chr21:10400001-10450001100sLOC10.0
14629.07409.03731.0chr21:10500001-10550001100sLOC10.0
24690.07148.02248.0chr21:10600001-10650001100sLOC10.0
34065.07567.02211.0chr21:13250001-13300001100sLOC10.0
43904.07255.02499.0chr21:14000001-14050001100sLOC10.0
\n", "
" ], "text/plain": [ " x y z gen_coord chr_copy_num timepoint \\\n", "0 4700.0 7234.0 2449.0 chr21:10400001-10450001 1 0 \n", "1 4629.0 7409.0 3731.0 chr21:10500001-10550001 1 0 \n", "2 4690.0 7148.0 2248.0 chr21:10600001-10650001 1 0 \n", "3 4065.0 7567.0 2211.0 chr21:13250001-13300001 1 0 \n", "4 3904.0 7255.0 2499.0 chr21:14000001-14050001 1 0 \n", "\n", " clusterID record_name mass \n", "0 0 sLOC 10.0 \n", "1 0 sLOC 10.0 \n", "2 0 sLOC 10.0 \n", "3 0 sLOC 10.0 \n", "4 0 sLOC 10.0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chr21_bs_data = chr21_bs_data[['X(nm)', 'Y(nm)', 'Z(nm)','Genomic coordinate','Chromosome copy number']]\n", "chr21_bs_data.rename({\n", " 'X(nm)': 'x',\n", " 'Y(nm)': 'y',\n", " 'Z(nm)': 'z',\n", " 'Genomic coordinate':'gen_coord',\n", " 'Chromosome copy number': 'chr_copy_num'},\n", " axis=1, inplace=True)\n", "\n", "chr21_bs_data_CIMA = SegmentXYZ(chr21_bs_data)\n", "\n", "display(chr21_bs_data_CIMA.atomList.head())" ] }, { "cell_type": "code", "execution_count": 15, "id": "29dc2c8e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0]\n" ] } ], "source": [ "print(chr21_bs_data_CIMA.atomList.timepoint.unique())" ] }, { "cell_type": "code", "execution_count": 17, "id": "8236d440", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "['chr21:10400001-10450001', 'chr21:10500001-10550001',\n", " 'chr21:10600001-10650001', 'chr21:13250001-13300001',\n", " 'chr21:14000001-14050001', 'chr21:14050001-14100001',\n", " 'chr21:14100001-14150001', 'chr21:14150001-14200001',\n", " 'chr21:14200001-14250001', 'chr21:14250001-14300001',\n", " ...\n", " 'chr21:46200001-46250001', 'chr21:46250001-46300001',\n", " 'chr21:46300001-46350001', 'chr21:46350001-46400001',\n", " 'chr21:46400001-46450001', 'chr21:46450001-46500001',\n", " 'chr21:46500001-46550001', 'chr21:46550001-46600001',\n", " 'chr21:46600001-46650001', 'chr21:46650001-46700001']\n", "Length: 651, dtype: str\n" ] } ], "source": [ "print(chr21_bs_data_CIMA.atomList.gen_coord.unique())" ] } ], "metadata": { "kernelspec": { "display_name": "CIMA_testing", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }