99 - Writting to and reading from different formats

Description: This notebook contains an example on how to write a list of CIMA objects into the FOF_CT volumetric

The following index has links to the different sections of the notebook. Some sections may have additional indexes to access the subparts.
If you are using VS Code to visualize the notebook, the links will not work. This is due to how VC Code render the notebook. To navigate the notebook use the outline panel. To know more about it, check this link.

Content:

Library imports and functions
Setting some variables
Exporting the data with CIMA
Writting to FOF-CT volumetric format
Reading the FOF-CT volumetric format
Reading from XYZ coordinates format

Library imports and functions

Content:

Library imports and functions
Setting some variables
Exporting the data with CIMA
Extracting morphological features

Back to the general index

import re
from pathlib import Path
from itertools import product

import pandas as pd
import seaborn as sns

from cima.parsers.parser_csv import CSVParser
from cima.utils.misc import fof_volumetric_ct_writer, fof_volumetric_ct_reader, fof_bs_ct_reader

## Regular expressions to get the metadata from the filename
regex_patterns = {
    "nucleusID": re.compile(r"(?i)nuc(\d+)"), # This searches for "Nuc" or "nuc" followed by digits
    "cellID": re.compile(r"(?i)cell(\d+)"), # This searches for "Cell" or "cell" followed by digits
    "locationID" : re.compile(r"(?i)loc-?(\d+)"), # This searches for "Loc" or "loc" followed by optional hyphen and digits
    "date" : re.compile(r"(\d{4}[-\.]\d{2}[-\.]\d{2})"), # This searches for dates in the format YYYY-MM-DD or YYYY.MM.DD
    "homolog" : re.compile(r"_([ABPM01pm])"), ## Added just in case the homolog is in the filename.
    "chromosome" : re.compile(r"(?i)(chr[\d+MXY])") ## Added just in case the chromosome is in the filename.
    }

Setting some variables

In the following cell you should set up some variables for the script to run properly. Change the variables at your convenience, and then you can press “Run all” in the Jupyter Notebook.

Brief explanation of the different variables:

work_folder: this variable is where the plots and the tsv files that the script creates will be stored.

data_file: this variable is where the csv files to be assessed are stored.

info_file: file that contains information about your walks, such as the chromosome, start and end of the steps. You need a columns that matches the 'timepoint' column in the csv files, to be able to show proper information.

column_round: the column that matches the 'timepoint' column in the csv files.

column_name: the column with the "nice" name for the steps. This will be used instead of the timepoint number. Use "" in case the information file does not have this column (in this case the name of the steps will be in the form of 'StepN' where N is an incremental number form 1 to the max number of steps).

column_size: the column for the size of the segment. Use "" in case the information file does not have this column (in this case the notebook will calculate the size automatically of the segments according to the start and end positions in the file).

column_start_pos: the column of the start position of the timepoint.

column_end_pos: the column of the start position of the timepoint.

number_of_steps: the number of steps in your library.

starting_step: the step at which analysis should start.

prec_mean_dataset: the worst precision of the three axes (usually the Z). This will be used as a blurring factor when creating the 3D maps to calculate the volumetric features.

save_plots: save the plots into a folder or show them in the notebook.

file_suffix: add a suffix to the different files created, in case you wish to save the results from different assessments in the same folder. It will be added as "_suffix" before the .tsv/.pdf.

The notebook will create a folder in the work_folder path called 6_morphological_features_extraction. Inside this folder 2 subfolders will be created, one for tsvs and one for plots. Inside each folder, a sub-subfolder with the suffix name will have the plots/tsvs for that experiment.

6_morphological_features_extraction/
|-- tsvs/
|   |-- example_tsv_SUFFIX.tsv
|-- plots/
|   |-- example_plot_SUFFIX.pdf

Back to the general index

work_folder = "/scratch/CIMA_tutorial/"
data_folder = "/scratch/CIMA_tutorial/data/chr3"

info_file = "/scratch/CIMA_tutorial/data/chr3/Info_chr3.txt"
column_round = "Round"
column_name = "Name"
column_size = "Size(kb)"
column_chr = "Chr"
column_start_pos = "Start(hg19)"
column_end_pos = "End(hg19)"

number_of_steps_library = 16
starting_step = 6

prec_mean_dataset = 30

save_plots = False

file_suffix = "chr3"

work_folder = Path(work_folder)
data_folder = Path(data_folder)
pre_assessment_folder = Path(work_folder, "2_experimental_quality_assessment")
reconstruction_tsv_folder = Path(work_folder, "3_3D_reconstruction_assessment", "tsvs", file_suffix)

info_df = pd.read_table(Path(info_file), sep="\t")
info_df.columns = info_df.columns.str.strip() # In case there are leading or trailing spaces in the column names

info_file_dict = {
    "column_round" : column_round,
    "column_name" : column_name,
    "column_size" : column_size,
    "column_chr" : column_chr,
    "column_start_pos" : column_start_pos,
    "column_end_pos" : column_end_pos,
}

del column_round, column_name, column_size, column_chr, column_start_pos, column_end_pos, info_file

Exporting the data with CIMA

Content:

Creating the list of CIMA StructuralObject
Checking the information file
Creating a color palette and the results files

Back to the general index

Creating the list of CIMA StructuralObject

In this cell we create a list of StructuralObjects (the main CIMA class type).
This list is the basis to do the assessment through this Jupyter Notebook.

When loading the data, we will also take out those traces or CS that did not pass the assessment in the previous notebooks.

When reading the file, it will also add some metadata gathered from the filename of the csv file:

nucleusID: searches for "nuc" (with any combination of lower and upper case) followed by digits. If there is not a match, the default is filecount (first file read will be number 1, ...).

cellID: searches for "cell" (with any combination of lower and upper case) followed by digits. If there is not a match, the default is filecount (first file read will be number 1, ...).

locationID: searches for "loc" (with any combination of lower and upper case) followed by optional hyphen and digits.If there is not a match, the default is filecount (first file read will be number 1, ...).

date: searches for dates in the format YYYY-MM-DD or YYYY.MM.DD. If there is not a match, the default is the file name.

homolog: searches for "_" followed by A, B, M, P, m, p, 0 or 1 (only one instance of this). If there is not a match, the default is "A".

chromosome: searches for "chr" (with any combination of lower and upper case) followed by a number, M, X or Y. If there is not a match, the default is "chrN".

Back to the section index

# Get a list of all CSV files in the specified data folder
csv_files = list(data_folder.glob('*.csv'))

if len(csv_files) == 0:
    print("No files found, please input the correct work_folder")
else:
    print(f"Total number of files found: {len(csv_files)}")

print("Loading the experimental quality data from the CSV files")
traces_pass = pd.read_csv(Path(pre_assessment_folder, f"experimental_assessment_pass_{file_suffix}.txt"), sep="\t", names=["trace"])

print("Loading the bad flags from the CSV files")
segments_to_remove_CCC = pd.read_csv(Path(reconstruction_tsv_folder, f'segments2remove_bycccvariation_t2{("_" + file_suffix) if file_suffix != "" else ""}.tsv'), sep="\t")
if "flag" not in segments_to_remove_CCC.columns and not segments_to_remove_CCC.empty:
    print("\tCreating the 'flag' column for reconstruction-based removal")
    segments_to_remove_CCC["flag"] = segments_to_remove_CCC[["experimentID", "nucleusID", "homolog", "timepoint"]].astype(str).apply("_".join, axis=1)
    bad_flags = set(segments_to_remove_CCC["flag"].astype(str).values)
elif not segments_to_remove_CCC.empty:
    bad_flags = set(segments_to_remove_CCC["flag"].astype(str).values)
else:
    bad_flags = set()

# Get a list of metadata keys to extract from filenames
metadata_keys = ["nucleusID", "cellID", "locationID", "date", "homolog", "chromosome"]

# Initialize an empty list to store the StructuralObject instances
obj_list = []

if len(csv_files) > 0:
    # Iterate over each file in the list of file names
    for count, filein in enumerate(csv_files, start=1):

        stem = Path(filein).stem

        if not stem in traces_pass["trace"].values:
            print(f"Skipping file {count}: {stem} because it did not pass the experimental quality assessment.")
            continue

        print(f"Processing file {count}: {stem}", end="\r")

        # We mine the metadata from the filename using regex patterns. If not possible, assign default values.
        # In the case of nucleus, cell and location, the default is the filecount (first file read will be number 1, ...)
        # In the case of date, if no date can be found, the filename gets used.
        # If no valid homolog denomination is found, the default value is A.
        # If no valid chromosome denomination is found, the default value is chrN.
        metadata_CIMA = {
            key: (match.group(1) if (match := regex_patterns[key].search(stem)) else default)
            for key, default in zip(metadata_keys, [count, count, count, stem, "A", "chrN"])
        }

        # Adjust the metadata for nucleus and cell (if nucleus is count, but cell is something, use the cell match in nucleus)
        if "cellID" in metadata_CIMA and metadata_CIMA["nucleusID"] == count:
            metadata_CIMA["nucleusID"] = metadata_CIMA["cellID"]

        # Set experimentID as date for consistency
        metadata_CIMA["experimentID"] = metadata_CIMA["date"]

        # Read the CSV file and create a StructuralObject instance
        objin = CSVParser.read_CSV_file(filein.as_posix(), metadata = metadata_CIMA, content_type = "srx")

        # Filter the atomList to include only the specified timepoints in the info file
        objin.atomList = objin.atomList[(objin.atomList['timepoint'] >= starting_step - 1) &
                                        (objin.atomList['timepoint'] < (starting_step + number_of_steps_library - 1))].copy()
        
        experiment_id = objin.metadata["experimentID"]
        nucleus_id = int(objin.metadata["nucleusID"])
        homolog = objin.metadata["homolog"]
        
        full_flags = (experiment_id + "_" + str(nucleus_id) + "_" +homolog + "_" + objin.atomList["timepoint"].astype(str))
        remove_mask = full_flags.isin(bad_flags)

        if remove_mask.any():
            removed_tps = objin.atomList.loc[remove_mask, "timepoint"].unique()
            # removed_tps = ",".join([tp_names[tp] for tp in removed_tps])
            # print(f"Excluding timepoints {removed_tps} in experimentID: {experiment_id}, nucleusID: {nucleus_id}, homolog: {homolog} due to poor assessment.")
            objin.atomList = objin.atomList.loc[~remove_mask].copy()

        for tp in objin.atomList["timepoint"].unique():
            # Take out the timepoint from the atomList if it has less tha 10 points
            tp_mask = objin.atomList["timepoint"] == tp
            if tp_mask.sum() < 10:
                print(f"\tExcluding timepoint {tp} in experimentID: {experiment_id}, nucleusID: {nucleus_id}, homolog: {homolog} because it has less than 10 points.")
                objin.atomList = objin.atomList.loc[~tp_mask].copy()
        
        # Append the object to the list
        obj_list.append(objin)

    # Nice print to tell the user the number of objects created.
    print(f"\nTotal number of objects created: {len(obj_list)}")

    # Clean up memory by deleting some of the variables no longer needed
    del count, filein, stem, metadata_CIMA, match, objin

del csv_files, metadata_keys
del bad_flags
del experiment_id, nucleus_id, homolog, full_flags, remove_mask
if 'segments_to_remove_precision' in locals():
    del segments_to_remove_CCC

Total number of files found: 33
Loading the experimental quality data from the CSV files
Loading the bad flags from the CSV files
	Creating the 'flag' column for reconstruction-based removal
Skipping file 2: 2018-07-10_nuc04_A_chr3 because it did not pass the experimental quality assessment.
Skipping file 5: 2018-07-10_nuc03_A_chr3 because it did not pass the experimental quality assessment.
	Excluding timepoint 19 in experimentID: 2018-09-04, nucleusID: 7, homolog: A because it has less than 10 points.
Skipping file 26: 2018-09-04_nuc01_A_chr3 because it did not pass the experimental quality assessment.
Processing file 33: 2018-09-04_nuc06_A_chr3
Total number of objects created: 30

Checking the information file

This cells makes an adjustment in the info_df.

It checks if the column of the info_df that we will use as the link between this data frame and our walks has the same timepoints. This happens because most programs start with timepoint 0, while most information files start at timepoint 1.

This will create a new column in the dataframe that is called ‘timepoint’. This column will be modified so it has the same range as the timepoint column from the csv files.

Back to the section index

list_of_timepoints = sorted({int(tp) for obj in obj_list for tp in obj.atomList.timepoint.unique()})
info_of_timepoints = list(map(int, info_df[info_file_dict["column_round"]].tolist()))

print(f'List of timepoints found across all objects: {list_of_timepoints}')
print(f'List of timepoints in the information file: {info_of_timepoints}\n')

if list_of_timepoints == info_of_timepoints:
    print('The timepoints found in the objects match those in the information file.')
    info_df["timepoint"] = info_df[info_file_dict["column_round"]]
elif set(list_of_timepoints).issubset(info_of_timepoints):
    print('Note: The timepoints found in the objects are a subset of those in the information file. Not all timepoints are present.')
    info_df["timepoint"] = info_df[info_file_dict["column_round"]]
elif list_of_timepoints == [x - 1 for x in info_of_timepoints]:
    print('Note: The column for the information file that has the timepoint is offset by 1 compared to the objects, adjusting accordingly.')
    info_df["timepoint"] = info_df[info_file_dict["column_round"]] - 1
elif set(list_of_timepoints).issubset({x - 1 for x in info_of_timepoints}):
    print('Note: The timepoints found in the objects are a subset of those in the information file, which are offset by 1 compared to the objects. Adjusting accordingly.')
    info_df["timepoint"] = info_df[info_file_dict["column_round"]] - 1
else:
    print('Warning: The timepoints found in the objects do not match those in the information file.')

if info_file_dict["column_size"] == "":
    print("Calculating size column from start and end positions...")
    column_size = "size"
    info_file_dict["column_size"] = column_size
    info_df[column_size] = info_df[info_file_dict["column_end_pos"]] - info_df[info_file_dict["column_start_pos"]]

if info_file_dict["column_name"] == "":
    print("Setting name column from round numbers...")
    info_file_dict["column_name"] = "name"
    info_df["name"] = [f"Step{i}" for i in range(1, info_df.shape[0]+1)]

tp_names = info_df.set_index("timepoint")[info_file_dict["column_name"]].to_dict()
    
del list_of_timepoints, info_of_timepoints

List of timepoints found across all objects: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
List of timepoints in the information file: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]

Note: The column for the information file that has the timepoint is offset by 1 compared to the objects, adjusting accordingly.

Creating a color palette and the results folders for the tsv and plot files

Here we create a color palette automatically based on the different experiment ID found across the files. Experiment IDs are usually the date of the experiment. This allows to easily check for a possible batch effect.

We also create the folders were to store the results.

FOF_CT_files: a folder that will have all the files created. It will have a subfolder with the file_suffix chosen.

Back to the section index

# Define the color palette for experiments. Use the date or the experimentID (date) as the key.
color_palette = {exp_date: color for exp_date, color in zip(
    # We get a set of unique dates from the metadata of the objects
    set([file.metadata["experimentID"] for file in obj_list]),
    # We assign a color from the tab10 palette for each unique date
    sns.color_palette("tab10", n_colors=len(set([file.metadata["experimentID"] for file in obj_list]))) 
)}

# This palette is fixed for the current experiments used in the tutorial, which aim to reproduce the figures in the CIMA paper.
color_palette = {'2018-01-11':"#a6cee3" ,'2018-06-28':'#1f78b4', '2018-07-10':'#b2df8a', '2018-09-04':'#33a02c'}

# Create the folders to save the results. If the folders already exist, do not raise an error.
fof_ct_folder = Path(work_folder, "FOF_CT_files", file_suffix)
fof_ct_folder.mkdir(exist_ok=True, parents=True)

Writting to FOF-CT volumetic format

CIMA can write the contents of an object into the FOF_CT volumetric format.

This format is composed of 3 different tables (core.csv, trace.csv and spot.csv)

core.csv

This has information about the Spot_ID. In our case the Spot_ID corresponds to a chromatin segment. Internally it’s called 4dn_FOF_CT_core. The columns of this table are as follow:

Spot_ID: incremental number to have a unique identifier for every chromatin segment.

X, Y, Z: coordinates of the centroid of the chromatin segment.

Chrom: chromosome of the chromatin segment.

Chrom_Start: genomic start coordinate of the chromatin segment.

Chrom_End: genomic end coordinate of the chromatin segment.

trace.csv

this has information about the Trace_ID. In our case, the Trace_ID corresponds to the trace. Internally I think is called 4dn_FOF_CT_trace. The columns of this table are as follow:

Trace_ID: incremental number to have a unique identifier for every trace.

LocID: custom column for this table. Indentifier of the location/field of view where the trace is located. We gather this from the name of the file, as it is tipically included.

NucleusID: custom column for this table. Corresponds to the cellID/nucleusID where the trace is located. We gather this from the name of the file, as it is tipically included.

Homolog: custom column for this table. Corresponds to the homolog of the trace. We gather this from the name of the file, as it is tipically included.

spot.csv

This has the data that each Spot_ID contains. In our case, this table contains the blinks themselves. Internally it is called 4dn_FOF_CT_quality. The columns of this table are as follow:

Loc_ID: this column correspond to the imageID from the SRX csv file.

Spot_ID: this corresponds to the Spot_ID from the core.csv table.

X: this corresponds to the X coordinate of the blink.

Y: this corresponds to the Y coordinate of the blink.

Z: this corresponds to the Z coordinate of the blink.

Fluor: this corresponds to the fluorophore used. Right now it has a dummy value as it only says Fluor.

Prec_X: custom column for this table. This corresponds to the xprec columns of the atomList in the CIMA object.

Prec_Y: custom column for this table. This corresponds to the yprec columns of the atomList in the CIMA object.

Prec_Z: custom column for this table. This corresponds to the zprec columns of the atomList in the CIMA object.

timepoint: custom column for this table. This corresponds to the timepoint column of the atomList in the CIMA object.

z_step: custom column for this table. This corresponds to the z-step column of the atomList in the CIMA object.

cycle: custom column for this table. This corresponds to the cycle column of the atomList in the CIMA object.

In order for the fof_ct_wirter function to work, the user must provide a bit of metadata of the experiment. It is described below:

genome_assembly: the genome assembly used on the experiment.

lab_name: the name(s) of the lab(s) that did the experiment. Can be multiple ones.

experimenter_name: the name(s) of the experimenter(s) that did the experiment. Can be multiple ones.

experimenter_contact: the contact information of the experimenter(s) that did the experiment. Can be multiple ones.

description: small description of the experiment.

GENOME_ASSEMBLY = "hg19"
LAB_NAME = "Wu Lab"
EXPERIMENTER_NAME = "Ting Wu, Guy Nir"
EXPERIMENTER_CONTACT = "ting.wu@example.com, guy.nir@example.com"
DESCRIPTION = "chr3 files"

# This is just an example, the user should fill this dictionary with the actual fluorophores used in the experiment.
# The structure of the dictionary should be {timepoint: fluorophore}, where timepoint is the integer timepoint and fluorophore is a string with the name of the fluorophore used in that timepoint.
# Numbering of timepoints sould be consistent with the timepoint column in the CSV file.
# If you are using the structure of the tutorial files, you have a DataFrame with a column named "timepoint" that has the timepoint information.
# Example of the dictionary structure:
FLUOROPHORE_DICT = {
    5: 'Alexa647',
    6: 'Alexa647',
    7: 'Alexa647',
    8: 'Alexa647',
    9: 'Alexa647',
    10: 'Alexa647',
    11: 'Alexa647',
    12: 'Alexa647',
    13: 'Alexa647',
    14: 'Alexa647',
    15: 'Alexa647',
    16: 'Alexa647',
    17: 'Alexa647',
    18: 'Alexa647',
    19: 'Alexa647',
    20: 'Alexa647'}

# This is a pythonic way to create the same dictionary using data avialable through the tutorial notebooks.
# This only works if you have only ONE fluorophore across all timepoints.
# FLUOROPHORE_DICT = {tp: color for tp, color in product(info_df["timepoint"], ["Alexa647"])}

The function as is done below will write a vlumetric FOF_CT compliant file.
If you do not add the dictionary that contains the information about the fluorophore for every timepoint, it will throw an error.
In a similar fashon, if the dictionary does not contain the same number of keys as timepoints in the dataset, it will throw an error.

The function assumes that the order of the regions matches the order of timepoints in the file.

fof_ct_metadata = {"genome_assembly": GENOME_ASSEMBLY,
                   "lab_name": LAB_NAME,
                   "experimenter_name": EXPERIMENTER_NAME,
                   "experimenter_contact": EXPERIMENTER_CONTACT,
                   "description": DESCRIPTION,
                   }

fof_volumetric_ct_writer(table_folder=fof_ct_folder,
                         data2write=obj_list,
                         fof_metadata=fof_ct_metadata,
                         reg_dict={row["timepoint"]: (row["Chr"], row["Start(hg19)"], row["End(hg19)"]) for _, row in info_df.iterrows()},
                         fluor=FLUOROPHORE_DICT,
                         file_suffix = file_suffix
                         )

Processing file 2018-09-04_nuc03_A_chr3 [1/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
	Time segment not found for region chr3:152500000-153000000, skipping...
	Time segment not found for region chr3:153500000-154000000, skipping...
Processing file 2018-07-10_nuc02_A_chr3 [2/30]
Processing file 2018-07-10_nuc05_A_chr3 [3/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc07_A_chr3 [4/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc09_A_chr3 [5/30]
Processing file 2018-07-10_nuc11_A_chr3 [6/30]
Processing file 2018-09-04_nuc01_B_chr3 [7/30]
Processing file 2018-07-10_nuc06_B_chr3 [8/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc05_B_chr3 [9/30]
Processing file 2018-07-10_nuc06_A_chr3 [10/30]
Processing file 2018-06-28_nuc08_A_chr3 [11/30]
Processing file 2018-06-28_nuc05_A_chr3 [12/30]
	Time segment not found for region chr3:152000000-152500000, skipping...
Processing file 2018-07-10_nuc07_A_chr3 [13/30]
	Time segment not found for region chr3:156500000-157000000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc01_A_chr3 [14/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc02_B_chr3 [15/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc01_A_chr3 [16/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
Processing file 2018-09-04_nuc04_A_chr3 [17/30]
Processing file 2018-07-10_nuc04_B_chr3 [18/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-09-04_nuc03_B_chr3 [19/30]
Processing file 2018-07-10_nuc03_B_chr3 [20/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc01_B_chr3 [21/30]
Processing file 2018-09-04_nuc07_A_chr3 [22/30]
	Time segment not found for region chr3:151500000-152000000, skipping...
	Time segment not found for region chr3:157000000-157500000, skipping...
Processing file 2018-06-28_nuc02_A_chr3 [23/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc10_A_chr3 [24/30]
Processing file 2018-07-10_nuc08_A_chr3 [25/30]
Processing file 2018-09-04_nuc05_A_chr3 [26/30]
Processing file 2018-06-28_nuc01_B_chr3 [27/30]
Processing file 2018-09-04_nuc10_A_chr3 [28/30]
	Time segment not found for region chr3:157000000-157500000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc07_B_chr3 [29/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
	Time segment not found for region chr3:152000000-152500000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-09-04_nuc06_A_chr3 [30/30]

Using the same function we can turn volumetric format off to write as a standard ball and stick FOF_CT format

fof_ct_metadata = {"genome_assembly": GENOME_ASSEMBLY,
                   "lab_name": LAB_NAME,
                   "experimenter_name": EXPERIMENTER_NAME,
                   "experimenter_contact": EXPERIMENTER_CONTACT,
                   "description": DESCRIPTION,
                   }

fof_volumetric_ct_writer(table_folder=fof_ct_folder,
                         data2write=obj_list,
                         fof_metadata=fof_ct_metadata,
                         reg_dict={row["timepoint"]: (row["Chr"], row["Start(hg19)"], row["End(hg19)"]) for _, row in info_df.iterrows()},
                        #  fluor=FLUOROPHORE_DICT,
                         file_suffix = file_suffix,
                         volumetric_format = False
                         )

Processing file 2018-09-04_nuc03_A_chr3 [1/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
	Time segment not found for region chr3:152500000-153000000, skipping...
	Time segment not found for region chr3:153500000-154000000, skipping...
Processing file 2018-07-10_nuc02_A_chr3 [2/30]
Processing file 2018-07-10_nuc05_A_chr3 [3/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc07_A_chr3 [4/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc09_A_chr3 [5/30]
Processing file 2018-07-10_nuc11_A_chr3 [6/30]
Processing file 2018-09-04_nuc01_B_chr3 [7/30]
Processing file 2018-07-10_nuc06_B_chr3 [8/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc05_B_chr3 [9/30]
Processing file 2018-07-10_nuc06_A_chr3 [10/30]
Processing file 2018-06-28_nuc08_A_chr3 [11/30]
Processing file 2018-06-28_nuc05_A_chr3 [12/30]
	Time segment not found for region chr3:152000000-152500000, skipping...
Processing file 2018-07-10_nuc07_A_chr3 [13/30]
	Time segment not found for region chr3:156500000-157000000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc01_A_chr3 [14/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc02_B_chr3 [15/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc01_A_chr3 [16/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
Processing file 2018-09-04_nuc04_A_chr3 [17/30]
Processing file 2018-07-10_nuc04_B_chr3 [18/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-09-04_nuc03_B_chr3 [19/30]
Processing file 2018-07-10_nuc03_B_chr3 [20/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc01_B_chr3 [21/30]
Processing file 2018-09-04_nuc07_A_chr3 [22/30]
	Time segment not found for region chr3:151500000-152000000, skipping...
	Time segment not found for region chr3:157000000-157500000, skipping...
Processing file 2018-06-28_nuc02_A_chr3 [23/30]
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-07-10_nuc10_A_chr3 [24/30]
Processing file 2018-07-10_nuc08_A_chr3 [25/30]
Processing file 2018-09-04_nuc05_A_chr3 [26/30]
Processing file 2018-06-28_nuc01_B_chr3 [27/30]
Processing file 2018-09-04_nuc10_A_chr3 [28/30]
	Time segment not found for region chr3:157000000-157500000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-06-28_nuc07_B_chr3 [29/30]
	Time segment not found for region chr3:150000000-150500000, skipping...
	Time segment not found for region chr3:152000000-152500000, skipping...
	Time segment not found for region chr3:157500000-158000000, skipping...
Processing file 2018-09-04_nuc06_A_chr3 [30/30]

Reading the FOF-CT volumetic format

In a similar fashion, CIMA can read the core.csv, trace.csv and spot.csv created by the fof_ct_writer function. With this we can recreate a list of CIMA objects similar to the original one.

core_file = Path(fof_ct_folder, f"core_{file_suffix}_volumetric.csv")
trace_file = Path(fof_ct_folder, f"trace_{file_suffix}_volumetric.csv")
spot_file = Path(fof_ct_folder, f"spot_{file_suffix}_volumetric.csv")

obj_CIMA_list = fof_volumetric_ct_reader(fof_core_file=core_file, fof_spot_file=spot_file, fof_trace_file=trace_file)

Reconstructing file: loc1_nuc3_A
Reconstructing file: loc3_nuc2_A
Reconstructing file: loc4_nuc5_A
Reconstructing file: loc6_nuc7_A
Reconstructing file: loc7_nuc9_A
Reconstructing file: loc8_nuc11_A
Reconstructing file: loc9_nuc1_B
Reconstructing file: loc10_nuc6_B
Reconstructing file: loc11_nuc5_B
Reconstructing file: loc12_nuc6_A
Reconstructing file: loc13_nuc8_A
Reconstructing file: loc14_nuc5_A
Reconstructing file: loc15_nuc7_A
Reconstructing file: loc16_nuc1_A
Reconstructing file: loc17_nuc2_B
Reconstructing file: loc18_nuc1_A
Reconstructing file: loc19_nuc4_A
Reconstructing file: loc20_nuc4_B
Reconstructing file: loc21_nuc3_B
Reconstructing file: loc22_nuc3_B
Reconstructing file: loc23_nuc1_B
Reconstructing file: loc24_nuc7_A
Reconstructing file: loc25_nuc2_A
Reconstructing file: loc27_nuc10_A
Reconstructing file: loc28_nuc8_A
Reconstructing file: loc29_nuc5_A
Reconstructing file: loc30_nuc1_B
Reconstructing file: loc31_nuc10_A
Reconstructing file: loc32_nuc7_B
Reconstructing file: loc33_nuc6_A

obj_CIMA_list[0].atomList.head()

	imageID	zstep	...	yprec	zprec	timepoint	record_name	s11	s12	shiftz	mass
0	335775	6	...	9.96838	85.8496	6	sLOC	NaN	NaN	1321.719971	10.0
1	335783	6	...	10.76700	90.8840	6	sLOC	NaN	NaN	1337.540039	10.0
2	336116	8	...	4.67027	37.7125	6	sLOC	NaN	NaN	1431.239990	10.0
3	336118	8	...	6.80031	60.6664	6	sLOC	NaN	NaN	1376.939941	10.0
4	336119	8	...	6.51492	55.7733	6	sLOC	NaN	NaN	1417.579956	10.0

5 rows × 40 columns

Reading the FOF-CT bs format

In a similar fashion, CIMA can read the core.csv nad trace.csv. With this we can recreate a list of CIMA objects similar to the original one.

IMPORTANT

The atomList of the object will be mostly filled with 0s, as most of the information is missing.

core_file = Path(fof_ct_folder, f"core_{file_suffix}_bs.csv")
trace_file = Path(fof_ct_folder, f"trace_{file_suffix}_bs.csv")

obj_CIMA_list = fof_bs_ct_reader(fof_core_file=core_file, fof_trace_file=trace_file)

Reconstructing file: loc1_nuc3_A
Reconstructing file: loc3_nuc2_A
Reconstructing file: loc4_nuc5_A
Reconstructing file: loc6_nuc7_A
Reconstructing file: loc7_nuc9_A
Reconstructing file: loc8_nuc11_A
Reconstructing file: loc9_nuc1_B
Reconstructing file: loc10_nuc6_B
Reconstructing file: loc11_nuc5_B
Reconstructing file: loc12_nuc6_A
Reconstructing file: loc13_nuc8_A
Reconstructing file: loc14_nuc5_A
Reconstructing file: loc15_nuc7_A
Reconstructing file: loc16_nuc1_A
Reconstructing file: loc17_nuc2_B
Reconstructing file: loc18_nuc1_A
Reconstructing file: loc19_nuc4_A
Reconstructing file: loc20_nuc4_B
Reconstructing file: loc21_nuc3_B
Reconstructing file: loc22_nuc3_B
Reconstructing file: loc23_nuc1_B
Reconstructing file: loc24_nuc7_A
Reconstructing file: loc25_nuc2_A
Reconstructing file: loc27_nuc10_A
Reconstructing file: loc28_nuc8_A
Reconstructing file: loc29_nuc5_A
Reconstructing file: loc30_nuc1_B
Reconstructing file: loc31_nuc10_A
Reconstructing file: loc32_nuc7_B
Reconstructing file: loc33_nuc6_A

obj_CIMA_list[0].atomList.head()

	...	timepoint	record_name	s11	s12	shiftz	mass
0	...	1	sLOC	NaN	NaN	1481.047974	10.0
1	...	2	sLOC	NaN	NaN	1398.165283	10.0
2	...	3	sLOC	NaN	NaN	963.539978	10.0
3	...	4	sLOC	NaN	NaN	951.852051	10.0
4	...	6	sLOC	NaN	NaN	919.484924	10.0

5 rows × 40 columns

Reading from XYZ coordinate format

CIMA also can read files that only contain X, Y, Z coordinates.
In this case we have to use the SegmentXYZ. This only requires that the data have the columns x, y and z. All other columns will be preserved.
Here there is an example using chr21 data as an example.

from cima.segments.segment_info_xyz import SegmentXYZ

chr21_bs_data = pd.read_csv('https://zenodo.org/records/3928890/files/chromosome21.tsv?download=1', sep='\t')

display(chr21_bs_data.head())

	Z(nm)	X(nm)	Y(nm)	Genomic coordinate	Chromosome copy number	Gene names	Transcription	TSS ZXY(nm)
0	2449.0	4700.0	7234.0	chr21:10400001-10450001	1	NaN	NaN	NaN
1	3731.0	4629.0	7409.0	chr21:10500001-10550001	1	NaN	NaN	NaN
2	2248.0	4690.0	7148.0	chr21:10600001-10650001	1	NaN	NaN	NaN
3	2211.0	4065.0	7567.0	chr21:13250001-13300001	1	NaN	NaN	NaN
4	2499.0	3904.0	7255.0	chr21:14000001-14050001	1	NaN	NaN	NaN

chr21_bs_data = chr21_bs_data[['X(nm)', 'Y(nm)', 'Z(nm)','Genomic coordinate','Chromosome copy number']]
chr21_bs_data.rename({
            'X(nm)': 'x',
            'Y(nm)': 'y',
            'Z(nm)': 'z',
            'Genomic coordinate':'gen_coord',
            'Chromosome copy number': 'chr_copy_num'},
        axis=1, inplace=True)

chr21_bs_data_CIMA = SegmentXYZ(chr21_bs_data)

display(chr21_bs_data_CIMA.atomList.head())

	x	y	z	gen_coord	chr_copy_num	record_name	mass
0	4700.0	7234.0	2449.0	chr21:10400001-10450001	1	sLOC	10.0
1	4629.0	7409.0	3731.0	chr21:10500001-10550001	1	sLOC	10.0
2	4690.0	7148.0	2248.0	chr21:10600001-10650001	1	sLOC	10.0
3	4065.0	7567.0	2211.0	chr21:13250001-13300001	1	sLOC	10.0
4	3904.0	7255.0	2499.0	chr21:14000001-14050001	1	sLOC	10.0

print(chr21_bs_data_CIMA.atomList.timepoint.unique())

[0]

print(chr21_bs_data_CIMA.atomList.gen_coord.unique())

<StringArray>
['chr21:10400001-10450001', 'chr21:10500001-10550001',
 'chr21:10600001-10650001', 'chr21:13250001-13300001',
 'chr21:14000001-14050001', 'chr21:14050001-14100001',
 'chr21:14100001-14150001', 'chr21:14150001-14200001',
 'chr21:14200001-14250001', 'chr21:14250001-14300001',
 ...
 'chr21:46200001-46250001', 'chr21:46250001-46300001',
 'chr21:46300001-46350001', 'chr21:46350001-46400001',
 'chr21:46400001-46450001', 'chr21:46450001-46500001',
 'chr21:46500001-46550001', 'chr21:46550001-46600001',
 'chr21:46600001-46650001', 'chr21:46650001-46700001']
Length: 651, dtype: str