Preprocessing¶

10X Genomics Data Files¶

Example Using a 10X Genomics Dataset¶

Output from the 10X Genomics Cell Ranger pipeline includes multiple files. The two necessary files for importing the results into tcrdist2 are:

The Filtered contig annotations (CSV)
The Clonotype consensus annotations (CSV)

The required fields for these file are shown at the bottom of this page. below. All data downloaded from 10X Genomics Example Datasets .

Tip

We use the following files in the coding example:

vdj_v1_hs_pbmc_t_filtered_contig_annotations.csv
vdj_v1_hs_pbmc_t_consensus_annotations.csv

Pre-Processing the 10X Data Files¶

Some pre-processing of the data is required to create a tcrdist2 compatible clone file from the 10X Cell Ranger files. This is done using tcrdist.preprocess_10X.get_10X_clones().

import pandas as pd
import numpy as np
import multiprocessing

import tcrdist as td
from tcrdist import preprocess_10X
from tcrdist.repertoire import TCRrep
from tcrdist.vis_tools import bostock_cat_colors, cluster_viz

# filtered_contig_annotations_csvfile
fca = 'vdj_v1_hs_pbmc_t_filtered_contig_annotations.csv'
ca =  'vdj_v1_hs_pbmc_t_consensus_annotations.csv'
clones_df = preprocess_10X.get_10X_clones(organism = 'human',
                                       filtered_contig_annotations_csvfile = fca,
                                       consensus_annotations_csvfile = ca)
clones_df.head()

Tip

Multiple productive alpha or beta chains associate with a single 10X unique cell identifier are possible. tcrdist2 picks the productive variant of the chain associated with the highest number of unique molecular identifies.

Initialize TCRrep to Calculate tcrdistances¶

tr = TCRrep(cell_df = clones_df, organism = "human", chains = ["alpha","beta"])
tr.infer_cdrs_from_v_gene(chain = 'alpha', imgt_aligned=True)
tr.infer_cdrs_from_v_gene(chain = 'beta', imgt_aligned=True)
tr.index_cols = ['clone_id', 'subject',
               'v_a_gene',  'j_a_gene', 'v_b_gene', 'j_b_gene',
               'cdr3_a_aa', 'cdr3_b_aa',
               'cdr1_a_aa', 'cdr2_a_aa', 'pmhc_a_aa',
               'cdr1_b_aa', 'cdr2_b_aa', 'pmhc_b_aa',
               'cdr3_b_nucseq', 'cdr3_a_nucseq']#,
               #'va_countreps', 'ja_countreps',#
               #'vb_countreps', 'jb_countreps',
               #'va_gene', 'vb_gene',
               #'ja_gene', 'jb_gene']
tr.deduplicate()
cpus = multiprocessing.cpu_count()
tr._tcrdist_legacy_method_alpha_beta(processes = cpus)

tr.clone_df['epitope'] = "Unk" # Assign an unknown epitope to all clones
cluster_viz(pd.DataFrame(tr.paired_tcrdist),
            tr.clone_df,
            tr.clone_df['epitope'].unique(),
            bostock_cat_colors(['set3']),
            "Clutered with Paired tcrdistances (Legacy Metric)")

https://user-images.githubusercontent.com/46639063/68563087-55d93c80-0401-11ea-988e-cbf954d57730.png

cluster_viz(pd.DataFrame(tr.dist_b.values),
            tr.clone_df,
            tr.clone_df['epitope'].unique(),
            bostock_cat_colors(['set3']),
            "Clutered with tcrdistance Beta Chain (Legacy Metric)")

https://user-images.githubusercontent.com/46639063/68563088-55d93c80-0401-11ea-8c6a-51d78b7b46a9.png

Bulk Data Processed with MiXCR¶

If you have bulk sequencing data, MiXCR is an excellent option for rapidly annotating the CDRs and identifying gene V and J gene usage. Documentation for MiXCR can be found at mixcr.readthedocs.io/.

Adaptive Data File¶

Describe preprocessing of Adaptive Files

VDJDB Data Files¶

Describe Preprocessing VDJDB files

Clones File¶

Describe how to use a clones file from previous TCRdist run.

Mappers¶

The typical workflow for getting data into tcrdist2 is to load a data file with Pandas and then apply a mapper to select and rename columns to match those required by tcrdist2. see Mappers

We have begun writing wrappers for various data sources. Mappers are just python dictionaries and can be easily emulated if your data headers are unique.

import pandas as pd
from tcrdist import mappers
mapping = mappers.tcrdist_clone_df_to_tcrdist2_mapping
tcrdist1_df = mappers.generic_pandas_mapper(df = tcrdist_clone_df_PA,
                                            mapping = mapping)

Input Files¶

10X Genomics Data Fields¶

Required Fields in the 10X Genomics contig_annotations_csvfile

FIELD	EXAMPLE VALUE
barcode	AAAGATGGTCTTCTCG-1
is_cell	TRUE
contig_id	AAAGATGGTCTTCTCG-1_contig_1
high_confidence	TRUE
length	695
chain	TRB
v_gene	TRBV5-1*01
d_gene	TRBD2*02
j_gene	TRBJ2-3*01
c_gene	TRBC2*01
full_length	TRUE
productive	TRUE
cdr3	CASSPLAGYAADTQYF
cdr3_nt	TGCGCCAGCAGCCCCCTAGCGGGATACGCAGCAGATACGCAGTATTTT
reads	9427
umis	9
raw_clonotype_id	clonotype14
raw_consensus_id	clonotype14_consensus_1

Required Fields in in the 10X Genomics consensus_annotations_csvfile

FIELD	EXAMPLE VALUE
clonotype_id	clonotype100
consensus_id	clonotype100_consensus_1
length	550
chain	TRB
v_gene	TRBV24-1*01
d_gene	TRBD1*01
j_gene	TRBJ2-7*01
c_gene	TRBC2*01
full_length	TRUE
productive	TRUE
cdr3	CATSDPGQGGYEQYF
cdr3_nt	TGTGCCACCAGTGACCCCGGACAGGGAGGATACGAGCAGTACTTC
reads	8957
umis	9

Adaptive Biotech Files¶

See the 3.4.13 SAMPLE EXPORT section of immunoSEQ Analyzer Manual

The standard output is expected to have column headers as shown in the table below.

immunoSEQ

Reference Manual	Definition	Example1	Example2
nucleotide	Full length nucleotide sequence	GAGTCGCCCAGA	ATCCAGCCCTCAGA
aminoAcid	CDR3 amino acid sequence	CASSLSWYNTGELFF	CASSLGQTNTEAFF
count	The count of sequence reads for the given nucleotide sequence	13	2
frequencyCount	The percentage of a particular nucleotide sequence in the total sample Total length of the CDR3 region	0.003421953	5.26E-04
cdr3Length	The highest degree of specificity obtained when identifying the V gene Name of the V family (if identified)	45	42
vMaxResolved	Name of the V gene if identified, otherwise ‘unresolved’	TCRBV27-01*01	TCRBV12
vFamilyName	Number of the V gene allele if identified, otherwise ‘1’	TCRBV27	TCRBV12
vGeneName	Name of the potential V families for a sequence if V family identity is unresolved	TCRBV27-01
vGeneAllele	Name of the potential V genes for a sequence if V gene identity is unresolved	1
vFamilyTies	Name of the potential V gene alleles for a sequence if V allele identity is unresolved
vGeneNameTies	The highest degree of specificity obtained when identifying the D gene Name of the D family (if identified)		TCRBV12-03,TCRBV12-04
vGeneAlleleTies	Name of the D gene if identified, otherwise ‘unresolved’
dMaxResolved	Number of the D gene allele if identified, otherwise ‘1’ or ‘unresolved’ if D gene is unresolved		TCRBD01-01*01
dFamilyName	Name of the potential D families for a sequence if D family identity is unresolved		TCRBD01
dGeneName	Name of the potential D genes for a sequence if D gene identity is unresolved		TCRBD01-01
dGeneAllele	Name of the potential D gene alleles for a sequence if D allele identity is unresolved		1
dFamilyTies	The highest degree of specificity obtained when identifying the J gene Name of the J family (if identified)	TCRBD01,TCRBD02
dGeneNameTies	Name of the J gene if identified, otherwise ‘unresolved’	TCRBD01-01,TCRBD02-01
dGeneAlleleTies	Number of the J gene allele if identified, otherwise ‘1’
jMaxResolved	Name of the potential J families for a sequence if J family identity is unresolved	TCRBJ02-02*01	TCRBJ01-01*01
jFamilyName	Name of the potential J genes for a sequence if J gene identity is unresolved	TCRBJ02	TCRBJ01
jGeneName	Name of the J gene if identified, otherwise ‘unresolved’	TCRBJ02-02	TCRBJ01-01
jGeneAllele	Number of the J gene allele if identified, otherwise ‘1’	1	1
jFamilyTies	Name of the potential J families for a sequence if J family identity is unresolved
jGeneNameTies	Name of the potential J genes for a sequence if J gene identity is unresolved
jGeneAlleleTies	Name of the potential J gene alleles for a sequence if J allele identity is unresolved
vDeletion	The number of deletions from the 3’ end of the V segment	0	6
n1Insertion	The number of insertions in the N1 region	2	4
d5Deletion	The number of deletions from the 5’ end of the D segment	0	1
d3Deletion	The number of deletions from the 3’ end of the D segment	10	5
n2Insertion	The number of insertions in the N2 region	3	2
jDeletion	The number of deletions from the 5’ end of the J segment	2	1
vIndex	Distance from the start of the sequence (designated 0) to the conserved cysteine residue in the V motif	36	39
n1Index	Distance from 0 to the start of the N1	53	50
dIndex	Distance from 0 to the start of the D region	55	54
n2Index	Distance from 0 to the start of the N2 region	57	60
jIndex	Distance from 0 to the start of the J region	60	62
estimatedNumberGenomes	Estimated number of T/B cell genomes present in the sample	13	2
sequenceStatus	Whether the nucleotide sequence generates a functional amino acid sequence	In	In
cloneResolved	The gene segments used to construct the CDR3 (VDJ, VJ, DJ)	VDJ	VDJ
vOrphon	Not implemented at this time
dOrphon	Not implemented at this time
jOrphon	Not implemented at this time
vFunction	Not implemented at this time
dFunction	Not implemented at this time
jFunction	Not implemented at this time