tcrdist2 provides flexible distance measures for comparing T cell receptors across complementarity determining regions (CDRs)

The following short examples use the TCRrep class and tcrdist2() method, which is appropriate for small and medium-sized datasets of 10-10,000 TCR clones. The data used in these examples can be downloaded here: dash.csv (375KB). The tcrdist2() method automates some steps that can be seen in the longer examples.

TCR Distances

Hamming Distance

Hamming distance between receptors, using the CDR1, CDR2, CDR2.5 and CDR3. By default, the CDRs are aligned with the Needleman–Wunsch algorithm using BLOSUM62 scoring matrix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

tr.pw_tcrdist
In [2]: tr.pw_tcrdist
Out[2]:
array([[ 0, 45, 41, ..., 42, 37, 26],
       [45,  0, 49, ..., 46, 47, 45],
       [41, 49,  0, ..., 29, 28, 45],
       ...,
       [42, 46, 29, ...,  0, 28, 27],
       [37, 47, 28, ..., 28,  0, 39],
       [26, 45, 45, ..., 27, 39,  0]], dtype=int16)

Reciprocal Alignment Distance

Reciprocal allignment distance between receptors, using the CDR1, CDR2, CDR2.5 and CDR3. By default, the CDRs are aligned with the Needleman–Wunsch algorithm using BLOSUM62 scoring matrix.

Tip

The reciprocal distance is calucated using the Needlman-Wunsch algorith and a scoring matrix, such that the distance is score(x,x) + score(y,y) - 2 * score(x,y)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'nw',
			reduce = True,
			dump = False,
			save = False)

tr.pw_tcrdist
In [2]: tr.pw_tcrdist
Out[2]:
array([[  0, 504, 449, ..., 440, 407, 280],
       [504,   0, 573, ..., 506, 533, 518],
       [449, 573,   0, ..., 313, 336, 479],
       ...,
       [440, 506, 313, ...,   0, 319, 268],
       [407, 533, 336, ..., 319,   0, 455],
       [280, 518, 479, ..., 268, 455,   0]], dtype=int16)

Chain-Specific Distance

Chain specific distance attributes are provided as TCRrep.pw_alpha and TCRrep.pw_beta.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

assert np.all( tr.pw_tcrdist == tr.pw_alpha + tr.pw_beta)
In [2]: tr.pw_tcrdist
Out[2]:
array([[ 0, 45, 41, ..., 42, 37, 26],
       [45,  0, 49, ..., 46, 47, 45],
       [41, 49,  0, ..., 29, 28, 45],
       ...,
       [42, 46, 29, ...,  0, 28, 27],
       [37, 47, 28, ..., 28,  0, 39],
       [26, 45, 45, ..., 27, 39,  0]], dtype=int16)

In [3]: tr.pw_alpha
Out[3]:
array([[ 0, 25, 23, ..., 27, 22, 23],
       [25,  0, 28, ..., 27, 27, 27],
       [23, 28,  0, ..., 24, 23, 28],
       ...,
       [27, 27, 24, ...,  0, 26, 14],
       [22, 27, 23, ..., 26,  0, 25],
       [23, 27, 28, ..., 14, 25,  0]], dtype=int16)

In [4]: tr.pw_beta
Out[4]:
array([[ 0, 20, 18, ..., 15, 15,  3],
       [20,  0, 21, ..., 19, 20, 18],
       [18, 21,  0, ...,  5,  5, 17],
       ...,
       [15, 19,  5, ...,  0,  2, 13],
       [15, 20,  5, ...,  2,  0, 14],
       [ 3, 18, 17, ..., 13, 14,  0]], dtype=int16)

CDR-Specific Distance

CDR-specific tcr-distances.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

assert np.all( tr.pw_tcrdist == tr.pw_alpha + tr.pw_beta)
assert np.all( tr.pw_beta    == tr.cdr3_b_aa_pw + tr.pmhc_b_aa_pw + tr.cdr2_b_aa_pw + tr.cdr1_b_aa_pw)
assert np.all( tr.pw_alpha   == tr.cdr3_a_aa_pw + tr.pmhc_a_aa_pw + tr.cdr2_a_aa_pw + tr.cdr1_a_aa_pw)

Tip

CDR2.5 alpha and CDR2.5 beta, the pMHC-facing loop between CDR2 and CDR3, are referred to in tcrdist2 as pmhc_a and phmc_b, respectively.

In [2]: tr.cdr3_b_aa_pw
Out[2]:
array([[0, 7, 5, ..., 2, 2, 2],
       ...,
       [2, 6, 5, ..., 1, 2, 0]], dtype=int16)

In [3]: tr.pmhc_b_aa_pw
Out[3]:
array([[0, 4, 4, ..., 4, 4, 0],
       ...,
       [0, 4, 4, ..., 4, 4, 0]], dtype=int16)

In [4]: tr.cdr2_b_aa_pw
Out[4]:
array([[0, 6, 5, ..., 5, 5, 0],
       ...,
       [0, 6, 5, ..., 5, 5, 0]], dtype=int16)

In [5]: tr.cdr1_b_aa_pw
Out[5]:
array([[0, 3, 4, ..., 4, 4, 1],
       ...,
       [1, 2, 3, ..., 3, 3, 0]], dtype=int16)

Custom CDR Weights

Customize weight of CDR-specific contributions to the overall pw_tcrdist matrix. For instance, here a weight of 3 is applied to the CDR3 compared to other CDRs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

assert np.all( tr.pw_tcrdist == tr.pw_alpha + tr.pw_beta)
assert np.all( tr.pw_beta    == tr.cdr3_b_aa_pw + tr.pmhc_b_aa_pw + tr.cdr2_b_aa_pw + tr.cdr1_b_aa_pw)
assert np.all( tr.pw_alpha   == tr.cdr3_a_aa_pw + tr.pmhc_a_aa_pw + tr.cdr2_a_aa_pw + tr.cdr1_a_aa_pw)

Tip

Check weightings used attribute TCRrep.tr.paired_tcrdist_weights. New weightings can be applied wiht replacment weights or by matrix addition.

Custom Distances

pwseqdist

tcrdist2 uses the pip installable package pwseqdist, to compute distances using parrallel processing.

In addition to the methods above, almost any metric comparing strings can be used. For instance, the package python-Levenshtein calculates the edit distance between strings which has become a popular way of comparing large sets of CDR3s.

1
2
3
4
5
6
7
8
import pwseqdist as pw 
from scipy.spatial.distance import squareform
import Levenshtein
dvec = pw.apply_pairwise_sq(seqs = ['homer', 'home', 'rome'], 
                               metric = Levenshtein.distance, 
                               ncpus = 1)
dmat = squareform(dvec)

In [2]: dmat
Out[2]:
array([[0, 1, 2],
       [1, 0, 1],
       [2, 1, 0]])

Levenshtein Distance

pwseqdist functionality can be applied to any string column in the TCRrep clone DataFrame with custom_dmat().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

import Levenshtein
tr.custom_dmat(cdr = 'cdr3_b_aa', metric =Levenshtein.distance, processes = 1)
tr.add_custom_dmat(cdr = 'cdr3_b_aa', metric =Levenshtein.distance, processes = 1)
In [2]: tr.custom_dmat(cdr = 'cdr3_b_aa', metric =Levenshtein.distance, processes = 1)
Out[2]:
array([[0, 7, 5, ..., 2, 2, 2],
       [7, 0, 8, ..., 6, 7, 6],
       [5, 8, 0, ..., 5, 5, 5],
       ...,
       [2, 6, 5, ..., 0, 2, 1],
       [2, 7, 5, ..., 2, 0, 2],
       [2, 6, 5, ..., 1, 2, 0]])

Tip

Using add_custom_dmat() will automatically create a new attribute with the name custom_[selected_cdr]_pw. For instance,

In [2]: tr.add_custom_dmat(cdr = 'cdr3_b_aa', metric =Levenshtein.distance, processes = 1)

In [3]: tr.custom_cdr3_b_aa_pw
Out[3]:
array([[0, 7, 5, ..., 2, 2, 2],
       [7, 0, 8, ..., 6, 7, 6],
       [5, 8, 0, ..., 5, 5, 5],
       ...,
       [2, 6, 5, ..., 0, 2, 1],
       [2, 7, 5, ..., 2, 0, 2],
       [2, 6, 5, ..., 1, 2, 0]])

Levenshtein Distance on CDRs

With, add_custom_dmat() custom metrics can be combined to compute distances across multipe CDRs. For instance,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

import Levenshtein
for cdr,w in {'cdr3_b_aa':3,'pmhc_b_aa':1,'cdr2_b_aa':1,'cdr1_b_aa':1}.items():
	tr.add_custom_dmat(cdr = cdr, metric =Levenshtein.distance, processes = 1)

tr.pw_levenshtein_tcrdist_beta = tr.custom_cdr3_b_aa_pw + \
                                 tr.custom_pmhc_b_aa_pw + \
                                 tr.custom_cdr2_b_aa_pw + \
                                 tr.custom_cdr1_b_aa_pw 
In [2]: tr.pw_levenshtein_tcrdist_beta
Out[2]:
array([[ 0, 20, 18, ..., 15, 15,  3],
       [20,  0, 21, ..., 19, 20, 18],
       [18, 21,  0, ...,  5,  5, 17],
       ...,
       [15, 19,  5, ...,  0,  2, 13],
       [15, 20,  5, ...,  2,  0, 14],
       [ 3, 18, 17, ..., 13, 14,  0]])

Saving

tcrdist2 uses the pip installable package zipdist, to compute to archive and reload Numpy and Pandas TCRrep attributes.

An archive can be created at the time tcrdist2() is called by setting tcrdist2.save to True.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = True,
			dest = "some_archive",
			dest_tar_name = "some_archive.tar.gz")

An archive can be created at any stage using TCRrep.archive(). All JSON serializable, Numpy array, and Pandas DataFrame attributes are saved.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'hamming',
			reduce = True,
			dump = False,
			save = False)

tr.archive( dest = "some_archive", 
			dest_tar_name = "some_archive.tar.gz", 
			verbose = True, 
			use_csv = False)

Reloading

A TCRrep instance can be rebuilt from the .tar.gz archive file. For instance:

1
2
3
4
5
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
tr = TCRrep(cell_df=pd.DataFrame(), chains=['alpha','beta'], organism='mouse')
tr.rebuild(dest_tar_name = "some_archive.tar.gz")

All attributes can be accessed as before.

In [2]: tr.pw_tcrdist
Out[2]:
array([[ 0, 45, 41, ..., 42, 37, 26],
       [45,  0, 49, ..., 46, 47, 45],
       [41, 49,  0, ..., 29, 28, 45],
       ...,
       [42, 46, 29, ...,  0, 28, 27],
       [37, 47, 28, ..., 28,  0, 39],
       [26, 45, 45, ..., 27, 39,  0]], dtype=int16)