Clustering

tcrdist2 recognizes two general cluster attributes:

  1. TCRrep.clone_index
  2. TCRrep.clone_df

Builtin hierarchical clustering methods TCRrep.simple_clone_index() and TCRrep.clone_index_to_df() methods are provided. However, clone cluster attribuets can be generated by user selected methods. Cluster membership need not unique. That is, a clone may be in multiple clusters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
			metric = 'nw',
			reduce = True,
			dump = False,
			save = False)

tr.cluster_index = tr.simple_cluster_index(pw_distances = None,
                                              method = 'complete',
                                              criterion = "distance",
                                              t = 100)
assert len(tr.cluster_index) == tr.clone_df.shape[0]

tr.cluster_df = tr.cluster_index_to_df(tr.cluster_index)
In [2]: tr.cluster_index
Out[2]:
array([104, 135,  76,  64,  72,  57,  64,  71,  62,  81,  84, 104, 122,
        .......
       141,  56, 113,  92,  94, 120,  77,  13,  56,  52,  58,   4],
      dtype=int32)
In [3]: tr.cluster_df
Out[3]:
     cluster_id                                          neighbors  K_neighbors
3             4  [16, 25, 26, 29, 32, 50, 61, 68, 69, 94, 103, ...           24
91           92  [35, 38, 41, 105, 131, 146, 181, 186, 189, 206...           18
80           81  [9, 13, 70, 74, 81, 85, 104, 106, 133, 148, 21...           17
55           56  [18, 22, 42, 91, 98, 187, 191, 195, 217, 231, ...           15
93           94  [15, 77, 123, 144, 173, 175, 205, 212, 259, 29...           11
57           58  [83, 159, 203, 220, 237, 243, 249, 261, 300, 3...           11
103         104        [0, 11, 27, 46, 65, 67, 110, 162, 170, 275]           10
76           77                [124, 160, 185, 227, 230, 306, 318]            7
78           79                  [78, 90, 194, 199, 251, 270, 280]            7

Sampling

tcrdist2 uses the pip installable package tcrsampler, to sample CDR3s from user-specified background.

An example background dataset can be downloaded here: britanova_chord_blood.csv.

The tcrsampler can be used to get V-gene, J-gene or join V-J-gene frequency estimates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import pandas as pd
import os
from tcrsampler.sampler import TCRsampler
# fn = 'britanova_chord_blood.csv' # real file
fn = os.path.join('tcrdist','test_files', 'britanova_chord_blood_sample_5000.csv') # test_only file
t = TCRsampler()
t.ref_df = pd.read_csv(fn)
t.build_background()
t.v_freq
t.j_freq
t.vj_freq
t.sample_background(v ='TRBV10-1*01', j ='TRBJ1-1*01',n=3, depth = 1, seed =1, use_frequency= True )
In [2]: t.v_freq
Out[2]:
{'TRBV10-1*01': 0.001986268228829416,
        .....
 'TRBV7-8*01': 0.01020904355302685,
 'TRBV7-9*01': 0.03903704172001506,
 'TRBV9*01': 0.020669023478171192}
In [3]: t.j_freq
Out[3]:
{'TRBJ1-1*01': 0.056913580882579605,
        .....
 'TRBJ2-6*01': 0.01458019406497144,
 'TRBJ2-7*01': 0.22612204954887244}
In [4]: t.vj_freq
Out[4]: {....
 ('TRBV28*01', 'TRBJ1-5*01'): 0.004184454053794057,
 ('TRBV28*01', 'TRBJ1-6*01'): 0.002038884189985508,
 ('TRBV28*01', 'TRBJ2-1*01'): 0.011147572827334612,
 ('TRBV28*01', 'TRBJ2-2*01'): 0.001618765348999049}

Moreover, tcrsampler can return CDR3s based on specified V-J gene usage.

In [7]: t.sample_background(v ='TRBV10-1*01', j ='TRBJ1-1*01',n=3, depth = 1, seed =1, use_frequency= True )
   ...:
Out[7]: ['CASSPRGDTEAFF', 'CASSEGATEAFF', 'CASSPRGDTEAFF']