Clustering¶
tcrdist2 recognizes two general cluster attributes:
TCRrep.clone_index
TCRrep.clone_df
Builtin hierarchical clustering methods TCRrep.simple_clone_index()
and
TCRrep.clone_index_to_df()
methods are provided.
However, clone cluster attribuets can be generated by user
selected methods. Cluster membership need not unique.
That is, a clone may be in multiple clusters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
df = pd.read_csv('dash.csv')
df = df[df.epitope.isin(['PA'])]
tr = TCRrep(cell_df=df, chains=['alpha','beta'], organism='mouse')
tr.tcrdist2(processes = 1,
metric = 'nw',
reduce = True,
dump = False,
save = False)
tr.cluster_index = tr.simple_cluster_index(pw_distances = None,
method = 'complete',
criterion = "distance",
t = 100)
assert len(tr.cluster_index) == tr.clone_df.shape[0]
tr.cluster_df = tr.cluster_index_to_df(tr.cluster_index)
|
In [2]: tr.cluster_index
Out[2]:
array([104, 135, 76, 64, 72, 57, 64, 71, 62, 81, 84, 104, 122,
.......
141, 56, 113, 92, 94, 120, 77, 13, 56, 52, 58, 4],
dtype=int32)
In [3]: tr.cluster_df
Out[3]:
cluster_id neighbors K_neighbors
3 4 [16, 25, 26, 29, 32, 50, 61, 68, 69, 94, 103, ... 24
91 92 [35, 38, 41, 105, 131, 146, 181, 186, 189, 206... 18
80 81 [9, 13, 70, 74, 81, 85, 104, 106, 133, 148, 21... 17
55 56 [18, 22, 42, 91, 98, 187, 191, 195, 217, 231, ... 15
93 94 [15, 77, 123, 144, 173, 175, 205, 212, 259, 29... 11
57 58 [83, 159, 203, 220, 237, 243, 249, 261, 300, 3... 11
103 104 [0, 11, 27, 46, 65, 67, 110, 162, 170, 275] 10
76 77 [124, 160, 185, 227, 230, 306, 318] 7
78 79 [78, 90, 194, 199, 251, 270, 280] 7
Sampling¶
tcrdist2 uses the pip installable package tcrsampler, to sample CDR3s from user-specified background.
An example background dataset can be downloaded here: britanova_chord_blood.csv.
The tcrsampler can be used to get V-gene, J-gene or join V-J-gene frequency estimates.
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd
import os
from tcrsampler.sampler import TCRsampler
# fn = 'britanova_chord_blood.csv' # real file
fn = os.path.join('tcrdist','test_files', 'britanova_chord_blood_sample_5000.csv') # test_only file
t = TCRsampler()
t.ref_df = pd.read_csv(fn)
t.build_background()
t.v_freq
t.j_freq
t.vj_freq
t.sample_background(v ='TRBV10-1*01', j ='TRBJ1-1*01',n=3, depth = 1, seed =1, use_frequency= True )
|
In [2]: t.v_freq
Out[2]:
{'TRBV10-1*01': 0.001986268228829416,
.....
'TRBV7-8*01': 0.01020904355302685,
'TRBV7-9*01': 0.03903704172001506,
'TRBV9*01': 0.020669023478171192}
In [3]: t.j_freq
Out[3]:
{'TRBJ1-1*01': 0.056913580882579605,
.....
'TRBJ2-6*01': 0.01458019406497144,
'TRBJ2-7*01': 0.22612204954887244}
In [4]: t.vj_freq
Out[4]: {....
('TRBV28*01', 'TRBJ1-5*01'): 0.004184454053794057,
('TRBV28*01', 'TRBJ1-6*01'): 0.002038884189985508,
('TRBV28*01', 'TRBJ2-1*01'): 0.011147572827334612,
('TRBV28*01', 'TRBJ2-2*01'): 0.001618765348999049}
Moreover, tcrsampler can return CDR3s based on specified V-J gene usage.
In [7]: t.sample_background(v ='TRBV10-1*01', j ='TRBJ1-1*01',n=3, depth = 1, seed =1, use_frequency= True )
...:
Out[7]: ['CASSPRGDTEAFF', 'CASSEGATEAFF', 'CASSPRGDTEAFF']