I have WES data of 235 samples, aligned using bwa aln. I followed GATK's Best Pratice for marking duplicates, realignment around INDEL, and BQSR until I got the recalibrated bamfiles. All these are done with GATK 2.7-2.
After that I generated gvcf file from the recalibrated bamfiles, using HaplotypeCaller from GATK 3.1-1, followed by GenotypeGVCFs and VQSR.
I came across one INDEL, which passes VQSR,
The variant is
chr14 92537378 .
G GC 11330.02 PASS
AC=11;AF=0.023;AN=470;BaseQRankSum=0.991;ClippingRankSum=-6.820e-01;DP=
13932;FS=0.000;GQ_MEAN=137.20;GQ_STDDEV=214.47;InbreedingCoeff=-0.0311;MLEAC=11;MLEAF=0.023;MQ=43.67;MQ0=0;MQRankSum=-1.073e+00;NCC=0;QD=18.16;
ReadPosRankSum=-1.760e-01;VQSLOD=3.70;culprit=FS
Indivudual genotypes seem very good. To name a few:
1800 0/0:65,0:65:99:0,108,1620 0/0:86,0:86:99:0,120,1800 0/1:23,25:48:99:1073,0,1971 0/1:51,39:90:99:1505,0,4106
But when I checked the variants in IGV, something strange popped out.
----------------------------------------------------------------------------------------------------------------------
These are samples heterozygous for this site
----------------------------------------------------------------------------------------------------------------------
Sample 1
Info in gvcf file:
chr14 92537350 .
T <NON_REF> .
. END=92537350 GT:DP:GQ:MIN_DP:PL 0/0:85:99:85:0,120,1800
chr14 92537351 .
C <NON_REF> .
. END=92537351 GT:DP:GQ:MIN_DP:PL 0/0:86:0:86:0,0,2081
chr14
92537352 . C
<NON_REF> . .
END=92537352
GT:DP:GQ:MIN_DP:PL
0/0:85:99:85:0,120,1800
chr14
92537353 . C
<NON_REF> . .
END=92537353 GT:DP:GQ:MIN_DP:PL
0/0:85:0:85:0,0,2035
chr14
92537354 . C
<NON_REF> . .
END=92537377
GT:DP:GQ:MIN_DP:PL
0/0:81:99:77:0,96,1440
chr14 92537378 .
G GC,<NON_REF> 1553.73 . BaseQRankSum=0.214;ClippingRankSum=2.058;DP=81;MLEAC=1,0;MLEAF=0.500,0.00;MQ=48.68;MQ0=0;MQRankSum=-5.451;ReadPosRankSum=-0.470
GT:AD:DP:GQ:PL:SB
0/1:36,45,0:81:99:1591,0,4674,1715,4811,6527:15,21,18,27
chr14
92537379 . T
TGCTGCTGCTGCTGC,<NON_REF>
1553.73 .
BaseQRankSum=1.705;ClippingRankSum=2.004;DP=82;MLEAC=1,0;MLEAF=0.500,0.00;MQ=48.84;MQ0=0;MQRankSum=-5.041;ReadPosRankSum=-0.612
GT:AD:DP:GQ:PL:SB
0/1:36,46,0:82:99:1591,0,4674,1715,4811,6527:15,21,18,28
chr14
92537380 . T
<NON_REF> . .
END=92537386
GT:DP:GQ:MIN_DP:PL
0/0:80:99:77:0,120,1800
chr14
92537387 . T
<NON_REF> . .
END=92537388
GT:DP:GQ:MIN_DP:PL
0/0:70:0:68:0,0,1641
chr14
92537389 . T
<NON_REF> . .
END=92537396
GT:DP:GQ:MIN_DP:PL
0/0:72:99:71:0,120,1800
chr14
92537397 . T
<NON_REF> . .
END=92537398
GT:DP:GQ:MIN_DP:PL
0/0:65:0:63:0,0,1790
chr14
92537399 . T
<NON_REF> . .
END=92537399
GT:DP:GQ:MIN_DP:PL
0/0:64:7:64:0,8,2005
chr14
92537400 . G
<NON_REF> . .
END=92537403
GT:DP:GQ:MIN_DP:PL
0/0:63:0:62:0,0,1794
chr14
92537404 . C
<NON_REF> . .
END=92537405
GT:DP:GQ:MIN_DP:PL
0/0:60:36:60:0,24,1916
chr14
92537406 . T
<NON_REF> . .
END=92537472
GT:DP:GQ:MIN_DP:PL 0/0:37:93:22:0,60,900
Sample2
Info in gvcf file:
chr14 92537350 .
T <NON_REF> .
. END=92537350 GT:DP:GQ:MIN_DP:PL 0/0:68:99:68:0,120,1800
chr14 92537351 .
C <NON_REF> .
. END=92537351 GT:DP:GQ:MIN_DP:PL 0/0:69:0:69:0,0,1826
chr14
92537352 . C
<NON_REF> . .
END=92537352
GT:DP:GQ:MIN_DP:PL
0/0:68:99:68:0,99,1485
chr14
92537353 . C
CG,<NON_REF> 1143.73
.
BaseQRankSum=-1.389;ClippingRankSum=1.389;DP=65;MLEAC=1,0;MLEAF=0.500,0
.00;MQ=50.53;MQ0=0;MQRankSum=-0.459;ReadPosRankSum=0.189
GT:AD:DP:GQ:PL:SB
0/1:29,35,0:64:99:1181,0,2224,1299,2338,3637:15,14,16,19
chr14
92537354 . C
CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,<NON_REF> 1143.73 . BaseQRankSum=-2.657;ClippingRankSum=1.5
51;DP=64;MLEAC=1,0;MLEAF=0.500,0.00;MQ=50.37;MQ0=0;MQRankSum=-0.081;ReadPosRankSum=0.162
GT:AD:DP:GQ:PL:SB
0/1:29,35,0:64:99:1181,0,2224,
1299,2338,3637:15,14,16,19
chr14
92537355 . C
<NON_REF> . .
END=92537355
GT:DP:GQ:MIN_DP:PL
0/0:48:0:48:0,0,191
chr14
92537356 . T
<NON_REF> . .
END=92537377
GT:DP:GQ:MIN_DP:PL
0/0:44:45:41:0,36,540
chr14 92537378 .
G GC,<NON_REF> 695.73
.
BaseQRankSum=-0.290;ClippingRankSum=-0.961;DP=46;MLEAC=1,0;MLEAF=0.500,0.00;MQ=46.98;MQ0=0;MQRankSum=-0.243;ReadPosRankSum=-0.707 GT:AD:DP:GQ:PL:SB
0/1:27,18,0:45:99:733,0,2931,874,3003,3877:5,22,6,12
chr14
92537379 . T
TGCTGCTGCTGCTGC,<NON_REF>
695.73 . BaseQRankSum=1.796;ClippingRankSum=-1.077;DP=45;MLEAC=1,0;MLEAF=0.500,0.00;MQ=46.65;MQ0=0;MQRankSum=-0.313;ReadPosRankSum=-0.846 GT:AD:DP:GQ:PL:SB
0/1:27,18,0:45:99:733,0,2931,874,3003,3877:5,22,6,12
chr14
92537380 . T
<NON_REF> . .
END=92537386
GT:DP:GQ:MIN_DP:PL
0/0:42:78:40:0,78,1170
chr14
92537387 . T
<NON_REF> . .
END=92537388
GT:DP:GQ:MIN_DP:PL
0/0:38:0:37:0,0,353
chr14
92537389 . T
<NON_REF> . .
END=92537389
GT:DP:GQ:MIN_DP:PL
0/0:43:84:43:0,84,1260
chr14
92537390 . G
<NON_REF> . .
END=92537391
GT:DP:GQ:MIN_DP:PL
0/0:44:0:43:0,0,1288
chr14
92537392 . T
<NON_REF> .
. END=92537396 GT:DP:GQ:MIN_DP:PL 0/0:43:96:42:0,90,1350
chr14
92537397 . T
<NON_REF> . .
END=92537406
GT:DP:GQ:MIN_DP:PL
0/0:37:0:36:0,0,791
chr14 92537407 .
T <NON_REF> .
. END=92537407 GT:DP:GQ:MIN_DP:PL 0/0:38:87:38:0,87,1305
----------------------------------------------------------------------------------------------------------------------
These are samples homozygous (reference) for this site
----------------------------------------------------------------------------------------------------------------------
Sample 3
Info in gvcf file:
chr14
92537340 . T
<NON_REF> . .
END=92537347
GT:DP:GQ:MIN_DP:PL
0/0:58:0:56:0,0,1291
chr14
92537348 . G
<NON_REF> .
. END=92537348 GT:DP:GQ:MIN_DP:PL 0/0:58:99:58:0,120,1800
chr14
92537349 . G
<NON_REF> . .
END=92537349
GT:DP:GQ:MIN_DP:PL
0/0:57:0:57:0,0,1362
chr14
92537350 . T
<NON_REF> .
. END=92537350 GT:DP:GQ:MIN_DP:PL 0/0:59:99:59:0,120,1800
chr14
92537351 . C
<NON_REF> . .
END=92537351
GT:DP:GQ:MIN_DP:PL
0/0:58:0:58:0,0,1395
chr14
92537352 . C
<NON_REF> . .
END=92537352
GT:DP:GQ:MIN_DP:PL
0/0:62:99:62:0,117,1755
chr14
92537353 . C
CG,<NON_REF> 968.73 .
BaseQRankSum=-0.691;ClippingRankSum=1.176;DP=63;MLEAC=1,0;MLEAF=0.500,0
.00;MQ=51.17;MQ0=0;MQRankSum=-3.666;ReadPosRankSum=0.008
GT:AD:DP:GQ:PL:SB
0/1:28,31,0:59:99:1006,0,4320,1114,4412,5526:12,16,13,18
chr14
92537354 . C
CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,<NON_REF> 968.73
. BaseQRankSum=-1.405;ClippingRan
kSum=2.115;DP=60;MLEAC=1,0;MLEAF=0.500,0.00;MQ=51.05;MQ0=0;MQRankSum=-4.112;ReadPosRankSum=0.192
GT:AD:DP:GQ:PL:SB
0/1:29,31,0:60:99:1006
,0,4320,1114,4412,5526:12,17,13,18
chr14
92537355 . C
<NON_REF> .
. END=92537378 GT:DP:GQ:MIN_DP:PL 0/0:55:99:50:0,105,1575
chr14
92537379 . T
<NON_REF> . .
END=92537379
GT:DP:GQ:MIN_DP:PL
0/0:43:0:43:0,0,1041
chr14
92537380 . T
<NON_REF> . .
END=92537386 GT:DP:GQ:MIN_DP:PL 0/0:45:99:45:0,99,1485
chr14
92537387 . T
<NON_REF> . .
END=92537388
GT:DP:GQ:MIN_DP:PL
0/0:41:0:41:0,0,1042
chr14
92537389 . T
<NON_REF> . .
END=92537396
GT:DP:GQ:MIN_DP:PL
0/0:42:99:39:0,96,1440
chr14
92537397 . T
<NON_REF> . .
END=92537404
GT:DP:GQ:MIN_DP:PL
0/0:37:0:35:0,0,1026
chr14
92537405 . A
<NON_REF> . .
END=92537406
GT:DP:GQ:MIN_DP:PL
0/0:34:25:34:0,24,1070
chr14
92537407 . T
<NON_REF> . .
END=92537407
GT:DP:GQ:MIN_DP:PL
0/0:32:78:32:0,78,1170
chr14
92537408 . C
<NON_REF> . .
END=92537412
GT:DP:GQ:MIN_DP:PL
0/0:31:55:30:0,52,1046
chr14
92537413 . G
<NON_REF> . .
END=92537415
GT:DP:GQ:MIN_DP:PL
0/0:26:60:25:0,60,900
chr14
92537416 . A
<NON_REF> . .
END=92537417
GT:DP:GQ:MIN_DP:PL
0/0:26:57:26:0,57,855
Sample 4
Info in gvcf file:
chr14
92537337 . C
<NON_REF> . .
END=92537339
GT:DP:GQ:MIN_DP:PL
0/0:91:99:90:0,120,1800
chr14
92537340 . T
<NON_REF> . .
END=92537347
GT:DP:GQ:MIN_DP:PL
0/0:87:0:84:0,0,1310
chr14
92537348 . G
<NON_REF> . .
END=92537348
GT:DP:GQ:MIN_DP:PL
0/0:87:99:87:0,120,1800
chr14
92537349 . G
<NON_REF> . .
END=92537349
GT:DP:GQ:MIN_DP:PL
0/0:82:0:82:0,0,1198
chr14
92537350 . T
<NON_REF> . .
END=92537350
GT:DP:GQ:MIN_DP:PL 0/0:88:99:88:0,120,1800
chr14
92537351 . C
<NON_REF> . .
END=92537351
GT:DP:GQ:MIN_DP:PL
0/0:82:0:82:0,0,1172
chr14
92537352 . C
<NON_REF> . .
END=92537352 GT:DP:GQ:MIN_DP:PL 0/0:90:99:90:0,120,1800
chr14
92537353 . C
<NON_REF> . .
END=92537353
GT:DP:GQ:MIN_DP:PL
0/0:83:0:83:0,0,1118
chr14
92537354 . C
G,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,<NON_REF> 2183.19 . DP=81;MLEAC=1,1,0;MLEAF=0.500,0.500,0.0
0;MQ=45.91;MQ0=0 GT:AD:DP:GQ:PL:SB
1/2:0,25,45,0:70:99:2479,1775,7710,649,0,522,2338,1778,644,2315:0,0,0,0
chr14
92537355 . C
<NON_REF> . .
END=92537355 GT:DP:GQ:MIN_DP:PL 0/0:76:0:76:0,0,0
chr14
92537356 . T
<NON_REF> . .
END=92537378 GT:DP:GQ:MIN_DP:PL 0/0:58:99:53:0,111,1665
chr14
92537379 . T
C,<NON_REF> 471.77 .
BaseQRankSum=-2.228;ClippingRankSum=2.141;DP=70;MLEAC=1,0;MLEAF=0.500,0.00;MQ=45.42;MQ0=0;MQRankSum=-0.123;ReadPosRankSum=-0.877 GT:AD:DP:GQ:PL:SB 0/1:61,9,0:70:99:500,0,7741,730,7813,8543:9,52,7,2
chr14
92537380 . T
<NON_REF> . .
END=92537386
GT:DP:GQ:MIN_DP:PL
0/0:54:99:52:0,120,1800
chr14
92537387 . T
G,<NON_REF> 471.77 .
BaseQRankSum=-3.047;ClippingRankSum=2.269;DP=62;MLEAC=1,0;MLEAF=0.500,0.00;MQ=44.20;MQ0=0;MQRankSum=-1.624;ReadPosRankSum=-2.002 GT:AD:DP:GQ:PL:SB 0/1:55,7,0:62:99:500,0,7741,730,7813,8543:4,51,5,2
chr14
92537388 . T
C,<NON_REF> 471.77 . BaseQRankSum=-2.740;ClippingRankSum=2.395;DP=60;MLEAC=1,0;MLEAF=0.500,0.00;MQ=43.57;MQ0=0;MQRankSum=-0.875;ReadPosRankSum=-2.418 GT:AD:DP:GQ:PL:SB
0/1:53,7,0:60:99:500,0,7741,730,7813,8543:3,50,5,2
chr14
92537389 . T
<NON_REF> .
. END=92537395 GT:DP:GQ:MIN_DP:PL 0/0:49:99:48:0,114,1710
chr14
92537396 . G
GC,<NON_REF> 462.73 .
BaseQRankSum=-2.359;ClippingRankSum=1.234;DP=57;MLEAC=1,0;MLEAF=0.500,0.00;MQ=43.66;MQ0=0;MQRankSum=-1.078;ReadPosRankSum=-2.171 GT:AD:DP:GQ:PL:SB
0/1:53,4,0:57:99:500,0,7741,730,7813,8543:2,51,3,1
chr14
92537397 . T
TGCTGCTGCTG,<NON_REF>
462.73 .
BaseQRankSum=-1.640;ClippingRankSum=0.984;DP=57;MLEAC=1,0;MLEAF=0.500,0.00;MQ=43.66;MQ0=0;MQRankSum=-1.421;ReadPosRankSum=-2.171 GT:AD:DP:GQ:PL:SB
0/1:53,4,0:57:99:500,0,7741,730,7813,8543:2,51,3,1
chr14
92537398 . C
<NON_REF> . .
END=92537406 GT:DP:GQ:MIN_DP:PL 0/0:54:0:52:0,0,906
chr14
92537407 . T
<NON_REF> . .
END=92537407
GT:DP:GQ:MIN_DP:PL
0/0:61:99:61:0,120,1800
chr14
92537408 . C
<NON_REF> . .
END=92537418 GT:DP:GQ:MIN_DP:PL 0/0:51:0:46:0,0,1128
chr14
92537419 . T
<NON_REF> . .
END=92537419
GT:DP:GQ:MIN_DP:PL
0/0:51:99:51:0,99,1485
chr14
92537420 . A
<NON_REF> . .
END=92537420
GT:DP:GQ:MIN_DP:PL
0/0:43:0:43:0,0,1262
chr14
92537421 . T
<NON_REF> . .
END=92537421
GT:DP:GQ:MIN_DP:PL
0/0:45:14:45:0,15,1344
chr14
92537422 . A
<NON_REF> . .
END=92537424
GT:DP:GQ:MIN_DP:PL
0/0:41:0:40:0,0,1133
----------------------------------------------------------------------------------------------------------------------
Problem
----------------------------------------------------------------------------------------------------------------------
Judging from the alignment visualized by IGV, there are no insertions at that site, but an SNP in the next position, in all samples. But
HaplotypeCaller calls G->GC insertion in some samples while not in other samples.
My questions are:
1) In variant calling HaplotypeCaller would do a local de-novo assembly in some region. Is the inconsistency between bamfiles and gvcf because of the re-assembly?
2) Why is the insertion called in some samples but not others, although the position in all samples looks similar?
3) There are other variants called adjacent or near this site, but later filtered by VQSR. Does that indicate the variants from this region cannot be trusted?
chr14 92537353 . C CG 303342.41 VQSRTrancheINDEL0.00to90.00+
chr14 92537354 . C CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG 142809.26 VQSRTrancheINDEL0.00to90.00+
chr14 92537378 . G GC 11330.02 PASS
chr14 92537379 rs12896583 T TGCTGCTGCTGCTGC,C 13606.25 VQSRTrancheINDEL0.00to90.00+
chr14 92537385 rs141993435 CTTT C 314.64 VQSRTrancheINDEL0.00to90.00+
chr14 92537387 rs12896588 T G 4301.60 VQSRTrancheSNP99.90to100.00
chr14 92537388 rs12896589 T C 4300.82 VQSRTrancheSNP99.90to100.00
chr14 92537396 . G GC,GCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGC 4213.61 VQSRTrancheINDEL0.00to90.00+
chr14 92537397 . T TGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,TGCTGCTGCTGCTG,TGCTGCTGCTG,TGCTGCTG 2690.79 VQSRTrancheINDEL0.00to90.00+
3) I used GATK 2.7 for processing before calling gvcf files. Any possibility that if I change to 3.2, the bamfiles would look more similar to gvcf result? I have tried GATK3.2-2 for generating gvcf from GATK 2.7 bamfiles, the results are the same.
----------------------------------------------------------------------------------------------------------------------
Update
----------------------------------------------------------------------------------------------------------------------
--bamOutput was used to output assembled haplotypes by HaplotypeCaller.
(https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--bamOutput)
Sample 1
Sample 3
Now the IGV alignments are consistent with gvcf output.
However, since there is a long adjacent insertion downstream the variant, and even for the samples that do not have the variant, there are long insertions upstream the position. I just feel that the region is not mapped very well. So the variant is excluded in further analysis.
----------------------------------------------------------------------------------------------------------------------
Summary
----------------------------------------------------------------------------------------------------------------------
GATK is quite sensitive in variant calling, thus there is a great possibility of false positive, especially for INDELs. To eliminate the false positives, INDELs in particular, you can either manually check the alignment if the variant list is short, or if the list is very long, you can use another less sensitive caller (Platyplus, Isaac) to call INDELs, and then manually check those discordant calls.