Repeated Sequences#
Repeated sequence: a sequence where, compared to a reference sequence, a segment of one or more nucleotides (the repeat unit) is present several times, one after the other.
Syntax#
NOTE: a Community Consultation proposal is being prepared which will suggest to allow only the format where the entire range of the repeated sequence is indicated; so g.123_191CAG[23]
, not g.123CAG[23]
.
Unique Repeat | |
---|---|
Syntax | sequence_identifier ":" coordinate_type "." position sequence "[" total_copy_number "]" |
Examples |
|
Mixed Repeat | |
Syntax | sequence_identifier ":" coordinate_type "." position sequence "[" total_copy_number "]" sequence "[" total_copy_number "]" … sequence "[" total_copy_number "]" |
Examples |
|
Explanation of Symbols | |
|
Notes#
- repeated sequences include both small (mono-, di-, tri-, etc., nucleotide) and larger (kilobase-sized) repeats.
- for mixed repeats, the range of the repeat sequence is given followed by a listing of each repeat unit and the number of repeats in each unit;
NC_000012.11:g.112036755_112036823CTG[9]TTG[1]CTG[13]
. NM_000044.3:c.171_239GCA[34]
describes a repeated sequence containing 34GCA
units (sequenced, the reference sequence contains 23GCA
units).NM_000044.3:c.(92_331)insN[33]
describes an insertion of 33 nucleotides in the amplified region from positionc.92
toc.331
(not sequenced), containing a repeated sequence of 24GCA
units in the reference sequence.- exception: using a coding DNA reference sequence ("c." description), a repeated sequence variant description can be used only for repeat units with a length which is a multiple of 3, i.e. which can not affect the reading frame.
Consequently, use
NM_024312.4:c.2692_2693dup
and notNM_024312.4:c.2686A[10]
; useNM_024312.4:c.1741_1742insTATATATA
and notNM_024312.4:c.1738TA[6]
.
Examples#
-
unique repeat
-
sequenced
-
NC_000014.8:g.101179660_101179695TG[14]
a repeatedTG
di-nucleotide sequence, starting at positiong.101179660
on human chromosome 14, with 14TG
copies. -
NC_000014.8:g.[101179660_101179695TG[14]];[101179660_101179695TG[18]]
a repeated TG di-nucleotide sequence, starting at positiong.101179660
on human chromosome 14, is present with 14TG
copies on one allele and 18TG
copies on the other allele.
-
-
repeat expansion disorders
-
sequenced
-
NM_023035.2:c.6955_6993CAG[26]
(orc.6955_6993dup
)
a repeated CAG tri-nucleotide sequence, starting at positionc.6955
in the CACNA1A gene with 26CAG
copies (p.(Gln2319[26])
orp.(Gln2319_Gln2331dup)
). -
NC_000003.12:g.63912687_63912716AGC[13]
/c.89_118AGC[13]
a repeatedAGC
tri-nucleotide sequence in the ATXN7 gene on chromosome 3, starting at positiong.63912687
/c.89
, with 13AGC
copies (the reference sequence has 10 copies).
NOTE: in literature, the tri-nucleotide repeat, encoding a poly-Gln repeat on protein level, is known as theCAG
repeat. However, based on the ATXN7 coding DNA reference sequence (GenBankLRG_866t1
orNM_000333.3
) and applying the 3'rule, the repeat has to be described as anAGC
repeat.
-
-
not sequenced
-
NC_000003.12:g.(63912602_63912844)insN[9]
/NM_000333.3:c.(4_246)insN[9]
a fragment containing theAGC
repeat in the ATXN7 gene was amplified (from nucleotideg.63912602
/c.4
tog.63912844
/c.246
) and its size determined to be 9 nucleotides larger (insN[9]
) compared to that of the reference sequence.
NOTE: since the fragment was not sequenced, the variant can not be described asg.63912687_63912716AGC[13]
/c.89_118AGC[13]
. -
NC_000003.12:g.(63912602_63912844)delN[15]
/NM_000333.3:c.(4_246)delN[15]
a fragment containing theAGC
repeat in the ATXN7 gene was amplified (from nucleotideg.63912602
/c.4
tog.63912844
/c.246
) and its size determined to be 15 nucleotides smaller (delN[15]
) than that of the reference sequence.
-
-
-
-
mixed repeat reference sequence
-
repeat expansion disorders
-
FMR1 repeat (reference sequence
GGC[9]GGA[1]GGC[10]
)
in literature, the Fragile-X tri-nucleotide repeat is described as aCGG
-repeat. However, based on a coding DNA reference sequence (GenBankNM_002024.5
) and applying the 3'rule, the repeat has to be described as a mixedGGC
-GGA
-GGC
repeat.-
NM_002024.5:c.-128_-69GGC[10]GGA[1]GGC[9]GGA[1]GGC[10]
a sequencedGGC
tri-nucleotide repeat from positionc.-128
toc.-69
contains 10GGC
, 1GGA
, 9GGC
, 1GGA
, and 10GGC
units (31 repeat units). -
NM_002024.5:c.-128_-69GGC[68]GGA[1]GGC[10]
a repeatedCGG
tri-nucleotide sequence, starting at positionc.-129
with 79 repeat units.
NOTE: since the reference sequence contains a mixed repeat (CGG
andAGG
units), the variant can not be described asNM_002024.5:c.-129CGG[79]
.NM_002024.5:c.-129CGG[79]
would cover only the sequence up to the firstAGG
interruption (positionc.-99
). -
NM_002024.5:c.-128_-69GGM[108]
a repeated mixed tri-nucleotide sequence, starting at positionc.-129
with 108GGC
/GGA
copies. -
NM_002024.5:c.(-144_-16)insN[(1800_2400)]
the amplified region containing the FMR1 repeat region (between nucleotidesc.-144
andc.-16
) contains an insertion of 1800 to 2400 nucleotides (600 to 800GGC
/GGA
units).
-
-
HTT repeat (reference sequence
LRG_763t1:52_153CAG[21]CAA[1]CAG[1]CCG[1]CCA[1]CCG[7]CCT[2]
)
in literature, the Huntington's Disease tri-nucleotide repeat, encoding a variable poly-Gln followed by a variable poly-Pro repeat on protein level, is known as theCAG
repeat. Based on the HTT (huntingtin) coding DNA reference sequence (GenBankLRG_763t1
orNM_002111.8
) and applying the 3'rule, the Poly-Gln encoding repeat has to be described as anAGC
-AAC
-AGC
repeat.LRG_763t1:c.54_110GCA[23]
a sequencedGCA
tri-nucleotide repeat starting at positionc.54
contains 23 units, on protein level described asNP_002102.4:p.(Gln18)[25]
.
NOTE: theGCA
repeat is followed byACAGCA
, extending the encoded Gln-repeat by 2.
-
CFTR intron 9
NM_000492.3:c.1210-33_1210-6GT[11]T[6]
the mixed repeat sequence form positionc.1210-33
toc.1210-6
contains 11GT
and 6T
copies.
NOTE: when only the variable T-stretch is described, the format isNM_000492.3:c.1210-12_1210-6T[7]
(see Q&A below).
-
-
NC_000012.11:g.112036755_112036823CTG[9]TTG[1]CTG[13]
a complex repeated sequence from positiong.112036755
tog.112036823
on chromosome 12 with first aCTG
unit present in 9 copies, then aTTG
unit present in 1 copy and then aCTG
unit present in 13 copies.
-
-
differing genomic (g.) and coding DNA (c.) descriptions
NC_000001.11:g.57367047_57367121ATAAA[15]
andNM_021080.3:c.-136-75952_-136-75878ATTTT[15]
describe the same repeat allele in intron 3 of the DAB1 gene.
NOTE: based on the 3' rule and the transcriptional orientation of the gene (minus strand), the description of the repeat units differs.
Discussion#
Intron 9 of the CFTR gene ends with the sequence ...tgtgtgtgtgtttttttaacag
. Both the TG
and T
stretches are variable in length (from 9 to 13 and 5 to 9, respectively). The reference sequence has 11 TG
copies and 7 T
s. Is it correct to describe an allele as c.1210-14TG[13]T[5]
or for the T stretch as c.1210-6T[5]
?
A complex case.
First, note that by applying the 3'rule it is a variable GT and not a TG stretch.
When the coding DNA reference sequence has 11 TG copies followed by 7 T copies, the reference allele is described as c.1210-33_1210-6GT[11]T[6]
.
When only variability of the T-stretch is reported, the reference allele is described as c.1210-12_1210-6T[7]
.
To indicate the overall variability found in the population, the description is c.1210-33_1210-6GT[(9_13)]T[(4_8)]
for the combined repeat and c.1210-12_1210-6T[(5_9)]
for the T-stretch.