Repeated Sequences#
Repeated sequence: a sequence where, compared to a reference sequence, a segment of one or more nucleotides (the repeat unit) is present several times, one after the other.
Syntax#
NOTE: a Community Consultation proposal is being prepared which will suggest to allow only the format where the entire range of the repeated sequence is indicated; so r.123_191cag[23]
, not r.123cag[23]
.
Positions only | |
---|---|
Syntax | sequence_identifier ":r." positions "[" copy_number "]" |
Examples |
|
Sequence given | |
Syntax | sequence_identifier ":r." start_position sequence "[" copy_number "]" |
Examples |
|
Explanation of Symbols | |
|
Notes#
- all variants should be described on the DNA level; descriptions on the RNA and/or protein level may be given in addition.
- repeated sequences include both small (mono-, di-, tri-, etc., nucleotide) and larger (kilobase-sized) repeats.
- the format based on repeat position is preferred, descriptions of the repeat sequence quickly become too lengthy.
NOTE: whiler.123cug[23]
describes a repeat of 23cug
units,r.123_125[23]
describes a tri-nucleotide repeat of 23 units which could be interrupted with other units (e.g., a rarecua
). The descriptionr.123cug[23]
can thus only be used when the repeat was sequenced. - the format
r.-125_-123cug[4]
should not be used; it contains redundant information (-125_-123
andcug
). - for composite repeats, the basic format can be used, successively listing each different repeat unit;
r.456_465[4]466_489[9]490_499[3]
.
Examples#
-
r.-124_-123[14]
(alternativelyr.-124ug[14]
)
a repeated di-nucleotide sequence, with the first unit located from positionr.-124
tor.-123
, is present in 14 copies.
NOTE: when the repeat is variable in the population and the reference sequence has 15 units, the descriptionr.-123ug[14]
is preferred overr.-97_-96del
.
NOTE: when the repeat is variable in the population and the reference sequence has 15 units, the descriptionr.-123ug[17]
is preferred overr.-99_-96dup
. -
r.-124_-123[14];[18]
(alternativelyr.-124ug[14];[18]
)
a repeated di-nucleotide sequence, with the first unit located from positionr.-124
tor.-123
, is present in 14 copies on one allele and 18 copies on the other allele. -
FMR1
GGC
-repeat: in literature, the Fragile-X tri-nucleotide repeat is known as theCGG
-repeat. However, based on a coding RNA reference sequence (GenBankNM_002024.5
) and applying the 3'rule, on the RNA level, the repeat has to be described as aggc
-repeat (see Recommendations).-
r.-128_-126[79]
an extended repeat of exactly 79 units.
NOTE:r.-128ggc[79]
can only be used when the repeat has been sequenced, excluding it is interrupted by one or moregga
-triplets. -
r.-128_-126[(600_800)]
the repeated tri-nucleotide sequence, starting at positionc.-128
, has an estimated size of between 600 and 800 copies.
NOTE: the repeat can be pure or a mix ofggc
andgga
triplets.
-
-
HD
AGC
-repeat: based on the HTT (huntingtin) coding DNA reference sequence (GenBankNM_002111.6
), applying the 3'rule, on the RNA level, the Huntington's Disease tri-nucleotide repeat is described as anagc
(notcag
) repeat.-
r.53agc[19]
NOTE: the coding RNA reference sequence (NM_002111.6
) contains an allele of 21agc
repeats.
NOTE: on protein level, the reference allele contains 21Gln
s, described asp.Gln[21]
(alternativelyp.Q[21]
). The difference derives from the fact that theagc
repeat is interrupted by aaac
-triplet (caa
coding) at position 20. -
r.53_55[31]
the coding RNA reference sequence (NM_002111.6
) contains a tri-nucleotide allele of 32 repeats (agc
-19,aac
,agc
,cgc
,cac
,cgc
-7,cuc
-2) encoding 21Gln
and 11Pro
-residues.
-