Repeated Sequences#
Repeated sequence: a sequence where, compared to a reference sequence, a segment of one or more nucleotides (the repeat unit) is present several times, one after the other.
Syntax#
NOTE: a Community Consultation proposal is being prepared which will suggest to allow only the format where the entire range of the repeated sequence is indicated; so r.123_191cag[23], not r.123cag[23].
| Positions only | |
|---|---|
| Syntax | sequence_identifier ":r." positions "[" copy_number "]" |
| Examples |
|
| Sequence given | |
| Syntax | sequence_identifier ":r." start_position sequence "[" copy_number "]" |
| Examples |
|
| Explanation of Symbols | |
| |
Notes#
- all variants should be described on the DNA level; descriptions on the RNA and/or protein level may be given in addition.
- repeated sequences include both small (mono-, di-, tri-, etc., nucleotide) and larger (kilobase-sized) repeats.
- the format based on repeat position is preferred, descriptions of the repeat sequence quickly become too lengthy.
NOTE: whiler.123cug[23]describes a repeat of 23cugunits,r.123_125[23]describes a tri-nucleotide repeat of 23 units which could be interrupted with other units (e.g., a rarecua). The descriptionr.123cug[23]can thus only be used when the repeat was sequenced. - the format
r.-125_-123cug[4]should not be used; it contains redundant information (-125_-123andcug). - for composite repeats, the basic format can be used, successively listing each different repeat unit;
r.456_465[4]466_489[9]490_499[3]. - exception: using a coding RNA reference sequence, a repeated sequence variant description can be used only for repeat units with a length which is a multiple of 3, i.e. which can not affect the reading frame.
Consequently, use
NM_024312.4:r.2692_2693dupand notNM_024312.4:r.2686a[10]; useNM_024312.4:r.1741_1742insuauauauaand notNM_024312.4:r.1738ua[6]. This restriction only applies to the coding sequence, which does not include the UTR sequence. As such,NM_024312.4:r.-6_-3g[6]is valid as the reading frame is not affected.
Examples#
-
r.-124_-123[14](alternativelyr.-124ug[14])
a repeated di-nucleotide sequence, with the first unit located from positionr.-124tor.-123, is present in 14 copies.
NOTE: when the repeat is variable in the population and the reference sequence has 15 units, the descriptionr.-123ug[14]is preferred overr.-97_-96del.
NOTE: when the repeat is variable in the population and the reference sequence has 15 units, the descriptionr.-123ug[17]is preferred overr.-99_-96dup. -
r.-124_-123[14];[18](alternativelyr.-124ug[14];[18])
a repeated di-nucleotide sequence, with the first unit located from positionr.-124tor.-123, is present in 14 copies on one allele and 18 copies on the other allele. -
FMR1
GGC-repeat: in literature, the Fragile-X tri-nucleotide repeat is known as theCGG-repeat. However, based on a coding RNA reference sequence (GenBankNM_002024.5) and applying the 3'rule, on the RNA level, the repeat has to be described as aggc-repeat (see Recommendations).-
r.-128_-126[79]
an extended repeat of exactly 79 units.
NOTE:r.-128ggc[79]can only be used when the repeat has been sequenced, excluding it is interrupted by one or moregga-triplets. -
r.-128_-126[(600_800)]
the repeated tri-nucleotide sequence, starting at positionc.-128, has an estimated size of between 600 and 800 copies.
NOTE: the repeat can be pure or a mix ofggcandggatriplets.
-
-
HD
AGC-repeat: based on the HTT (huntingtin) coding DNA reference sequence (GenBankNM_002111.6), applying the 3'rule, on the RNA level, the Huntington's Disease tri-nucleotide repeat is described as anagc(notcag) repeat.-
r.53agc[19]
NOTE: the coding RNA reference sequence (NM_002111.6) contains an allele of 21agcrepeats.
NOTE: on protein level, the reference allele contains 21Glns, described asp.Gln[21](alternativelyp.Q[21]). The difference derives from the fact that theagcrepeat is interrupted by aaac-triplet (caacoding) at position 20. -
r.53_55[31]
the coding RNA reference sequence (NM_002111.6) contains a tri-nucleotide allele of 32 repeats (agc-19,aac,agc,cgc,cac,cgc-7,cuc-2) encoding 21Glnand 11Pro-residues.
-