Substitution#

Substitution: a sequence change where, compared to a reference sequence, one nucleotide is replaced by one other nucleotide.

Syntax#

Simple sequence substitution
Syntax	`sequence_identifier ":" coordinate_type "." position reference_sequence ">" alternate_sequence`
Examples	`NC_000023.10:g.33038255C>A`
Genome reference with coordinates from aligned transcript
Syntax	`sequence_identifier "(" transcript_identifier "):c." transcript_position reference_sequence ">" alternate_sequence`
Examples	`NG_012232.1(NM_004006.2):c.93+1G>T`
Explanation of Symbols
`alternate_sequence`: The new sequence (nucleic or amino acid) that will occur at the specified position in the given sequence_identifier `coordinate_type`: The type of molecule and coordinate system; see the general recommendations. `position`: a position specified as a simple ordinal sequence position `reference_sequence`: The sequence at the specified position in the given sequence_identifier `sequence_identifier`: an identifier for a sequence from a recognized database `transcript_identifier`: An identifier for a transcript in a recognized database; a transcript implies a specific sequence, exon structure, and, for coding transcripts, CDS start and end positions. `transcript_position`: A position specified as simple ordinal sequence position OR as an offset from an exon boundary (e.g., 22+7 or 56-3). See also explanation of grammar used in HGVS Nomenclature.

Notes#

substitutions involving two or more consecutive nucleotides are described as deletion/insertion (delins) variants (see Deletion/insertion (delins)).
two variants separated by one or more nucleotides should be described individually and not as a "delins" of the sequence affected.
- Exception: two variants separated by one nucleotide, together affecting one amino acid, should be described as a "delins".
  NOTE: This rule prevents tools predicting the consequences of a variant to make conflicting and incorrect predictions (e.g., c.235_237delinsTAT (p.Lys79Tyr) versus c.[235A>T;237G>T] (p.[Lys79*;Lys79Asn]).
  NOTE: the SVD-WG has prepared a proposal to modify this recommendation (see SVD-WG010).
nucleotides that have been tested and found not changed are described as c.123=, g.4567_4569= (see SVD-WG001 (no change)).
it is not correct to describe "polymorphisms" as c.76A/G (see Discussions).

Examples#

NC_000023.10:g.33038255C>A
a substitution of the C nucleotide at g.33038255 by an A.
NG_012232.1(NM_004006.2):c.93+1G>T
a substitution of the G nucleotide at c.93+1 (coding DNA reference sequence) by a T.
LRG_199t1:c.79_80delinsTT
nucleotides c.79 and c.80 are replaced by TT.
NOTE: changes involving two or more consecutive nucleotides are described as deletion-insertion (delins) so the description c.[79G>T;80C>T] is not correct.
NOTE: based on the definition of a substitution, i.e. one nucleotide replaced by one other nucleotide, this change can not be described as a substitution like c.79_80GC>TT or c.79GC>TT.
NM_004006.2:c.145_147delinsTGG
two substitutions replacing codon CGC (positions c.145 to c.147) by TGG.
NOTE: two variants separated by one nucleotide, together affecting one amino acid, should be described as a "delins" so the description c.[145C>T;147C>G] is not correct (see deletion/insertion).
LRG_199t1:c.54G>H
a substitution of the G nucleotide at c.54 (coding DNA reference sequence) by A, C, or T (IUPAC code "H", see Standards).
NM_004006.2:c.123=
a screen was performed showing that nucleotide c.123 was a C, as in the coding DNA reference sequence (the nucleotide was not changed).
NOTE: the description NM_004006.2:c.= can not be used, c.= indicates the entire NM_004006.2 coding DNA reference sequence was analysed and no change was identified.
NOTE: the description LRG_199t1:c.94-23_188+33= indicates no variants where found in the region indicated (exon 3 of the DMD gene).
LRG_199t1:c.85=/T>C
a mosaic case where at position c.85, besides the normal sequence (a T, described as "="), also chromosomes are found containing a C (c.85T>C).
NOTE: irrespective of the frequency in which each nucleotide was found, the reference is always described first.
NM_004006.2:c.85=//T>C
a chimeric case, i.e. the sample is a mix of cells containing c.85= and c.85T>C.
NOTE: irrespective of the frequency in which each nucleotide was found, the reference is always described first.

Discussion#

When I only sequenced RNA (cDNA) and not genomic DNA, should I then give the description of a variant on DNA level in parentheses?

Yes, while the variant on RNA level can be described as r.76a>g on DNA level, based on e.g., a coding DNA reference, sequence it should be described as c.(76A>G).

How can I shorten the descriptions of SNPs in a manuscript?

Publications reporting linkage or association studies often use a range of different markers/SNPs. Such publications should contain at least once an unequivocal description of all markers used linking them to a reference sequence, preferably a genomic reference sequence. When this has been done, simplified descriptions can be used like

NM_004006.1 3G>T, using a GenBank coding DNA reference sequence
GJB2 76A>C, using a HGNC-approved gene symbol as reference
rs2306220 T>C, using a dbSNP-identifier as a reference
DXS1219 CA[18];[21] (or AFM297yd1 CA[18];[21]), using a marker DXS1219 (AFM297yd1) as reference

How should I describe a variant in the promoter region of a gene?

It is recommended to describe variants in the promoter region of a gene based on a genomic reference sequence, e.g., NC_000023.10:g.33357783G>A (chrX, hg19). Describing the variant in relation to a coding DNA reference sequence (for this variant NM_004006.1:c.-128354C>T or NM_000109.3:c.-401C>T is possible but not really very informative; you do not know how long the 5'UTR is. The variant can also be described using a genomic reference sequence containing the promoter region (for this variant e.g., L01538.1:g.1407C>T), but again this is not really informative. Although NC_000023.10:g.33357783G>A seems complex, it can be used in a genome browsers helping you to quickly zoom in on the region of interest.

Are polymorphisms described like NM_004006.1:c.76A/G?

No, all substitutions are described as NM_004006.1:c.76A>G. In the past, the format c.76A/G has been used to describe "polymorphic" sequence variants. Note that a description should be neutral, simply describe the change, and not include any other information like predicted or known functional consequences.

Can I describe a GC to TG variant as a di-nucleotide substitution (NG_012232.1:g.12GC>TG)?

No, this is not allowed. By definition, a substitution changes one nucleotide into one other nucleotide. The change ..GAAGCCAG.. to ..GAATGCAG.. should be described as NG_012232.1:g.12_13delinsTG, i.e. a deletion/insertion (delins) (see Deletion-Insertion and Description - Note). When phase information is not available, the variant should be described as NG_012232.1:g.12G>T(;)13C>G (see Alleles).

The BRCA1 coding DNA reference sequence NM_007294.3 from position c.2074 to c.2080 is ..CATGACA... A variant frequently found in the population is ..CATAACA.. (NM_007294.3:c.2077G>A). In a patient I found the sequence ..CATATAACA... Can I describe this variant as NM_007294.3:c.[2077G>A;2077_2078insTA]?

The correct description of this variant is NM_007294.3:c.2077delinsATA.
NOTE: the answer was modified, i.e. the addition "However, since the variant is likely a combination of two other variants, it is acceptable to describe it as NM_007294.3:c.[2077G>A;2077_2078insTA]." was removed.