General#
Since references to websites are not yet acknowledged as citations, please mention Den Dunnen et al. (2016) HGVS recommendations for the description of sequence variants: 2016 update. Hum.Mutat. 25: 37: 564-569 when referring to these pages. Note that although the examples on these pages mainly give examples for human (Homo sapiens), the recommendations can be applied to all species.
Make sure you have also seen the Basics, explaining the history of these recommendations, the process of making changes, the versioning of the recommendations and important remarks on terminology.
General recommendations#
-
all variants should be described at the most basic level, the DNA level. Descriptions on the RNA and/or protein level may be given in addition.
- descriptions should make clear whether the change was experimentally determined or theoretically deduced by giving predicted consequences in parentheses.
- descriptions on RNA/protein level should describe the changes observed on that level (RNA/protein) and not try to incorporate any knowledge regarding the change on DNA-level (see Discussion below).
-
all variants should be described in relation to an accepted reference sequence (see Reference Sequences).
- the reference sequence file used should be public and clearly described, e.g.,
NC_000023.10
,LRG_199
,NG_012232.1
,NM_004006.2
,LRG_199t1
,NR_002196.1
,NP_003997.1
, etc. (see Reference Sequences). - the reference sequence used must contain the residue(s) described to be changed.
- the recommended reference is a genomic reference sequence based on a recent genome build.
- for human, the recommended reference is based on genome build GRCh38/hg38, e.g.,
NC_000023.11
for chromosome X. - when variants are reported in relation to a transcript, the preferred reference sequence is the reference suggested by the MANE project (see Ensembl or NCBI).
NOTE: while Locus Reference Genomic (LRG) reference sequences are still acceptable, new LRGs are no longer generated and RefSeq or Ensembl transcripts specified by the MANE project are preferred for all genes where available to help standardize clinical reporting.
- for human, the recommended reference is based on genome build GRCh38/hg38, e.g.,
- the reference sequence used must contain the residue(s) described to be changed.
- a letter prefix is mandatory to indicate the type of reference sequence used.
Accepted prefixes are;
c
for a coding DNA reference sequence,g
for a linear genomic reference sequence,m
for a mitochondrial DNA reference sequence,n
for a non-coding DNA reference sequence,o
for a circular genomic reference sequence,p
for a protein reference sequence,r
for an RNA reference sequence (transcript).
- the numbering of the residues (nucleotide or amino acid) in relation to the reference sequence used should follow the approved scheme (see Numbering).
- the reference sequence file used should be public and clearly described, e.g.,
-
two variants separated by one or more nucleotides should be described individually and not as a "delins".
- exception: two variants separated by one nucleotide, together affecting one amino acid, should be described as a "delins".
NOTE: the SVD-WG is preparing a proposal to modify this recommendation. To apply the current rule one needs to know whether the two variants are in a coding sequence and affecting one amino acid. Recommendations should be general. The new recommendation will be: two variants separated by less than two nucleotides should be described as a "delins".
- exception: two variants separated by one nucleotide, together affecting one amino acid, should be described as a "delins".
-
3'rule: for all descriptions, the most 3' position possible of the reference sequence is arbitrarily assigned to have been changed.
- the 3'rule also applies for changes in single residue stretches and tandem repeats (nucleotide or amino acid).
- the 3'rule applies to ALL descriptions (genome, gene, transcript, and protein) of a given variant.
- exception: deletions/duplications around exon/exon junctions using c., r. or n. reference sequences (see Numbering).
-
descriptions on DNA, RNA, and protein level are clearly different:
- DNA-level:
123456A>T
number(s) referring to the nucleotide(s) affected, nucleotides in CAPITALS using IUPAC-IUBMB assigned nucleotide symbols. - RNA-level:
76a>u
number(s) referring to the nucleotide(s) affected, nucleotides in lower case using IUPAC-IUBMB assigned nucleotide symbols. - protein level:
Lys76Asn
the amino acid(s) affected in three- or one-letter code followed by a number IUPAC-IUBMB assigned amino acid symbols.- three-letter amino acid code is preferred (see Standards).
- the
*
can be used to indicate the translation stop codon in both one- and three-letter amino acid code descriptions.
- DNA-level:
-
prioritisation: when a description is possible according to several types, the preferred description is: (1) substitution, (2) deletion, (3) inversion, (4) duplication, (5) insertion.
- when a variant can be described as a duplication or an insertion, prioritisation determines it should be described as a duplication.
- descriptions removing part of a reference sequence and replacing it with part of the same sequence are not allowed (e.g.,
NM_004006.2:c.[762_768del;767_774dup]
).
-
only approved HGNC gene symbols should be used to describe genes.
NOTE: to avoid confusion, HGVS recommends to follow the HGNC guidelines to use italics to denote genes.
NOTE: for protein nomenclature, see the International Protein Nomenclature Guidelines, written with the involvement of the HGNC. -
HGVS recommends following the HGNC guidelines and the VICC Gene Fusion Specification nomenclature to describe products of gene fusions.
- The HGNC recommendations include using a
GENESYMBOL1::GENESYMBOL2
syntax for gene-level fusion descriptions, and aGENESYMBOL1-GENESYMBOL2
syntax for read-through transcripts. - The VICC nomenclature extends the HGNC recommendations to include a terminology, information model, and nomenclature for gene-level and exon-level representation, with components for disambiguating regulatory fusions from chimeric transcript fusions.
- HGVS also recommends the use of adjoined transcripts for precise and unambiguous characterization of chimeric transcripts at the sequence level.
- The HGNC recommendations include using a
Characters used#
In HGVS nomenclature, some characters have a specific meaning:
+
(plus) is used in nucleotide numbering;c.123+45A>G
.-
(hyphen-minus) is used in nucleotide numbering;c.124-56C>T
.*
(asterisk) is used in nucleotide numbering and to indicate a translation termination (stop) codon (see Standards);c.*32G>A
andp.Trp41*
._
(underscore) is used to indicate a range;g.12345_12678del
.[ ]
(square brackets) are used for alleles (see DNA, RNA, protein), which includes multiple inserted sequences at one position and insertions from a second reference sequence.;
(semicolon) is used to separate variants and alleles;g.[123456A>G;345678G>C]
org.[123456A>G];[345678G>C]
.,
(comma) is used to separate different transcripts/proteins derived from one allele;r.[123a>u,122_154del]
.NC_000002.11:g.48031621_48031622ins[TAT;48026961_48027223;GGC]
.NC_000002.11:g.47643464_47643465ins[NC_000022.10:35788169_35788352]
.
:
(colon) is used to separate the reference sequence file identifier (accession.version_number) from the actual description of a variant;NC_000011.9:g.12345611G>A
.:
: (double colon) is used to describe adjoined transcripts from gene fusions (RNA Deletion-insertion) and to designate break point junctions creating a ring chromosome (DNA Complex (HGVS/ISCN)).( )
(parentheses) are used to indicate uncertainties and predicted consequences;NC_000023.9:g.(123456_234567)_(345678_456789)del
,p.(Ser123Arg)
.
NOTE: the range of the uncertainty should be described as precisely as possible (see below).?
(question mark) is used to indicate unknown positions (nucleotide or amino acid);g.(?_234567)_(345678_?)del
.^
(caret) is used as "or";c.(370A>C^372C>R)
as back translation ofp.Ser124Arg
(i.e. changing theAGC
codon toCGC
,AGG
, orAGA
).>
(greater than) is used to describe substitution variants (DNA and RNA level);g.12345A>T
,r.123a>u
(see DNA, RNA).=
(equals) is used to indicate a sequence was tested but found unchanged;p.(Arg234=)
./
(forward slash) is used to indicate mosaicism (see DNA substitution).//
(double forward slash) is used to indicate chimerism (see DNA substitution).-
|
(pipe) is used to indicate that not a direct change of the sequence is described but a modification (a change of state, e.g., methylation). (see methylation). -
NOTE: spaces are not permitted in any HGVS description
Abbreviations in variant descriptions#
Specific abbreviations are used to describe different variant types.
>
(greater than) indicates a substitution (DNA and RNA level);g.123456G>A
,r.123c>u
(see DNA, RNA).- a substitution on the protein level is described as
p.Ser321Arg
(see protein).
- a substitution on the protein level is described as
del
indicates a deletion;c.76del
(see DNA, RNA, protein).dup
indicates a duplication;c.76dup
(see DNA, RNA, protein).ins
indicates an insertion;c.76_77insG
(see DNA, RNA, protein).- duplicating insertions are described as duplications, not as insertions.
inv
indicates an inversion;c.76_83inv
(see DNA, RNA). Not used on protein level, usually described as deletion-insertion.fs
indicates a frameshift;p.Arg456GlyfsTer17
(orp.Arg456Glyfs*17
, see Frameshifts).-
ext
indicates an extension;p.Met1ext-5
(see Extension). -
HGVS/ISCN (see Community Consultation):
cen
indicates the centromere of a chromosome.chr
indicates a chromosome;chr11:g.12345611G>A
(NC_000011.9
).pter
indicates the first nucleotide of a chromosome.qter
indicates the last nucleotide of a chromosome.sup
indicates a supernumerary chromosome (marker chromosome).
-
changes of state (modifications):
gom
indicates a gain of methylation;g.12345678_12345901
|gom
.lom
indicates a loss of methylation;g.12345678_12345901
|lom
.met
indicates a methylation;g.12345678_12345901
|met=
.
Discussion#
Are spaces allowed within HGVS variant descriptions?
Spaces have never been permitted within HGVS variant descriptions. Unfortunately, some databases, software, and publications display variant descriptions with internal spaces. These variants do not conform to HGVS Nomenclature and likely create problems for automated parsers.
Some papers and websites use a "-" (hyphen-minus) to indicate a range, is this correct?
The sign used to indicate a range is "_" (underscore) and not a "-" (hyphen-minus).
The minus sign should only be used as a minus in the description of variants based on a coding DNA reference sequence.
c.12-14del
describes a deletion of nucleotide -14 in the intron directly preceding coding DNA nucleotide 12, not a deletion of nucleotides c.12
to c.14
.
Why is it recommended to use three-letter amino acid code to describe protein variants?
Several amino acids start with the same initial letter (Ala
, Arg
, Asn
, and Asp
start with A; Gln
, Glu
, and Gly
with G; Leu
and Lys
with L; Phe
and Pro
with P; Thr
and Tyr
with T) but in one-letter amino acid code, this letter is used as abbreviation for only one.
In practice, this leads to many mistakes.
It is therefore recommended to use three-letter amino acid code abbreviations.
When I want to report a variant on DNA, RNA, and protein level, do I need to use a specific separator?
No, best is to report the variant using the format NC_000023.11:g.32849790T>A
NM_004006.3:c.124A>T
r.(124a>u)
p.(Ser42Cys)
.
NOTE: when several NP_'s are annotated in the NM_ reference sequence, it is mandatory to add the NP_ reference sequence used to describe the variant on protein level.
I found a substitution variant (DNA) which alters splicing, inserting a short sequence in the transcript (RNA), giving a frameshift on protein level. How should I list this variant, as substitution, as splice variant, as insertion or as frameshift?
When listing variant types, HGVS recommends listing them separately for each level, i.e. DNA, RNA, and protein. On DNA level you identified a substitution, on RNA level an insertion, and on protein level a frameshift.
What do you mean with "variants should be described on the protein level and not incorporate knowledge regarding the change on the DNA-level"?
It means that protein variant descriptions should be derived from comparing the variant protein sequence with the reference protein sequence.
Knowledge on the underlying change on the DNA level should not be used.
e.g., when MetTrpSerSer
Ser
HisAsp..
changes to MetTrpSerSer
.
HisAsp..
, this is described as p.Ser5del
.
The information that on the DNA level the change is ..ATGTGGTCC
AGT
TCCCACGAT..
to ..ATGTGGTCC
.
TCCCACGAT..
, so the codon for Ser4 is deleted, is not used; the description p.Ser4del
is not correct.
Is it correct that when I apply the 3'rule for genes that are on the minus strand of a chromosome, the "g." and "c." variant descriptions differ regarding the nucleotide that I describe as deleted?
Yes, when a gene is on the minus strand of a chromosome (opposite transcriptional orientation) and the change is located in a repeated sequence (mono-, di-, tri-, etc. nucleotide stretches), the 3'rule has this as a consequence.
When the chromosome sequence is TGGG
G
CAT..
and one of the G's is deleted (change to TGGG
.
CAT..
), the description based on chromosome coordinates is g.5del
.
When the annotated coding DNA reference sequence is on the minus strand (ATGCCC
C
A..
), the description is c.7del
.
Not only is the deleted nucleotide different (G vs. C), in fact the descriptions also point to another nucleotide, g.5
vs. g.2
(equivalent to c.7
).
Can I describe a deletion when I have not yet sequenced the break point?
Yes, using the characters to indicate uncertainties, i.e. the question mark (?
) and parentheses (()
), such cases can be described.
Describe the range of uncertainty as precise as possible.
For details, see Uncertain.