Protein Composition and Structure

Protein Composition and Structure

Composition

The database contains 37,866 proteins representing 25,770 named loci. For each locus, a largest isoform was selected for compilation of the statistics that follow. These 25,770 proteins have a mean size of 483 amino acids (aa) and a median of 343 aa. A considerable fraction of the proteins in the data set derive from computational predictions. When these are excluded, the mean increases to 575 aa and the median to 431 aa. This smaller set of 18,886 proteins ranges in size from 25 to 33,423 residues. A single protein in this set, 58 aa LUZP6, starts with an isoleucine rather than the usual methionine.

In the following table, two methods were used to calculate amino acid usage in the 18,886 selected proteins. In the "By protein" column, compositions were calculated for each of the proteins and then averaged. In the "By sequence" column, the usage is from treating the 18,886 sequences as one long sequence. The latter method is weighted toward the usage in larger proteins. These numbers are not weighted for expression.

                          Amino acid            Usage (%)
                                         By protein   By sequence
                        A alanine           7.214       7.010
                        C cysteine          2.491       2.284
                        D aspartate         4.591       4.767
                        E glutamate         6.839       7.124
                        F phenylalanine     3.830       3.664
                        G glycine           6.716       6.577
                        H histidine         2.592       2.623
                        I isoleucine        4.378       4.352
                        K lysine            5.749       5.745
                        L leucine          10.091       9.964
                        M methionine        2.284       2.138
                        N asparagine        3.484       3.603
                        P proline           6.174       6.285
                        Q glutamine         4.578       4.751
                        R arginine          5.804       5.636
                        S serine            7.944       8.302
                        T threonine         5.149       5.315
                        U selenocysteine    0.001       0.000
                        V valine            6.023       5.980
                        W tryptophan        1.277       1.207
                        Y tyrosine          2.793       2.670

There are significant variations from the values above in the usage of many amino acids at the amino termini and carboxyl termini of proteins. These differences may be related to frequent modifications, or other processing and degradation pathways. One example of note is the elevated level of cysteine four positions from the carboxyl terminus, likely reflecting prenylation.

The genome encodes several families of proteins with very unusual amino acid compositions. Many of these are smaller proteins such as the protamines, late cornified envelope proteins, and metallothioneins.

The following table provides some additional examples of individual proteins and gene families where larger proteins have unusual compositions. The numbers given are residues for that amino acid and the total size of the protein. Some predicted proteins have been excluded. The relative fractions vary among the amino acids with the tryptophan-rich proteins being considerably lower than the others. For additional imformation about these proteins, see the sections listed in the right column of the table

Proteins with High Fractions of Individual Amino acids
Amino acid Protein (aa fraction) Section
alanine MARCKS (102/332)
histone H1 family Histones, Related Proteins, and Modifying Enzymes
BASP1 (57/227) Additional Brain Proteins
HOXA13 (93/388) HOX Genes
arginine arginine- / serine-rich splicing factors Capping and Splicing
asparagine PYGO1 (50/419) B cells
aspartate DSPP (259/1301) Bone and Related Tissues
ACRC (122/691) Nucleus and Nucleolus
SPP1 (48/314) Bone and Related Tissues
ANP32B (38/251) Nucleus and Nucleolus
cysteine keratin-associated proteins Keratins
glutamate TCHH (526/1943) Skin and Related Tissues
RPGR (307/1152) Crystallins and Other Eye Proteins
ANP32E (71/268) Nucleus and Nucleolus
NSBP1 (73/282) Nonhistone Chromosomal Proteins
glutamine ZNF853 (264/659) Krüppel-related Zinc Finger Proteins
IVL (150/585) Skin and Related Tissues
glycine LOR (145/312) Skin and Related Tissues
GAR1 (73/217) Nucleus and Nucleolus
keratin-associated proteins Keratins
collagens Collagen
histidine HRC (89/699) Calmodulin and Calcium
HRG (66/525) Liver
SLC39A7 (57/469) Solute Carrier Families
isoleucine olfactory receptor families Olfactory Receptors
type 2 taste receptors Taste Receptors
leucine MFSD3 (104/412) Solute Carrier Families
GP1BB (47/206) Platelets and Megakaryocytes
SLC39A5 (123/540) Solute Carrier Families
TMEM82 (78/343)
PLUNC (58/256) Lung
lysine histone H1 family Histones, Related Proteins, and Modifying Enzymes
CYLC2 (92/348) Testes and Sperm
methionine RGAG1 (145/1388) DNA Transposons and Retrovirus-related Sequences
phenylalanine DERL2 (31/239) ER, Golgi, and the Secretory Pathway
ALG10 (58/473) Protein Glycosylation
DERL3 (29/239)
ALG10B (57/473) Protein Glycosylation
proline proline-rich salivary proteins Lacrimal and Salivary Glands
serine DSPP (542/1301) Bone and Related Tissues
HRNR (957/2850) Skin and Related Tissues
threonine mucins Mucins
tryptophan CCDC70 (16/233) Coiled-Coil Proteins
CDR1 (17/262) Cerebellum
tyrosine DAZ2 (66/558) Testes and Sperm
DAZ3 (46/438) Testes and Sperm
valine PRLHR (54/370) Growth Hormone and Related Hormones
DCXR (32/244) Kidney
GPR141 (40/305) G-Protein-coupled Receptors
FAHD2A (41/314) Additional Enzymes and Related Sequences

Proteins with High Fractions of Individual Amino acids
Amino acid	Protein (aa fraction)	Section
alanine	MARCKS (102/332)
histone H1 family	Histones, Related Proteins, and Modifying Enzymes
BASP1 (57/227)	Additional Brain Proteins
HOXA13 (93/388)	HOX Genes
arginine	arginine- / serine-rich splicing factors	Capping and Splicing
asparagine	PYGO1 (50/419)	B cells
aspartate	DSPP (259/1301)	Bone and Related Tissues
ACRC (122/691)	Nucleus and Nucleolus
SPP1 (48/314)	Bone and Related Tissues
ANP32B (38/251)	Nucleus and Nucleolus
cysteine	keratin-associated proteins	Keratins
glutamate	TCHH (526/1943)	Skin and Related Tissues
RPGR (307/1152)	Crystallins and Other Eye Proteins
ANP32E (71/268)	Nucleus and Nucleolus
NSBP1 (73/282)	Nonhistone Chromosomal Proteins
glutamine	ZNF853 (264/659)	Krüppel-related Zinc Finger Proteins
IVL (150/585)	Skin and Related Tissues
glycine	LOR (145/312)	Skin and Related Tissues
GAR1 (73/217)	Nucleus and Nucleolus
keratin-associated proteins	Keratins
collagens	Collagen
histidine	HRC (89/699)	Calmodulin and Calcium
HRG (66/525)	Liver
SLC39A7 (57/469)	Solute Carrier Families
isoleucine	olfactory receptor families	Olfactory Receptors
type 2 taste receptors	Taste Receptors
leucine	MFSD3 (104/412)	Solute Carrier Families
GP1BB (47/206)	Platelets and Megakaryocytes
SLC39A5 (123/540)	Solute Carrier Families
TMEM82 (78/343)
PLUNC (58/256)	Lung
lysine	histone H1 family	Histones, Related Proteins, and Modifying Enzymes
CYLC2 (92/348)	Testes and Sperm
methionine	RGAG1 (145/1388)	DNA Transposons and Retrovirus-related Sequences
phenylalanine	DERL2 (31/239)	ER, Golgi, and the Secretory Pathway
ALG10 (58/473)	Protein Glycosylation
DERL3 (29/239)
ALG10B (57/473)	Protein Glycosylation
proline	proline-rich salivary proteins	Lacrimal and Salivary Glands
serine	DSPP (542/1301)	Bone and Related Tissues
HRNR (957/2850)	Skin and Related Tissues
threonine	mucins	Mucins
tryptophan	CCDC70 (16/233)	Coiled-Coil Proteins
CDR1 (17/262)	Cerebellum
tyrosine	DAZ2 (66/558)	Testes and Sperm
DAZ3 (46/438)	Testes and Sperm
valine	PRLHR (54/370)	Growth Hormone and Related Hormones
DCXR (32/244)	Kidney
GPR141 (40/305)	G-Protein-coupled Receptors
FAHD2A (41/314)	Additional Enzymes and Related Sequences

Many proteins contain short proline-rich regions. Some proteins, such as certain members of the formin family have very large proline-rich regions that affect the overall composition of the proteins. A similar situation is seen with the leucine-rich repeat proteins.

The small number of proteins containing selenocysteine are described separately (see Selenium Proteins).

Homopolymer segments

Many protein sequences contain long runs of a single amino acid. Notable examples from the largest isoforms in the reference set are presented in the following table (some predicted proteins have been excluded). Proteins often have much larger regions where runs of a single amino acid are broken by one or a few other amino acids. The homopolymer tracts may not be encoded using a single codon for that amino acid. Such variation in codon usage would increase the stability of the DNA sequences that encode the homopolymer tracts. The proteins are described in the sections listed in the right column.

Proteins with Large Homopolymer tracts
Amino Acid Protein Tract length (aa) Section
alanine PHOX2B 20 Homeobox and Related Proteins
FBRS 19 Fibroblast Growth Factors
HOXA13 18 HOX Genes
aspartate HRC 16 Calmodulin and Calcium
ATAD2 14 Bromodomain Family
ASPN 14 Leucine-rich Repeat Family
glutamate MYT1 32 Oligodendrocytes and Myelin
EHMT2 24 Histones, Related Proteins, and Modifying Enzymes
TTBK1 23 Tubulin and Microtubules
glycine AR 23 Nuclear Receptors
POU3F2 21 POU Domain
CAPNS1 20 Cysteine Proteases
histidine NR4A3 14 Nuclear Receptors
DYRK1A 13 Dual-Specificity Protein Kinases
MEOX2 13 Homeobox and Related Proteins
proline PCLO 22 Synapses
FMNL2 21 Cytoskeleton
ZFHX4 20 Homeobox and Related Proteins
RAPH1 20 Ras
WHAMM 20
glutamine FOXP2 40 FOX Family
TBP 38 RNA Polymerase and General Transcription Factors
MAML2 34 Notch Pathway
EP400 29 Nonhistone Chromosomal Proteins
NCOA3 29 Nuclear Receptors
THAP11 29 Zinc Finger Proteins
MN1 28 Ets Family
arginine FLJ37078 11
SLC24A3 10 Solute Carrier Families
serine TNRC18 58
SRRM2 42 Capping and Splicing
MLLT3 42 PHD Finger Proteins
ARL6IP4 25 ADP-Ribosylation Factors
SETD1A 24 Histones, Related Proteins, and Modifying Enzymes
DACH1 24 Additional Genes in Development
threonine CADM1 13 Additional Genes in Development
ANK3 12 Ankyrin Family
KDM6B 11 Histones, Related Proteins, and Modifying Enzymes

Proteins with Large Homopolymer tracts
Amino Acid	Protein	Tract length (aa)	Section
alanine	PHOX2B	20	Homeobox and Related Proteins
FBRS	19	Fibroblast Growth Factors
HOXA13	18	HOX Genes
aspartate	HRC	16	Calmodulin and Calcium
ATAD2	14	Bromodomain Family
ASPN	14	Leucine-rich Repeat Family
glutamate	MYT1	32	Oligodendrocytes and Myelin
EHMT2	24	Histones, Related Proteins, and Modifying Enzymes
TTBK1	23	Tubulin and Microtubules
glycine	AR	23	Nuclear Receptors
POU3F2	21	POU Domain
CAPNS1	20	Cysteine Proteases
histidine	NR4A3	14	Nuclear Receptors
DYRK1A	13	Dual-Specificity Protein Kinases
MEOX2	13	Homeobox and Related Proteins
proline	PCLO	22	Synapses
FMNL2	21	Cytoskeleton
ZFHX4	20	Homeobox and Related Proteins
RAPH1	20	Ras
WHAMM	20
glutamine	FOXP2	40	FOX Family
TBP	38	RNA Polymerase and General Transcription Factors
MAML2	34	Notch Pathway
EP400	29	Nonhistone Chromosomal Proteins
NCOA3	29	Nuclear Receptors
THAP11	29	Zinc Finger Proteins
MN1	28	Ets Family
arginine	FLJ37078	11
SLC24A3	10	Solute Carrier Families
serine	TNRC18	58
SRRM2	42	Capping and Splicing
MLLT3	42	PHD Finger Proteins
ARL6IP4	25	ADP-Ribosylation Factors
SETD1A	24	Histones, Related Proteins, and Modifying Enzymes
DACH1	24	Additional Genes in Development
threonine	CADM1	13	Additional Genes in Development
ANK3	12	Ankyrin Family
KDM6B	11	Histones, Related Proteins, and Modifying Enzymes

Very large proteins

The following table provides a list of the largest proteins in the reference set. Only one isoform is listed for each. Predicted proteins are not listed. Note also the very large predicted LOC643677 (7081 aa) and HMCN2 (5065 aa).

Largest Proteins
Gene Size (aa) Protein Section
TTN 33423 titin Muscle
MUC16 14507 mucin 16 (CA-125 antigen) Mucins
SYNE1 8797 nesprin 1 Spectrin and Plectin Families
OBSCN 7968 obscurin Muscle
SYNE2 6907 nesprin 2 Spectrin and Plectin Families
NEB 6669 nebulin Muscle
GPR98 6306 G-Protein-coupled Receptors
MUC5AC 6207 mucin 5AC Mucins
MACF1 5938 filament crosslinking protein Spectrin and Plectin Families
AHNAK 5890 Cytoskeleton
AHNAK2 5795 Cytoskeleton
MUC5B 5765 mucin 5B Mucins
DST 5675 dystonin Spectrin and Plectin Families
HMCN1 5635 hemicentin Additional Genes in Development
MDN1 5596 midasin Nucleus and Nucleolus
MLL2 5537 PHD Finger Proteins
FCGBP 5405 Fc-binding protein Fc Receptors
MUC4 5284 mucin 4 Mucins
USH2A 5202 usherin Auditory and Vestibular Functions
UBR4 5183 retinoblastoma-associated protein RB1 and Related Functions
MUC2 5179 mucin 2 Mucins
SSPO 5147 subcommissural organ spondin Additional Genes in Development
PCLO 5142 piccolo Synapses
HYDIN 5120 Additional Brain Proteins
EPPK1 5090 epiplakin 1 Spectrin and Plectin Families
ABCA13 5058 ATP-binding Cassette Proteins
RYR1 5038 ryanodine receptor Muscle
KIAA1109 5005

Largest Proteins
Gene	Size (aa)	Protein	Section
TTN	33423	titin	Muscle
MUC16	14507	mucin 16 (CA-125 antigen)	Mucins
SYNE1	8797	nesprin 1	Spectrin and Plectin Families
OBSCN	7968	obscurin	Muscle
SYNE2	6907	nesprin 2	Spectrin and Plectin Families
NEB	6669	nebulin	Muscle
GPR98	6306		G-Protein-coupled Receptors
MUC5AC	6207	mucin 5AC	Mucins
MACF1	5938	filament crosslinking protein	Spectrin and Plectin Families
AHNAK	5890		Cytoskeleton
AHNAK2	5795		Cytoskeleton
MUC5B	5765	mucin 5B	Mucins
DST	5675	dystonin	Spectrin and Plectin Families
HMCN1	5635	hemicentin	Additional Genes in Development
MDN1	5596	midasin	Nucleus and Nucleolus
MLL2	5537		PHD Finger Proteins
FCGBP	5405	Fc-binding protein	Fc Receptors
MUC4	5284	mucin 4	Mucins
USH2A	5202	usherin	Auditory and Vestibular Functions
UBR4	5183	retinoblastoma-associated protein	RB1 and Related Functions
MUC2	5179	mucin 2	Mucins
SSPO	5147	subcommissural organ spondin	Additional Genes in Development
PCLO	5142	piccolo	Synapses
HYDIN	5120		Additional Brain Proteins
EPPK1	5090	epiplakin 1	Spectrin and Plectin Families
ABCA13	5058		ATP-binding Cassette Proteins
RYR1	5038	ryanodine receptor	Muscle
KIAA1109	5005

Many of the proteins listed above contain spectrin-type repeats. Additional large proteins are listed with that family. Larger proteins often contain repeating domains such as those first identified in epidermal growth factor and fibronectin.

Protein modifications

Peptide processing and posttranslational modifications of proteins are presented in detail in the chapters on Proteases and Translation and Protein Modification. The presence of large gene families for proteins that are the substrates for such modifications can be helpful in identifying sequences important for these functions.

Proteins with the γ-carboxyglutamate modification are described in the section on coagulation. The following figure shows the amino acid usage (darker being more conserved) in a partial alignment of 11 of these proteins (see Notes and References). Note the completely conserved glutamate residues near the center of the alignments. Interpretation of such alignments can be complex. In this case, a number of these proteins are also processed by cleavage amino-terminal to the relatively conserved alanine at position 18 in the figure.

Another example of shared sequences around the location of a modified amino acid is seen at the active site of sulfatases. In these enzymes, a cysteine is converted to formylglycine.

Notes and references

Many references and other information for individual genes can be found in the RefSeq entries linked via the pages for the proteins mentioned in this section. A table of these entries (with the corresponding gene identifiers) and a collection of their sequences also are available.

The tables in this section were constructed using the human RefSeq proteins set available at the time release 37.1 of the human reference genome sequence became available. There are some differences in this protein set and the genes annotated onto the reference genome.

The RefSeq proteins are associated with specific transcripts and there are often multiple transcripts for a given gene that may produce distinct or identical protein products. As explained in this section, this protein set was reduced by eliminating gene predictions and then choosing a single largest isoform for each gene. Also, only protein sequences derived from the reference mitochondrial genome were retained.

To produce the figure on carboxyglutamate-containing proteins, amino acids 24-85 from PROZ were used in searches to produce the alignments. The proteins used are those listed in the example in the section on coagulation except for PRRG2. MGP and BGLAP were also omitted.

See also the additional reading for this chapter.

Previous section | Additional reading | Next section