Basics of protein identification
are simple: Proteins are digested
and the resulting peptides are analyzed by mass spectrometers
to obtain their mass data. There are three different types of
MS data that can be used for database search. They are (1) molecular
weights of peptides that can be used for Peptide
Mass Mapping, (2) combination of mass data and partial amino
acid sequence that can be used for Sequence
Tag, and (3) tandem mass spectrometry data (uninterpreted)
that are used for MS/MS fragmentation
ion search.
Most algorithms these days incorporate
a statistical treatment to judge whether the search results
are significant or not. The above three searches would allow,
at least in theory, identifying all known proteins (database
present), but cannot be applied to identifiying any unknown
protein. For unknown proteins (database absent), it is best
to obtain actual amino acid sequence by interpreting the MS/MS
data (de novo sequencing by MS/MS).
Interpreted amino acid sequences can also be used for sequence
homology search (fasta
or blast
search).
General Strategy for
the Identification of Gel (SDS-PAGE or 2-D Gel) Purified Proteins
Adopted by the MSF
Most proteins submitted to the MSF are gel purified, and thus
begins the process with in-gel digestion with trypsin. Peptides
are extracted at the completion of digestion and a small aliquot
(less than 10%) of the digest is analyzed to obtain molecular
weights of tryptic peptides using one of the MALDI-TOF instruments
(Bruker Biflex III or ABI DE-PRO). Obtained MW information is
then used for peptide finger printing for the identification.
When proteins are identified with high confidence (statistically
significant identification), no further MS analysis is usually
pursued. The MALDI-TOF data are also used to monitor the efficiency
of the digestion and to estimate the amount of the protein in
gel digested.
However, when proteins cannot
be identified by peptide mass mapping unambiguously, the digest
is further analyzed by a hybrid nanospray/ESI-Quadrupole-TOF-MS
and MS/MS in a QSTAR mass spectrometer (Applied Biosystems Inc.,
Foster City, CA) for de novo peptide sequencing, sequence tag
search, and/or MS/MS ion search. The static nanospray MS/MS
is especially useful used when the target protein is not known
(database absent). Interpreted MS/MS data can be used for the
sequence homology search. For the proteins from the known genome
databases which cannot be identified with a statistically significant
score by peptide mass mapping due to impurities (presence of
more than two proteins), a cLC/MS/MS analysis is performed using
a Finnigan LCQ Deca XP-Michrom MS4 LC and the resulting data
are used for a Sequest search.
For the highest confidence of
the protein identification, we often performs a "two to
three layered search" - a combination of peptide mass mapping,
sequence query (sequence tag) and de novo sequencing.
Digestion
with other endopeptidases
Not all proteins are digested
equally well by trypsin. For example, highly hydrophobic and
aggregated proteins, and extensively glycosylated proteins are
not digested well. Trypsin may not be the ideal enzyme for very
small proteins (<10 kDa) or some very acidic proteins that
are lacking or have a low occurrence of basic amino acids (Lys
and Arg) in the sequence. When trypsin fails to digest the protein,
we use alternate digestion protocols as an additional effort:
CNBr/trypsin double digestion or Lys-C digestion in the presence
of SDS for aggregated proteins (SDS is removed afterward by
organic solvent partitioning); PNGaseF/trypsin digestion for
highly glycosylated proteins; chymotryptic digestion for proteins
with no or a low number basic amino acids. Edman sequencing
is an option for non-tryptic peptides after fractionation on
a capillary C18 column (0.3 mm ID), if the starting amount of
protein is sufficient (>5 pmol). Our Procise Edman sequencer
can analyze 1 to 2 pmol amount of peptide loaded with the UV
detection limit of 300 fmoles.