A database of local structures of protein segments

Glossary

[0-9] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

3D mesh gridding (3D_MESH)
One of the classification methods adopted in ProSeg. Firstly, a principal component analysis (PCA) of 4179 segments corresponding to the centroid of cluster that had been obtained from one-pass clustering analysis (L=9, Dth=40) was carried out. Secondly, a virtual 3-dimensional space was created with the top three most significant components in PCA analysis as three axes. Thirdly, all segments in the database were mapped on the 3D space. Since the backbone structure of each segment is defined with 27(=3L) angles, 27-dimensional space is indeed necessary for perfect description of segment structure. Therefore, this step involves some loss of information. However, the 3D mapping procedure enables the visualization of the intrinsically multi-dimensional space of the protein universe with a minimum loss of information. Finally, the 3D space is divided into 27000 voxels using a 30x30x30 mesh. The voxel containing at least one segment is considered as a class in this classification. See the paper [1] for details.

Picture for 3D mesh gridding concept
Picture for 3D mesh gridding concept

Cartesian RMS deviation (RMSD) [Å]
parameter indicating the degree of structural congruence among segments. RMSD means the route mean square deviation of the Cartesian coordinates of assigned segments from the averaged coordinates which is calculated after these segments were superimposed with minimum deviation. Except for explicit clarifications, only backbones (i.e. NH, Cα, and C atoms) are considered in RMSD calculation.
centroid of cluster
Center of mass of a cluster in the multi-dimensional space of the protein universe. This corresponds to a fictitious segment whose dihedral angles are the averages of the dihedral angles of a set of segments that were assigned to the cluster.
chain ID
A symbol (ex. A, B, C, etc) that identifies a particular chain in a record of PDB entry in the case of a protein consisting of multiple chains. When a protein consists of a single chain, you should use a symbol “_” (an under bar/under score) as chain ID.
cluster
A set of segments having a similar backbone structure. Clusters are obtained as the result of classification (clustering analysis).
cluster center
See “centroid of cluster
cluster ID (CID)
An identification number of clusters.
cluster rank (r)
A number corresponding to the descending order of clusters according to their frequency.
clustering condition
A method, parameters, and a dataset that were used in clustering analysis.
clustering method
Computational methods for classification. Currently, two alternative methods, one-path clustering and 3D mesh gridding, are adopted in ProSeg.
Culled PDB
One of the datasets for representative proteins. The Culled PDB (version: Dec. 13, 2001), which is now available in ProSeg, contains 370 protein chains that were selected under the following conditions: resolution < 1.6Å; R-factor <0.2; sequence identity < 25% (ref. Wang, G. and Dunbrack, R. L. Jr. (2003) Bioinformatics 19, 1589-1591).

Chain lists:153L_ 16PK_ 1A2PA 1A4IA 1A6M_ 1A8D_ 1A8E_ 1ABA_ 1ADS_ 1AHO_ 1AIE_ 1AJSA 1AMM_ 1AMTA 1AOP_ 1ARB_ 1ARU_ 1ATG_ 1B0B_ 1B0UA 1B0YA 1B16A 1B3AA 1B4VA 1B5EA 1B6A_ 1B6G_ 1B8OA 1BFD_ 1BGF_ 1BKF_ 1BKRA 1BTEA 1BX4A 1BX7_ 1BXAA 1BXOA 1BYI_ 1BYQA 1C0PA 1C1DA 1C1KA 1C3WA 1C4QA 1C52_ 1C5EA 1C75A 1C7KA 1C8CA 1C9OA 1CC8A 1CCWA 1CCWB 1CEX_ 1CG5B 1CRUA 1CS1A 1CSEI 1CSH_ 1CTJ_ 1CTQA 1CUOA 1CXQA 1CY5A 1CYO_ 1CZPA 1D3GA 1D4OA 1D5TA 1D8WA 1DBFA 1DCIA 1DCS_ 1DEOA 1DF4A 1DFMA 1DG6A 1DI6A 1DJ0A 1DK8A 1DLFH 1DLFL 1DLWA 1DP7P 1DPSA 1DQZA 1DS1A 1DVJA 1DY5A 1DYPA 1DYSA 1E19A 1E29A 1E2UA 1E30A 1E4MM 1E58A 1E5KA 1E5MA 1E6UA 1E7LA 1E85A 1E9EA 1EAJA 1EB6A 1EDG_ 1EDMB 1EG9A 1EG9B 1EGUA 1EJ0A 1EJ8A 1EJGA 1ELKA 1ELWA 1EN2A 1EP0A 1EQOA 1ES9A 1ET1A 1EU1A 1EUVA 1EUVB 1EUWA 1EW4A 1EYVA 1EZGA 1EZM_ 1F0IA 1F0LA 1F1EA 1F24A 1F2TA 1F46A 1F4PA 1F74A 1F7LA 1F86A 1F8EA 1F94A 1FAZA 1FCQA 1FCYA 1FI2A 1FIUA 1FJ2A 1FK5A 1FLMA 1FM0D 1FM0E 1FN8A 1FO8A 1FQTA 1FR3A 1FS7A 1FSGA 1FT5A 1FVGA 1FW9A 1FX2A 1FXMA 1FYEA 1G2BA 1G2QA 1G2RA 1G2YA 1G3P_ 1G4IA 1G57A 1G5AA 1G61A 1G66A 1G6SA 1G6UA 1G6XA 1G7AA 1G7AB 1G8QA 1G9OA 1GA6A 1GCI_ 1GD0A 1GK8A 1GK8I 1GM7A 1GM7B 1GMXA 1GOIA 1H4XA 1H61A 1H6RA 1H6TA 1H8DL 1H96A 1H97A 1HBNA 1HBNB 1HBNC 1HBZA 1HD2A 1HDHA 1HDOA 1HETA 1HFES 1HG7A 1HNJA 1HOZA 1HQ1A 1HVBA 1HW1A 1HX0A 1HXIA 1HYOA 1HZ4A 1HZTA 1HZYA 1I0HA 1I0VA 1I27A 1I2TA 1I40A 1I4FA 1I4UA 1I5GA 1I6WA 1I71A 1I8OA 1ID0A 1IFC_ 1IHRA 1IJVA 1IKHA 1INLA 1IQQA 1ISUA 1IXH_ 1J6ZA 1J97A 1J98A 1J9BA 1JB3A 1JBEA 1JCLA 1JD0A 1JER_ 1JG1A 1JHGA 1JK3A 1JRRA 1JY3N 1JY3O 1JY3P 1JZ8A 1K0MA 1K20A 1K3IA 1K55A 1K92A 1KA1A 1KOE_ 1LAM_ 1LKKA 1LUCA 1MFMA 1MLA_ 1MOQ_ 1MRJ_ 1MRP_ 1MUN_ 1NKD_ 1NLS_ 1NOX_ 1OAA_ 1OPD_ 1ORC_ 1PA2A 1PPN_ 1PSRA 1QCZA 1QDDA 1QFMA 1QFTA 1QG8A 1QGIA 1QH4A 1QH8A 1QH8B 1QHVA 1QJ4A 1QKSA 1QL0A 1QLWA 1QMGA 1QNRA 1QOPA 1QOPB 1QOWA 1QPCA 1QQ9A 1QQFA 1QREA 1QRRA 1QTNA 1QTNB 1QTOA 1QTSA 1QTWA 1QU9A 1RA9_ 1RB9_ 1RGEA 1RHS_ 1RIE_ 1SGPI 1SVFA 1SVFB 1SWUA 1TCA_ 1THFD 1VFYA 1WFBA 1WHI_ 1XNB_ 1YGE_ 1ZIN_ 256BA 2A0B_ 2ARCA 2BTCI 2CTC_ 2DRI_ 2END_ 2ENG_ 2ERL_ 2FDN_ 2IGD_ 2ILK_ 2LISA 2MCM_ 2NLRA 2OLBA 2PTH_ 2PVBA 2RN2_ 2SGA_ 2SNS_ 2TPSA 3CAOA 3CHBD 3CYR_ 3EBX_ 3EZMA 3GRS_ 3LZT_ 3NUL_ 3PYP_ 3SEB_ 3SIL_ 3VUB_ 3XIS_ 4EUGA 4FGF_ 4UBPA 4UBPB 4UBPC 6RLXA 6RLXB 7A3HA 7ODCA 8ABP_

DSSP
A method proposed by Kabsch and Sander to standardize secondary structure assignment of proteins. DSSP is abbreviated from “definition of secondary structure of proteins” or “database of secondary structure in proteins”. DSSP is also the program that calculates DSSP code from PDB entries (ref. Kabsch, W. and Sander, C. (1983) Biopolymers 22, 2577-2637). In ProSeg, the following five symbols are used according to the DSSP code: H = alpha-helix, E = extended strand, participates in beta-ladder, T = hydrogen bonded turn, S = bend, _ = others or cannot be assigned.
dihedral distance
See “structural dissimilarity
distance
See “structural dissimilarity
frequency (fcls)
The ratio of the number of segments assigned to a certain cluster to the total number of segments analyzed.
frequency counts
The number of times that a certain type of amino acids appears at a particular position within segments that are assigned to a particular cluster. See “position specific scoring matrix” for details.
KL entropy
See “Kullback-Leibler entropy
Kullback-Leibler entropy (KL) [bit]
The scalar value indicating the degree of amino acid preference of segments assigned to a certain cluster. KL value becomes zero when the frequencies of appearance of amino acids at all positions are the same as the averaged frequency of all segments. In contrast, the value becomes large when some particular amino acids appear frequently at a certain position. See the paper [2] for details.
non-redundant proteins
See “representative proteins
normalized frequency
See “frequency
number of hydrogen bond (HB)
An averaged number of hydrogen bonds formed between the backbones of segments in a cluster. The hydrogen-donor atom and hydrogen-acceptor atom are each counted as 0.5. Therefore, HB = 1.0, when a segment involves one set of hydrogen-donor and hydrogen-acceptor atoms. Hydrogen bonds were assigned using the DSSP program. Only the bonds whose stabilization energy exceeded 1 kcal/mol were taken into account.
number of segments (Mr)
The number of segments assigned to a certain cluster.
one-pass clustering (ONE_PASS)
One of the classification methods adopted in ProSeg. This is one of the unsupervised non-hierarchical clustering algorithms. One-pass clustering does not require us to presume a parameter for the total number of clusters before clustering. Time-consuming iterative calculations are also unnecessary. Therefore, the method is applicable to a problem in which the number of clusters is unknown, and it can process large-scale calculations at higher speed than other non-hierarchical clustering methods, such as k-means and self-organizing map. The one-pass method also has an intrinsic advantage in the classification of samples having a quite unbalanced distribution. See the paper [2] for details.
phi, psi, omega (φ, ψ, ω) [°, degree]
Torsion angles of a peptide bond, as defined by IUPAC. A value within the range between -180° and +180° should be used in ProSeg.
position specific scoring matrix (PSSM)
A mathematical matrix indicating the degree of amino acid preference of segments in a certain cluster. PSSM is prepared by converting a multiple alignment of sequences. Although there are several ways to convert a multiple alignment into a score, a simple amino acid propensity is calculated without any weights and pseudo-counts and is used as a score in ProSeg. This propensity corresponds to the ratio of the frequency count of a certain type of amino acid appearing at a particular position to the global frequency count of the amino acid.
Protein Data Bank (PDB)
The database of 3D structures of proteins and nucleic acids. The PDB is the single worldwide repository for the processing and distribution of 3D structure data of proteins and nucleic acids. All entries are available via the Internet (http://www.rcsb.org/pdb/). PDB ID consists of four alphanumeric characters.
radius of gyration (RG) [Å]
An averaged value of the radius of gyration of segments in a cluster. RG is the parameter by which the size of particles can be characterized regardless of their shape. In ProSeg, RG is simply calculated as the root mean square distances of every Cα atom in the backbone from the center of mass of these Cα atoms.
rank, ranking
See “cluster rank
representative protein dataset (RPD)
A set of representative proteins. Currently, only one dataset, the Culled PDB (version: Dec. 13, 2001), is available in ProSeg.
representative proteins
Proteins that have no evolutional relationship and no structural similarity to each other. Since the PDB contains many redundant entries derived from homologous, structurally similar, or mutant proteins, these redundant entries should be eliminated before proper statistical analyses.
RMS deviation
See “Cartesian RMS deviation
segment ID (SID)
An identification number of segments.
segment length (L)
A length of segments, i.e. a number of consecutive amino acid residues of divided segments. Currently, only L = 5, 9, 11 and 15 is available in ProSeg.
set of representative proteins
See “representative protein dataset
single-pass clustering
See “one-pass clustering
start residue, center residue, and end residue
Start, center, and end residues are the N-terminal, central, and C-terminal amino acid residues of a segment, respectively. In ProSeg, the numbers of the start, center, and end residues are assigned to integers by simply counting from the N-terminal end of a protein. Therefore, the number may not be in agreement with an original residue code in the PDB entries, because the original code sometimes includes missing numbers or alphanumeric symbols.
structural dissimilarity (Dssim, D) [°, degree]
A parameter indicating the degree of dissimilarity in structure between two segments. This parameter is defined as an averaged Euclid distance between two sets of backbone dihedral angles of the segments, as shown in the following equation. Cosine and arccosine functions are used to convert the difference between the two angles into a value within 0°~180°. This parameter is also used for evaluating the dissimilarities between a query segment and cluster centers, and between the assigned segments and their cluster center.

structural dissimilarity

threshold parameter (Dth) [ °, degree]
A parameter required in one-pass clustering method. The parameter is responsible for making a new cluster and should be assigned arbitrarily before clustering. In ProSeg, a value of 30° or 40° is used as an appropriate value for Dth. See the paper [2] for details.