Data mining of the PDB
The following data were mined from remark 280 of the PDB, which gives details of crystallization conditions. This involved downloading gigabytes of data, and analysing it with Perl and Python scripts. Further details are given in the Methods section, below.
1. The most popular organic and salt precipitants
Ammonium sulfate was the most popular precipitant with 900 entries, followed by PEG 4K and PEG 8K. However, if you combine the medium and high molecular-weight PEGs (1968 entries) they easily outnumber ammonium sulfate. Salts are generally less popular than organic materials. For more information see http://www.douglas.co.uk/top14.htm
2. The temperature used in crystallization experiments
Room temperature is the most popular temperature, followed by 4-8°C. Prior to 1998, 4°C was the most popular temperature. Since then, the range 24-28°C has increased in popularity, which may reflect the higher proportion of proteins that were crystallized from thermophilic organisms.
3. The protein concentration used in crystallization experiments
Proteins have been crystallized at concentrations as low as 0.75 mg/ml and as high as 300 mg/ml. Five, 10 and 20 mg/ml are over-represented because these concentrations are often selected at the start of the crystallization procedure and not adjusted during optimization. Without this bias, the most successful concentration would probably be around15 mg/ml .
4. The protein concentration used for crystallization, plotted against the number of amino acids in all chains; also.
The number of amino acids was extracted from the PDB file. In oligomeric complexes, the total number of amino acids in all chains was counted (this is the same as the number of amino acids in each chain multiplied by the number of monomers in the complex). Small correlations were seen in both cases, with smaller proteins being crystallized with higher protein and ammonium sulfate concentrations on average, compared to larger proteins. The average ammonium sulfate concentration used for proteins with fewer than 250 amino acids was 1.80M, while for those with over 1000 it was 2.01M. Similarly, the average protein concentration used for proteins with fewer than 250 amino acids was 16.14 mg/ml, while for those with over 1000 it was 13.18 mg/ml. Peat et al. performed a similar analysis using a standard z-test (Acta Cryst. (2005). D61, 1662-1669). They found that the relationship between molecular weight of a protein and the concentration of ammonium sulfate used for crystallization was "highly statistically significant" (p1666).
5. The acid-base character of proteins crystallized and the pH used for crystallization
It is noticeable that basic proteins are under-represented in the PDB, possibly because of the tendency for lysine side-chains to be disordered. The four charged residues Asp, Glu, Arg and Lys are present in approximately the same frequencies in humans (in E. coli the basic residues are slightly more abundant) This may suggest the use of surface entropy reduction to crystallize basic proteins (Protein Science. 16:1569-1576 (2007 Aug).) However, there is no evidence that manipulating the pH can provide extra help in crystallizing basic proteins.
A complete set of PDB files was downloaded, and analysed using PERL and Python scripts. As well as the crystallization conditions from REMARK 280, the date of the structure, the description, the sequence and the atomic co-ordinates were also pulled out. After this, the crystallization conditions were normalized by hundreds of substitutions using regular expressions in a PERL script. For example ammonium sulphate was listed as AS, A.S., AMM.S., A-S, AMM. SULF., AMONIUM SULPH. etc. All of these listings were normalized and replaced by AM_SULF. Also, the concentrations were converted to molarities, and the concentration were placed in front of the name of the chemical (“AM_SULF, 0.1 M” became “0.1 M AM_SULF”). This was converted into a CSV file that could be loaded into Excel. The ingredients were further sorted into columns, with the most common precipitants (ammonium sulphate, sodium chloride, PEG 4K and PEG8K) assigned to separate columns. Temperature, date, the number of amino acids in the sequence and pH were also included. Temperatures in Fahrenheit and Kelvin were converted to centigrade.
Many entries in the PDB were either completely missing, or could not be parsed. A total of 3939 entries could be parsed. Where two or more different pHs were mentioned in REMARK 280, the pHs were ignored. The data covers the period up to October 2004. If anyone is interested in using the scripts to acquire an up-to-date data set, please contact Patrick Shaw Stewart.
The last plot was generated by a more complex analysis where the exposed areas of residues on the surfaces of proteins was calculated using the CCP4 program AREAIMOL.
Data was extracted by Peter J. Leicester and Patrick D. Shaw Stewart. Analysis was carried out by Patrick Shaw Stewart.