| Research Summary |
|
|
Chemical Sensors, Biosensors, Retention Mechanisms in High Performance Liquid Chromatography, Chemometrics, Olfaction, Computational Biology Lavine’s research program is divided into two broad areas: sensors and computational biology. Biosensors in Lavine’s lab utilize self-assembled monolayers to immobilize antibodies and surface plasmon resonance (SPR) spectroscopy to detect changes in refractive index that occur when the target is captured by the modified surface. Chemical sensing includes both mechanistic studies and environmental applications of swellable molecularly imprinted polymers. Computational biology focuses on DNA microarrays and understanding the mechanics of DNA bending. Lavine has also been interested in the development of broadly based profiling techniques for fingerprinting complex biological samples. Large amounts of data are usually generated in fingerprinting experiments requiring chemometric methods for data analysis, and these methods are critically important in extracting information embedded in the data. Hence, data mining and knowledge discovery represent a third major thrust of Lavine’s research. Lavine’s research group has been a pioneer in the development and application of a large variety of multivariate data analysis techniques including genetic algorithms, factor analysis, curve resolution, and pattern recognition. On-going research projects in Lavine’s group are summarized below. Compound Specific Imprinted Nanospheres for Optical SensingThe objective of the proposed research is to investigate the use of molecularly imprinted polymers as the basis of a sensitive and selective sensing method for the detection of pharmaceutical and other emerging organic contaminants, which are at parts per billion (ppb) levels, in aquatic environments. The research will involve the preparation of moderately crosslinked, molecularly imprinted polymeric nanospheres (ca. 200nm in diameter) that are designed to swell and shrink as a function of analyte concentration in aqueous media. These nanospheres will be incorporated into hydrogel membranes. Chemical sensing is based on changes in the optical properties of the membrane that accompany swelling of the molecularly imprinted nanospheres. Two effects contribute to this change. One is an increase in the size of the microspheres, which will lead to an increase in the amount of light scattered. The other is a change in the refractive index. Because swelling leads to an increase in the percentage of water in the polymer, the refractive index of the nanospheres will decrease as they swell. This brings them closer to the refractive index of the hydrogel membrane, leading to a decrease in the amount of light scattered/reflected by the microspheres. For the systems that we will be studying, the change in refractive index is the dominant effect. This change will be measured by surface plasmon resonance spectroscopy (SPR) or fluorescence spectroscopy for nanospheres that have been prepared from monomers that fluoresce. The prototype sensor will be capable of detecting pollutants and hazardous materials selectively at ppb levels. Anthrax DetectionAn anthrax sensor will be developed using surface plasmon resonance (SPR) spectroscopy. SPR is a member of a family of spectroscopic techniques based on evanescent wave optics. SPR has been used for the determination of refractive indexes, dielectric constants and layer thicknesses. The experimental set-up for SPR that will be used to detect template binding will be the so-called Kretschmann configuration, which consists of a thin metal film (typically 50 nm thick gold or silver) at the interface of a high and low refractive index materials. Excitation by laser light will result in the production of surface plasmons in the metal film at a given internal angle of incident light when the energy and momentum are matched between the photons and the surface plasmon waves. (A plasmon or charge density wave is a collective oscillation of the charge in a metal). Surface plasmon light is extremely sensitive to changes in the optical architecture of the interface, which will occur after binding of anthrax to the bound antibody on the gold surface. A suitable antibody will be mixed with an appropriate long chained self-assembled monolayer to yield a formulation that will be directly deposited onto a gold substrate. Ideally, the antibody should contain an –SH moiety but it should be possible to develop a suitable self assembled monolayer formulation using an antibody that does not contain an –SH moiety. The SPR response of the reference Au substrate, which contains the antibody and self-assembled monolayer, will be compared to a control, which will consist of a gold substrate containing only the self-assembled monolayer. If the anthrax spores elicit an SPR response when only the reference Au substrate is used, the experiment will be judged as a success provided that an absence of anthrax spores in the test solution produces a negligible response for both the reference and control. The specificity of this system will be determined by two factors: the selectivity of the antibody towards the anthrax spores and the magnitude of the refractive index change caused by the binding of the spores to the Au substrate, which should be very large. For this reason, we do not anticipate that interference due to nonspecific binding will be a problem. Supervised Learning From Microarray DataMicroarrays have allowed the expression level of thousands of genes or proteins to be measured simultaneously. Data sets generated by these arrays consist of a small number of observations (e.g., 20-100 samples) on a very large number of measurement variables (e.g., 10,000 genes or proteins). Each variable indicates whether a particular gene or protein is under or over expressed. The observations in these data sets have other attributes associated with them such as a class label denoting the pathology of the subject from which the sample was taken. We would like to be able to analyze the large arrays of data from a microarray experiment at an intermediate level using pattern recognition techniques for interpretation. However, there are problems when applying pattern recognition methods to larger data sets. Classification success rates will vary with the pattern recognition method employed. Low classification success rates are often obtained for the prediction set despite a linearly separable training set. Automation of these techniques for larger data sets is difficult. The underlying premise of the approach to data analysis described in this paper is that all classification methods will work well when a problem is simple. By identifying the appropriate features, a “hard” problem can be reduced to a "simple" one. Also, by selecting the most salient features of the data, a classifier can be developed that will obviate the need for a more detailed understanding of the system being investigated. At the very least, such an analysis could identify those genes or proteins worthy of further study. Our goal is, therefore, feature selection, in order to increase the signal to noise ratio of the data by discarding measurements that are not characteristic of the profile of the various classes in the data set. For gene expression data, it is important that a multivariate approach to feature selection be employed since genes usually work in groups to regulate biological processes. Any approach to feature selection must also take into account the existence of redundancies in the data because the features of interest are most likely small sets of highly interdependent genes. We report on the development of a genetic algorithm (GA) that employs supervised learning to mine gene expression and proteomic data. Our pattern recognition GA selects features that optimize the separation of the classes in a plot of the two or three largest principal components of the data. Because the largest principal components capture the bulk of the variance in the data, the features chosen by the pattern recognition GA contain information primarily about the differences between classes in a data set. The principal component analysis routine embedded in the fitness function of the pattern recognition GA will act as an information filter, significantly reducing the size of the search space since it restricts the search to feature sets whose principal component plots show clustering on the basis of class. In addition, the algorithm focuses on those classes and/or samples that are difficult to classify as it trains using a form of boosting. Samples that consistently classify correctly are not as heavily weighted as samples that are difficult to classify. Over time, the algorithm learns its optimal parameters in a manner similar to a neural network. The algorithm integrates aspects of artificial intelligence and evolutionary computations to yield a smart one pass procedure for feature selection and pattern recognition. Recently, the fitness function of the pattern recognition GA has been enhanced. Transverse learning has been introduced by coupling a robustified version of the Hopkins statistic to the original fitness function of our pattern recognition GA. For training sets with small amounts of labeled data (i.e., data points tagged with a class label) and large amounts of unlabeled data (i.e., data points not tagged with a class label), this approach is preferred, as our results will show since information in the unlabeled data is used by the fitness function to guide feature selection. With this approach, feature subsets are selected to increase clustering in the principal component plot (using both the labeled and unlabeled data points), while simultaneously optimizing the separation between classes (using only the labeled data points). Transverse learning ensures that features identified by the pattern recognition GA will produce a discriminant that will perform better than one developed from a set of features whose selection is based solely on the dichotomization power of the features for the labeled data points. OlfactionThe chemical sense of olfaction is a complex and poorly understood phenomenon. While it is an integral part of everyday life, information about the relationship between chemical structure and odor quality is scarce. For a compound to have an odor, it is generally agreed that it must be volatile as well as both lipid and water-soluble. Beyond this general description of characteristics, there is no agreement among researchers as to which molecular properties and structural features are responsible for the olfactory impressions invoked by odorants. Analysis of odor-structure relationships (OSR) using computer assisted methods and pattern recognition techniques can provide a practical approach to the analysis of odorants. The heart of the approach is finding a set of molecular descriptors from which a discriminating relationship can be found. According to the current theories of olfaction, the perception of odor is initiated by the interaction of the odorant with the olfactory receptor sites in the nose. Olfactory excitation only occurs if the size and shape of the stimulant is the complement of the receptor or if the stimulant possesses sufficient conformational flexibility to attain the correct shape. The spatial arrangement of the stimulant’s functional and steric groups must also conform to the overall 3-dimensional geometry of the receptor. It is logical to apply this information to a structure-olfaction study during the key step: the development of molecular descriptors. However, only topological, and bulk geometric descriptors, e.g., molecular connectivity indices, substructures, substructural molecular connectivity environment descriptors, molecular volume, and principal moments of inertia, have been used to describe molecular shape in previously published OSR studies of musks and other odorants. Descriptors, which contain information about the olfactory process, need to be developed and tested in order to formulate more effective OSRs. A methodology to facilitate the intelligent design of new odorants (e.g., musks) with specialized properties is being developed as part of our on-going research effort in machine learning. In a traditional framework, the introduction of a new odorant is a lengthy, costly, and laborious discovery, development, and testing process. We propose to streamline this process utilizing large existing olfactory databases available through the open scientific literature as input for a new structure/activity correlation methodology. The first step in this process is to characterize each molecule in the database by an appropriate set of descriptors. To accomplish this task, an enhanced version of the Transferable Atom Equivalent (TAE) descriptor methodology will be used to create a large set of electron density derived shape/property hybrid (PEST), wavelet coefficient (WCD) and TAE histogram descriptors. We have chosen these molecular property descriptors to represent the problem because they have been shown to contain pertinent shape and electronic properties of the molecule and correlate with key modes of intermolecular interactions. Traditional QSAR methodologies, which employ fragment based descriptors, have been shown to be effective for QSAR development within homologous sets of molecules but are less effective when applied to datasets containing a great deal of structural variation. In contrast to previous attempts at SAR, our use of shape-aware electron density based molecular property descriptors has removed many of the limitations brought about by the use of descriptors based on substructure fragments, molecular surface properties, or other whole molecule descriptors. Another reason for the mixed success of past QSAR efforts can be traced to the nature of the underlying modeling problem, which is often quite complex. To meet these challenges, a genetic algorithm for pattern recognition analysis has been developed that selects descriptors which create class separation in a plot of the two largest principal components of the data while simultaneously searching for features that increase clustering of the data. Development of Computational Methods for DNA Bending |

