Pattern recognition of acorns from different Quercus species based on oil content and fatty acid profile By

El objetivo de este estudio fué (i) la caracterización de diferentes especies del género Quercus y (ii) la clasificación de las mismas en base al contenido y composición de ácidos grasos del aceite de sus frutos y/o en sus caracteres morfológicos, via técnicas de patrón de reconocimiento (Análisis de Componentes Principales, ACP, Análisis de Cluster, AC, y Análisis Discriminante, AD). Se han estudiado Quercus rotundifolia Lam., Quercus suber L. y Quercus pyrenaica Willd., pertenecientes a la misma zona del centro de Portugal. Al emplear el contenido de aceite y sus respectivas composiciones de ácidos grasos para caracterizar a las muestras, el ACP reveló grupos bien separados correspondientes a cada especie, los cuales, a su vez, se confirmarón con el AC y el AD. El ‘‘ancho’’ y ‘‘longitud’’ de las bellotas exhibieron un poder discriminante bajo. Las bellotas de Q. rotundifolia mostraron el contenido más elevado de aceite, seguidas de las de Q. suber y Q. pyrenaica (9.1, 5.2 y 3.8%, respectivamente). Los perfiles de ácidos grasos de los aceites de Q. rotundifolia y Q. suber son similares al del aceite de oliva, mientras que el aceite de las bellotas de Q.pyrenaica es más insaturado.


INTRODUCTION
Quercus is one of the most common genus in the Portuguese Flora where there are 8 different species.All species are acorn producers and each of them is well adapted to diverse soils and climate conditions (Goes, 1991).These fruits were used both for human and animal feed.Until the early seventies, the extraction of the oil from Q. rotundifolia acorns was carried out in some oil extraction plants in Portugal, due to the very similar fatty acid composition of this oil to olive oil (Ferrão and Ferrão, 1988).However, the increase in salaries made the harvest too high to make the extraction of this oil viable.Nowadays, the use of olive oil, due to its nutraceutical properties, has drawn attention to edible oils with similar composition.
The aim of this study was (i) to characterize 3 different species of the Quercus genus and (ii) to discriminate among them on the basis of the content and fatty acid composition of the oil in their fruits and/or on morphological aspects of the fruits via pattern recognition techniques (Principal Component Analysis, Cluster Analysis and Discriminant Analysis).The following different species, grown in the same mixed stand in the center of Portugal (Portalegre) were investigated: Quercus rotundifolia Lam., Quercus suber L. and Quercus pyrenaica Willd.
Pattern recognition methods were carried out to determine the set of measurements for sample characterization, i.e., to identify the pattern (Miller and Miller, 1993).Therefore, the available data were used simultaneously rather than sequentially and Principal Component Analysis (PCA), Cluster Analysis (CA) and Discriminant Analysis (DA) were performed on multivariate data.
PCA is an attempt to best describe the shape of a multivariate distribution by considering selected linear combinations of the original variables rather than the variables themselves (Bolfinger, 1975).In addition, with this technique, the initial m-dimensional space (m variables) may be reduced to n dimensions (n<m) without considerable loss of information (Harman, 1976;Hoffman and Young, 1983).The initial system of m axis is replaced by another system where the new axis are the principal components (Morrison, 1967;Piggott and Sherman, 1986).The first component shows the maximum correlation with all the variables and explains the highest proportion of the global variance (Powers, 1988).This method allows the geometric representation of the original objects (Quercus acorns) in a space of reduced dimensions defined by a new set of axis and, consequently the identification of groups of similar objects.It may also provide a particular interpretation of the components and subsequently of the original variables.
The second stage of multivariate data analysis consisted of a Cluster Analysis of the multivariate data in order to confirm the existence of the groups suggested by the plot of the samples in the reduced space defined by the significant principal components.The term Cluster Analysis (CA) covers a number of different clustering concepts and classification algorithms (Mirkin, 1996).In this study, the hierarchic clustering methodology was used, i.e., the clusters of higher levels resulted from aggregations of the clusters of lower levels.The Euclidean distance was used as a measure of similarity or distance between samples.As a rule for the linkage of clusters, the single linkage method (or the nearest neighbor) was used.In this method, the distance between the two closest objects in the different clusters determines the distance between two clusters.
Finally, after the identification of isolated groups of samples by PCA and CA, a Discriminant Analysis (DA) was used to determine which variables discriminate among these groups a priori defined (Morrison, 1967;Burgard and Kuznicki, 1990).The basic idea underlying is whether groups differ with regard to the mean of a variable and then use that variable to predict group membership.One can ask whether or not two or more groups are significantly different from each other with respect to the mean of a particular variable.If the means for a variable are significantly different in different groups, then this variable discriminates between the groups.In fact, the procedure is identical to the one-way analysis of variance or to the multivariate analysis of variance if several variables are used.

Materials
Ripened fruits were obtained, on the same date, from adult trees of a mixed stand (Portalegre, Portugal) with the following Quercus species: Q. rotundifolia, Q. suber and Q. pyrenaica.Each sample was collected from one individual tree, i.e., 10 samples of Q. rotundifolia, 11 samples of Q. suber and 9 samples of Q. pyrenaica, were obtained; n-hexane p.a. was used for oil extraction.

Methods
Morphological Characterization of the fruits: The average length and width (cm) were measured for 100 fruits from each tree picked at random.
Oil extraction: Before oil extraction, the fruits were dehulled, ground in a household coffee mill (knife cutter type) and heated at 75 o C for 90 minutes.The oil from prepared material was extracted by n-hexane p.a. in a Soxhlet apparatus for 8 hours (Ferreira-Dias et al., 2003).Experiments were carried out in triplicate and the results were expressed on a dry basis.
Chemical characterization of the oil: The fatty acids profile of every oil sample was evaluated as their methyl esters in a gas chromatograph (Carlo Erba, Vega 2000 GC) equipped with a SUPELCO capillary column (SP TM -2380, 0.2µm, 60m x 0.25 mm; fused silica).Both detector and injector (FID) were heated at 250 o C. Temperature was programmed as follows: 175 o C for 25 minutes, a slope of 5 o C/min.from 175 o C to 220 o C and 220 o C for 10 minutes.Hydrogen was the carrier gas at a column head pressure of 60 kPa.

Statistical analysis
The experimental results, concerning the average "length" and "width" of the fruits from each of the 30 trees studied, the average yield in oil and its fatty acid composition, were put in a matrix form (Matrix A).Samples were presented in rows and variables in columns (matrix 30x8).A matrix 30 x 6 was considered for statistical analysis by removing the columns corresponding to the "length" and "width" of the fruits (Matrix B).
Concerning pattern recognition, a Principal Component Analysis (PCA) was first carried out on the experimental data (matrices A and B).The second stage of multivariate data analysis consisted of a Cluster Analysis (CA) of the data matrices A and B, in order to confirm the existence of the groups suggested by the plot of the samples in the reduced space defined by the significant principal components.
Finally, a Discriminant Analysis (DA) was used on data from matrix B, to determine which variables discriminate between these groups a priori defined (Morrison, 1967;Burgard and Kuznicki, 1990).The model of discrimination was built step-by-step and a forward stepwise analysis was followed.At each step, it was evaluated which variable would contribute most to the discrimination between groups.That variable would then be included in the model, beginning the next step.The maximum number of discriminant functions will be equal to the number of groups minus one or to the number of variables in the analysis, whichever is smaller.The best combination of variables for discriminant analysis includes variables that represent independent measures of product similarities and differences.
In addition, the Classification Functions can be used to determine to which group each case most likely belongs.The classification matrix shows the number of cases that were correctly classified and those that were misclassified.
PCA, CA and DA were performed by using the software "Statistica TM ", version 5, from Statsoft, USA.

Sample Characterization by Principal Components Analysis
The average values of the oil content, respective fatty acid profile, average length and width of the fruit  I, displayed in a matrix form (Matrix A).These data were first submitted to a PCA and the eigenvalues and respective variances of the extracted new axis (principal components) are shown in Table II.According to the criterion proposed by Kaiser, only principal components with eigenvalues greater than 1 must be retained for the analysis, since they explain more than the average variability accounted for by one original variable as meaningful (Dagneli, 1977;Burgard and Kuznicki, 1990).Therefore, the initial 8-dimensional space (defined by 8 variables) can be reduced to a plane, F1F2, defined by the first two principal components since they have eigenvalues greater than one.This plane accounts for about 76% of the variance explained by the original data matrix A.
The correlations between the original variables and the first two principal components, i.e. the loadings of variables are in Fig. 1-A.The first axis is positively correlated with the concentrations of linoleic and linolenic acids (C18:2 and C18:3, respectively) and negatively with oleic acid (C18:1).The amounts of saturated fatty acids (palmitic, C16:0, and stearic acids, C18:0) are better correlated with the second principal component, increasing along it.They are also well correlated with the negative part of the first principal component, which plots both acids very close to the diagonal of the 2 nd quadrant.The presence of the variables ''Width'' and ''Length'' of the fruits near to the diagonal of the first quadrant indicates that they are equally correlated with both axes.In order to redistribute the weightings of the variables so as to make them and respective fatty acid composition on the first and second principal components (fig.1-A); Plot of the Quercus fruit samples (from Q. suber (s), Q. rotundifolia (i) and Q. pyrenaica (p)) on the plane defined by first and second principal components (fig.1-B).-2,0 -1,5 -1,0 -0,5 0,0 0,5 1,0 1,5 2,0 2,5

B
nearer to or farther away from each of the axes, rotation of principal components by Varimax technique was also performed (Burgard and Kuznicki, 1990).However, no improvement on interpreting principal components and/or clustering the samples was achieved.
When the samples were plot on the plane formed by the first and the second principal components, F1F2 (Fig. 1-B), the different samples seem to be grouped according to the species.The Q. rotundifolia fruit samples show the highest oil content and their oil is richer in saturated fatty acids (palmitic and stearic acids).Larger and longer fruits are observed for Q. pyrenaica.The lowest oil content is obtained for Q. pyrenaica acorns.This oil is richer in unsaturated fatty acids (mainly linoleic acid).The Q. suber fruits are in an intermediate position in the plane F1F2.This suggests that for Q. suber, both oil content and composition are between the values observed for Q. rotundifolia and Q. pyrenaica fruits.
As an attempt to better separate samples by species, the variables concerning the morphological characterization of the fruits were ignored on Matrix A (Table I.) and a second PCA on these smaller set of data (Matrix B) was performed.
Similarly as obtained for Matrix A, the original information can be displayed on the plane F1F2, corresponding to the eigenvalues higher than the unity (Table III).This plane explains about 82% of the initial information contained in the matrix B.
The loadings of the initial variables on first and second components (Fig. 2-A) are similar to those observed for Matrix A. With respect to the projection of the samples on the plane F1F2 (Fig. 2-B), a better separation of the clusters according to the species is obtained.Therefore, the length and width of the fruits showed not to be adequate parameters to   discriminate between samples from different species of the Quercus genus, as previously suggested by the results from the analysis of variance.

Sample Characterization by Cluster Analysis
In a second stage, a CA was carried out to investigate the feasibility of the clusters suggested by PCA.The hierarchical tree diagram (dendrogram) for the 30 Quercus samples described by the "oil content" and fatty acid composition (6 variables; Matrix B) is shown in Fig. 3. Similar dendrogram is obtained when data from Matrix A is used (not shown).This confirms the small power of "fruit length" and "fruit width" on discrimination and characterization of Quercus samples.For a linkage distance higher than 5, only two clusters can be defined: a Cluster corresponding to Q. pyrenaica fruits, and another Cluster where the sub-cluster of Q. rotundifolia fruits and the sub-cluster of Q. suber are joined together.For a linkage distance of 4, three clusters, corresponding to the different species, can be identified.Only an outlier from Q. pyrenaica (P9) was observed at this linkage level (Fig. 3).

Sample Characterization by Discriminant Analysis
After confirming by Cluster Analysis the existence of isolated groups of samples, a Discriminant Analysis was used to determine which variables discriminate between the groups a priori defined, corresponding to the 3 species.Due to the low discriminant power exhibited by the variables ''Length'' and ''Width'' of the Quercus fruits on the previous data analysis (PCA and CA) only the oil content of the fruits and their fatty acid composition (Matrix B) were used in Discriminant Analysis.Table IV summarizes the successive steps of the forward stepwise analysis.The highest the F-value of a variable, the highest is the discriminating power of that variable.Therefore, the most important variable to discriminate between Quercus species was oleic acid content followed by stearic acid, oil content and palmitic acid level.
The existence of the 3 entirely distinct Quercus species can be confirmed by the plot of the Quercus samples onto the plane formed by the two Discriminant Functions (canonical roots) found by canonical analysis (Fig. 4).
In addition, the following Classification Functions were established to define every group (species) and can be used to determine to which group each case most likely belongs:

Figure 2
Figure 2 Principal Component Analysis of data from Matrix B-Loadings of oil content of Quercus fruit samples and respective fatty acid composition on the first and second principal components (fig.2-A);Plot of the Quercus fruit samples (from Q. suber (s), Q. rotundifolia (i) and Q. pyrenaica (p)) on the plane defined by first and second principal components (fig.2-B).

Figure 3
Figure 3 Dendrogram of the Quercus fruit samples (from Q. suber (s), Q. rotundifolia (i) and Q. pyrenaica (p)) based on their oil content and fatty acid composition.

Table I Matrix A containing the average values of oil content (%, w/w, dry basis), respective fatty acid profile (%),average length and width of Quercus fruit samples
samples of Q. rotundifolia, Q. suber and Q. pyrenaica are shown in Table