Multivariate models to classify Tuscan virgin olive oils by zone

Para estudiar y clasificar aceites de oliva vírgenes Toscanos, se utilizaron 179 muestras, que fueron obtenidas de frutos recolectados durante la primera mitad de Noviembre, de tres zonas diferentes de la Región. El muestreo fue repetido durante 5 años. Se analizaron ácidos grasos, fitol, alcoholes alifáticos y triterpénicos, dialcoholes triterpénicos, esteróles, escualeno y tocoferoles. Se consideró un subconjunto de variables que fueron seleccionadas en un trabajo anterior como el más efectivo y fiable, desde el punto de vista univariado. Los datos analíticos se transformaron (excepto para el cicloartenol) para compensar las variaciones anuales, restándose la media de la zona Este de los demás valores, dentro de cada año. Se calcularon los modelos de tres clases univariados y además se desecharon variables. Posteriormente, se evaluaron modelos de tres zonas incluyendo fitol (que siempre fue seleccionado) y todas las combinaciones de ácidos palmítico, palmitoleico y oleico, tetracosanol, cicloartenol y escualeno. Se estudiaron modelos incluyendo desde dos a siete variables. El modelo mejor mostró errores de clasificación por zona inferiores al 40%, errores de clasificación por zona dentro del año menores del 45% y errores de clasificación global igual al 30%. Este modelo incluye fitol, ácido palmítico, tetracosanol y cicloartenol.


INTRODUCTION
Tuscany is crossed by a virtual boundary: the northern limit of the cultivation area of Olea europaea.
Moreover, the Region shows a great eco-climatic variability, depending on its geographical position and on its complex orography (Maselli et al., 1996, Maracchi eta!., 1994).
These two facts make the local production of virgin olive oil very diversified (Alessandri et al., 1997a).However it is extremely difficult to quantify these variations in absolute and reliable terms, because the related influences of the yearly variations of the climatic parameters appear very complex (Alessandri et al., 1997a, Alessandri 1993, Alessandri et al., 1997b).The annual variation of Tuscan olive oils quality can be greater than the variation due to the zone of origin (Alessandri et al., 1997a, Alessandri et al., 1997b).The same can be said for variations due to the harvesting time (Alessandri et al., 1995).Moreover, the interactions among all these factors (Alessandri et al., 1997a) give a difficult resolution to the problem.
Things become more difficult if olive oil characterization and classification is not only a scientific need but also an operational one, expecially if scientific findings have to be the basis for local, regional, national or european regulations.
This work is part of a group of researches to study, and eventually to try and resolve these problems.Multivariate classification models are discussed, with the aim of classifying Tuscan virgin olive oils by zone.The models are compensated for annual variations.The reliability and the discriminatory power of the variables considered here, was recently analyzed from the univariate point of view (Alessandri et al., 1997a).A multivariate model can show substantial increment of resolution, when compared to the performances of its univariate components.

Sampling and experimental design
From the harvest season of 1989-90 to 1993-94 (labeled from «89» to «93» in figures and tables), 179 samples of virgin olive oil were collected.They were obtained exclusively from olives harvested in Tuscany during the first half of November.
In this paper, three zones of Tuscany are considered.The northern zone (labeled N in figures and tables), corresponds to Pistola, Lucca, and Massa-Carrara.The Western zone (labeled W), covers the Tyrrhenian coast of Tuscany, without its extreme northern part, and extends over Livorno, and a portion of Grosseto.The inner part of Grosseto is included in the Eastern zone (labeled E) with Florence, Arezzo and Siena.
Data related to the N zone and to the 1991-92 harvest season are missing, due to lack of samples.
The descriptive statistics of the samples are reported elsewhere (Alessandri ef a/., 1997a).
Fatty acids were determined using the method stated by the Technical Commission of the Italian Agriculture Ministery (Commissione Técnica 1976).

Statistical Analysis
All the calculations were made by means of the SAS package (Statistical Analysis System).

Starting Variables
Among the 34 variables derived from the chemical analyses (Alessandri et al. 1997a), only 10 were included in the classification models described in this paper: Palmitic,Palmitoleic and Oleic acid,Phytol Tetracosanol,Hexacosanol,Cycloartenol,Squalene.This first selection follows the assessment of their effectiveness and reliability, that was carried out by means of univariate (Alessandri et al., 1997a) and multivariate (Alessandri etal., 1997b) models.These models were calculated by zone-couples, within each year, on non-transformed data (see below).Data related to the harvest of the 1991-92 growing season were excluded from the calculation of the ' models, but included in their testing.

Compensation of yearly variations
To compensate yearly variation the following steps were taken.
-Data were grouped by year.
-The East zone of Tuscany is the best known in terms of oils (and wines) of high quality and ancient tradition and it is the most studied (Alessandri et al., 1997a).Therefore the Eastzone was considered the reference zone.Within each year, all the data were transformed to make the mean of each variable related to the East-zone equal to zero.-The Cycloartenol values were not transformed, because the «Year» is not a significant source of variation for this variable in Tuscany (Alessandri etal., 1997a).-Each transformed variable was included in an ANOVA model (Analysis of variance) to confirm that, after the transformation, the «Year» was a not significant source of variation.-If a variable did not match the condition above, it was discarded.-The data related to the remaining variables were no more considered as grouped by year and were included in the following calculations as a whole.This approach made a substantial increase of the degrees of freedom of the classification models possible.

Calculation of the classification models
All the classification models were based on Linear Discriminant Analysis (Lachenbruch 1975, Hand 1981).They were cross-validated by the leavingone-out method (Lachenbruch and Mickey 1968), calculated on four-year data (see 3.3.1 and 3.3.2) and then tested on one-year data.
The first calculations were made considering one couple of levels of the Zone variable (W vs. E, W vs. N and E vs. N) at a time.Univariate models were calculated to verify their consistency with the models calculated by year, on non-transformed data (Alessandri et al., 1997a).
Then univariate and multivariate models were calculated, considering the three zones (E, N and W) together.In these models four years of data were included, but their classification errors were analyzed not only as a whole but also within each year (Table I).The variable selection was carried out, by means of the evaluation of all these errors.50% and 45% of misclassified observations were adopted as selecting thresholds, to evaluate the following: -The total discriminatory power (total of misclassified observations) -The capability to recognize the three zones at the same level (observations misclassified byzone, considering all the years together, except for 1991) ~ The yearly consistency (within-year misclassified observations, i.e. total of misclassified observations from the three zones of each single year).

ANOVA
After the transformation described above, the «Year» is not a significant source of variation for any of the 10 variables that were investigated at first (ANOVA calculations not reported).The «Zone», on the contrary, is always highly significant.

Classification models
The univariate models, calculated by zonecouples (Table I), confirm the results derived from models calculated by year on original values (Alessandri et al., 1997a) The three-zone univariate models (Table I) show total classification errors between 41% and 59%.Oleic Acid and Phytol score the lowest errors (42% and 41%), while Delta-5-Avenasteroi (59%) e Beta-Sitosterol (56%), the highest.
If we consider the within-year classification errors (excluding 1991), it can be noted that Palmitoleic acid, Tetracosanol and Phytol show only one value greater than 50%.Hexacosanol, Beta-Sitosterol and Delta-5-Avenasterol, on the other side, show three or four values greater than that threshold.
For these reasons Phytol was selected to be included in all the multivariate models, and Beta-Sitosterol, Delta-5-Avenasterol and Hexacosanol were discarded.Therefore the subsequent multivariate models were calculated including all the 63 simple combinations of Phytol and six variables: Palmitic, Palmitoleic and Oleic acid, Tetracosanol, Cycloartenol and Squalene (Tables lia and lib).Cycloart.

Tetracos.
Tetracos.Bivariate models (Table Ha) show that the inclusion of Cycloartenol or Oleic Acid, or Squalene, decreases the total classification error from 41% (Phytol-only model) to the corresponding 36%, 37% and 38%.The by-zone classification errors of the Phytol+Cycloartenol model are all less than 50%.The same can be said about the within-year errors related to the Phytol+Squalene model.Among the 15 models of three dimensions (Tab.Ha), four lead to a further decrement of the total misclassified observations.Three of them include Cycloartenol and Palmitic, or Palmitoleic, or Oleic Acid.They respectively show errors of 33%, 33%, 29%.The fourth model includes Oleic Acid and Squalene and scores 34% of total classification error.The by-zone classification errors and the within-year errors related to the Phytol+PalmiticAcid+Cycloartenol model are both less than 45%.This particular feature and its effectiveness, both make this the most interesting three-dimension model.It correctly recognizes all the three zones and shows a remarkable stability through the years.This is not the case of the Phytol+Oleic acid+Cycloartenol model.For this reason it was considered less valuable, though it scores a lower total classification error (29%).
Among the 20 models of four dimensions (Tables lia  and lllb), only the Phytol+Palmitic Acid+Tetracosanol+ Cycloartenol model shows the features emphasized above (Figures 1-5), and a further reduction of the total classification error from 33% (Phytol+Palmitic Acid+Cycloartenol model) to 30%.
Two of the 15 models of five dimensions show within-year errors that match the 45% threshold.Both include Phytol, Palmitic Acid, Cycloartenol and Squalene.The fifth variable is Palmitoleic Acid or Tetracosanol.Both perform worse than the formerly selected models and misclassify 34% of the observations.This trend is shared by the only six-variable model related to by-zone and within-year errors matching the 45% threshold.It includes Phytol, Palmitic, Palmitoleic, Oleic Acid, Tetracosanol and Squalene, and misclassifies 36% of observations.The same trend is confirmed by the inclusion of the last variable: the seven-dimension model matches the 45% thresholds as above, but misclassifies a further 1% of observations (total classification error 37%).
The items discussed till now can be summarized looking at the performances of the models selected within each number of variables (Table III).The classification model including these variables is the most effective and reliable, among the models considered i this work.

Fifth Analysis Variable
Squalene.

Sixth Analysis Variable
Squalene.Errors greater than 45% are underlined.Errors greater than 50% are double-underlined.For the sake of comparison also the errors related to the Phytol univariate model are reported.All the variables except for Cycloartenol, have been transformed (see text).
Among these nine models only one shows the lowest total classification error of 30%.It is the four-dimension model including: Phytol, Palmitic Acid, Tetracosanol, and Cycloartenol (Table IV, Figure 1-5).It shows a remarkable feature: all its by-zone classification errors are less than 40% (Table III, Figure 1 ).This characteristic is shared by the selected three-dimension model (Phytol+Palmitic Acid+ Cycloartenol), but this model shows a higher total classification error (33%, Table III).The bi-dimensional representation of the selected four-dimension model (Figures 1-5) was obtained by means of the two related canonical variâtes.The discriminant linear function and its classifying thresholds were then re-calculated from the canonical variâtes.The selected four-dimension model was also re-calculated including the non-transformed variables.It is worth underlining that the related total classification error rises to 39% (+9%).This confirms the importance of taking account of yearly variations.Some results (Table III) reported in this paper can be compared to those regarding univariate models calculated by zone-couples, a year at a time and on original values (3).The two groups of the best performing variables have remarkable overlappings:

Multivariate classification model including
-Phytol had been formerly selected to classify Eastern vs Westem and Northern Tuscan olive oils; -Palmitolejc and Oleic Acid had been formerly selected to classify Western vs Eastern and Northern observations; -Cycloartenol had been formerly selected to classify Northern vs Eastern and Western oils both in univariate (Alessandh et al., 1997a) and multivariate (Alessandri et al., 1997b) models.

CONCLUSIONS
Multivariate effective and reliable models can be calculated, to classify virgin olive oils from Tuscany on a three-zone basis.
It is important to compensate analytical values for the yearly variations.The method presented here requires either the harvesting year of a blind sample to be known or, for the same year, a reference sampling (the East-zone), to estimate and compensate the variation related to that year.
According to our data, Phytol, Palmitic, Palmitoleic and Oleic acid, Tetracosanol, Cycloartenol and Squalene are related to effectiveness and reliability.
Models including phytol, palmitic acid, cycloartenol with or without tetracosanol misclassify 30% or 33% of the observations.They are related to by-zone classification errors lower than 40% and to withinyear classification errors lower than 45%.
. Oleic Acid and Phytol score low classification errors for three zonecouples.Palmitic, Palmitoleic acid, and Tetracosanol effectively classify the Western observations vs the others.Hexacosanol and Squalene effectively classify the Eastern oils vs the Western and the NorthernPercent of misclassified observations related to the zones or the years listed below.The data from the 1991-92 harvesting season (Year=91) < in all the calculations.In two-class classification models, the errors greater than 35% are underlined.In three-class classification models, the errors greater than 45% are underlined and the errors greater than 50% are çlout?lQ-un(j^r|in^ÇliAll the variables except for Cycloartenol have been transformed (see text).(c) Consejo Superior de Investigaciones Científicas Licencia Creative Commons 3.0 España (by-nc) http://grasasyaceites.revistas.csic.es three-class mod.: TEST on one-year sub-set of data misclassified observations related to the zones or the years listed below.The data from the 1991 -92 harvesting season (Year = 91 ) are not included in all the calculations.Errors greater than 45% are underlined-Errors greater than 50% are double-underliped.For the sake of comparison also the errors related to the Phytol univariate model are reported.(c)Consejo Superior de Investigaciones Científicas Licencia Creative Commons 3.0 España (by-nc) http://grasasyaceites.revistas.csic.es observations related to the zones or the years listed below.The data from the 1991-92 harvesting season (Year = 91) are not included in all the calculations.Errors greater than 45% are underlined.Errors greater than 50% are double-underlined.For the sake of comparison also the errors related to the Phytol univariate model are reported.All the variables except for Cycloartenol, have been transformed (see text).
(c) Consejo Superior de Investigaciones Científicas Licencia Creative Commons 3.0 España (by-nc) http://grasasyaceites.revistas.csic.es Figure 1 Distribution of the observations in the space of the canonical variâtes (CAN1 and CAN2).Observations from the Eastern, Northern and Western zones of Tuscany are labeled «E», «N» and «W>> respectively.The dotted lines correspond to the between-zone threshold.These are derived from the linear discriminant function including the two canonical variâtes.CAN 1 and CAN2 are linear combinations of the transformed (see text) values of Phytol, Palmitic Acid, Tetracosanol, Cycloartenol.The classification model including these variables is the most effective and reliable, among the models considered i this work.Olive oils extracted from drupes harvested during the seasons of1989-90,1990-91,1992-93 and 1993-94 are represented.

Figure 5
Figure 1 Distribution of the observations in the space of the canonical variâtes (CAN1 and CAN2).Observations from the Eastern, Northern and Western zones of Tuscany are labeled «E», «N» and «W>> respectively.The dotted lines correspond to the between-zone threshold.These are derived from the linear discriminant function including the two canonical variâtes.CAN 1 and CAN2 are linear combinations of the transformed (see text) values of Phytol, Palmitic Acid, Tetracosanol, Cycloartenol.The classification model including these variables is the most effective and reliable, among the models considered i this work.Olive oils extracted from drupes harvested during the seasons of1989-90,1990-91,1992-93 and 1993-94 are represented.
observations related to the zones or the years listed below.The data from the 1991-92 harvesting season (Year = 91) are not included in all the calculations.