Introduction
This appendix accompanies the paper "Keeping the Data Lake in Form: Proximity Mining for Pre-filtering Schema Matching". We examine the performance of our proposed schema matching pre-filtering approach using the sample datasets from the OpenML Data Lake (DL). The sample annotated data used in the experiments can be retrieved from: our project's github page.
This webpage is created by the Database Technologies and Information Management Group, UPC, BarcelonaTech and the Department of Computer and Decision Engineering, ULB, Brussels.
Table of Contents:
Appendix: Pre-filtering Experiment Analysis
In this appendix, we examine the cases in our experiments which led to correctly classified true positives or incorrectly classified false negatives or false positives. The incorrectly classified cases will be scrutinised in details to detect the reasons for our approach in failing in such cases, or whether those cases were falsely manually annotated by the contributors to the ground-truth for our experiment. False annotation can occur because the manual annotators depended on the textual descriptions provided by OpenML for each dataset. Those descriptions might describe a specific domain or subject-area, yet pairs of datasets falling under this subject-area might have disjoint information describing it (i.e., they do not store the same information in their attributes). As the annotators did not investigate the attribute metadata and only depended on the descriptions of the datasets, this can lead to the false annotations. The annotators were not allowed to investigate the attribute metadata to eliminate any possibility of bias towards our approach using the same metadata. This should keep the ground-truth independent from our approach leading to better valid results.
We investigate the All-Prox model because it was the best performing proximity model in our experiments. This model utilises all the different types of metadata collected about datasets and their attributes. We select the results from the proximity cut-off threshold of 0.4 because it achieves top performing results in terms of the recall and efficiency-gain trade-off graph in our results (see Fig. 15 in the paper).
We analyse the false negatives and the false positives. The goal of our schema matching pre-filtering approach is to minimise both false negatives and false positives in order to improve recall and precision respectively. We analyse the cases arising as false negatives and whether they are actually true negatives (i.e., have disjoint information) and the cases of false positives which ideally could be still filtered out. Finally, we analyse the cases of true positives and we identify the meta-features which helped our proximity model in correctly identifying those pairs.
1. False Negatives
We took a random sample of 30 false negative cases from the total 81 false negatives. In our experiments we had about 550 actual positive cases. At the threshold of 'T=0.4' for the All-Prox model, 81 cases were not correctly classified as positive cases for schema matching. We analysed the details of those cases in Table 1. Each row in the table lists a dataset pair which did not have a proximity score above the threshold of '0.4' although they were manually annotated to be related datasets under the same topic.
In Table 1, the columns are as follows:
- DS 1 ID: The OpenML dataset unique identifier for accessing the first dataset's description page. This could be accessed by replacing the `X' symbols at the end of the URL: https://www.openml.org/d/XXXX.
- DS 2 ID: The OpenML unique identifier for the second dataset.
- DS1 and DS2: The names of the first and second dataset in the pair respectively.
- Topic: the common subject-area classification for both datasets in the pair.
- att1 and att2: the number of attributes in each dataset respectively.
- nom att 1 and nom att 2: the number of nominal attributes in each dataset.
- num att 1 and num att 2: the number of numeric attributes in each dataset.
- Nom Att M: A list of nominal attributes from both datasets which have a matching data profile. Each attribute pair start with a `*' symbol followed by the attribute from the first dataset, then the `+' symbol, and finally the matching attribute from the second dataset. Each matching pairs of attributes follow the same structure and start on a new line.
- Num Att M: A list of numeric attributes from both datasets which have a matching data profile. It follows the same structure as the Nom Att M column.
- Rel?: the final verdict whether the pairs of datasets really fall under the same topical category or not. This is based on manually analysing the attributes in each dataset, and checking if each attribute in the first dataset matches an attribute from the other dataset in terms of their `semantics'. Here we consider the actual meaning and description of each attribute and not just the data profiles. A value of `yes' means that both datasets have many semantically related attributes (a couple or more) and a value of `no' means not. Note: although a dataset pair might have attributes with matching data profiles, yet, the semantic meaning of the attributes might be different based on their description. For the result in this column, we also consider the semantic meanings of the attributes in our decision.
Table 1: A list of randomly selected false-negative dataset pair cases from the All-Prox model in the OpenML experiment.
| DS 1 ID |
DS 2 ID |
DS1 |
DS2 |
Topic |
att 1 |
att 2 |
nom att 1 |
nom att 2 |
num att 1 |
num att 2 |
Nom Att M |
Num Att M |
Rel? |
| 8 |
1097 |
liver- disorders |
ICU |
Health Measurements |
7 |
21 |
1 |
17 |
6 |
4 |
0 |
0 |
No |
| 10 |
1524 |
lymph |
vertebra column |
Disease |
19 |
7 |
19 |
1 |
0 |
6 |
0 |
0 |
No |
| 18 |
1038 |
mfeat- morphological |
gina_ agnostic |
Digit Handwriting Recognition |
7 |
971 |
4 |
1 |
3 |
970 |
0 |
0 |
No |
| 23 |
329 |
cmc |
hayes- roth |
Citizen Census Data |
10 |
5 |
9 |
5 |
1 |
0 |
*Wifes_ education + educational_ level |
0 |
Yes |
| 25 |
1510 |
colic |
wdbc |
Disease |
28 |
31 |
20 |
1 |
8 |
30 |
0 |
1 *Pulse + V3 abdomcentesis _ tot 2 *al_ protein + V13 |
No |
| 34 |
43 |
post- operative patient- data |
haber- man |
Disease |
9 |
4 |
9 |
2 |
0 |
2 |
0 |
0 |
No |
| 34 |
222 |
post- operative patient- data |
echo Months |
Disease |
9 |
10 |
9 |
3 |
0 |
7 |
0 |
0 |
No |
| 55 |
1510 |
hepatitis |
wdbc |
Disease |
20 |
31 |
14 |
1 |
6 |
30 |
0 |
*ALK_ PHOSPHATE + V23 |
No |
| 150 |
473 |
cover type |
cjs |
Plantation Measurements |
55 |
35 |
45 |
11 |
10 |
24 |
0 |
0 |
No |
| 171 |
1510 |
primary-tumor |
wdbc |
Disease |
18 |
31 |
18 |
1 |
0 |
30 |
0 |
0 |
No |
| 301 |
563 |
ozone_ level |
kdd_ el_ nino-small |
Geo-graphical Measurements |
73 |
9 |
73 |
2 |
0 |
7 |
0 |
0 |
No |
| 307 |
4552 |
vowel |
Bach Choral Harmony |
Sound Measurements |
13 |
17 |
3 |
15 |
10 |
2 |
0 |
0 |
No |
| 343 |
1094 |
white- clover |
Acorns |
Plantation Measurements |
32 |
5 |
5 |
2 |
27 |
3 |
0 |
1 *White- Clover-93 + Tree_ Height 2 *Other Legumes-91 + Acorn_ size |
No |
| 375 |
4552 |
japanese vowels |
Bach Choral Harmony |
Sound Measurements |
15 |
17 |
1 |
15 |
14 |
2 |
0 |
0 |
No |
| 492 |
1464 |
newton_ hema |
blood- transfusion service- center |
Health Measurements |
4 |
5 |
1 |
1 |
3 |
4 |
0 |
* cells_ percentage + V4 |
No |
| 563 |
1487 |
kdd_ el_ nino-small |
ozone- level- 8hr |
Geo- graphical Measurements |
9 |
73 |
2 |
1 |
7 |
72 |
0 |
0 |
No |
| 1471 |
1490 |
eeg-eye-state |
planning relax |
Cardiac EEG Measurements |
15 |
13 |
1 |
1 |
14 |
12 |
* Class + Class |
0 |
Yes |
| 1478 |
1509 |
har |
walking-activity |
Motion patterns |
562 |
5 |
1 |
1 |
561 |
4 |
0 |
0 |
No |
| 1489 |
4552 |
phoneme |
Bach Choral Harmony |
Sound Measurements |
6 |
17 |
1 |
15 |
5 |
2 |
0 |
0 |
No |
| 183 |
205 |
abalone |
sleep |
Animal Profile |
9 |
8 |
2 |
3 |
7 |
5 |
0 |
0 |
No |
| 183 |
232 |
abalone |
fishcatch |
Animal Profile |
9 |
8 |
2 |
2 |
7 |
6 |
* Sex + Sex |
1 * Height + Height 2 * Length + Length1 3 * Diameter + Length3 |
Yes |
| 329 |
541 |
hayes- roth |
socmob |
Citizen Census Data |
5 |
6 |
5 |
4 |
0 |
2 |
0 |
0 |
No |
| 451 |
541 |
irish |
socmob |
Citizen Census Data |
6 |
6 |
4 |
4 |
2 |
2 |
0 |
0 |
No |
| 483 |
492 |
iq_ brain_ size |
newton_ hema |
Health Measurements |
9 |
4 |
3 |
1 |
6 |
3 |
0 |
0 |
No |
| 483 |
1464 |
iq_ brain_ size |
blood- trans- fusion- service-center |
Health Measurements |
9 |
5 |
3 |
1 |
6 |
4 |
0 |
0 |
No |
| 492 |
679 |
newton_ hema |
rmftsa_ sleepdata |
Health Measurements |
4 |
3 |
1 |
1 |
3 |
2 |
0 |
0 |
No |
| 1483 |
1509 |
ldpa |
walking- activity |
Motion patterns |
8 |
5 |
3 |
1 |
5 |
4 |
* Class + Class |
0 |
No |
| 8 |
212 |
liver- disorders |
diabetes_ numeric |
Health Measurements |
7 |
3 |
1 |
0 |
6 |
3 |
0 |
0 |
No |
| 212 |
492 |
diabetes_ numeric |
newton_ hema |
Health Measurements |
3 |
4 |
0 |
1 |
0 |
1 |
0 |
0 |
No |
| 212 |
679 |
diabetes_ numeric |
rmftsa_ sleepdata |
Health Measurements |
3 |
3 |
0 |
1 |
3 |
2 |
0 |
0 |
No |
Manual Analysis
The final decision to decide whether two datasets are actually related or no, which is the information in the 'rel?' column in Table 1, was based on the detailed scrutinisation of the datasets and their attributes by one of the authors. We summarise our commentary which led to our final verdict about each pair in the sample as follows:
- DS 8 and 1097: Liver-disorders is about liver disorders based on blood measurements, and the ICU is based on non-blood measurements related to intensive care patients with more attributes about patient demographics. Here the two datasets were too broadly classified together in the same subject-area as they do not contain related information in their attributes.
- DS 10 and 1524: lymph is about lymphography measurements while vertebra-column has unrelated measurements about orthopaedic patients. Here the two datasets were too broadly classified together in the same subject-area.
- DS 18 and 1038: mfeat-morphological is about handwritten numerals with 7 attributes mainly about different aggregate dimensions of handwritten characters. While gina_agnostic is also about 2 digits written numbers instead of one with more than 900 attributes referring to digits pixel maps. Both datasets model the same subject-area but using different measurements for different entities.
- DS 23 and 329: Cmc is about contraceptive usage by citizens in Indonesia with demographic and socio-economic characteristics of married couples. Hayes-roth is about individual persons demographics. Only 1 attribute was in common between the datasets, although many other attributes had similar attribute content but unrelated attribute names. The predicted similarity was in the range of 0.3-0.4 so it was a borderline pair that was filtered out.
- DS 25 and 1510: colic is about surgeries on horses and measurements of horse conditions, wdbc talks about breast cancer and shape and texture measurements. As the second dataset does not have meaningful attribute names, we can not confirm the semantic relationships between attributes, but a few numeric attributes had intersecting data but their names were not coinciding so it reduced the similarity coefficient.
- DS 34 and 43: postoperative-patient-data is about postoperative actions taken on patients after surgery and measures body temperatures, while haberman is about patients who had surgery of breast cancer and their survival rates. No related attributes with intersecting values exist.
- DS 34 and 222: Postoperative-patient-data is about postoperative actions taken on patients after surgery and measures body temperatures, while echoMonths is about patients who had a heart attack and their survival rates based on heart measurements. No related attributes with intersecting values exist.
- DS 55 and 1510: hepatitis is about hepatitis patients and liver function tests, wdbc talks about breast cancer and shape and texture measurements. 1 attribute coincidently has a similar numeric distribution with the other dataset.
- DS 150 and 473: covertype is about US forest cover type based on cartographic measurements, while cjs is about tree growth measurements in Canada. Both datasets model the same subject-area but using different measurements for different entities.
- DS 171 and 1510: primary-tumor is about patients with different cancer types and the different symptoms they experience, while wdbc is about breast cancer and shape and texture measurements. Both datasets model the same subject-area but using different measurements for different entities.
- DS 301 and 563: ozone_level is about forecasting ozone days in the environment with climate measurements modelled as categorical data, while kdd_el_nino-small is about meteorological measurements about oceans. Here the two datasets were too broadly classified together in the same subject-area.
- DS 307 and 4552: vowel is about vowel speech recognition which is based on features not described by the dataset creator, while BachChoralHarmony is about music information retrieval and description of musical notes. Here the two datasets were too broadly classified together in the same subject-area.
- DS 343 and 1094: white-clover is about description of land area having a specific plant type, and acorns is about a type of fruit and the description of the location of its cultivation. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Both datasets model different aspects about the same subject area.
- DS 375 and 4552: JapaneseVowels is about vowel speech recognition using sound samples descriptions, and BachChoralHarmony is about music information retrieval and description of musical notes. Here the two datasets were too broadly classified together in the same subject-area.
- DS 492 and 1464: newton_hema is about cell markup data from cats, and blood-transfusion-service-center is about blood donation measurements. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
- DS 563 and 1487: kdd_el_nino-small is about meteorological measurements about oceans, while ozone-level-8hr is about forecasting ozone days in the environment with climate numerical measurements. Here the two datasets were too broadly classified together in the same subject-area.
- DS 1471 and 1490: eeg-eye-state is about EEG measurement for a duration of hundreds of seconds using a headset of a single person with eyes open or shut, while planning-relax is about EEG measurements for 5 days which have been transformed using a wavelet packet analysis mechanism which results in different attribute ranges. The only common attribute is the nominal class attribute which model a two-state condition from the subject doing the experiments which resulted in the measurements recorded. The predicted similarity was in the range of 0.3 -0.4 so it was a borderline pair that was filtered out.
- DS 1478 and 1509: har is about measurements of movements using inertial measurements, while walking-activity is about movements collected using accelerometer measurements. Both datasets have different measurements of motion which are not intersectable.
- DS 1489 and 4552: phoneme is about type of sound recognition whether oral or nasal using normalised measurements of harmonics using energy normalisation which give attributes describing phonemes of speech parts, and BachChoralHarmony is about music information retrieval and description of musical notes. Here the two datasets were too broadly classified together in the same subject-area.
- DS 183 and 205: abalone is about prediction of the age of abalone shellfish based on its physical characteristics, while sleep is about assessment of mammal animals features and the effect of their sleep time on danger from other predators. No related attributes with intersecting values exist. Here the two datasets were too broadly classified together in the same subject-area.
- DS 183 and 232: abalone is about prediction of the age of abalone shellfish based on its physical characteristics, while fishcatch is about the physical characteristics of fish from a lake in Finland. Both datasets are about the same subject-area but they talk about different animals. The problem is that the attributes have different data. For example, `sex’ attribute has three distinct values in one dataset and 2 values in the other. The `height’ also has a different range of values between 0 and 1 for abalone shellfish while it ranges between 14 and 44 for the lake fish dataset. Although some attributes had similar names, all attributes had different values. The predicted similarity was in the range of 0.3 -0.4 so it was a borderline pair that was filtered out.
- DS 329 and 541: Hayes-roth is about individual persons demographics, and socmob is about families and their work occupations. No related attributes with intersecting values exist. Here the two datasets were too broadly classified together in the same subject-area.
- DS 451 and 541: irish is about children at school and their educational characteristics, and socmob is about families and their work occupations. No related attributes with intersecting values exist. Here the two datasets were too broadly classified together in the same subject-area. Although, both datasets might have similar dataset level meta-features (e.g., number of attributes per type), going to the deeper level of attributes will easily identify that both datasets are unrelated.
- DS 483 and 492: iq_brain_size is about brain measurements and IQ between twins, and newton_hema is about cell markup data from cats. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
- DS 483 and 1464: iq_brain_size is about brain measurements and IQ between twins, and blood-transfusion-service-center is about blood donation measurements. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
- DS 492 and 679: newton_hema is about cell markup data from cats, and rmftsa_sleepdata is about body measurements in babies during different sleeping states. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
- DS 1483 and 1509: ldpa is about measurements of movements using position sensors on part of the body, while walking-activity is about movements collected using accelerometer measurements. Both datasets have different measurements of motion which are not intersectable, except for the ‘class’ attribute which is about activity type which might be related. We note here that although the datasets might have identical attribute names (as the attribute names are arbitrary), yet they do not store the same information in their attributes, that is why it is important to combine name-based proximity metrics with value-based metrics. Here the two datasets were too broadly classified together in the same subject-area.
- DS 8 and 212: liver-disorders is about liver disorders based on blood measurements, and diabetes_numeric is about measurements of diabetic c-peptide serum in children. Here the two datasets were too broadly classified together in the same subject-area.
- DS 212 and 492: diabetes_numeric is about measurements of diabetic c-peptide serum in children, and newton_hema is about cell markup data from cats. Here the two datasets were too broadly classified together in the same subject-area.
- DS 212 and 679: diabetes_numeric is about measurements of diabetic c-peptide serum in children, and rmftsa_sleepdata is about body measurements in babies during different sleeping states. Here the two datasets were too broadly classified together in the same subject-area.
Conclusion
From our analysis, the majority of the cases (90% of the cases) in the sample of false negatives were actually not related and therefore are not supposed to be annotated as related datasets by the human annotators in the experiment. Those were falsely annotated cases due to the fact that the human annotators only had the overall dataset descriptions to judge whether they are related. Many datasets were broadly classified under the same topic although they should have been distinguished. Our automated techniques depending on our proposed proximity models are able to identify more accurately such cases and correctly the related pairs due to the different detailed meta-features collected about the datasets and their attributes.
2. False Positives
We took a random sample of 30 false positive cases from the total 172 false positives. In our experiments we had about 550 actual positive cases. At the threshold of 'T=0.4' for the All-Prox model, 172 cases were incorrectly classified as positive cases for schema matching, although they did not belong to the same topical subject-area. We analysed the details of those cases in Table 2. Each row in the table lists a dataset pair which had a proximity score above the threshold of '0.4' although they were manually annotated to be not related datasets (i.e., under different topics).
In Table 2, the columns are as follows:
- DS1 ID: The OpenML dataset unique identifier for accessing the first dataset's description page. This could be accessed by replacing the `X' symbols at the end of the URL: https://www.openml.org/d/XXXX.
- DS2 ID: The OpenML unique identifier for the second dataset.
- DS1 and DS2: The names of the first and second dataset in the pair respectively.
- DS1 Topic: the subject-area classification for the first dataset in the pair.
- DS2 Topic: the subject-area classification for the second dataset in the pair.
- DS 1 no. att and DS 2 no. att: the number of attributes in each dataset respectively.
- DS 1 no. nom att and DS 2 no. nom att: the number of nominal attributes in each dataset.
- DS 1 no. num att and DS 2 no. num att: the number of numeric attributes in each dataset.
- Related Nom Att: A list of nominal attributes from both datasets which have a matching data profile. Each attribute pair start with a `*' symbol followed by the attribute from the first dataset, then the `+' symbol, and finally the matching attribute from the second dataset. Each matching pairs of attributes follow the same structure and start on a new line.
- Related Num Att: A list of numeric attributes from both datasets which have a matching data profile. It follows the same structure as the Nom Att M column.
- Rel?: the final verdict whether the pairs of datasets really fall under the same topical category or not. This is based on manually analysing the attributes in each dataset, and checking if each attribute in the first dataset matches an attribute from the other dataset in terms of their `semantics'. Here we consider the actual meaning and description of each attribute and not just the data profiles. A value of `yes' means that both datasets have many semantically related attributes (a couple or more) and a value of `no' means not. Note: although a dataset pair might have attributes with matching data profiles, yet, the semantic meaning of the attributes might be different based on their description.
Table 2: A list of randomly selected false-positive dataset pair cases from the All-Prox model in the OpenML experiment.
| DS1 ID |
DS2 ID |
DS1 |
DS2 |
DS1 Topic |
DS2 Topic |
DS 1 no. att |
DS 2 no. att |
DS 1 no. nom att |
DS 2 no. nom att |
DS 1 no. num att |
DS 2 no. num att |
Related Nom Att |
Related Num Att |
Rel? |
| 14 |
1497 |
mfeat-fourier |
wall robot navigation |
Digit Handwriting Recognition |
Robot Motion Measurements |
77 |
25 |
1 |
1 |
76 |
24 |
0 |
1 * Att74 + V18 2 * Att42 + V1 |
No |
| 22 |
1464 |
mfeat-zernike |
blood transfusion service center |
Digit Handwriting Recognition |
Health Measurements |
48 |
5 |
1 |
1 |
47 |
4 |
0 |
1 * Att35 + V4 2 * Att41 + V1 3 * Att15 + V2 |
No |
| 24 |
492 |
mush-room |
newton hema |
Plant profiles |
Health Measurements |
23 |
4 |
23 |
1 |
0 |
3 |
0 |
0 |
No |
| 29 |
1466 |
credit approval |
cardio-toco-graphy |
Credit Card Applications |
Cardiac EEG Measurements |
16 |
36 |
10 |
16 |
6 |
20 |
* A10 + V27 |
1 * A8 + V6 2 * A2 + V9 |
No |
| 36 |
571 |
segment |
delta elevators |
Image Pixel Classification |
Aircraft Performance |
20 |
7 |
4 |
1 |
16 |
6 |
* Short-line-density-5 + diffDiffClb |
1 * Exgreen-mean + climb 2 * Rate Exred-mean + Altitude |
No |
| 43 |
1049 |
haber-man |
pc4 |
Disease |
Comput-er Software Defects |
4 |
38 |
2 |
4 |
2 |
34 |
* Survival status + c |
* Number of positive axillary nodes detected + LOC COMMENTS |
No |
| 43 |
1511 |
haber-man |
wholesale customers |
Disease |
Retail Sales Data |
4 |
9 |
2 |
2 |
2 |
7 |
* Survival status + Channel |
Number of positive axillary nodes detected + V1 |
No |
| 43 |
4552 |
habe-rman |
Bach Choral Harmony |
Disease |
Sound Measurements |
4 |
17 |
2 |
15 |
2 |
2 |
1 * Survival status + V9 2 * Patients year of operation + V15 |
* Age of patient at time of operation + V2 |
No |
| 222 |
1519 |
echo Months |
robot failures lp4 |
Disease |
Robot Motion Measurements |
10 |
91 |
3 |
1 |
7 |
90 |
0 |
* Epss + V34 |
No |
| 470 |
1095 |
profb |
Cereals |
Football Matches |
Food Nutritional Facts |
10 |
16 |
6 |
11 |
4 |
5 |
0 |
1 * Point spread + sugars 2 * Underdog Points + Carbo |
No |
| 568 |
1500 |
kdd coil 2 |
seismic-bumps |
Chemic-al Contamination |
Seismic Measurements |
12 |
8 |
3 |
1 |
9 |
7 |
* river size + Class |
1 * Concentration 1 + V7 2 * Concentration 4 + V6 |
No |
| 1030 |
1054 |
ERA |
mc2 |
Employ-ee Test Data |
Comput-er Software Defects |
5 |
40 |
5 |
4 |
0 |
36 |
* In2 + CALL PAIRS |
0 |
No |
| 1051 |
1097 |
cocomo numeric |
ICU |
Software Development Effort |
Health Measurements |
17 |
21 |
15 |
17 |
2 |
4 |
* VEXP + LOC |
1 * ACT EFFORT + ID 2 * LOC + AGE |
No |
| 1465 |
1478 |
breast-tissue |
har |
Human Tissue impeda-nce spectrum |
Motion patterns |
10 |
562 |
1 |
1 |
9 |
561 |
* Class + Class |
0 |
No |
| 1471 |
1476 |
eeg-eye-state |
gas-drift |
Cardiac EEG Measurements |
Gas sensing statistics |
15 |
129 |
1 |
1 |
14 |
128 |
0 |
* V6 + V105 |
No |
| 1478 |
1597 |
har |
credit card |
Motion patterns |
Credit Card Transactions |
562 |
31 |
1 |
1 |
561 |
30 |
0 |
* V145 + V26 |
No |
| 1511 |
1516 |
whole-sale customers |
robot failures lp1 |
Retail Sales Data |
Robot Motion Measurements |
9 |
91 |
2 |
1 |
7 |
90 |
* Region + Class |
0 |
No |
| 1524 |
4552 |
vertebra-column |
Bach- Choral- Harmony |
Disease |
Sound Measurements |
7 |
17 |
1 |
15 |
6 |
2 |
* Class + V14 |
* V1 + V2 |
No |
| 18 |
1495 |
mfeat morphol-ogical |
qualitat-ive bankruptcy |
Digit Handwriting Recognition |
Bankru-ptcy Evaluation |
7 |
7 |
4 |
7 |
3 |
0 |
* Att1 + V3 |
0 |
No |
| 43 |
1462 |
haber-man |
banknote authent-ication |
Disease |
Bank-note Recog-nition |
4 |
5 |
2 |
1 |
2 |
4 |
* Survival status + Class |
* Number of positive axillary nodes detected + V3 |
No |
| 151 |
1526 |
electric-ity |
wall-robot-naviga-tion |
Product Pricing |
Robot Motion Measurements |
9 |
5 |
2 |
1 |
7 |
4 |
0 |
* Period + V2 |
No |
| 196 |
230 |
autoMpg |
machine cpu |
Car |
CPU Performance Data |
8 |
7 |
3 |
1 |
5 |
6 |
0 |
1 * Displacement + MYCT 2 * Horsepower – CACH |
No |
| 329 |
1511 |
hayes-roth |
whole-sale custom-ers |
Citizen Census Data |
Retail Sales Data |
5 |
9 |
5 |
2 |
0 |
7 |
* Hobby + Region |
0 |
No |
| 1489 |
1526 |
phoneme |
wall-robot-naviga-tion |
Sound Measurements |
Robot Motion Measurements |
6 |
5 |
1 |
1 |
5 |
4 |
0 |
* V5 + V4 |
No |
| 1499 |
1500 |
seeds |
seismic-bumps |
Plant Measurements |
Seismic Measurements |
8 |
8 |
1 |
1 |
7 |
7 |
All attributes |
All attributes |
Yes |
| 1500 |
1523 |
seismic-bumps |
vertebra-column |
Seismic Measurements |
Disease |
8 |
7 |
1 |
1 |
7 |
6 |
* Class + Class |
* V1 + V2 |
No |
| 1509 |
1525 |
walking-activity |
wall-robot-naviga-tion |
Motion patterns |
Robot Motion Measurements |
5 |
3 |
1 |
1 |
4 |
2 |
0 |
* V4 + V2 |
No |
| 183 |
225 |
abalone |
puma-8NH |
Animal Profile |
Robot Motion Measurements |
9 |
9 |
2 |
0 |
7 |
9 |
0 |
1* Height + tau1 * Viscera weight + tau2 3 * Whole weight + thetad2 |
No |
| 205 |
212 |
sleep |
diabetes numeric |
Animal Profile |
Health Measurements |
8 |
3 |
3 |
0 |
5 |
3 |
0 |
* total sleep + age |
No |
| 225 |
1462 |
puma-8NH |
banknote-authent-ication |
Robot Motion Measurements |
Bank-note Recognition |
9 |
5 |
0 |
1 |
9 |
4 |
0 |
1 * Thetadd3 + V3 2 * Thetad1 + V4 3 * Thetadd3 + V2 |
No |
Manual Analysis
The final decision to decide whether two datasets are actually related or not, which is the information in the `rel?' column in Table 2, was based on the detailed scrutinisation of the datasets and their attributes by one of the authors. We summarise our commentary which led to our final verdict about each pair in the sample as follows:
- DS 14 and 1497: The datasets have multiple attributes in common with quite similar numeric distributions, we only report two of the strongest.
- DS 22 and 1464: The second smaller dataset by coincidence has most of its attributes highly matching with numeric attributes from the much larger first dataset.
- DS 24 and 492: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
- DS 29 and 1466: The attributes coincidently contain similar profiles of information. For example, 2 nominal attributes had similar value distributions and each had two values and both had 2 distinct values only, however, they have different semantic meanings in their context. Many numeric attributes also coincidently had very similar numeric distribution.
- DS 36 and 571: The datasets have multiple attributes which are coincidently common with quite similar nominal profile and numeric distributions, we only report three of the strongest.
- DS 43 and 1049: Some nominal and numeric attributes with similar profiles. Haberman has a few attributes with very common data profiles (e.g., binary values) which match with attributes in larger datasets with more attributes.
- DS 43 and 1511: Some nominal and numeric attributes with similar profiles. Haberman has a few attributes with very common data profiles (e.g., binary values) which match with attributes in larger datasets with more attributes.
- DS 43 and 4552: Some nominal and numeric attributes with similar profiles. Haberman has a few attributes with very common data profiles (e.g., binary values) which match with attributes in larger datasets with more attributes.
- DS 222 and 1519: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
- DS 470 and 1095: A few numeric attributes had similar profiles across the datasets.
- DS 568 and 1500: The datasets have multiple attributes which are coincidently common with quite similar nominal profile and numeric distributions, we only report three of the strongest.
- DS 1030 and 1054: Some nominal attributes with similar profiles.
- DS 1051 and 1097: The datasets have some attributes which coincidently have quite similar nominal profile and numeric distributions.
- DS 1465 and 1478: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
- DS 1471 and 1476: A few numeric attributes had similar profiles between a small dataset and a large dataset.
- DS 1478 and 1597: A few numeric attributes had similar profiles between a small dataset and a large dataset.
- DS 1511 and 1516: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
- DS 1524 and 4552: Some nominal and numeric attributes with similar profiles. The matching nominal attributes have the same number of distinct values (binary) and nearly identical distribution of values (50\% for each value). Numeric attributes had matching numeric distributions.
- DS 18 and 1495: A few nominal attributes had similar profiles.
- DS 43 and 1462: The datasets have some attributes which coincidently have quite similar nominal profile and numeric distributions.
- DS 151 and 1526: A few numeric attributes had similar profiles.
- DS 196 and 230: A few numeric attributes had similar profiles.
- DS 329 and 1511: A few nominal attributes had similar profiles.
- DS 1489 and 1526: A few numeric attributes had similar profiles.
- DS 1499 and 1500: Both datasets store the same data, but they were falsely annotated by the OpenML contributors with different metadata and description. This is a case where our automated techniques can detect such human errors.
- DS 1500 and 1523: The datasets have some attributes which coincidently have quite similar nominal profile and numeric distributions.
- DS 1509 and 1525: A few numeric attributes had similar profiles.
- DS 183 and 225: A few numeric attributes had similar profiles.
- DS 205 and 212: The datasets have some attributes which coincidently have quite similar nominal profiles.
- DS 225 and 1462: The datasets have some attributes which coincidently have quite similar nominal profiles.
Conclusion
From our analysis, the majority of the cases in the sample of false positives were true false positives which was caused by coincidently matching data profiles of some of the attributes across different datasets. As we have selected a threshold of `0.4' we expect having such cases of false positives but we try to minimise them as much as possible to improve the efficiency gain performance of our pre-filtering approach. Our approach was also able to detect human errors in annotating the descriptions of datasets in the cases of datasets 1499 and 1500. Those datasets were incorrectly annotated different descriptions about different topics although they contain duplicate data about the same subject and having the same instances. Therefore, our approach can actually support human annotations by detecting such errors and highlighting them to the DL's data wrangler.
3. True Positives
In this section, we present a sample of the true positive cases our proximity-mining-based approach was successful in detecting. Those were pairs of datasets which were positively proposed by our approach and were actually belonging to the same topical subject-area. We took a random sample of 30 true positive cases from the total 469 cases. The actual positives were 550 cases but not all of them were detected using our techniques at a threshold of `T=0.4' for the All-Prox model (a discussion of the cases that were missed are presented in the false negatives section. We demonstrate using the sample cases the underlying workings of our approach by presenting the summary of those pairs and their matching attributes.
We analysed the details of those cases in Table 3. Each row in the table lists a dataset pair which had a proximity score above the threshold of $0.4$ and which were manually annotated to be datasets belonging to the same topical subject-area. Our goal is to highlight how our approach detected that each pair were related. We present the related attributes which were found.
In Table 3, the columns are as follows:
- DS1 ID: The OpenML dataset unique identifier for accessing the first dataset's description page. This could be accessed by replacing the `X' symbols at the end of the URL: https://www.openml.org/d/XXXX.
- DS2 ID: The OpenML unique identifier for the second dataset.
- DS1 and DS2: The names of the first and second dataset in the pair respectively.
- Topic: the common subject-area classification for both datasets in the pair.
- DS1 and DS2 no. att: the number of attributes in each dataset respectively.
- DS1 and DS2 no. nom att the number of nominal attributes in each dataset.
- DS1 and DS2 no. num att: the number of numeric attributes in each dataset.
- Related Nom Att: A list of nominal attributes from both datasets which have a matching data profile. Each attribute pair start with a `*' symbol followed by the attribute from the first dataset, then the `+' symbol, and finally the matching attribute from the second dataset. Each matching pairs of attributes follow the same structure and start on a new line.
- Related Num Att: A list of numeric attributes from both datasets which have a matching data profile. It follows the same structure as the Related Nom Att column.
Table 3: A list of randomly selected true-positive dataset pair cases from the All-Prox model in the OpenML experiment.
| DS1 ID |
DS2 ID |
DS1 |
DS2 |
Topic |
DS 1 no. att |
DS 2 no. att |
DS1 no. nom att |
DS2 no. nom att |
DS1 no. num att |
DS2 no. num att |
Related Nom Att |
Related Num Att |
| 8 |
483 |
liver-disorders |
iq brain size |
Health Measurements |
7 |
9 |
1 |
3 |
6 |
6 |
* Selector + ORDER |
1 * alkphos + WEIGHT2 2 * mcv + FIQ |
| 12 |
20 |
mfeat-factors |
mfeat-pixel |
Digit Handwriting Recognition |
217 |
241 |
3 |
241 |
214 |
0 |
* Class + class |
0 |
| 16 |
32 |
mfeat-karhunen |
pendigits |
Digit Handwriting Recognition |
65 |
17 |
1 |
1 |
64 |
16 |
* Class + class |
0 |
| 16 |
1042 |
mfeat-karhunen |
gina prior |
Digit Handwriting Recognition |
65 |
785 |
1 |
263 |
64 |
522 |
0 |
0 |
| 22 |
1501 |
mfeat-zernike |
semeion |
Digit Handwriting Recognition |
48 |
257 |
1 |
257 |
47 |
0 |
* Class + class |
0 |
| 25 |
1523 |
colic |
vertebra-column |
Disease |
28 |
7 |
20 |
1 |
8 |
6 |
* surgical lesion + Class |
1 * pulse – V12 2 *packed cell volume + V4 |
| 55 |
222 |
hepatitis |
echo-Months |
Disease |
20 |
10 |
14 |
3 |
6 |
7 |
1 * ANTIVIRALS + pericardial 2 * SPIDERS + still alive 3 * FATIGUE + alive at 1 |
1 * ALBUMIN + lvdd 2 * AGE + age |
| 150 |
1596 |
covertype |
covertype |
Plantation Measurements |
55 |
55 |
45 |
45 |
10 |
10 |
Many with the same name |
0 |
| 183 |
1557 |
abalone |
abalone |
Animal Profile |
9 |
9 |
2 |
2 |
7 |
7 |
* Sex + V1 |
1 * Length + V32 2 * Diameter + V23 3 * Shell weight + V84 Length + V2 |
| 187 |
287 |
wine |
wine quality |
Food Nutritional Facts |
14 |
12 |
1 |
1 |
13 |
11 |
0 |
1 * Alcohol + alcohol2 2* Nonflavanoid phenols + citric.acid 3 * Ash + pH 4 * Color intensity + fixed acidity |
| 224 |
1523 |
breast-Tumor |
vertebra-column |
Disease |
10 |
7 |
8 |
1 |
2 |
6 |
* deg-malig + Class |
* class + V2 |
| 307 |
1489 |
vowel |
phoneme |
Sound Measurements |
13 |
6 |
3 |
1 |
10 |
5 |
0 |
1 * Feature 4 + V42 2 * Feature 2 + V5 |
| 329 |
382 |
hayes-roth |
ipums la 97-small |
Citizen Census Data |
5 |
61 |
5 |
61 |
0 |
0 |
0 |
0 |
| 354 |
1567 |
poker |
poker-hand |
Poker Hands |
11 |
11 |
11 |
11 |
0 |
0 |
1 * X9 + V72 2 * X7 + V53 3 * X5 + V9 |
0 |
| 378 |
451 |
ipums la 99-small |
irish |
Citizen Census Data |
61 |
6 |
60 |
4 |
1 |
2 |
1 * Sex + Sex 2 * schltype + Type school |
0 |
| 382 |
534 |
ipums la 97-small |
cps 85 wages |
Citizen Census Data |
61 |
11 |
61 |
7 |
0 |
4 |
1 * workedyr + MARR 2 * SEX + sex 3 * yrlastwk + OCCUPATION |
0 |
| 451 |
1590 |
irish |
adult |
Citizen Census Data |
6 |
15 |
4 |
9 |
2 |
6 |
1 * Sex + sex 2 * Leaving Certificate + class |
* Prestige score + age |
| 483 |
1093 |
iq brain size |
Brainsize |
Health Measurements |
9 |
7 |
3 |
1 |
6 |
6 |
* SEX + Gender |
1 * FIQ + PIQ 2 * WEIGHT + Weight |
| 483 |
1097 |
iq brain size |
ICU |
Health Measurements |
9 |
21 |
3 |
17 |
6 |
4 |
1 * ORDER + INF 2 * SEX + SEX |
* WEIGHT + HRA |
| 570 |
575 |
kdd coil 3 |
kdd coil 4 |
Chemical Contamination |
12 |
12 |
3 |
3 |
9 |
9 |
All attributes |
All attributes |
| 1030 |
1035 |
ERA |
ESL |
Employee Test Data |
5 |
5 |
5 |
5 |
0 |
0 |
* out1 + out1 |
0 |
| 1038 |
1501 |
gina agnostic |
semeion |
Digit Handwriting Recognition |
971 |
257 |
1 |
257 |
970 |
0 |
* label + V14 |
0 |
| 1049 |
1071 |
pc4 |
mw1 |
Computer Software Defects |
38 |
38 |
4 |
4 |
34 |
34 |
1 * ESSENTIAL DENSITY + ESSENTIAL DENSITY 2 * c + c 3 * PARAMETER COUNT + PARAMETER COUNT |
1 * DESIGN DENSITY + DESIGN DENSITY 2 * PERCENT COMMENTS + PERCENT COMMENTS 3 * MAINTENANCE SEVERITY + MAINTENANCE SEVERITY |
| 1050 |
1070 |
pc3 |
kc1-numeric |
Computer Software Defects |
38 |
95 |
2 |
27 |
36 |
68 |
* c + minHALSTEAD ERROR EST |
1 PERCENT COMMENTS + avgNUM OPER-ANDS 2 * LOC COMMENTS + NUMDEFECTS 3 LOC BLANK + maxLOC BLANK |
| 1053 |
1056 |
jm1 |
mc1 |
Computer Software Defects |
22 |
39 |
1 |
6 |
21 |
33 |
* defects + DESIGN DENSITY |
1 * iv(g) + BRANCH COUNT 2 lOCode + NUM OPERATORS 3 * branchCount + NUMBER OF LINES |
| 1054 |
1056 |
mc2 |
mc1 |
Computer Software Defects |
40 |
39 |
4 |
6 |
36 |
33 |
* c + MAINTENANCE SEVERITY |
1 * GLOBAL DATA DENSITY + CYCLOMATIC DENSITY 2 * GLOBAL DATA COMPLEXITY + NUM UNIQUE OPERATORS 3 * NUM UNIQUE OPER-ANDS + NUM UNIQUE OPER-ANDS |
| 1054 |
1065 |
mc2 |
kc3 |
Computer Software Defects |
40 |
40 |
4 |
4 |
36 |
36 |
Many with the same name |
Many with the same name |
| 1497 |
1525 |
wall-robot-navi-gation |
wall-robot-navi-gation |
Robot Motion Measurements |
25 |
3 |
1 |
1 |
24 |
2 |
* Class + class |
1 * V21 + V1 2 * V18 + V2 |
| 1517 |
1518 |
robot-failures-lp2 |
robot-failures-lp3 |
Robot Motion Measurements |
91 |
91 |
1 |
1 |
90 |
90 |
All attributes |
All attributes |
| 1518 |
1526 |
robot-failures-lp3 |
wall-robot-navi-gation |
Robot Motion Measurements |
91 |
5 |
1 |
1 |
90 |
4 |
* Class + Class |
0 |
Manual Analysis
We analysed each dataset pairs in Table 3 and their underlying proximity metrics and attribute similarities in order to deduce the mechanism that led to the correct detection of the pair for schema matching. The observations from our analysis of each pair is as follows:
- DS 8 and 483: Nominal attributes partially similar, numeric attributes coincidently matching with high similarities of data profiles.
- DS 12 and 20: Nominal attributes are very strongly similar, Dataset level similarity (not attribute level similarity) and name-based proximity were the main deciding factor here as the datasets have similar attribute counts, similar attribute types and very similar attribute names. This demonstrates the importance of combining both levels of comparison together and different types of similarity measures in a single proximity metric.
- DS 16 and 32: Nominal attributes very strongly similar. Maximum aggregation proximity metrics were probably the main deciding factors here.
- DS 16 and 1042: Attributes were weakly similar, but dataset level similarity was the main deciding factor here.
- DS 22 and 1501: Nominal attributes very strongly similar. Dataset level and name-based metrics were main deciding factors here as the attributes have quite similar naming scheme.
- DS 25 and 1523: Nominal attributes partially similar, numeric attributes coincidently matching.
- DS 55 and 222: Many nominal attributes which are partially similar, common numeric attributes found and some coincidently matching. Dataset level similarity and left-skewed attribute level aggregations were the main deciding factors here (as those aggregations give more importance to a few very strongly similar pairs of attributes).
- DS 150 and 1596: They are duplicate datasets about the same instances but with different preprocessing for numerical attributes. As most attributes will store different information based on the preprocessing, the name-based and dataset level proximity metrics were the main deciding factors here (and attribute level value-based proximity metrics were probably not considered here by the All-Prox proximity model).
- DS 183 and 1557: Those are duplicate datasets, but the second has been cleansed, undergone preprocessing of nominal attributes, and the attribute names have been changed to a generic format. It is clear in this case that the positive proximity prediction was mainly due to attribute level analysis of numeric attributes and with less importance given to name-based distances at the attribute level (dataset level name proximity is of course also highly contributing in this case as the names are very similar).
- DS 187 and 287: Both datasets are not duplicates and represent data from different sources. They were easily detected as related datasets due to the dataset name, the highly similar numeric attributes using left-skewed distributions and maximum aggregations, and the dataset level profile.
- DS 224 and 1523: Partially matching nominal and numeric attributes. Similar dataset level profiles was main deciding factor due to similar number of instances and attributes.
- DS 307 and 1489: A combination of attribute level aggregations considering maximum attribute name similarity (where 'class' is a common attribute) and left-skewed aggregations of attribute level data profiles considering the partially matching numeric attributes led to the detection of relatedness between those datasets.
- DS 329 and 382: Although no common data profile of attributes, there was still common attributes based on names (with different data though) like 'marital_status' and 'marst' which led to high similarity of maximum attribute name aggregations. The dataset level profile had a higher than average similarity too.
- DS 354 and 1567: Both are duplicated datasets but the the first is a heavily preprocessed version with changed nominal categorisations, and different normalisations of values. They were easily detected as related datasets using their dataset level proximity metrics and left-skewed attribute level similarities.
- DS 378 and 451: Very similar nominal attributes with similar data were the main deciding factors.
- DS 382 and 534: Nominal attributes which are similar led to the overall similarity of the datasets.
- DS 451 and 1590: Similar nominal attributes, numeric attributes coincidently matching.
- DS 483 and 1093: Both datasets contain both nominal and numeric attributes with similar data and names.
- DS 483 and 1097: Nominal attributes have partially similar data profiles, numeric attributes coincidently matching. The dataset profiles are similar too.
- DS 570 and 575: They are datasets storing the same information about different instances. Our proximity mining algorithm was able to detect that both datasets store related information although they had slight differences in data profiles. Our proposed approximate proximity metrics are able to detect such kind of similarities using a combination of different types of metadata using names or overall dataset level profiles.
- DS 1030 and 1035: Similar dataset level and attribute level data profiles and names.
- DS 1038 and 1501: Although the datasets are quite different in terms of their overall dataset profile, they still had relative similarity in terms of overall number of attributes and number of instances. This, alongside the fact that the names are generically numbered using common namings in both datasets helped detect their relatedness.
- DS 1049 and 1071: They are datasets storing the same information about different instances. Our proximity mining algorithm was able to detect that both datasets store related information although they had slight differences in data profiles. The approximate proximity metrics are able to detect such kind of similarities.
- DS 1050 and 1070: Multiple similar nominal and numeric attributes.
- DS 1053 and 1056: Multiple similar numeric attributes. Dataset level profile similarity also contributed to the correct detection of relatedness.
- DS 1054 and 1056: Multiple similar nominal and numeric attributes.
- DS 1054 and 1065: They are datasets storing the same information about different instances. Although the datasets had very different data profiles for each attribute, still the name-based attribute level proximity metrics were able to effectively detect that those datasets are holding related data. Left-skewed attribute level aggregations also effectively detected very high similarity in a few set of attribute pairs.
- DS 1497 and 1525: Most of the attributes in the dataset with smaller number of attributes had highly similar attributes in the bigger dataset.
- DS 1517 and 1518: Both datasets contain both nominal and numeric attributes with similar data and names.
- DS 1518 and 1526: The attribute level names were identical between the smaller dataset and the larger dataset. The maximum aggregation for attribute level similarity had a high score due to the very similar nominal attribute. Many numeric attributes had slight similarity in their data profiles too, but not too strong.
Conclusion
From our observations, we can conclude that different types of proximity metrics are complementary and are important for improving the performance of schema matching pre-filtering. Both attribute level and dataset level proximity metrics were important deciding factors for detecting related datasets. This was seen in cases where we can not only rely on attribute level value-based proximity metrics, including duplicate datasets which have heavily transformed and preprocessed attribute values. In such cases, the dataset level proximity metrics and name-based proximity metrics are the main deciding factors to propose such pairs of datasets for further schema matching. Also, name-based proximity metrics in combination with values-based metrics were both important in deciding whether a pair of datasets are related and should be analysed with schema matching. In some cases, having a skewed aggregation proximity metric or a maximum aggregation proximity metric for attribute level similarities supported in detecting related datasets. This demonstrates the importance of using different aggregation techniques for attribute level proximity metrics.
Return to top