Introduction

This appendix accompanies the paper "Keeping the Data Lake in Form: Proximity Mining for Pre-filtering Schema Matching". We examine the performance of our proposed schema matching pre-filtering approach using the sample datasets from the OpenML Data Lake (DL). The sample annotated data used in the experiments can be retrieved from: our project's github page.

This webpage is created by the Database Technologies and Information Management Group, UPC, BarcelonaTech and the Department of Computer and Decision Engineering, ULB, Brussels.


Table of Contents:


Appendix: Pre-filtering Experiment Analysis

In this appendix, we examine the cases in our experiments which led to correctly classified true positives or incorrectly classified false negatives or false positives. The incorrectly classified cases will be scrutinised in details to detect the reasons for our approach in failing in such cases, or whether those cases were falsely manually annotated by the contributors to the ground-truth for our experiment. False annotation can occur because the manual annotators depended on the textual descriptions provided by OpenML for each dataset. Those descriptions might describe a specific domain or subject-area, yet pairs of datasets falling under this subject-area might have disjoint information describing it (i.e., they do not store the same information in their attributes). As the annotators did not investigate the attribute metadata and only depended on the descriptions of the datasets, this can lead to the false annotations. The annotators were not allowed to investigate the attribute metadata to eliminate any possibility of bias towards our approach using the same metadata. This should keep the ground-truth independent from our approach leading to better valid results.

We investigate the All-Prox model because it was the best performing proximity model in our experiments. This model utilises all the different types of metadata collected about datasets and their attributes. We select the results from the proximity cut-off threshold of 0.4 because it achieves top performing results in terms of the recall and efficiency-gain trade-off graph in our results (see Fig. 15 in the paper).

We analyse the false negatives and the false positives. The goal of our schema matching pre-filtering approach is to minimise both false negatives and false positives in order to improve recall and precision respectively. We analyse the cases arising as false negatives and whether they are actually true negatives (i.e., have disjoint information) and the cases of false positives which ideally could be still filtered out. Finally, we analyse the cases of true positives and we identify the meta-features which helped our proximity model in correctly identifying those pairs.


1. False Negatives

We took a random sample of 30 false negative cases from the total 81 false negatives. In our experiments we had about 550 actual positive cases. At the threshold of 'T=0.4' for the All-Prox model, 81 cases were not correctly classified as positive cases for schema matching. We analysed the details of those cases in Table 1. Each row in the table lists a dataset pair which did not have a proximity score above the threshold of '0.4' although they were manually annotated to be related datasets under the same topic.

In Table 1, the columns are as follows:

Table 1: A list of randomly selected false-negative dataset pair cases from the All-Prox model in the OpenML experiment.
DS 1 ID DS 2 ID DS1 DS2 Topic att 1 att 2 nom att 1 nom att 2 num att 1 num att 2 Nom Att M Num Att M Rel?
8 1097 liver- disorders ICU Health Measurements 7 21 1 17 6 4 0 0 No
10 1524 lymph vertebra column Disease 19 7 19 1 0 6 0 0 No
18 1038 mfeat- morphological gina_ agnostic Digit Handwriting Recognition 7 971 4 1 3 970 0 0 No
23 329 cmc hayes- roth Citizen Census Data 10 5 9 5 1 0 *Wifes_ education + educational_ level 0 Yes
25 1510 colic wdbc Disease 28 31 20 1 8 30 0 1 *Pulse + V3 abdomcentesis _ tot
2 *al_ protein  + V13
No
34 43 post- operative patient- data haber- man Disease 9 4 9 2 0 2 0 0 No
34 222 post- operative patient- data echo Months Disease 9 10 9 3 0 7 0 0 No
55 1510 hepatitis wdbc Disease 20 31 14 1 6 30 0 *ALK_ PHOSPHATE  + V23 No
150 473 cover type cjs Plantation Measurements 55 35 45 11 10 24 0 0 No
171 1510 primary-tumor wdbc Disease 18 31 18 1 0 30 0 0 No
301 563 ozone_ level kdd_ el_ nino-small Geo-graphical Measurements 73 9 73 2 0 7 0 0 No
307 4552 vowel Bach Choral Harmony Sound Measurements 13 17 3 15 10 2 0 0 No
343 1094 white- clover Acorns Plantation Measurements 32 5 5 2 27 3 0 1 *White- Clover-93  + Tree_ Height
2 *Other Legumes-91  + Acorn_ size
No
375 4552 japanese vowels Bach Choral Harmony Sound Measurements 15 17 1 15 14 2 0 0 No
492 1464 newton_ hema blood- transfusion service- center Health Measurements 4 5 1 1 3 4 0 * cells_ percentage + V4 No
563 1487 kdd_ el_ nino-small ozone- level- 8hr Geo- graphical Measurements 9 73 2 1 7 72 0 0 No
1471 1490 eeg-eye-state planning relax Cardiac EEG Measurements 15 13 1 1 14 12 * Class + Class 0 Yes
1478 1509 har walking-activity Motion patterns 562 5 1 1 561 4 0 0 No
1489 4552 phoneme Bach Choral Harmony Sound Measurements 6 17 1 15 5 2 0 0 No
183 205 abalone sleep Animal Profile 9 8 2 3 7 5 0 0 No
183 232 abalone fishcatch Animal Profile 9 8 2 2 7 6 * Sex  + Sex 1 * Height + Height
2 * Length + Length1
3 * Diameter + Length3
Yes
329 541 hayes- roth socmob Citizen Census Data 5 6 5 4 0 2 0 0 No
451 541 irish socmob Citizen Census Data 6 6 4 4 2 2 0 0 No
483 492 iq_ brain_ size newton_ hema Health Measurements 9 4 3 1 6 3 0 0 No
483 1464 iq_ brain_ size blood- trans- fusion- service-center Health Measurements 9 5 3 1 6 4 0 0 No
492 679 newton_ hema rmftsa_ sleepdata Health Measurements 4 3 1 1 3 2 0 0 No
1483 1509 ldpa walking- activity Motion patterns 8 5 3 1 5 4 * Class + Class 0 No
8 212 liver- disorders diabetes_ numeric Health Measurements 7 3 1 0 6 3 0 0 No
212 492 diabetes_ numeric newton_ hema Health Measurements 3 4 0 1 0 1 0 0 No
212 679 diabetes_ numeric rmftsa_ sleepdata Health Measurements 3 3 0 1 3 2 0 0 No

Manual Analysis

The final decision to decide whether two datasets are actually related or no, which is the information in the 'rel?' column in Table 1, was based on the detailed scrutinisation of the datasets and their attributes by one of the authors. We summarise our commentary which led to our final verdict about each pair in the sample as follows:

Conclusion

From our analysis, the majority of the cases (90% of the cases) in the sample of false negatives were actually not related and therefore are not supposed to be annotated as related datasets by the human annotators in the experiment. Those were falsely annotated cases due to the fact that the human annotators only had the overall dataset descriptions to judge whether they are related. Many datasets were broadly classified under the same topic although they should have been distinguished. Our automated techniques depending on our proposed proximity models are able to identify more accurately such cases and correctly the related pairs due to the different detailed meta-features collected about the datasets and their attributes.


2. False Positives

We took a random sample of 30 false positive cases from the total 172 false positives. In our experiments we had about 550 actual positive cases. At the threshold of 'T=0.4' for the All-Prox model, 172 cases were incorrectly classified as positive cases for schema matching, although they did not belong to the same topical subject-area. We analysed the details of those cases in Table 2. Each row in the table lists a dataset pair which had a proximity score above the threshold of '0.4' although they were manually annotated to be not related datasets (i.e., under different topics).

In Table 2, the columns are as follows:

Table 2: A list of randomly selected false-positive dataset pair cases from the All-Prox model in the OpenML experiment.
DS1 ID DS2 ID DS1 DS2 DS1 Topic DS2 Topic DS 1 no. att DS 2 no. att DS 1 no. nom att DS 2 no. nom att DS 1 no. num att DS 2 no. num att Related Nom Att Related Num Att Rel?
14 1497 mfeat-fourier wall robot navigation Digit Handwriting Recognition Robot Motion Measurements 77 25 1 1 76 24 0 1 * Att74 + V18
2 * Att42 + V1
No
22 1464 mfeat-zernike blood transfusion service center Digit Handwriting Recognition Health Measurements 48 5 1 1 47 4 0 1 * Att35 + V4
2 * Att41 + V1
3 * Att15 + V2
No
24 492 mush-room newton hema Plant profiles Health Measurements 23 4 23 1 0 3 0 0 No
29 1466 credit approval cardio-toco-graphy Credit Card Applications Cardiac EEG Measurements 16 36 10 16 6 20 * A10  + V27 1 * A8 + V6
2 * A2 + V9
No
36 571 segment delta elevators Image Pixel Classification Aircraft Performance 20 7 4 1 16 6 * Short-line-density-5 + diffDiffClb 1 * Exgreen-mean + climb
2 * Rate Exred-mean + Altitude
No
43 1049 haber-man pc4 Disease Comput-er Software Defects 4 38 2 4 2 34 * Survival status + c * Number of positive axillary nodes detected + LOC COMMENTS No
43 1511 haber-man wholesale customers Disease Retail Sales Data 4 9 2 2 2 7 * Survival status + Channel Number of positive axillary nodes detected + V1 No
43 4552 habe-rman Bach Choral Harmony Disease Sound Measurements 4 17 2 15 2 2 1 * Survival status + V9 2 * Patients year of operation + V15 * Age of patient at time of operation + V2 No
222 1519 echo Months robot failures lp4 Disease Robot Motion Measurements 10 91 3 1 7 90 0 * Epss + V34 No
470 1095 profb Cereals Football Matches Food Nutritional Facts 10 16 6 11 4 5 0 1 * Point spread + sugars
2 * Underdog Points + Carbo
No
568 1500 kdd coil 2 seismic-bumps Chemic-al Contamination Seismic Measurements 12 8 3 1 9 7 * river size + Class 1 * Concentration 1 + V7
2 * Concentration 4 + V6
No
1030 1054 ERA mc2 Employ-ee Test Data Comput-er Software Defects 5 40 5 4 0 36 * In2 + CALL PAIRS 0 No
1051 1097 cocomo numeric ICU Software Development Effort Health Measurements 17 21 15 17 2 4 * VEXP + LOC 1 * ACT EFFORT + ID
2 * LOC + AGE
No
1465 1478 breast-tissue har Human Tissue impeda-nce spectrum Motion patterns 10 562 1 1 9 561 * Class + Class 0 No
1471 1476 eeg-eye-state gas-drift Cardiac EEG Measurements Gas sensing statistics 15 129 1 1 14 128 0 * V6 + V105 No
1478 1597 har credit card Motion patterns Credit Card Transactions 562 31 1 1 561 30 0 * V145 + V26 No
1511 1516 whole-sale customers robot failures lp1 Retail Sales Data Robot Motion Measurements 9 91 2 1 7 90 * Region + Class 0 No
1524 4552 vertebra-column Bach- Choral- Harmony Disease Sound Measurements 7 17 1 15 6 2 * Class + V14 * V1 + V2 No
18 1495 mfeat morphol-ogical qualitat-ive bankruptcy Digit Handwriting Recognition Bankru-ptcy Evaluation 7 7 4 7 3 0 * Att1 + V3 0 No
43 1462 haber-man banknote authent-ication Disease Bank-note Recog-nition 4 5 2 1 2 4 * Survival status + Class * Number of positive axillary nodes detected + V3 No
151 1526 electric-ity wall-robot-naviga-tion Product Pricing Robot Motion Measurements 9 5 2 1 7 4 0 * Period + V2 No
196 230 autoMpg machine cpu Car CPU Performance Data 8 7 3 1 5 6 0 1 * Displacement + MYCT
2 * Horsepower – CACH
No
329 1511 hayes-roth whole-sale custom-ers Citizen Census Data Retail Sales Data 5 9 5 2 0 7 * Hobby + Region 0 No
1489 1526 phoneme wall-robot-naviga-tion Sound Measurements Robot Motion Measurements 6 5 1 1 5 4 0 * V5 + V4 No
1499 1500 seeds seismic-bumps Plant Measurements Seismic Measurements 8 8 1 1 7 7 All attributes All attributes Yes
1500 1523 seismic-bumps vertebra-column Seismic Measurements Disease 8 7 1 1 7 6 * Class + Class * V1 + V2 No
1509 1525 walking-activity wall-robot-naviga-tion Motion patterns Robot Motion Measurements 5 3 1 1 4 2 0 * V4 + V2 No
183 225 abalone puma-8NH Animal Profile Robot Motion Measurements 9 9 2 0 7 9 0 1* Height + tau1
* Viscera weight + tau2
3 * Whole weight + thetad2
No
205 212 sleep diabetes numeric Animal Profile Health Measurements 8 3 3 0 5 3 0 * total sleep + age No
225 1462 puma-8NH banknote-authent-ication Robot Motion Measurements Bank-note Recognition 9 5 0 1 9 4 0 1 * Thetadd3 + V3
2 * Thetad1 + V4
3 * Thetadd3 + V2
No

Manual Analysis

The final decision to decide whether two datasets are actually related or not, which is the information in the `rel?' column in Table 2, was based on the detailed scrutinisation of the datasets and their attributes by one of the authors. We summarise our commentary which led to our final verdict about each pair in the sample as follows:

Conclusion

From our analysis, the majority of the cases in the sample of false positives were true false positives which was caused by coincidently matching data profiles of some of the attributes across different datasets. As we have selected a threshold of `0.4' we expect having such cases of false positives but we try to minimise them as much as possible to improve the efficiency gain performance of our pre-filtering approach. Our approach was also able to detect human errors in annotating the descriptions of datasets in the cases of datasets 1499 and 1500. Those datasets were incorrectly annotated different descriptions about different topics although they contain duplicate data about the same subject and having the same instances. Therefore, our approach can actually support human annotations by detecting such errors and highlighting them to the DL's data wrangler.


3. True Positives

In this section, we present a sample of the true positive cases our proximity-mining-based approach was successful in detecting. Those were pairs of datasets which were positively proposed by our approach and were actually belonging to the same topical subject-area. We took a random sample of 30 true positive cases from the total 469 cases. The actual positives were 550 cases but not all of them were detected using our techniques at a threshold of `T=0.4' for the All-Prox model (a discussion of the cases that were missed are presented in the false negatives section. We demonstrate using the sample cases the underlying workings of our approach by presenting the summary of those pairs and their matching attributes.

We analysed the details of those cases in Table 3. Each row in the table lists a dataset pair which had a proximity score above the threshold of $0.4$ and which were manually annotated to be datasets belonging to the same topical subject-area. Our goal is to highlight how our approach detected that each pair were related. We present the related attributes which were found.

In Table 3, the columns are as follows:

Table 3: A list of randomly selected true-positive dataset pair cases from the All-Prox model in the OpenML experiment.
DS1 ID DS2 ID DS1 DS2 Topic DS 1 no. att DS 2 no. att DS1 no. nom att DS2 no. nom att DS1 no. num att DS2 no. num att Related Nom Att Related Num Att
8 483 liver-disorders iq brain size Health Measurements 7 9 1 3 6 6 * Selector + ORDER 1 * alkphos + WEIGHT2
2 * mcv + FIQ
12 20 mfeat-factors mfeat-pixel Digit Handwriting Recognition 217 241 3 241 214 0 * Class + class 0
16 32 mfeat-karhunen pendigits Digit Handwriting Recognition 65 17 1 1 64 16 * Class + class 0
16 1042 mfeat-karhunen gina prior Digit Handwriting Recognition 65 785 1 263 64 522 0 0
22 1501 mfeat-zernike semeion Digit Handwriting Recognition 48 257 1 257 47 0 * Class + class 0
25 1523 colic vertebra-column Disease 28 7 20 1 8 6 * surgical lesion + Class 1 * pulse – V12
2 *packed cell volume + V4
55 222 hepatitis echo-Months Disease 20 10 14 3 6 7 1 * ANTIVIRALS + pericardial
2 * SPIDERS + still alive
3 * FATIGUE + alive at 1
1 * ALBUMIN + lvdd
2 * AGE + age
150 1596 covertype covertype Plantation Measurements 55 55 45 45 10 10 Many with the same name 0
183 1557 abalone abalone Animal Profile 9 9 2 2 7 7 * Sex + V1 1 * Length + V32
2 * Diameter + V23
3 * Shell weight + V84
Length + V2
187 287 wine wine quality Food Nutritional Facts 14 12 1 1 13 11 0 1 * Alcohol + alcohol2
2* Nonflavanoid phenols + citric.acid
3 * Ash + pH
4 * Color intensity + fixed acidity
224 1523 breast-Tumor vertebra-column Disease 10 7 8 1 2 6 * deg-malig + Class * class + V2
307 1489 vowel phoneme Sound Measurements 13 6 3 1 10 5 0 1 * Feature 4 + V42
2 * Feature 2 + V5
329 382 hayes-roth ipums la 97-small Citizen Census Data 5 61 5 61 0 0 0 0
354 1567 poker poker-hand Poker Hands 11 11 11 11 0 0 1 * X9 + V72
2 * X7 + V53
3 * X5 + V9
0
378 451 ipums la 99-small irish Citizen Census Data 61 6 60 4 1 2 1 * Sex + Sex
2 * schltype + Type school
0
382 534 ipums la 97-small cps 85 wages Citizen Census Data 61 11 61 7 0 4 1 * workedyr + MARR
2 * SEX + sex
3 * yrlastwk + OCCUPATION
0
451 1590 irish adult Citizen Census Data 6 15 4 9 2 6 1 * Sex + sex
2 * Leaving Certificate + class
* Prestige score + age
483 1093 iq brain size Brainsize Health Measurements 9 7 3 1 6 6 * SEX + Gender 1 * FIQ + PIQ
2 * WEIGHT + Weight
483 1097 iq brain size ICU Health Measurements 9 21 3 17 6 4 1 * ORDER + INF
2 * SEX + SEX
* WEIGHT + HRA
570 575 kdd coil 3 kdd coil 4 Chemical Contamination 12 12 3 3 9 9 All attributes All attributes
1030 1035 ERA ESL Employee Test Data 5 5 5 5 0 0 * out1 + out1 0
1038 1501 gina agnostic semeion Digit Handwriting Recognition 971 257 1 257 970 0 * label + V14 0
1049 1071 pc4 mw1 Computer Software Defects 38 38 4 4 34 34 1 * ESSENTIAL DENSITY + ESSENTIAL DENSITY
2 * c + c
3 * PARAMETER COUNT + PARAMETER COUNT
1 * DESIGN DENSITY + DESIGN DENSITY
2 * PERCENT COMMENTS + PERCENT COMMENTS
3 * MAINTENANCE SEVERITY + MAINTENANCE SEVERITY
1050 1070 pc3 kc1-numeric Computer Software Defects 38 95 2 27 36 68 * c + minHALSTEAD ERROR EST 1 PERCENT COMMENTS + avgNUM OPER-ANDS
2 * LOC COMMENTS + NUMDEFECTS
3 LOC BLANK + maxLOC BLANK
1053 1056 jm1 mc1 Computer Software Defects 22 39 1 6 21 33 * defects + DESIGN DENSITY 1 * iv(g) + BRANCH COUNT
2 lOCode + NUM OPERATORS
3 * branchCount + NUMBER OF LINES
1054 1056 mc2 mc1 Computer Software Defects 40 39 4 6 36 33 * c + MAINTENANCE SEVERITY 1 * GLOBAL DATA DENSITY + CYCLOMATIC DENSITY
2 * GLOBAL DATA COMPLEXITY + NUM UNIQUE OPERATORS
3 * NUM UNIQUE OPER-ANDS + NUM UNIQUE OPER-ANDS
1054 1065 mc2 kc3 Computer Software Defects 40 40 4 4 36 36 Many with the same name Many with the same name
1497 1525 wall-robot-navi-gation wall-robot-navi-gation Robot Motion Measurements 25 3 1 1 24 2 * Class + class 1 * V21 + V1
2 * V18 + V2
1517 1518 robot-failures-lp2 robot-failures-lp3 Robot Motion Measurements 91 91 1 1 90 90 All attributes All attributes
1518 1526 robot-failures-lp3 wall-robot-navi-gation Robot Motion Measurements 91 5 1 1 90 4 * Class + Class 0

Manual Analysis

We analysed each dataset pairs in Table 3 and their underlying proximity metrics and attribute similarities in order to deduce the mechanism that led to the correct detection of the pair for schema matching. The observations from our analysis of each pair is as follows:

Conclusion

From our observations, we can conclude that different types of proximity metrics are complementary and are important for improving the performance of schema matching pre-filtering. Both attribute level and dataset level proximity metrics were important deciding factors for detecting related datasets. This was seen in cases where we can not only rely on attribute level value-based proximity metrics, including duplicate datasets which have heavily transformed and preprocessed attribute values. In such cases, the dataset level proximity metrics and name-based proximity metrics are the main deciding factors to propose such pairs of datasets for further schema matching. Also, name-based proximity metrics in combination with values-based metrics were both important in deciding whether a pair of datasets are related and should be analysed with schema matching. In some cases, having a skewed aggregation proximity metric or a maximum aggregation proximity metric for attribute level similarities supported in detecting related datasets. This demonstrates the importance of using different aggregation techniques for attribute level proximity metrics.


Return to top