Introduction

This appendix accompanies the paper "Keeping the Data Lake in Form: Proximity Mining for Pre-filtering Schema Matching". We examine the performance of our proposed schema matching pre-filtering approach using the sample datasets from the OpenML Data Lake (DL). The sample annotated data used in the experiments can be retrieved from: our project's github page.

This webpage is created by the Database Technologies and Information Management Group, UPC, BarcelonaTech and the Department of Computer and Decision Engineering, ULB, Brussels.

Table of Contents:

Appendix Description
False Negatives
False Positives
True Positives

Appendix: Pre-filtering Experiment Analysis

In this appendix, we examine the cases in our experiments which led to correctly classified true positives or incorrectly classified false negatives or false positives. The incorrectly classified cases will be scrutinised in details to detect the reasons for our approach in failing in such cases, or whether those cases were falsely manually annotated by the contributors to the ground-truth for our experiment. False annotation can occur because the manual annotators depended on the textual descriptions provided by OpenML for each dataset. Those descriptions might describe a specific domain or subject-area, yet pairs of datasets falling under this subject-area might have disjoint information describing it (i.e., they do not store the same information in their attributes). As the annotators did not investigate the attribute metadata and only depended on the descriptions of the datasets, this can lead to the false annotations. The annotators were not allowed to investigate the attribute metadata to eliminate any possibility of bias towards our approach using the same metadata. This should keep the ground-truth independent from our approach leading to better valid results.

We investigate the All-Prox model because it was the best performing proximity model in our experiments. This model utilises all the different types of metadata collected about datasets and their attributes. We select the results from the proximity cut-off threshold of 0.4 because it achieves top performing results in terms of the recall and efficiency-gain trade-off graph in our results (see Fig. 15 in the paper).

We analyse the false negatives and the false positives. The goal of our schema matching pre-filtering approach is to minimise both false negatives and false positives in order to improve recall and precision respectively. We analyse the cases arising as false negatives and whether they are actually true negatives (i.e., have disjoint information) and the cases of false positives which ideally could be still filtered out. Finally, we analyse the cases of true positives and we identify the meta-features which helped our proximity model in correctly identifying those pairs.

1. False Negatives

We took a random sample of 30 false negative cases from the total 81 false negatives. In our experiments we had about 550 actual positive cases. At the threshold of 'T=0.4' for the All-Prox model, 81 cases were not correctly classified as positive cases for schema matching. We analysed the details of those cases in Table 1. Each row in the table lists a dataset pair which did not have a proximity score above the threshold of '0.4' although they were manually annotated to be related datasets under the same topic.

In Table 1, the columns are as follows:

DS 1 ID: The OpenML dataset unique identifier for accessing the first dataset's description page. This could be accessed by replacing the `X' symbols at the end of the URL: https://www.openml.org/d/XXXX.
DS 2 ID: The OpenML unique identifier for the second dataset.
DS1 and DS2: The names of the first and second dataset in the pair respectively.
Topic: the common subject-area classification for both datasets in the pair.
att1 and att2: the number of attributes in each dataset respectively.
nom att 1 and nom att 2: the number of nominal attributes in each dataset.
num att 1 and num att 2: the number of numeric attributes in each dataset.
Nom Att M: A list of nominal attributes from both datasets which have a matching data profile. Each attribute pair start with a `*' symbol followed by the attribute from the first dataset, then the `+' symbol, and finally the matching attribute from the second dataset. Each matching pairs of attributes follow the same structure and start on a new line.
Num Att M: A list of numeric attributes from both datasets which have a matching data profile. It follows the same structure as the Nom Att M column.
Rel?: the final verdict whether the pairs of datasets really fall under the same topical category or not. This is based on manually analysing the attributes in each dataset, and checking if each attribute in the first dataset matches an attribute from the other dataset in terms of their `semantics'. Here we consider the actual meaning and description of each attribute and not just the data profiles. A value of `yes' means that both datasets have many semantically related attributes (a couple or more) and a value of `no' means not. Note: although a dataset pair might have attributes with matching data profiles, yet, the semantic meaning of the attributes might be different based on their description. For the result in this column, we also consider the semantic meanings of the attributes in our decision.

***Table 1:*** A list of randomly selected false-negative dataset pair cases from the All-Prox model in the OpenML experiment.
DS 1 ID	DS 2 ID	DS1	DS2	Topic	att 1	att 2	nom att 1	nom att 2	num att 1	num att 2	Nom Att M	Num Att M	Rel?
8	1097	liver- disorders	ICU	Health Measurements	7	21	1	17	6	4	0	0	No
10	1524	lymph	vertebra column	Disease	19	7	19	1	0	6	0	0	No
18	1038	mfeat- morphological	gina_ agnostic	Digit Handwriting Recognition	7	971	4	1	3	970	0	0	No
23	329	cmc	hayes- roth	Citizen Census Data	10	5	9	5	1	0	*Wifes_ education + educational_ level	0	Yes
25	1510	colic	wdbc	Disease	28	31	20	1	8	30	0	1 Pulse + V3 abdomcentesis _ tot 2 al_ protein + V13	No
34	43	post- operative patient- data	haber- man	Disease	9	4	9	2	0	2	0	0	No
34	222	post- operative patient- data	echo Months	Disease	9	10	9	3	0	7	0	0	No
55	1510	hepatitis	wdbc	Disease	20	31	14	1	6	30	0	*ALK_ PHOSPHATE + V23	No
150	473	cover type	cjs	Plantation Measurements	55	35	45	11	10	24	0	0	No
171	1510	primary-tumor	wdbc	Disease	18	31	18	1	0	30	0	0	No
301	563	ozone_ level	kdd_ el_ nino-small	Geo-graphical Measurements	73	9	73	2	0	7	0	0	No
307	4552	vowel	Bach Choral Harmony	Sound Measurements	13	17	3	15	10	2	0	0	No
343	1094	white- clover	Acorns	Plantation Measurements	32	5	5	2	27	3	0	1 White- Clover-93 + Tree_ Height 2 Other Legumes-91 + Acorn_ size	No
375	4552	japanese vowels	Bach Choral Harmony	Sound Measurements	15	17	1	15	14	2	0	0	No
492	1464	newton_ hema	blood- transfusion service- center	Health Measurements	4	5	1	1	3	4	0	* cells_ percentage + V4	No
563	1487	kdd_ el_ nino-small	ozone- level- 8hr	Geo- graphical Measurements	9	73	2	1	7	72	0	0	No
1471	1490	eeg-eye-state	planning relax	Cardiac EEG Measurements	15	13	1	1	14	12	* Class + Class	0	Yes
1478	1509	har	walking-activity	Motion patterns	562	5	1	1	561	4	0	0	No
1489	4552	phoneme	Bach Choral Harmony	Sound Measurements	6	17	1	15	5	2	0	0	No
183	205	abalone	sleep	Animal Profile	9	8	2	3	7	5	0	0	No
183	232	abalone	fishcatch	Animal Profile	9	8	2	2	7	6	* Sex + Sex	1 * Height + Height 2 * Length + Length1 3 * Diameter + Length3	Yes
329	541	hayes- roth	socmob	Citizen Census Data	5	6	5	4	0	2	0	0	No
451	541	irish	socmob	Citizen Census Data	6	6	4	4	2	2	0	0	No
483	492	iq_ brain_ size	newton_ hema	Health Measurements	9	4	3	1	6	3	0	0	No
483	1464	iq_ brain_ size	blood- trans- fusion- service-center	Health Measurements	9	5	3	1	6	4	0	0	No
492	679	newton_ hema	rmftsa_ sleepdata	Health Measurements	4	3	1	1	3	2	0	0	No
1483	1509	ldpa	walking- activity	Motion patterns	8	5	3	1	5	4	* Class + Class	0	No
8	212	liver- disorders	diabetes_ numeric	Health Measurements	7	3	1	0	6	3	0	0	No
212	492	diabetes_ numeric	newton_ hema	Health Measurements	3	4	0	1	0	1	0	0	No
212	679	diabetes_ numeric	rmftsa_ sleepdata	Health Measurements	3	3	0	1	3	2	0	0	No

Manual Analysis

The final decision to decide whether two datasets are actually related or no, which is the information in the 'rel?' column in Table 1, was based on the detailed scrutinisation of the datasets and their attributes by one of the authors. We summarise our commentary which led to our final verdict about each pair in the sample as follows:

DS 8 and 1097: Liver-disorders is about liver disorders based on blood measurements, and the ICU is based on non-blood measurements related to intensive care patients with more attributes about patient demographics. Here the two datasets were too broadly classified together in the same subject-area as they do not contain related information in their attributes.
DS 10 and 1524: lymph is about lymphography measurements while vertebra-column has unrelated measurements about orthopaedic patients. Here the two datasets were too broadly classified together in the same subject-area.
DS 18 and 1038: mfeat-morphological is about handwritten numerals with 7 attributes mainly about different aggregate dimensions of handwritten characters. While gina_agnostic is also about 2 digits written numbers instead of one with more than 900 attributes referring to digits pixel maps. Both datasets model the same subject-area but using different measurements for different entities.
DS 23 and 329: Cmc is about contraceptive usage by citizens in Indonesia with demographic and socio-economic characteristics of married couples. Hayes-roth is about individual persons demographics. Only 1 attribute was in common between the datasets, although many other attributes had similar attribute content but unrelated attribute names. The predicted similarity was in the range of 0.3-0.4 so it was a borderline pair that was filtered out.
DS 25 and 1510: colic is about surgeries on horses and measurements of horse conditions, wdbc talks about breast cancer and shape and texture measurements. As the second dataset does not have meaningful attribute names, we can not confirm the semantic relationships between attributes, but a few numeric attributes had intersecting data but their names were not coinciding so it reduced the similarity coefficient.
DS 34 and 43: postoperative-patient-data is about postoperative actions taken on patients after surgery and measures body temperatures, while haberman is about patients who had surgery of breast cancer and their survival rates. No related attributes with intersecting values exist.
DS 34 and 222: Postoperative-patient-data is about postoperative actions taken on patients after surgery and measures body temperatures, while echoMonths is about patients who had a heart attack and their survival rates based on heart measurements. No related attributes with intersecting values exist.
DS 55 and 1510: hepatitis is about hepatitis patients and liver function tests, wdbc talks about breast cancer and shape and texture measurements. 1 attribute coincidently has a similar numeric distribution with the other dataset.
DS 150 and 473: covertype is about US forest cover type based on cartographic measurements, while cjs is about tree growth measurements in Canada. Both datasets model the same subject-area but using different measurements for different entities.
DS 171 and 1510: primary-tumor is about patients with different cancer types and the different symptoms they experience, while wdbc is about breast cancer and shape and texture measurements. Both datasets model the same subject-area but using different measurements for different entities.
DS 301 and 563: ozone_level is about forecasting ozone days in the environment with climate measurements modelled as categorical data, while kdd_el_nino-small is about meteorological measurements about oceans. Here the two datasets were too broadly classified together in the same subject-area.
DS 307 and 4552: vowel is about vowel speech recognition which is based on features not described by the dataset creator, while BachChoralHarmony is about music information retrieval and description of musical notes. Here the two datasets were too broadly classified together in the same subject-area.
DS 343 and 1094: white-clover is about description of land area having a specific plant type, and acorns is about a type of fruit and the description of the location of its cultivation. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Both datasets model different aspects about the same subject area.
DS 375 and 4552: JapaneseVowels is about vowel speech recognition using sound samples descriptions, and BachChoralHarmony is about music information retrieval and description of musical notes. Here the two datasets were too broadly classified together in the same subject-area.
DS 492 and 1464: newton_hema is about cell markup data from cats, and blood-transfusion-service-center is about blood donation measurements. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
DS 563 and 1487: kdd_el_nino-small is about meteorological measurements about oceans, while ozone-level-8hr is about forecasting ozone days in the environment with climate numerical measurements. Here the two datasets were too broadly classified together in the same subject-area.
DS 1471 and 1490: eeg-eye-state is about EEG measurement for a duration of hundreds of seconds using a headset of a single person with eyes open or shut, while planning-relax is about EEG measurements for 5 days which have been transformed using a wavelet packet analysis mechanism which results in different attribute ranges. The only common attribute is the nominal class attribute which model a two-state condition from the subject doing the experiments which resulted in the measurements recorded. The predicted similarity was in the range of 0.3 -0.4 so it was a borderline pair that was filtered out.
DS 1478 and 1509: har is about measurements of movements using inertial measurements, while walking-activity is about movements collected using accelerometer measurements. Both datasets have different measurements of motion which are not intersectable.
DS 1489 and 4552: phoneme is about type of sound recognition whether oral or nasal using normalised measurements of harmonics using energy normalisation which give attributes describing phonemes of speech parts, and BachChoralHarmony is about music information retrieval and description of musical notes. Here the two datasets were too broadly classified together in the same subject-area.
DS 183 and 205: abalone is about prediction of the age of abalone shellfish based on its physical characteristics, while sleep is about assessment of mammal animals features and the effect of their sleep time on danger from other predators. No related attributes with intersecting values exist. Here the two datasets were too broadly classified together in the same subject-area.
DS 183 and 232: abalone is about prediction of the age of abalone shellfish based on its physical characteristics, while fishcatch is about the physical characteristics of fish from a lake in Finland. Both datasets are about the same subject-area but they talk about different animals. The problem is that the attributes have different data. For example, `sex’ attribute has three distinct values in one dataset and 2 values in the other. The `height’ also has a different range of values between 0 and 1 for abalone shellfish while it ranges between 14 and 44 for the lake fish dataset. Although some attributes had similar names, all attributes had different values. The predicted similarity was in the range of 0.3 -0.4 so it was a borderline pair that was filtered out.
DS 329 and 541: Hayes-roth is about individual persons demographics, and socmob is about families and their work occupations. No related attributes with intersecting values exist. Here the two datasets were too broadly classified together in the same subject-area.
DS 451 and 541: irish is about children at school and their educational characteristics, and socmob is about families and their work occupations. No related attributes with intersecting values exist. Here the two datasets were too broadly classified together in the same subject-area. Although, both datasets might have similar dataset level meta-features (e.g., number of attributes per type), going to the deeper level of attributes will easily identify that both datasets are unrelated.
DS 483 and 492: iq_brain_size is about brain measurements and IQ between twins, and newton_hema is about cell markup data from cats. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
DS 483 and 1464: iq_brain_size is about brain measurements and IQ between twins, and blood-transfusion-service-center is about blood donation measurements. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
DS 492 and 679: newton_hema is about cell markup data from cats, and rmftsa_sleepdata is about body measurements in babies during different sleeping states. A few attributes coincidently had a similar numeric distribution with the other dataset, but do not have the same semantic meaning. Here the two datasets were too broadly classified together in the same subject-area.
DS 1483 and 1509: ldpa is about measurements of movements using position sensors on part of the body, while walking-activity is about movements collected using accelerometer measurements. Both datasets have different measurements of motion which are not intersectable, except for the ‘class’ attribute which is about activity type which might be related. We note here that although the datasets might have identical attribute names (as the attribute names are arbitrary), yet they do not store the same information in their attributes, that is why it is important to combine name-based proximity metrics with value-based metrics. Here the two datasets were too broadly classified together in the same subject-area.
DS 8 and 212: liver-disorders is about liver disorders based on blood measurements, and diabetes_numeric is about measurements of diabetic c-peptide serum in children. Here the two datasets were too broadly classified together in the same subject-area.
DS 212 and 492: diabetes_numeric is about measurements of diabetic c-peptide serum in children, and newton_hema is about cell markup data from cats. Here the two datasets were too broadly classified together in the same subject-area.
DS 212 and 679: diabetes_numeric is about measurements of diabetic c-peptide serum in children, and rmftsa_sleepdata is about body measurements in babies during different sleeping states. Here the two datasets were too broadly classified together in the same subject-area.

Conclusion

From our analysis, the majority of the cases (90% of the cases) in the sample of false negatives were actually not related and therefore are not supposed to be annotated as related datasets by the human annotators in the experiment. Those were falsely annotated cases due to the fact that the human annotators only had the overall dataset descriptions to judge whether they are related. Many datasets were broadly classified under the same topic although they should have been distinguished. Our automated techniques depending on our proposed proximity models are able to identify more accurately such cases and correctly the related pairs due to the different detailed meta-features collected about the datasets and their attributes.

2. False Positives

We took a random sample of 30 false positive cases from the total 172 false positives. In our experiments we had about 550 actual positive cases. At the threshold of 'T=0.4' for the All-Prox model, 172 cases were incorrectly classified as positive cases for schema matching, although they did not belong to the same topical subject-area. We analysed the details of those cases in Table 2. Each row in the table lists a dataset pair which had a proximity score above the threshold of '0.4' although they were manually annotated to be not related datasets (i.e., under different topics).

In Table 2, the columns are as follows:

DS1 ID: The OpenML dataset unique identifier for accessing the first dataset's description page. This could be accessed by replacing the `X' symbols at the end of the URL: https://www.openml.org/d/XXXX.
DS2 ID: The OpenML unique identifier for the second dataset.
DS1 and DS2: The names of the first and second dataset in the pair respectively.
DS1 Topic: the subject-area classification for the first dataset in the pair.
DS2 Topic: the subject-area classification for the second dataset in the pair.
DS 1 no. att and DS 2 no. att: the number of attributes in each dataset respectively.
DS 1 no. nom att and DS 2 no. nom att: the number of nominal attributes in each dataset.
DS 1 no. num att and DS 2 no. num att: the number of numeric attributes in each dataset.
Related Nom Att: A list of nominal attributes from both datasets which have a matching data profile. Each attribute pair start with a `*' symbol followed by the attribute from the first dataset, then the `+' symbol, and finally the matching attribute from the second dataset. Each matching pairs of attributes follow the same structure and start on a new line.
Related Num Att: A list of numeric attributes from both datasets which have a matching data profile. It follows the same structure as the Nom Att M column.
Rel?: the final verdict whether the pairs of datasets really fall under the same topical category or not. This is based on manually analysing the attributes in each dataset, and checking if each attribute in the first dataset matches an attribute from the other dataset in terms of their `semantics'. Here we consider the actual meaning and description of each attribute and not just the data profiles. A value of `yes' means that both datasets have many semantically related attributes (a couple or more) and a value of `no' means not. Note: although a dataset pair might have attributes with matching data profiles, yet, the semantic meaning of the attributes might be different based on their description.

***Table 2:*** A list of randomly selected false-positive dataset pair cases from the All-Prox model in the OpenML experiment.
DS1 ID	DS2 ID	DS1	DS2	DS1 Topic	DS2 Topic	DS 1 no. att	DS 2 no. att	DS 1 no. nom att	DS 2 no. nom att	DS 1 no. num att	DS 2 no. num att	Related Nom Att	Related Num Att	Rel?
14	1497	mfeat-fourier	wall robot navigation	Digit Handwriting Recognition	Robot Motion Measurements	77	25	1	1	76	24	0	1 * Att74 + V18 2 * Att42 + V1	No
22	1464	mfeat-zernike	blood transfusion service center	Digit Handwriting Recognition	Health Measurements	48	5	1	1	47	4	0	1 * Att35 + V4 2 * Att41 + V1 3 * Att15 + V2	No
24	492	mush-room	newton hema	Plant profiles	Health Measurements	23	4	23	1	0	3	0	0	No
29	1466	credit approval	cardio-toco-graphy	Credit Card Applications	Cardiac EEG Measurements	16	36	10	16	6	20	* A10 + V27	1 * A8 + V6 2 * A2 + V9	No
36	571	segment	delta elevators	Image Pixel Classification	Aircraft Performance	20	7	4	1	16	6	* Short-line-density-5 + diffDiffClb	1 * Exgreen-mean + climb 2 * Rate Exred-mean + Altitude	No
43	1049	haber-man	pc4	Disease	Comput-er Software Defects	4	38	2	4	2	34	* Survival status + c	* Number of positive axillary nodes detected + LOC COMMENTS	No
43	1511	haber-man	wholesale customers	Disease	Retail Sales Data	4	9	2	2	2	7	* Survival status + Channel	Number of positive axillary nodes detected + V1	No
43	4552	habe-rman	Bach Choral Harmony	Disease	Sound Measurements	4	17	2	15	2	2	1 * Survival status + V9 2 * Patients year of operation + V15	* Age of patient at time of operation + V2	No
222	1519	echo Months	robot failures lp4	Disease	Robot Motion Measurements	10	91	3	1	7	90	0	* Epss + V34	No
470	1095	profb	Cereals	Football Matches	Food Nutritional Facts	10	16	6	11	4	5	0	1 * Point spread + sugars 2 * Underdog Points + Carbo	No
568	1500	kdd coil 2	seismic-bumps	Chemic-al Contamination	Seismic Measurements	12	8	3	1	9	7	* river size + Class	1 * Concentration 1 + V7 2 * Concentration 4 + V6	No
1030	1054	ERA	mc2	Employ-ee Test Data	Comput-er Software Defects	5	40	5	4	0	36	* In2 + CALL PAIRS	0	No
1051	1097	cocomo numeric	ICU	Software Development Effort	Health Measurements	17	21	15	17	2	4	* VEXP + LOC	1 * ACT EFFORT + ID 2 * LOC + AGE	No
1465	1478	breast-tissue	har	Human Tissue impeda-nce spectrum	Motion patterns	10	562	1	1	9	561	* Class + Class	0	No
1471	1476	eeg-eye-state	gas-drift	Cardiac EEG Measurements	Gas sensing statistics	15	129	1	1	14	128	0	* V6 + V105	No
1478	1597	har	credit card	Motion patterns	Credit Card Transactions	562	31	1	1	561	30	0	* V145 + V26	No
1511	1516	whole-sale customers	robot failures lp1	Retail Sales Data	Robot Motion Measurements	9	91	2	1	7	90	* Region + Class	0	No
1524	4552	vertebra-column	Bach- Choral- Harmony	Disease	Sound Measurements	7	17	1	15	6	2	* Class + V14	* V1 + V2	No
18	1495	mfeat morphol-ogical	qualitat-ive bankruptcy	Digit Handwriting Recognition	Bankru-ptcy Evaluation	7	7	4	7	3	0	* Att1 + V3	0	No
43	1462	haber-man	banknote authent-ication	Disease	Bank-note Recog-nition	4	5	2	1	2	4	* Survival status + Class	* Number of positive axillary nodes detected + V3	No
151	1526	electric-ity	wall-robot-naviga-tion	Product Pricing	Robot Motion Measurements	9	5	2	1	7	4	0	* Period + V2	No
196	230	autoMpg	machine cpu	Car	CPU Performance Data	8	7	3	1	5	6	0	1 * Displacement + MYCT 2 * Horsepower – CACH	No
329	1511	hayes-roth	whole-sale custom-ers	Citizen Census Data	Retail Sales Data	5	9	5	2	0	7	* Hobby + Region	0	No
1489	1526	phoneme	wall-robot-naviga-tion	Sound Measurements	Robot Motion Measurements	6	5	1	1	5	4	0	* V5 + V4	No
1499	1500	seeds	seismic-bumps	Plant Measurements	Seismic Measurements	8	8	1	1	7	7	All attributes	All attributes	Yes
1500	1523	seismic-bumps	vertebra-column	Seismic Measurements	Disease	8	7	1	1	7	6	* Class + Class	* V1 + V2	No
1509	1525	walking-activity	wall-robot-naviga-tion	Motion patterns	Robot Motion Measurements	5	3	1	1	4	2	0	* V4 + V2	No
183	225	abalone	puma-8NH	Animal Profile	Robot Motion Measurements	9	9	2	0	7	9	0	1* Height + tau1 * Viscera weight + tau2 3 * Whole weight + thetad2	No
205	212	sleep	diabetes numeric	Animal Profile	Health Measurements	8	3	3	0	5	3	0	* total sleep + age	No
225	1462	puma-8NH	banknote-authent-ication	Robot Motion Measurements	Bank-note Recognition	9	5	0	1	9	4	0	1 * Thetadd3 + V3 2 * Thetad1 + V4 3 * Thetadd3 + V2	No

Manual Analysis

The final decision to decide whether two datasets are actually related or not, which is the information in the `rel?' column in Table 2, was based on the detailed scrutinisation of the datasets and their attributes by one of the authors. We summarise our commentary which led to our final verdict about each pair in the sample as follows:

DS 14 and 1497: The datasets have multiple attributes in common with quite similar numeric distributions, we only report two of the strongest.
DS 22 and 1464: The second smaller dataset by coincidence has most of its attributes highly matching with numeric attributes from the much larger first dataset.
DS 24 and 492: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
DS 29 and 1466: The attributes coincidently contain similar profiles of information. For example, 2 nominal attributes had similar value distributions and each had two values and both had 2 distinct values only, however, they have different semantic meanings in their context. Many numeric attributes also coincidently had very similar numeric distribution.
DS 36 and 571: The datasets have multiple attributes which are coincidently common with quite similar nominal profile and numeric distributions, we only report three of the strongest.
DS 43 and 1049: Some nominal and numeric attributes with similar profiles. Haberman has a few attributes with very common data profiles (e.g., binary values) which match with attributes in larger datasets with more attributes.
DS 43 and 1511: Some nominal and numeric attributes with similar profiles. Haberman has a few attributes with very common data profiles (e.g., binary values) which match with attributes in larger datasets with more attributes.
DS 43 and 4552: Some nominal and numeric attributes with similar profiles. Haberman has a few attributes with very common data profiles (e.g., binary values) which match with attributes in larger datasets with more attributes.
DS 222 and 1519: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
DS 470 and 1095: A few numeric attributes had similar profiles across the datasets.
DS 568 and 1500: The datasets have multiple attributes which are coincidently common with quite similar nominal profile and numeric distributions, we only report three of the strongest.
DS 1030 and 1054: Some nominal attributes with similar profiles.
DS 1051 and 1097: The datasets have some attributes which coincidently have quite similar nominal profile and numeric distributions.
DS 1465 and 1478: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
DS 1471 and 1476: A few numeric attributes had similar profiles between a small dataset and a large dataset.
DS 1478 and 1597: A few numeric attributes had similar profiles between a small dataset and a large dataset.
DS 1511 and 1516: Many partially matching attributes led to this error passing the 0.4 similarity threshold. It is a consequence of lower threshold levels.
DS 1524 and 4552: Some nominal and numeric attributes with similar profiles. The matching nominal attributes have the same number of distinct values (binary) and nearly identical distribution of values (50\% for each value). Numeric attributes had matching numeric distributions.
DS 18 and 1495: A few nominal attributes had similar profiles.
DS 43 and 1462: The datasets have some attributes which coincidently have quite similar nominal profile and numeric distributions.
DS 151 and 1526: A few numeric attributes had similar profiles.
DS 196 and 230: A few numeric attributes had similar profiles.
DS 329 and 1511: A few nominal attributes had similar profiles.
DS 1489 and 1526: A few numeric attributes had similar profiles.
DS 1499 and 1500: Both datasets store the same data, but they were falsely annotated by the OpenML contributors with different metadata and description. This is a case where our automated techniques can detect such human errors.
DS 1500 and 1523: The datasets have some attributes which coincidently have quite similar nominal profile and numeric distributions.
DS 1509 and 1525: A few numeric attributes had similar profiles.
DS 183 and 225: A few numeric attributes had similar profiles.
DS 205 and 212: The datasets have some attributes which coincidently have quite similar nominal profiles.
DS 225 and 1462: The datasets have some attributes which coincidently have quite similar nominal profiles.

Conclusion

From our analysis, the majority of the cases in the sample of false positives were true false positives which was caused by coincidently matching data profiles of some of the attributes across different datasets. As we have selected a threshold of `0.4' we expect having such cases of false positives but we try to minimise them as much as possible to improve the efficiency gain performance of our pre-filtering approach. Our approach was also able to detect human errors in annotating the descriptions of datasets in the cases of datasets 1499 and 1500. Those datasets were incorrectly annotated different descriptions about different topics although they contain duplicate data about the same subject and having the same instances. Therefore, our approach can actually support human annotations by detecting such errors and highlighting them to the DL's data wrangler.

3. True Positives

In this section, we present a sample of the true positive cases our proximity-mining-based approach was successful in detecting. Those were pairs of datasets which were positively proposed by our approach and were actually belonging to the same topical subject-area. We took a random sample of 30 true positive cases from the total 469 cases. The actual positives were 550 cases but not all of them were detected using our techniques at a threshold of `T=0.4' for the All-Prox model (a discussion of the cases that were missed are presented in the false negatives section. We demonstrate using the sample cases the underlying workings of our approach by presenting the summary of those pairs and their matching attributes.

We analysed the details of those cases in Table 3. Each row in the table lists a dataset pair which had a proximity score above the threshold of $0.4$ and which were manually annotated to be datasets belonging to the same topical subject-area. Our goal is to highlight how our approach detected that each pair were related. We present the related attributes which were found.

In Table 3, the columns are as follows:

DS1 ID: The OpenML dataset unique identifier for accessing the first dataset's description page. This could be accessed by replacing the `X' symbols at the end of the URL: https://www.openml.org/d/XXXX.
DS2 ID: The OpenML unique identifier for the second dataset.
DS1 and DS2: The names of the first and second dataset in the pair respectively.
Topic: the common subject-area classification for both datasets in the pair.
DS1 and DS2 no. att: the number of attributes in each dataset respectively.
DS1 and DS2 no. nom att the number of nominal attributes in each dataset.
DS1 and DS2 no. num att: the number of numeric attributes in each dataset.
Related Nom Att: A list of nominal attributes from both datasets which have a matching data profile. Each attribute pair start with a `*' symbol followed by the attribute from the first dataset, then the `+' symbol, and finally the matching attribute from the second dataset. Each matching pairs of attributes follow the same structure and start on a new line.
Related Num Att: A list of numeric attributes from both datasets which have a matching data profile. It follows the same structure as the Related Nom Att column.

***Table 3:*** A list of randomly selected true-positive dataset pair cases from the All-Prox model in the OpenML experiment.
DS1 ID	DS2 ID	DS1	DS2	Topic	DS 1 no. att	DS 2 no. att	DS1 no. nom att	DS2 no. nom att	DS1 no. num att	DS2 no. num att	Related Nom Att	Related Num Att
8	483	liver-disorders	iq brain size	Health Measurements	7	9	1	3	6	6	* Selector + ORDER	1 * alkphos + WEIGHT2 2 * mcv + FIQ
12	20	mfeat-factors	mfeat-pixel	Digit Handwriting Recognition	217	241	3	241	214	0	* Class + class	0
16	32	mfeat-karhunen	pendigits	Digit Handwriting Recognition	65	17	1	1	64	16	* Class + class	0
16	1042	mfeat-karhunen	gina prior	Digit Handwriting Recognition	65	785	1	263	64	522	0	0
22	1501	mfeat-zernike	semeion	Digit Handwriting Recognition	48	257	1	257	47	0	* Class + class	0
25	1523	colic	vertebra-column	Disease	28	7	20	1	8	6	* surgical lesion + Class	1 * pulse – V12 2 *packed cell volume + V4
55	222	hepatitis	echo-Months	Disease	20	10	14	3	6	7	1 * ANTIVIRALS + pericardial 2 * SPIDERS + still alive 3 * FATIGUE + alive at 1	1 * ALBUMIN + lvdd 2 * AGE + age
150	1596	covertype	covertype	Plantation Measurements	55	55	45	45	10	10	Many with the same name	0
183	1557	abalone	abalone	Animal Profile	9	9	2	2	7	7	* Sex + V1	1 * Length + V32 2 * Diameter + V23 3 * Shell weight + V84 Length + V2
187	287	wine	wine quality	Food Nutritional Facts	14	12	1	1	13	11	0	1 * Alcohol + alcohol2 2* Nonflavanoid phenols + citric.acid 3 * Ash + pH 4 * Color intensity + fixed acidity
224	1523	breast-Tumor	vertebra-column	Disease	10	7	8	1	2	6	* deg-malig + Class	* class + V2
307	1489	vowel	phoneme	Sound Measurements	13	6	3	1	10	5	0	1 * Feature 4 + V42 2 * Feature 2 + V5
329	382	hayes-roth	ipums la 97-small	Citizen Census Data	5	61	5	61	0	0	0	0
354	1567	poker	poker-hand	Poker Hands	11	11	11	11	0	0	1 * X9 + V72 2 * X7 + V53 3 * X5 + V9	0
378	451	ipums la 99-small	irish	Citizen Census Data	61	6	60	4	1	2	1 * Sex + Sex 2 * schltype + Type school	0
382	534	ipums la 97-small	cps 85 wages	Citizen Census Data	61	11	61	7	0	4	1 * workedyr + MARR 2 * SEX + sex 3 * yrlastwk + OCCUPATION	0
451	1590	irish	adult	Citizen Census Data	6	15	4	9	2	6	1 * Sex + sex 2 * Leaving Certificate + class	* Prestige score + age
483	1093	iq brain size	Brainsize	Health Measurements	9	7	3	1	6	6	* SEX + Gender	1 * FIQ + PIQ 2 * WEIGHT + Weight
483	1097	iq brain size	ICU	Health Measurements	9	21	3	17	6	4	1 * ORDER + INF 2 * SEX + SEX	* WEIGHT + HRA
570	575	kdd coil 3	kdd coil 4	Chemical Contamination	12	12	3	3	9	9	All attributes	All attributes
1030	1035	ERA	ESL	Employee Test Data	5	5	5	5	0	0	* out1 + out1	0
1038	1501	gina agnostic	semeion	Digit Handwriting Recognition	971	257	1	257	970	0	* label + V14	0
1049	1071	pc4	mw1	Computer Software Defects	38	38	4	4	34	34	1 * ESSENTIAL DENSITY + ESSENTIAL DENSITY 2 * c + c 3 * PARAMETER COUNT + PARAMETER COUNT	1 * DESIGN DENSITY + DESIGN DENSITY 2 * PERCENT COMMENTS + PERCENT COMMENTS 3 * MAINTENANCE SEVERITY + MAINTENANCE SEVERITY
1050	1070	pc3	kc1-numeric	Computer Software Defects	38	95	2	27	36	68	* c + minHALSTEAD ERROR EST	1 PERCENT COMMENTS + avgNUM OPER-ANDS 2 * LOC COMMENTS + NUMDEFECTS 3 LOC BLANK + maxLOC BLANK
1053	1056	jm1	mc1	Computer Software Defects	22	39	1	6	21	33	* defects + DESIGN DENSITY	1 * iv(g) + BRANCH COUNT 2 lOCode + NUM OPERATORS 3 * branchCount + NUMBER OF LINES
1054	1056	mc2	mc1	Computer Software Defects	40	39	4	6	36	33	* c + MAINTENANCE SEVERITY	1 * GLOBAL DATA DENSITY + CYCLOMATIC DENSITY 2 * GLOBAL DATA COMPLEXITY + NUM UNIQUE OPERATORS 3 * NUM UNIQUE OPER-ANDS + NUM UNIQUE OPER-ANDS
1054	1065	mc2	kc3	Computer Software Defects	40	40	4	4	36	36	Many with the same name	Many with the same name
1497	1525	wall-robot-navi-gation	wall-robot-navi-gation	Robot Motion Measurements	25	3	1	1	24	2	* Class + class	1 * V21 + V1 2 * V18 + V2
1517	1518	robot-failures-lp2	robot-failures-lp3	Robot Motion Measurements	91	91	1	1	90	90	All attributes	All attributes
1518	1526	robot-failures-lp3	wall-robot-navi-gation	Robot Motion Measurements	91	5	1	1	90	4	* Class + Class	0

Manual Analysis

We analysed each dataset pairs in Table 3 and their underlying proximity metrics and attribute similarities in order to deduce the mechanism that led to the correct detection of the pair for schema matching. The observations from our analysis of each pair is as follows:

DS 8 and 483: Nominal attributes partially similar, numeric attributes coincidently matching with high similarities of data profiles.
DS 12 and 20: Nominal attributes are very strongly similar, Dataset level similarity (not attribute level similarity) and name-based proximity were the main deciding factor here as the datasets have similar attribute counts, similar attribute types and very similar attribute names. This demonstrates the importance of combining both levels of comparison together and different types of similarity measures in a single proximity metric.
DS 16 and 32: Nominal attributes very strongly similar. Maximum aggregation proximity metrics were probably the main deciding factors here.
DS 16 and 1042: Attributes were weakly similar, but dataset level similarity was the main deciding factor here.
DS 22 and 1501: Nominal attributes very strongly similar. Dataset level and name-based metrics were main deciding factors here as the attributes have quite similar naming scheme.
DS 25 and 1523: Nominal attributes partially similar, numeric attributes coincidently matching.
DS 55 and 222: Many nominal attributes which are partially similar, common numeric attributes found and some coincidently matching. Dataset level similarity and left-skewed attribute level aggregations were the main deciding factors here (as those aggregations give more importance to a few very strongly similar pairs of attributes).
DS 150 and 1596: They are duplicate datasets about the same instances but with different preprocessing for numerical attributes. As most attributes will store different information based on the preprocessing, the name-based and dataset level proximity metrics were the main deciding factors here (and attribute level value-based proximity metrics were probably not considered here by the All-Prox proximity model).
DS 183 and 1557: Those are duplicate datasets, but the second has been cleansed, undergone preprocessing of nominal attributes, and the attribute names have been changed to a generic format. It is clear in this case that the positive proximity prediction was mainly due to attribute level analysis of numeric attributes and with less importance given to name-based distances at the attribute level (dataset level name proximity is of course also highly contributing in this case as the names are very similar).
DS 187 and 287: Both datasets are not duplicates and represent data from different sources. They were easily detected as related datasets due to the dataset name, the highly similar numeric attributes using left-skewed distributions and maximum aggregations, and the dataset level profile.
DS 224 and 1523: Partially matching nominal and numeric attributes. Similar dataset level profiles was main deciding factor due to similar number of instances and attributes.
DS 307 and 1489: A combination of attribute level aggregations considering maximum attribute name similarity (where 'class' is a common attribute) and left-skewed aggregations of attribute level data profiles considering the partially matching numeric attributes led to the detection of relatedness between those datasets.
DS 329 and 382: Although no common data profile of attributes, there was still common attributes based on names (with different data though) like 'marital_status' and 'marst' which led to high similarity of maximum attribute name aggregations. The dataset level profile had a higher than average similarity too.
DS 354 and 1567: Both are duplicated datasets but the the first is a heavily preprocessed version with changed nominal categorisations, and different normalisations of values. They were easily detected as related datasets using their dataset level proximity metrics and left-skewed attribute level similarities.
DS 378 and 451: Very similar nominal attributes with similar data were the main deciding factors.
DS 382 and 534: Nominal attributes which are similar led to the overall similarity of the datasets.
DS 451 and 1590: Similar nominal attributes, numeric attributes coincidently matching.
DS 483 and 1093: Both datasets contain both nominal and numeric attributes with similar data and names.
DS 483 and 1097: Nominal attributes have partially similar data profiles, numeric attributes coincidently matching. The dataset profiles are similar too.
DS 570 and 575: They are datasets storing the same information about different instances. Our proximity mining algorithm was able to detect that both datasets store related information although they had slight differences in data profiles. Our proposed approximate proximity metrics are able to detect such kind of similarities using a combination of different types of metadata using names or overall dataset level profiles.
DS 1030 and 1035: Similar dataset level and attribute level data profiles and names.
DS 1038 and 1501: Although the datasets are quite different in terms of their overall dataset profile, they still had relative similarity in terms of overall number of attributes and number of instances. This, alongside the fact that the names are generically numbered using common namings in both datasets helped detect their relatedness.
DS 1049 and 1071: They are datasets storing the same information about different instances. Our proximity mining algorithm was able to detect that both datasets store related information although they had slight differences in data profiles. The approximate proximity metrics are able to detect such kind of similarities.
DS 1050 and 1070: Multiple similar nominal and numeric attributes.
DS 1053 and 1056: Multiple similar numeric attributes. Dataset level profile similarity also contributed to the correct detection of relatedness.
DS 1054 and 1056: Multiple similar nominal and numeric attributes.
DS 1054 and 1065: They are datasets storing the same information about different instances. Although the datasets had very different data profiles for each attribute, still the name-based attribute level proximity metrics were able to effectively detect that those datasets are holding related data. Left-skewed attribute level aggregations also effectively detected very high similarity in a few set of attribute pairs.
DS 1497 and 1525: Most of the attributes in the dataset with smaller number of attributes had highly similar attributes in the bigger dataset.
DS 1517 and 1518: Both datasets contain both nominal and numeric attributes with similar data and names.
DS 1518 and 1526: The attribute level names were identical between the smaller dataset and the larger dataset. The maximum aggregation for attribute level similarity had a high score due to the very similar nominal attribute. Many numeric attributes had slight similarity in their data profiles too, but not too strong.

Conclusion

From our observations, we can conclude that different types of proximity metrics are complementary and are important for improving the performance of schema matching pre-filtering. Both attribute level and dataset level proximity metrics were important deciding factors for detecting related datasets. This was seen in cases where we can not only rely on attribute level value-based proximity metrics, including duplicate datasets which have heavily transformed and preprocessed attribute values. In such cases, the dataset level proximity metrics and name-based proximity metrics are the main deciding factors to propose such pairs of datasets for further schema matching. Also, name-based proximity metrics in combination with values-based metrics were both important in deciding whether a pair of datasets are related and should be analysed with schema matching. In some cases, having a skewed aggregation proximity metric or a maximum aggregation proximity metric for attribute level similarities supported in detecting related datasets. This demonstrates the importance of using different aggregation techniques for attribute level proximity metrics.

Return to top