3. Handling Missing Data

In this section, we discuss the different missing data patterns.

3.1 Missing Data Patterns

We observed two scenarios where data was missing. The first was due to the absence of mailing list archives. The second, a consequence of inactivity.

3.1.1 Absence of Mailing List Archive

While git log activity existed during the 2000-2001 time period, the mailing list archive was not unavailable. This is necessary to be accounted for during the social smells computation, as otherwise we would considered social smells to exist due to absence of data, instead of absence of communication.

3.1.2 Absence of Activity

It is expected during the development that certain periods inactivity occur. In the scope of this work, this translates to the absence of communication or development. Each time window can therefore be characterize as having one of the four missing data patterns:

Development without communication (Social Smell)
Development with communication (Considered ideal under our hypothesis)
No Development with Communication
No Development and No Communication

Consequently, we would expect that variables, and as consequence features to also be affected by these missing patterns. To verify our assumptions, we characterized the missing data using Mplus. The result of this analysis is summarized in the following table:

SUMMARY OF DATA

     Number of missing data patterns             4


SUMMARY OF MISSING DATA PATTERNS


     MISSING DATA PATTERNS (x = not missing)

           		1  	2 	3 	4
 cve_id   		x  	x  	x  	x
 activity_0		x  	x  	x  	x
 activity_2		x  	x  	x  	x
 Start     		x  	x  	x  	x
 End       		x  	x  	x  	x
 org_silo		x
 mis_link		x
 silence   		x
 congruence		x
 communicate		x
 code_dev  		x  	x
 file      		x  	x
 mail_dev 		x     		x
 thread    		x     		x
 commit    		x  	x
 churn     		x  	x

     MISSING DATA PATTERN FREQUENCIES

    Pattern   Frequency     Pattern   Frequency     Pattern   Frequency
    1        3590           	3        1973
    2         907           	4         227

Summarizing, and referring to the table generated by Mplus, we can say this about the dataset:

Number of cases with no MD at all (corresponds to Pattern 1 in the Mplus table): 3590
Number of cases with MD due to absence of mailing list data only (corresponds to Pattern 2 in the Mplus table): 907
Number of cases with MD due to absence of communication in mailing lists only (corresponds to Pattern 3 in the Mplus table): 1973
Number of cases with MD due to absence of both mailing list data and communication in mailing lists (corresponds to Pattern 4 in the Mplus table): 227

We can see identifying features to always be present, as the analysis is based on 3 month windows between file changes. This is expected, as the file changes we account for here are in CVE timelines of file changes. Therefore their ID, commit hash derived features, and dates associated to the commit hash should always be known.

Social Smell Features are consistently in the same pattern. This is expected, as they rely on the same data source to be derived.

Finally, the statistics variables (code_dev,file,commit,churn) and (mail_dev,thread) are accordingly in the same pattern: They are features derived from git log and mailing list respectively.

We decided to omit the missing data code, as it requires further explanation on variable renaming due to software limitations, and since the missing patterns were already known.

3.2 Missing Data Transformations

With respect to the absence of data during 2000-2001, about 17% of the rows (Patterns 2 and 4 in the Mplus table above) suffered missing values due to the loss of the mailing log (907 + 227 cases). Developers were presumably communicating during the time period but there is no data on such communication. We decided to remove rows from the dataset for which the mailing list data source is missing (i.e. 2000-2001).

Specifically, we believe that the conditions leading to the loss of the mailing log should have no noticeable causal effect on any of the other variables in the dataset (other than time-related variables such as “Start” and “End”). In other words, the authors deemed that data missing due to the absence of mailing list data can be considered Missing Completely at Random (MCAR) [7]. Therefore, Listwise Deletion [7], that is, deleting all rows that had missing data that was MCAR, other than increasing the estimate of standard error, induces no bias. While deleting 17% of the rows is a pretty significant deletion, the authors felt that the retained 83% of the rows was still sufficient to identifying major direct causal relationships. Also, from an explainability and understandability perspective, deleting 17% of the rows seemed preferable to employing pseudo-random number generation to impute values. Finally, choosing deletion rather than imputation also helps ensure the results can be more easily replicated. (The authors did pursue Full Information Maximum Likelihood-based imputation [7] initially, but after such considerations, chose not to use the results but to instead pursue Causal Discovery on the dataset resulting from Listwise Deletion.)

With respect to data missing due to inactivity durin given time period, any measures of features (counts) related to commits should all be 0.

Thus, with the above changes to the dataset, of deleting about 17% (907 + 227) of the rows affected by the loss of an old mailing log, and imputing zeros for all remaining missing values, we thus have a dataset of 5563 cases) with no MD and can thus continue onward to the next steps in preparing the dataset for applying causal discovery.

Here’s the file we obtained by resolving the MD as described above:

+ openssl_social_smells_timeline..renameVariables..resolveMD.csv