Skip to main content Link Search Menu Expand Document (external link)

4. Appending Variables Representing the Next Time Period

Having addressed missing values, we now expand our set of features by including their respective next time period.

4.1 Next Time Period Features

Consider Section 2 features, now with missing data addressed. We now use them to derive the next time step features. More specifically, these features repeat the 11 features assigned columns E-O of Section 2 but instead pertain to the next time period (90 days) in a CVE’s timeline. Their names match the 11 features of Sextion 2 but have a suffix “2” appended to their names. Their derivation is simple: Each of these next-time period-related features are derived as follows: Make a copy of the corresponding feature and shift all of its values up by one row. Of course, the top value of each original-variable column is simply unused; and the bottom value of the new column is blank.

Finally, for each CVE id, the record representing the last time period for a CVE’s remediation is then deleted. This is because it had gotten either the values of the first time period of the next CVE, or blanks, if it was the last row of the entire dataset.

The features appear in the same order as the corresponding original 11 features above: org_silo2, mis_link2, silence2, congruence2, communicate2, code_dev2, file2, mail_dev2, thread2, commit2, churn2.

Variable Derived Variable (i.e. Feature)
org_silo2 (P) Same count as “org_silo” but applied to the next time period
mis_link2 (Q) Same count as “mis_link” but applied to the next time period
silence2 (R) Same count as “silence” but applied to the next time period
congruence2 (S) Same 0..1 measure as “congruence” but applied to the next time period
communicate2 (T) Same 0..1 measure as “communicate” but applied to the next time period
code_dev2 (U) Same count as “code_dev” but applied to the next time period
file2 (V) Same count as “file” but applied to the next time period
mail_dev2 (W) Same count as “mail_dev” but applied to the next time period
thread2 (X) Same count as “thread” but applied to the next time period
commit2 (Y) Same count as “commit” but applied to the next time period
churn2 (Z) Same count as “churn” but applied to the next time period

As there were 5563 rows and there are 121 CVEs represented in the dataset, this should leave 5442 rows (after the header row). And instead of 15 columns, there should now be 26.

The result of this appending operation should look like this:

+ openssl_social_smells_timeline..renameVariables..resolveMD..deleteLastRecordEachCVE.csv