Paris School of Economics / Ehess
Campus Jourdan – 48 boulevard Jourdan 75014 Paris
5th floor, office 39
Thesis Supervisor: CLARK Andrew
Academic year of registration: 2013/2014
"COMBINATORICS" Stata command for batch OLS estimation and cross-validation:
This data mining program performs batch OLS estimation, out-of-sample validation and Leave-one-out cross-validation (LOOCV) on all the 2^n models combined from a set of n candidate explanatory variables.
It evaluates the OLS explanatory performances and out-of-sample (OOS) predictive performances of all possible 2^n models of a dependent variable that can be generated with a given set of n possible explanatory variables. This can be used in data mining in order to evaluate the distribution of predictors' coefficients and performances across the space of potential models. This can also be used as a benchmark tool in order to evaluate model selection and machine learning algorithms in natural datasets. Given the important risk of overfitting in high-dimensional feature selection problems, a model's performance is better approached by its cross-validation measure of fit, which is calculated here using leave-one-out cross-validation (LOOCV). Note that the number of possible models explodes with the number of possible explanatory variables and you would better limit yourself to 20 indepvars max (i.e., 1,048,576 models). This command is a wrapper for "tuples" and installs it automatically if necessary.
You can download the program directly from Stata by typing:
ssc install combinatorics
The program has been written for Stata 11 and later versions, and is not compatible with earlier versions.
Keep in mind that this program is in beta version. Read carefully the help file by typing "help combinatorics" after installing the program. Please send me an e-mail if you experience any trouble or bugs.
"COVERAGE" Stata command to describe coverage of variables in datasets, and visually analyze non-response patterns:
This program generates an excel file that has three main functions of interest on large and complex datasets.
- First it sorts variables of the dataset in memory by increasing order of missing observations. It gives also the sample size obtained from listwise-deleting missing values of increasingly missing variables. This tool is mostly helpful for analysts in early exploratory stages desiring to address the trade-off between increasing the number of controls in a regression, and facing the loss of sample size that it implies (consistency vs. efficiency). The results are computed at the observation-level, and for any aggregate level specified as factor variables in the options.
- Second, it displays binary values indicating whether or not a variable is present at any aggregate level. This option is mostly useful to appreciate instantly the general structure of a dataset. For instance, it can shows if a variable is available in a given country in a given year. Results are also displayed in two-way tables for a finer visual appraisal. Blank cells indicate that a subgroup has no data associated to it, which for a survey dataset means for instance that there was no survey at a given country for a given year. Alternatively, the user can also prefer to display missing rates instead of binary values, or sample means of the variables.
- Third, all results at the individual and aggregate level can also be computed for each value taken by a specified variable, which can help to visually analyze relationship between non-response rates and this specified variable. For instance, you can suppose that non-response on an individual income variable can be related to the average income in the individual's country. Or you can also imagine that if your dataset contains different data providers, then non-response in any variable may be related to the data provider.
You can download the program here:
To install it, simply copy/paste the content of the zip in your personal ado directory. Type "personal" in Stata to know your personal ado directory. The program has been written for Stata 13 and later versions, and has not been tested on earlier versions.
Keep in mind that this program is in beta version. Read carefully the help file by typing "help coverage" after installing the program. Please send me an e-mail if you experience any trouble or bugs.
"BHPS_UKHLS" Stata command to merge BHPS and Understanding Society (UKHLS):
This program helps to merge the panel databases British Household Panel Survey (BHPS) and its successor the UKHLS (United Kingdom Household Longitudinal Study, aka Understanding Society). You can specify the variables, files, and waves you desire more specifically to extract. It generates also a codebook of variable labels and value labels of your variables of interest, allowing to visualize instantly variations across years and files of the labelling system, as well as the presence or absence of variables in some waves. The program is written for Stata 12 and later versions. It is also written for waves A to R of BHPS and A to C of Understanding Society, but should be compatible with forthcoming waves of Understanding Society. You can download it here:
To install it, simply copy/paste the content of the zip in your personal ado directory. Type "personal" in Stata to know your personal ado directory.
Keep in mind that this program is in beta version. Read carefully the help file by typing "help bhps_ukhls" after installing the program. Please send me an e-mail if you experience any trouble or bugs.