By John Alberg
Boosted C5.0 classifiers are known to perform well when stacked up against other classifiers (see, for example, this paper).
The caret library for the R programming language is an exceptional environment for automatic parameter tuning and training of classifiers. However, caret does not allow for out-of-box tuning of C5.0 tree complexity. This post shows how you can customize caret to do just that.
Caret has built in capabilities for tuning the C5.0 meta parameters trials, model, and winnow. The C5.0 documentation describes these parameters in detail. The following code illustrates the ease of tuning and training a C5.0 classifier with a custom tuning grid:
Then, typing the following
from the R console will give you this nice chart showing the classifiers performance across the tuning parameters.
Alas, there is an important C5.0 tuning parameter that is not baked into caret. This parameter is minCases. The minCases parameter specifies the minimum number of cases (training examples) that must be put in at least two of the splits. Essentially it controls the depth of the trees created by C5.0 (depth cannot be controlled directly) and hence it is intimately connected with the resulting tree complexity. The purpose of tuning meta parameters is to find the optimal trade-off between model complexity and the training set size and so minCases is an important parameter to tune. That said, tuning minCases is problematic under cross-validation because the number of cases in the training folds are different than the number of cases in the entire dataset so the optimal value of minCases found in cross validation will not be equal to the true optimum for the entire data set (which the final model will be trained on). To overcome this obstacle we can define minCases as a proportion of the data set size and tune the proportional parameter instead. If we define minCases as
minCases <- length(y)/splits
then as "splits" increases, so will the depth and complexity of the resulting trees. The code below customizes the standard caret functions to allow for the tuning of "splits" along with the other C5.0 meta parameters.
With this code we can generate cross validated results like those in the following chart: