In this tutorial, I am going to present you how to perform supervised learning in R using the sparklyr package. The models that I am going to use are:

  • Linear Regression
  • Naive Bayes
  • Decision Tree
  • Random Forest
  • Logistic Regression
  • Multilayer Perceptron
  • Gradient Boosted Tree
  • Support Vector Machine

Loading data

First of all, we need to make a connection between spark and R. After, we are going to add data to connection, transforming a data frame in a spark data frame.

library(dplyr) # we'll need this to do some manipulations on data

sc <- spark_connect(master = "local")

# using iris for binary classification examples
iris_bin_tbl <- sdf_copy_to(sc, iris, name = "iris_bin_tbl", overwrite = TRUE) %>% 
  filter(Species != "setosa")

# using iris for multiclass classification examples
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

# using mtcars for regression examples
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

Partitioning in training and testing

You can partition your data as you like: 70/30, 75/25, 80/20…

# for iris_bin_tbl
partitions <- iris_bin_tbl %>%
  sdf_partition(training = 0.7, test = 0.3, seed = 1111)

iris_bin_training <- partitions$training
iris_bin_test <- partitions$test

# for iris_tbl
partitions <- iris_tbl %>%
  sdf_partition(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

# for mtcars_tbl
partitions <- mtcars_tbl %>%
  sdf_partition(training = 0.7, test = 0.3, seed = 1111)

mtcars_training <- partitions$training
mtcars_test <- partitions$test

Training, predicting and evaluating your models

Now that you split your data in training and test, let’s move to the fun part: train your model, make predictions and, of course, evaluate the results. Let’s see a summary of how each of these steps works in sparklyr.

  • Model training: Basically, all the models have the same syntax: your_model <- ml_function(training_data, formula). Some models, like Random Forest, perform more than one type of prediction, in these cases you have to add type parameter. Multiplayer perceptron is a quite different because you have to input layers configuration, but it isn’t difficult at all.

  • Make predictions: After training, you need to include the prediction column in your spark data frame. Only you have to do is to use pred <- sdf_predict(test_data, your_model).

  • Evaluating your model: Here, you have to use the evaluator according to your output. You can use one of this in pred data frame: ml_regression_evaluator(), ml_binary_classification_evaluator() or ml_multiclass_classification_evaluator(). Sparklyr doesn’t support a function for confusion matrix yet, if you want to analyse this, you can use dplyr::collect(pred) and use table(groundtruth, prediction). WARNING: Don’t use collect() when you have a data frame larger than your memory, you’ll lose the connection with spark. I suggest you estimate your confusion matrix with a sample using sdf_sample(). I know…I know that is not the best solution but it is what we have for while :(

Training your model

# linear regression
lm_model <- mtcars_training %>%
  ml_linear_regression(mpg ~ .)

# naive bayes
nb_model <- iris_training %>%
  ml_naive_bayes(Species ~ .)

# decision tree
dt_model <- iris_training %>%
  ml_decision_tree(Species ~ .)

# random forest (regression)
rf_model_reg <- mtcars_training %>%
  ml_random_forest(cyl ~ ., type = "regression")

# random forest (classification)
rf_model_class <- iris_training %>%
  ml_random_forest(Species ~ ., type = "classification")

# gradient boosted tree (regression)
gbt_model_reg <- mtcars_training %>%
  ml_gradient_boosted_trees(cyl ~ ., type = "regression")

# gradient boosted tree (binary). Multiclass classification is not implemented unitl now. When it works I'll update this tutorial
gbt_model_bin <- iris_bin_training %>% 
  ml_gradient_boosted_trees(Species ~ ., type = "classification")

# logistic regression
lr_model <- iris_bin_training %>%
  ml_logistic_regression(Species ~ .)

# multilayer perceptron
mlp_model <- iris_training %>%
  ml_multilayer_perceptron(Species ~ ., layers = c(4,3,3))

# suport vector machine (binary). Multiclass classification is not implemented unitl now. When it works I'll update this tutorial
svm_model <- iris_bin_training %>%
  ml_linear_svc(Species ~ .)


Remember, only thing you have to do is use pred <- sdf_predict(test_data, your_model), let’s check three examples.

pred_lm <- sdf_predict(mtcars_test, lm_model) # linear regression - regression
pred_gbt_bin <- sdf_predict(iris_bin_test, gbt_model_bin) # gradient tree boosting - binary
pred_nb <- sdf_predict(iris_test, nb_model)   # naive bayes - multiclass

In this part, I have to give you some details. For example, let’s take a look in ml_binary_classification_eval().

ml_binary_classification_eval(x, label_col = "label",
  prediction_col = "prediction", metric_name = "areaUnderROC")

Note that you have to pass the parameters label_col and prediction_col. You don’t need to “worry” about it because sparklyr create these columns with same default name when you use sdf_predict(), look:

## Observations: ??
## Variables: 14
## $ Sepal_Length           <dbl> 4.3, 4.4, 4.4, 4.7, 4.8, 4.9, 4.9, 5.0,...
## $ Sepal_Width            <dbl> 3.0, 2.9, 3.2, 3.2, 3.1, 2.5, 3.1, 3.0,...
## $ Petal_Length           <dbl> 1.1, 1.4, 1.3, 1.3, 1.6, 4.5, 1.5, 1.6,...
## $ Petal_Width            <dbl> 0.1, 0.2, 0.2, 0.2, 0.2, 1.7, 0.2, 0.2,...
## $ Species                <chr> "setosa", "setosa", "setosa", "setosa",...
## $ features               <list> [<4.3, 3.0, 1.1, 0.1>, <4.4, 2.9, 1.4,...
## $ label                  <dbl> 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, ...
## $ rawPrediction          <list> [<-11.881783, -11.384971, -9.929243>, ...
## $ probability            <list> [<0.1031988, 0.1696044, 0.7271968>, <0...
## $ prediction             <dbl> 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, ...
## $ predicted_label        <chr> "setosa", "setosa", "setosa", "setosa",...
## $ probability_virginica  <dbl> 0.10319876, 0.13955037, 0.11483333, 0.1...
## $ probability_versicolor <dbl> 0.1696044, 0.2175782, 0.1869394, 0.1807...
## $ probability_setosa     <dbl> 0.727196821, 0.642871385, 0.698227263, ...

Did you see? pred_nb contains the columns named label and prediction. That is why you just need to pass pred_nb to your evaluator. Easy, isn’t it? Take a look in these examples:

# gradient tree boosting
## [1] 0.825

# naive bayes
## [1] 0.9184959

But, why am I using worry in quotation marks? It’s because some models don’t create the column label, for example, ml_linear_regression(). Let’s check pred_lm out!

## Observations: ??
## Variables: 12
## $ mpg        <dbl> 10.4, 10.4, 14.3, 15.8, 18.1, 19.7, 21.4, 22.8
## $ cyl        <dbl> 8, 8, 8, 8, 6, 6, 4, 4
## $ disp       <dbl> 460.0, 472.0, 360.0, 351.0, 225.0, 145.0, 121.0, 140.8
## $ hp         <dbl> 215, 205, 245, 264, 105, 175, 109, 95
## $ drat       <dbl> 3.00, 2.93, 3.21, 4.22, 2.76, 3.62, 4.11, 3.92
## $ wt         <dbl> 5.424, 5.250, 3.570, 3.170, 3.460, 2.770, 2.780, 3.150
## $ qsec       <dbl> 17.82, 17.98, 15.84, 14.50, 20.22, 15.50, 18.60, 22.90
## $ vs         <dbl> 0, 0, 0, 0, 1, 0, 1, 1
## $ am         <dbl> 0, 0, 0, 1, 0, 1, 1, 0
## $ gear       <dbl> 3, 3, 3, 5, 3, 5, 4, 4
## $ carb       <dbl> 4, 4, 4, 4, 1, 6, 2, 2
## $ prediction <dbl> 14.63299, 15.72502, 13.12466, 24.98213, 21.58216, 1...

OMG!!! What am I supposed to do? Easy, just pass label_col = "your_output" as parameter, in this case, mpg

ml_regression_evaluator(pred_lm, label_col = "mpg")
## [1] 5.362564

