Welcome, friend :)

In this tutorial, I am going to present you how to perform supervised learning in R using the sparklyr package. The models that I am going to use are:

  • Linear Regression
  • Naive Bayes
  • Decision Tree
  • Random Forest
  • Logistic Regression
  • Multilayer Perceptron
  • Gradient Boosted Tree
  • Support Vector Machine

If you don’t know how to connect spark in R, don’t worry…check this out. If you have any question or suggestions, don’t hesitate to contact me on samuelmacedo@recife.ifpe.edu.br.

Let’s get to action…

Loading data

First of all, we need to make a connection between spark and R. After, we are going to add data to connection, transforming a data frame in a spark data frame.

library(sparklyr)
library(dplyr) # we'll need this to do some manipulations on data

sc <- spark_connect(master = "local")

# using iris for binary classification examples
iris_bin_tbl <- sdf_copy_to(sc, iris, name = "iris_bin_tbl", overwrite = TRUE) %>% 
  filter(Species != "setosa")

# using iris for multiclass classification examples
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

# using mtcars for regression examples
mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

Partitioning in training and testing

You can partition your data as you like: 70/30, 75/25, 80/20…

# for iris_bin_tbl
partitions <- iris_bin_tbl %>%
  sdf_partition(training = 0.7, test = 0.3, seed = 1111)

iris_bin_training <- partitions$training
iris_bin_test <- partitions$test

# for iris_tbl
partitions <- iris_tbl %>%
  sdf_partition(training = 0.7, test = 0.3, seed = 1111)

iris_training <- partitions$training
iris_test <- partitions$test

# for mtcars_tbl
partitions <- mtcars_tbl %>%
  sdf_partition(training = 0.7, test = 0.3, seed = 1111)

mtcars_training <- partitions$training
mtcars_test <- partitions$test

Training, predicting and evaluating your models

Now that you split your data in training and test, let’s move to the fun part: train your model, make predictions and, of course, evaluate the results. Let’s see a summary of how each of these steps works in sparklyr.

  • Model training: Basically, all the models have the same syntax: your_model <- ml_function(training_data, formula). Some models, like Random Forest, perform more than one type of prediction, in these cases you have to add type parameter. Multiplayer perceptron is a quite different because you have to input layers configuration, but it isn’t difficult at all.

  • Make predictions: After training, you need to include the prediction column in your spark data frame. Only you have to do is to use pred <- sdf_predict(test_data, your_model).

  • Evaluating your model: Here, you have to use the evaluator according to your output. You can use one of this in pred data frame: ml_regression_evaluator(), ml_binary_classification_evaluator() or ml_multiclass_classification_evaluator(). Sparklyr doesn’t support a function for confusion matrix yet, if you want to analyse this, you can use dplyr::collect(pred) and use table(groundtruth, prediction). WARNING: Don’t use collect() when you have a data frame larger than your memory, you’ll lose the connection with spark. I suggest you estimate your confusion matrix with a sample using sdf_sample(). I know…I know that is not the best solution but it is what we have for while :(

Does it seem complicated? Don’t worry… let’s check it out how easy is to train your models with sparklyr :)

Training your model

# linear regression
lm_model <- mtcars_training %>%
  ml_linear_regression(mpg ~ .)

# naive bayes
nb_model <- iris_training %>%
  ml_naive_bayes(Species ~ .)

# decision tree
dt_model <- iris_training %>%
  ml_decision_tree(Species ~ .)

# random forest (regression)
rf_model_reg <- mtcars_training %>%
  ml_random_forest(cyl ~ ., type = "regression")

# random forest (classification)
rf_model_class <- iris_training %>%
  ml_random_forest(Species ~ ., type = "classification")

# gradient boosted tree (regression)
gbt_model_reg <- mtcars_training %>%
  ml_gradient_boosted_trees(cyl ~ ., type = "regression")

# gradient boosted tree (binary). Multiclass classification is not implemented unitl now. When it works I'll update this tutorial
gbt_model_bin <- iris_bin_training %>% 
  ml_gradient_boosted_trees(Species ~ ., type = "classification")

# logistic regression
lr_model <- iris_bin_training %>%
  ml_logistic_regression(Species ~ .)

# multilayer perceptron
mlp_model <- iris_training %>%
  ml_multilayer_perceptron(Species ~ ., layers = c(4,3,3))

# suport vector machine (binary). Multiclass classification is not implemented unitl now. When it works I'll update this tutorial
svm_model <- iris_bin_training %>%
  ml_linear_svc(Species ~ .)

Predicting

Remember, only thing you have to do is use pred <- sdf_predict(test_data, your_model), let’s check three examples.

pred_lm <- sdf_predict(mtcars_test, lm_model) # linear regression - regression
pred_gbt_bin <- sdf_predict(iris_bin_test, gbt_model_bin) # gradient tree boosting - binary
pred_nb <- sdf_predict(iris_test, nb_model)   # naive bayes - multiclass

For the others models just use the same way…try it and have fun ;)

Evaluating

In this part, I have to give you some details. For example, let’s take a look in ml_binary_classification_eval().

ml_binary_classification_eval(x, label_col = "label",
  prediction_col = "prediction", metric_name = "areaUnderROC")

Note that you have to pass the parameters label_col and prediction_col. You don’t need to “worry” about it because sparklyr create these columns with same default name when you use sdf_predict(), look:

dplyr::glimpse(pred_nb)
## Observations: ??
## Variables: 14
## $ Sepal_Length           <dbl> 4.3, 4.4, 4.4, 4.7, 4.8, 4.9, 4.9, 5.0,...
## $ Sepal_Width            <dbl> 3.0, 2.9, 3.2, 3.2, 3.1, 2.5, 3.1, 3.0,...
## $ Petal_Length           <dbl> 1.1, 1.4, 1.3, 1.3, 1.6, 4.5, 1.5, 1.6,...
## $ Petal_Width            <dbl> 0.1, 0.2, 0.2, 0.2, 0.2, 1.7, 0.2, 0.2,...
## $ Species                <chr> "setosa", "setosa", "setosa", "setosa",...
## $ features               <list> [<4.3, 3.0, 1.1, 0.1>, <4.4, 2.9, 1.4,...
## $ label                  <dbl> 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, ...
## $ rawPrediction          <list> [<-11.881783, -11.384971, -9.929243>, ...
## $ probability            <list> [<0.1031988, 0.1696044, 0.7271968>, <0...
## $ prediction             <dbl> 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, ...
## $ predicted_label        <chr> "setosa", "setosa", "setosa", "setosa",...
## $ probability_virginica  <dbl> 0.10319876, 0.13955037, 0.11483333, 0.1...
## $ probability_versicolor <dbl> 0.1696044, 0.2175782, 0.1869394, 0.1807...
## $ probability_setosa     <dbl> 0.727196821, 0.642871385, 0.698227263, ...

Did you see? pred_nb contains the columns named label and prediction. That is why you just need to pass pred_nb to your evaluator. Easy, isn’t it? Take a look in these examples:

# gradient tree boosting
ml_binary_classification_evaluator(pred_gbt_bin)
## [1] 0.825

# naive bayes
ml_multiclass_classification_evaluator(pred_nb)
## [1] 0.9184959

But, why am I using worry in quotation marks? It’s because some models don’t create the column label, for example, ml_linear_regression(). Let’s check pred_lm out!

dplyr::glimpse(pred_lm)
## Observations: ??
## Variables: 12
## $ mpg        <dbl> 10.4, 10.4, 14.3, 15.8, 18.1, 19.7, 21.4, 22.8
## $ cyl        <dbl> 8, 8, 8, 8, 6, 6, 4, 4
## $ disp       <dbl> 460.0, 472.0, 360.0, 351.0, 225.0, 145.0, 121.0, 140.8
## $ hp         <dbl> 215, 205, 245, 264, 105, 175, 109, 95
## $ drat       <dbl> 3.00, 2.93, 3.21, 4.22, 2.76, 3.62, 4.11, 3.92
## $ wt         <dbl> 5.424, 5.250, 3.570, 3.170, 3.460, 2.770, 2.780, 3.150
## $ qsec       <dbl> 17.82, 17.98, 15.84, 14.50, 20.22, 15.50, 18.60, 22.90
## $ vs         <dbl> 0, 0, 0, 0, 1, 0, 1, 1
## $ am         <dbl> 0, 0, 0, 1, 0, 1, 1, 0
## $ gear       <dbl> 3, 3, 3, 5, 3, 5, 4, 4
## $ carb       <dbl> 4, 4, 4, 4, 1, 6, 2, 2
## $ prediction <dbl> 14.63299, 15.72502, 13.12466, 24.98213, 21.58216, 1...

OMG!!! What am I supposed to do? Easy, just pass label_col = "your_output" as parameter, in this case, mpg

ml_regression_evaluator(pred_lm, label_col = "mpg")
## [1] 5.362564

That’s all folks

Liked it? You can share this tutorial using the buttons below. If you want to contribute with my website you can fork me on github. If you still have any doubts feel free to contact at samuelmacedo@recife.ifpe.edu.br.

See ya!