Install and running APACHE SPARK (R) from scratch

I suppose that you use R 3.x.x (in my case 3.5-devel) and Rstudio > 1.1.x (in my case 1.2.308).
Updated the 25/3/2018 with the last version of SPARK.

# In Rstudio, first install the package HelpersMG from CRAN and update it:

install.packages("HelpersMG.tar.gz")
install.packages("http://www.ese.u-psud.fr/epc/conservation/CRAN/HelpersMG.tar.gz", repos=NULL, type="source")

# Then load the HelpersMG library and the last version of SPARK:

library("HelpersMG")
wget("http://apache.crihan.fr/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz")
system("tar -xvzf spark-2.3.0-bin-hadoop2.7.tgz")

# Change the .profile and .Rprofile for future use

SPARK_HOME <- file.path(getwd(), "spark-2.3.0-bin-hadoop2.7")

HOME <- system("echo $HOME", intern = TRUE)

fileConn<-file(file.path(HOME, ".profile"))
total <- readLines(fileConn)
writeLines(c(total, paste0('SPARK_HOME="', SPARK_HOME, '"'), "export SPARK_HOME"))
close(fileConn)

Sys.setenv(SPARK_HOME = SPARK_HOME)

fileConn<-file(file.path(HOME, ".Rprofile"))
total <- readLines(fileConn)
writeLines(c(total, 
                      'if (nchar(Sys.getenv("SPARK_HOME")) < 1) {',
                             paste0('Sys.setenv(SPARK_HOME = "', SPARK_HOME, '")'), 
             '}'
           ))
close(fileConn)

# Now install the sparkR package:

install.packages(file.path(SPARK_HOME, "R", "lib", "SparkR"), repos=NULL)


# You have SPARK ready to be used
# Now start the master on your computer
# If you return from a previous use, just begin here

SPARK_HOME <- Sys.getenv("SPARK_HOME")

system(paste0(file.path(SPARK_HOME, "sbin", "stop-master.sh"), ";", file.path(SPARK_HOME, "sbin", "start-master.sh")))

# And run a slave on your computer; just to test

x <- system("ifconfig", intern=TRUE)
IP <- rev(gsub("^(.*) ([0-9\\.]+) (.*)$", "\\2", x[grep("inet ", x)]))[1]

system(paste0(file.path(SPARK_HOME, "sbin", "start-slave.sh"), " spark://", IP, ":7077"))


# Let try to run a computing:

library("SparkR")

spark_link <- paste0("spark://", IP, ":7077")
sparkR.stop()
sc <- sparkR.session(master = spark_link,
                     appName = "Nom de la session",
                     sparkEnvir = list(spark.driver.memory = "2g"))

output <- spark.lapply(1:100, function(x) {x*2})

Don't expect to have exceptional result in such a configuration ;) But it works.

Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

Install treemix in ubuntu 20.04

stepAIC from package MASS with AICc