Developing with Apache Spark

Launching apps on a cluster from Eclipse

In this post we are going to tell how it is not necessary to deploy the code every time you want to test and use spark-submit, but prove it faster from your remote machine with Eclipse. Both for Java and Scala.

Let’s suppose that we have a set-up cluster with Yarn and Apache Spark, and a remote machine with access to this cluster (without firewall or with appropriate ports open) where we will develop our code with Eclipse. (We are using Spark 1.4.1 libraries with Hortonworks 2.3.2 version.)

Assuming this, we would normally develop with Eclipse, debug locally, till we have the code ready for sending it to our cluster. At this point it is where many developers think that it is needed to create and deploy the jar files, moving them to the cluster and use spark-submit to test. Doing like this, would require to repeat that process for every change we want to test. That would take a huge amount of time if we think of all the possible changes in our code we want to test. This procedure is not efficient at all. If we have the driver program locally, we would only need to debug remotely when we want to debug the executors. We recommend to submit the code (during debugging) via Eclipse. Thus, we need to do the following:

  1.  Set up our Spark code to run it as yarn-client
val conf = new SparkConf().setMaster("yarn-client").setAppName("app")
val sc = new SparkContext(conf)
  1. Use spark-yarn libraries 






Once we have this, we will compile our code, with Maven for instance, and will generate a jar with our code. We will run it with the following configuration showed in the picture.


As we can see, we add Hadoop configuration files to our classpath (yarn-client, core-site, hdfs-site) and include the location where the jar will be created.

If the cluster is well configured, and all Spark libraries are included in our classpath, the code should be able to run from Eclipse (where we have the driver locally). This way we can debug from Eclipse, and all distributed functions would run in every node in our cluster.

There are times when Spark does not load its libraries well. For example when we run our app and get NullPointer in the logs of Eclipse. Or when we get ClassNotFound in the cluster logs (actually Yarn’s). (We have seen this with org.apache.spark.Logging, though maybe it’s not like this with other versions). To sort this out we add the jar assembly to the spark.yarn.jar property with our version (we can get this jar during the Spark installation of our cluster. Thus we would have:


(We recommend to set this and other properties via an XML configuration file and not explicitly. This case is for illustrative purpose.)

With this, we would be able to run our code distributedly in our cluster. Thus we could debug all driver code from our Eclipse without recompiling. Everything but functions to be run distributedly. Should we need to change these types of functions, we would need to recompile so that the jars used by the workers is the new.


Debugging an executor

Finally, if we want to debug an executor, we would have to use the remote debugging that provides Eclipse. To do this we would set remote debugging on an available port (for instance 10000) and start debug (in listening mode. After that we would run the code with this configuration: 

conf.set("spark.executor.extraJavaOptions","-Xdebug -Xrunjdwp:transport=dt_socket,server=n,address=cliente01.pragsis.local:10000,suspend=n")


As you can see we only add an executor so that during debugging there is not interference.

With all this said, we would be able to execute our Spark app, both drivers' and executors'.



How To Debug a Remote Java Application: 


Contributor: Óliver Caballero, Big Data architect.

Published: April 2016