Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

410

votes

20 answers

Spark - repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the…

apache-spark distributed-computing rdd

asked Jul 24 '15 at 12:49

Praveen Sripati

32,799
16
80
117

340

votes

14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset

asked Jul 20 '15 at 02:31

oikonomiyaki

7,691
15
62
101

327

votes

25 answers

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in…

python apache-spark pyspark apache-spark-sql rename

asked Dec 03 '15 at 22:21

Shubhanshu Mishra

6,210
6
21
23

316

votes

17 answers

How to show full column content in a Spark Dataframe?

I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content: val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv") df.registerTempTable("tasks") results =…

dataframe scala apache-spark spark-csv output-formatting

asked Nov 16 '15 at 19:17

tracer

3,265
2
15
6

306

votes

17 answers

What is the difference between map and flatMap and a good use case for each?

Can someone explain to me the difference between map and flatMap and what is a good use case for each? What does "flatten the results" mean? What is it good for?

apache-spark

asked Mar 12 '14 at 11:54

Eran Witkon

4,042
4
19
20

296

votes

2 answers

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism. Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is…

apache-spark distributed-computing

asked Sep 17 '15 at 03:06

Manikandan Kannan

8,684
15
44
65

290

votes

14 answers

Spark java.lang.OutOfMemoryError: Java heap space

My cluster: 1 master, 11 slaves, each node has 6 GB memory. My settings: spark.executor.memory=4g, Dspark.akka.frameSize=512 Here is the problem: First, I read some data (2.19 GB) from HDFS to RDD: val imageBundleRDD =…

out-of-memory apache-spark

asked Jan 15 '14 at 13:30

Hellen

3,472
5
18
25

256

votes

11 answers

Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Getting strange behavior when calling function outside of a closure: when function is in a object everything is working when function is in a class get : Task not serializable: java.io.NotSerializableException: testing The problem is I need my…

scala apache-spark serialization

asked Mar 23 '14 at 15:22

Nimrod007

9,825
8
48
71

251

votes

9 answers

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: Number of data nodes: 3 Data node machine spec: CPU: Core i7-4790 (# of cores: 4, #…

hadoop apache-spark hadoop-yarn

asked Jul 08 '14 at 00:46

zeodtr

10,645
14
43
60

239

votes

6 answers

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

apache-spark distributed-computing rdd

asked Nov 11 '14 at 17:14

user1261215

224

votes

21 answers

How to stop INFO messages displaying on spark console?

I'd like to stop various messages that are coming on spark shell. I tried to edit the log4j.properties file in order to stop these message. Here are the contents of log4j.properties # Define the root logger with appender…

apache-spark log4j spark-submit

asked Jan 05 '15 at 14:04

Vishwas

6,967
5
42
69

215

votes

1 answer

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very…

scala performance apache-spark pyspark rdd

asked Sep 08 '15 at 17:46

Mrityunjay

2,211
3
14
8

209

votes

7 answers

Add JAR files to a Spark job - spark-submit

True... it has been discussed quite a lot. However, there is a lot of ambiguity and some of the answers provided ... including duplicating JAR references in the jars/executor/driver configuration or options. The ambiguous and/or omitted details The…

java scala apache-spark jar spark-submit

asked May 10 '16 at 08:03

YoYo

9,157
8
57
74

205

votes

3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 25 '15 at 18:17

Evan Zamir

8,059
14
56
83

199

votes

14 answers

Show distinct column values in pyspark dataframe

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need…

python apache-spark pyspark apache-spark-sql

asked Sep 08 '16 at 06:03

Satya

5,470
17
47
72

2 3

…

99 100 Next