Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
410
votes
20 answers

Spark - repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the…
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
340
votes
14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?
327
votes
25 answers

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in…
Shubhanshu Mishra
  • 6,210
  • 6
  • 21
  • 23
316
votes
17 answers

How to show full column content in a Spark Dataframe?

I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content: val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv") df.registerTempTable("tasks") results =…
tracer
  • 3,265
  • 2
  • 15
  • 6
306
votes
17 answers

What is the difference between map and flatMap and a good use case for each?

Can someone explain to me the difference between map and flatMap and what is a good use case for each? What does "flatten the results" mean? What is it good for?
Eran Witkon
  • 4,042
  • 4
  • 19
  • 20
296
votes
2 answers

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism. Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is…
Manikandan Kannan
  • 8,684
  • 15
  • 44
  • 65
290
votes
14 answers

Spark java.lang.OutOfMemoryError: Java heap space

My cluster: 1 master, 11 slaves, each node has 6 GB memory. My settings: spark.executor.memory=4g, Dspark.akka.frameSize=512 Here is the problem: First, I read some data (2.19 GB) from HDFS to RDD: val imageBundleRDD =…
Hellen
  • 3,472
  • 5
  • 18
  • 25
256
votes
11 answers

Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Getting strange behavior when calling function outside of a closure: when function is in a object everything is working when function is in a class get : Task not serializable: java.io.NotSerializableException: testing The problem is I need my…
Nimrod007
  • 9,825
  • 8
  • 48
  • 71
251
votes
9 answers

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: Number of data nodes: 3 Data node machine spec: CPU: Core i7-4790 (# of cores: 4, #…
zeodtr
  • 10,645
  • 14
  • 43
  • 60
239
votes
6 answers

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?
user1261215
224
votes
21 answers

How to stop INFO messages displaying on spark console?

I'd like to stop various messages that are coming on spark shell. I tried to edit the log4j.properties file in order to stop these message. Here are the contents of log4j.properties # Define the root logger with appender…
Vishwas
  • 6,967
  • 5
  • 42
  • 69
215
votes
1 answer

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very…
Mrityunjay
  • 2,211
  • 3
  • 14
  • 8
209
votes
7 answers

Add JAR files to a Spark job - spark-submit

True... it has been discussed quite a lot. However, there is a lot of ambiguity and some of the answers provided ... including duplicating JAR references in the jars/executor/driver configuration or options. The ambiguous and/or omitted details The…
YoYo
  • 9,157
  • 8
  • 57
  • 74
205
votes
3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
199
votes
14 answers

Show distinct column values in pyspark dataframe

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need…
Satya
  • 5,470
  • 17
  • 47
  • 72
1
2 3
99 100