二十四：RDD源码分析_each rdd is character源码-程序员宅基地

技术标签： SparkCore

一：初识Spark:

进入官网 http://spark.apache.org
Apache Spark is a unified analytics engine for large-scale data processing
Apache Spark是一个标准的大型数据处理分析引擎，具有如下4个特性：

1.1：运行速度快：

相对于hadoop：编程模型不一样：mapreduce是基于进程计算，基本每一步都需要落到磁盘上，而spark是线程的，基于DAG的pipeline的计算。

1.2：易用：

可以用这么多语言来编程 scala python java R and SQL，支持80多个API

1.3：通用性

生态栈上体现，对各种问题可以有效的解决。

1.4：运行在任何地方：

Spark runs on Hadoop（on yanr）, Apache Mesos, Kubernetes（2.3以后支持）, standalone（spark集群）, or in the cloud. It can access diverse data sources

二：RDD 源码：

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable 不可变的 map 成新的集合

partitioned collection of elements 分区集合
that can be operated on in parallel. 单机开发运行并行数据
This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition,

2.1 Internally, each RDD is characterized by five main properties:

—A list of partitions 一系列分区
—A function for computing each split 一个函数去计算每个分区
—A list of dependencies on other RDDs 一系列的依赖关系
–Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 一个分区器
–Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 每个分区计算有个最佳位置：

RDD是一个继承了序列化和日志的一个抽象类：

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {

2.2 RDD的五大特性实现的5个方法：

这些特性在HADOOPRDD和JDBCRDD等中需要去具体的实现

protected def getPartitions: Array[Partition]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */

def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */

 protected def getDependencies: Seq[Dependency[_]] = deps

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

  // =======================================================================
  // Methods and fields available on all RDDs
  // =======================================================================

  /** The SparkContext that created this RDD. */

三：Initializing Spark

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

3.1：SparkContext

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

3.2：SparkConf

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs
Most of the time, you would create a SparkConf object with new SparkConf(), which will load
values from any spark.* Java system properties set in your application as well. In this case,
parameters you set directly on the SparkConf object take priority over system properties

import org.apache.spark.{SparkConf, SparkContext}

object SparkContextApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("First").setMaster("local[2]")
    val sc = new SparkContext(conf)
    
 // TODO----   
   
  sc.stop()

In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process

本文链接：https://blog.csdn.net/weizhonggui/article/details/88358370

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

第1篇目标检测概述 —（2）目标检测算法介绍_检测类算法的作用-程序员宅基地

文章浏览阅读1.4k次，点赞3次，收藏8次。目标检测算法是一种计算机视觉算法，用于在图像或视频中识别和定位特定的目标物体。本节课就给大家重点介绍下基于深度学习的目标检测算法！_检测类算法的作用

Linux编译mplayer播放badapple及编译fbv加载图片（基于V3S预告板子要开源了）_v3s笔记-程序员宅基地

文章浏览阅读1.8k次，点赞3次，收藏15次。Linux编译mplayer播放badapple及编译fbv加载图片（基于V3S）1.编译前准备：1.linux5.10内核2.2018.02bulidroot3.v3s板子2.linux内核配置1.修改设备树（linux-5.10/arch/arm/boot/dts/文件夹下）1.修改sun8i-v3s.dtsi文件的最后一个dma的位置添加以下代码： codec_analog: codec-analog@01c23000 { compatible = "allwinner,sun_v3s笔记

我们应如何度过自己的大学生活？_如何度过大学生活1000字-程序员宅基地

文章浏览阅读3.1k次。我们应如何度过自己的大学生活？踏着九月的烈日，我们成功地来到了河南理工大学，开始了我们的大学生活，那么你可曾想过，我们到底应该如何度过我们的大学生活才算有意义呢？可曾记得高中老师说的最多的一句话：“好好学吧！上了大学就轻松了！”每当听到这句话时都会给我们莫大的鼓励，也让我们对大学充满了憧憬。那么大学生活真如高中老师说的那样轻松吗？其实不然！高中老师所谓的轻松只是在一定程度上正课的时..._如何度过大学生活1000字

python snownlp情感分析简易demo(分享)，没有我Python干不成的事！_snowlp情感分析代码-程序员宅基地

文章浏览阅读745次。SnowNLP是国人开发的python类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob不同的是，这里没有用NLTK，所有的算法都是自己实现的，并且自带了一些训练好的字典。注意本程序都是处理的unicode编码，所以使用时请自行decode成unicode。MIT许可下发行。其github主页可能有些不准确，我也是随便提取的数据，不过snownlp还是号称情感分析准确很高的！_snowlp情感分析代码

命令行安装todesk_todesk命令行csdn-程序员宅基地

文章浏览阅读653次，点赞10次，收藏7次。要想通过命令行安装todesk，也是比较简单的。_todesk命令行csdn

如何开发一个个人微信小程序，微信小程序开发入门教程_微信小程序怎么开发自己的小程序-程序员宅基地

文章浏览阅读10w+次，点赞183次，收藏1.1k次。做任何程序开发要首先找到其官方文档，我们先来看看其有哪些官方文档。微信小程序开发文档链接为：https://mp.weixin.qq.com/debug/wxadoc/dev/index.html，如下图：这里就是做微信小程序开发的全部官方文档。知道了文档的位置，下面我们来介绍下如何做一个微信小程序开发：第一步：下载微信小程序开发者工具并安装，下载路径：https://mp.weix..._微信小程序怎么开发自己的小程序