Spark分布式集群搭建

詹学伟
詹学伟
发布于 2024-04-21 / 10 阅读
0
0

Spark分布式集群搭建

一、运行环境说明

  • Spark的运行环境,可以是在windows上,也可以是运行在linux上,一般情况而言都是运行在linux上的。所以,我们课程也是基于linux来运行的,linux使用的Centos7版本。

  • 下载地址:https://archive.apache.org/

二、部署环境

机器地址

节点名称

部署路径

192.168.1.23

node3

/usr/local/spark-3.3.0

192.168.1.24

node4

/usr/local/spark-3.3.0

192.168.1.25

node5

/usr/local/spark-3.3.0

三、部署

1.下载

地址:https://archive.apache.org/dist/spark/spark-3.3.0/

2.解压

tar -zvxf spark-3.3.0-bin-hadoop3.tgz 

3.重命名、位置移动

mv spark-3.3.0-bin-hadoop3 spark-3.3.0

mv spark-3.3.0 /usr/local/

4.修改配置文件

进入到配置文件目录

cd /usr/local/spark-3.3.0/conf
  • 修改workers

cp workers.template workers
vim workers

node3
node4
node5
cp spark-env.sh.template spark-env.sh

vim spark-env.sh

HADOOP_CONF_DIR=/usr/local/hadoop-3.3.0/etc/hadoop
 cd /usr/local/spark-3.3.0/sbin/
 
 vim spark-config.sh 

export JAVA_HOME=/usr/local/jdk1.8.0_191

五、分发

cd /usr/local
scp -r spark-3.3.0/ root@node4:`pwd`
scp -r spark-3.3.0/ root@node5:`pwd`

六、启动

在某台节点上执行:

/usr/local/spark-3.3.0/sbin/start-all.sh 

七、访问

http://192.168.110.23:8080/

八、spark-shell测试

hadoop fs -mkdir -p /spark/in
hadoop fs -put xxxx.txt /spark/in
/usr/local/spark-3.3.0/bin/spark-shell 
# 本地
scala> val textFile = sc.textFile("file:///tmp/spark-test.txt");
# HDFS
scala> val textFile = sc.textFile("hdfs:///user/spark/in/spark-test.txt");
scala> textFile.count();
res12: Long = 1
scala> textFile.first();
scala> textFile.map(line=>line.split("").size).reduce((a,b) => if (a > b) a else b);
res8: Int = 70
scala> val wordCount = textFile.flatMap(line =>line.split(" ")).map(x => (x,1)).reduceByKey((a,b) => a+b);
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:25
scala> wordCount.collect
res9: Array[(String, Int)] = Array((hive,2), (phone,2), (flink,4), (zookeeper,4), (map,2), (mysql,3), (list,2), (pig,4), (java,3), (root,2), (yarn,3), (file,3), (test,2), (desk,3), (spark,1), (hadoop,2), (if,2), (home,2), (mr,4), (flume,4), (else,1), (kudu,3), (try,1), (for,2), (dev,2), (hdfs,3), (hbase,4))
scala> val wordCount2 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_);
wordCount2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[12] at reduceByKey at <console>:25
scala> wordCount2.collect
res10: Array[(String, Int)] = Array((hive,2), (phone,2), (flink,4), (zookeeper,4), (map,2), (mysql,3), (list,2), (pig,4), (java,3), (root,2), (yarn,3), (file,3), (test,2), (desk,3), (spark,1), (hadoop,2), (if,2), (home,2), (mr,4), (flume,4), (else,1), (kudu,3), (try,1), (for,2), (dev,2), (hdfs,3), (hbase,4))
scala> val wordCount3 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1));
wordCount3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[20] at map at <console>:25
scala> wordCount3.collect
res11: Array[(String, Int)] = Array((flink,4), (zookeeper,4), (pig,4), (mr,4), (flume,4), (hbase,4), (mysql,3), (java,3), (yarn,3), (file,3), (desk,3), (kudu,3), (hdfs,3), (hive,2), (phone,2), (map,2), (list,2), (root,2), (test,2), (hadoop,2), (if,2), (home,2), (for,2), (dev,2), (spark,1), (else,1), (try,1))

4. 执行第一个spark 程序

  • 普通模式提交任务

/usr/local/spark-3.3.0/bin/run-example SparkPi 10

该算法是利用蒙特·卡罗算法求圆周率PI,通过计算机模拟大量的随机数,最终会计算出比较精确的π。


评论