spark on yarn 报错

报错提示是越界问题,但是我一看代码,他提示越界的地方是val sc = new SparkContext(conf)
完整代码:package cn.chinahadoop.spark

import org.apache.spark.{SparkContext, SparkConf}

/**
* Created by chenchao on 14-3-1.
*/
object Analysis{

def main (args: Array[String]) {
val conf = new SparkConf()
val sc = new SparkContext(conf)
conf.setAppName("analysis")
//val rdd = data.map((_,1)).reduceByKey(_+_)
//val rdd = data.filter(_.split('\t')(0)>="00:00:00").filter(_.split("\t")(0)<="12:00:00").count()
//val rdd2 = data.count()
// rdd.saveAsTextFile("/mnt/workspace/output")
// val rdd = data.filter(_.split('\t')(3).split(" ")(0)=="1").filter(_.split('\t')(3).split(" ")(1)=="2").count()
// print(rdd+" ")
val data = sc.textFile("/test/input/sp.txt")
val count = data.flatMap(line =>line.split(" ")).map(word => (word,1)).reduceByKey(_+_).saveAsTextFile("/mnt/workspace")
print("work done!!!!!!!!!!!!!")
}

}
报错:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$setEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:122)
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$setEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:120)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.setEnvFromInputString(YarnSparkHadoopUtil.scala:120)
at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$5.apply(Client.scala:360)
at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$5.apply(Client.scala:358)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:358)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:414)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:105)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at cn.chinahadoop.spark.Analysis$.main(Analysis.scala:12)
at cn.chinahadoop.spark.Analysis.main(Analysis.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

wangxiaolei

赞同来自: fish

首先,你机器环境有两个版本的hadoop环境,一个是cdh5.4.2,一个是cdh5.4.8。           而当前hadoop集群运行的是cdh5.4.2 而当前安装的spark版本是cdh5.4.8。 然后,需要手动配置hadoop的相关变量才能保证spark on yarn提交的集群是cdh5.4.2 最后,在文件spark-env.sh配置export SPARK_YARN_USER_ENV="/huanglang/hadoop/hadoop-2.6.0-cdh5.4.2/etc/hadoop"          这里的配置是错误的,导致Spark在初始化的时候报错。         正确的配置export SPARK_YARN_USER_ENV="CLASSPATH=/huanglang/hadoop/hadoop-2.6.0-cdh5.4.2/etc/hadoop" 如果用的是/usr/lib/hadoop,就不需要重新指定hadoop的配置文件。 我在/mnt/workspace/目录下新建tmp目录,里面有重新编译后的chinahadoop-1.0-SNAPSHOT.jar包和spark-cdh5.4.2版本的rpm包。 运行成功后的输出结果在目录/mnt/workspace/output下面。 还有spark集群最好配置下免密码登录,最好保证用cdh版本是一致的。  

fish - Hadooper

赞同来自:

提交命令用的什么?

黄浪 - coder....

赞同来自:

脚本 #!/usr/bin/env bash $SPARK_HOME/bin/spark-submit \   --master yarn-client \   --executor-memory 60m \   --total-executor-cores 1 \   /mnt/workspace/chinahadoop-1.0-SNAPSHOT.jar

yanglei

赞同来自:

SparkContext初始化时发生了错误。   

fish - Hadooper

赞同来自:

你的环境在哪里?是云主机么?是否可以远程登录?

要回复问题请先登录注册