第二课_Spark程序设计与企业级应用案例

第二课_Spark程序设计与企业级应用案例的相关问题都在下面进行提问回帖
1、大家在这个帖子上回复自己想要提的问题。(相同的问题,请点赞表示自己关注这个问题,不用重复提问)
2、提出的问题,老师在直播课的最后30分钟统一回答。
3、课后会整理出参考答案,给每个问题回帖。
第二课课后调查问卷:
https://wj.qq.com/s/1259269/fe17
或者扫二维码进行填写
[attach]5568[/attach]
 
已邀请:

jhg22

赞同来自: 李思宇isx 张文山4tw

spark正式生产环境中问题:spark实时计算(kafka+sparkStreaming+redis)时,每分钟都会有task skipped的现象,如下图所示:
[attach]5561[/attach]
[attach]5558[/attach]
[attach]5559[/attach]
[attach]5560[/attach]

不知道什么原因引起的?董老师帮忙看一下呢?一直没有找到原因。
数据一直都有流入的。

tl_oni

赞同来自: 探照灯儿

我用IntelliJ IDEA构建开发环境时,用提spark-2.10,scala 2.11.8,jdk 1.8,出现 not found: type SparkConf     问题, 如图,是什原因,怎么办??
环境:
spark master ip 116,
三个worker 为101,102,103
 
我的开发主机IntelliJ IDEA构建在ip 为114上
[attach]5551[/attach]

jhg22

赞同来自: tl_oni

引入包
import org.apache.spark._

王云鹏

赞同来自: Lotus丶

我们现在用的是Spark1.6,如果升级到Spark2.1,是否只需要重新配置即可?代码向下兼容?谢谢

kendu

赞同来自: Lotus丶

请问老师SPARK on YARN集群模式下,可以用python接口来写spark应用程序,并提交给SPARK on YARN集群吗?谢谢!
spark计算时,数据会写入磁盘吗?什么条件下写入磁盘?
报错日志里的错误是:ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
这个可能的原因是什么?
看第一课中,有一条SQL,老师说用了4个MR。那么如何区分一条SQL用几个MR呢?
resource manager的工作负担重不重?实际部署的时候,是否是将resource manager单独部署在一台机器上还是与name node合并?
请问下  sortbykey可以设置reduce个数么  如何设置?hbase的region预分区个数和spark过程中的reduce个数相同么
请问老师container是进程吗?yarnAllocator分配的container是什么形式给到executor的?task是以什么形式传到executor的?kyro的buffer.max指的是什么?
老师,我测试过,现在spark 支持多个sparkcontext了啊?只要multiple.sparkcontext 设置为true就行了。也可以运行。能否谈下,这种方式是为了解决那种类型的需求?
想问老师数据收集的问题:
 
在 日志文件-》flume -》 kafka -》SparkStreaming架构中,使用tail -F收集日志,遇到了以下两个问题:
 
1. 日志文件在滚动(例如文件大小达到固定上限),切换文件时,flume会丢数据;
2. source输入太快,会丢失数据
 
请问董老师,有什么解决方案吗?
老师,请教您一个问题:spark处理的问题必须是可以按行切割的文件么?换句话说:就是内容可以按行随机处理。
1)cache时,超出内存怎么处理?
2)如果设置的core超过cpu的核会怎么处理?
(1)如果一个文件200M,存储的时候是两个blcok,两个block是均匀地分为100M、100M,还是一个一个block存储分为128M、72M?

(2)如果是一个200M的文件,分为两个partition,是100M、100M,还是128M、72M呢? 如果是后者,会导致负载均衡问题吗?
介绍一下spark程序中,怎样打log4j的日志?
董老师,在sparkconf中设置和在submit设置的优先级,还有就是rdd的缓存时间,不释放就一直存在吗?
请问下 现在工作中都用JAVA 可不可以不学SCALA直接学JAVA的各种SPARK用法
刚才讲到的pipe调用外部脚本,这个外部脚本是否需要每个executor的相同目录下都要有一份这个脚本?
是否有通过spark-submit提交外部脚本的方式?
intelij 在windows下,可以本机IDE中执行吗
yarn-client 模式下 sc.textFile("file:///data") 是读取driver上的本地文件还是executor上的文件
spark-submit -jars 提交n个jar包 如果--master yarn-cluster 是否也需要每个executor的相同目录下都要有一份同样的n个jar吗?
spark-submit 启动应用时,num-executors 和 executor-cores 
num-executors=100
executor-cores=1

num-executors=50
executor-cores=2

executors-total 是一样的,有啥区别?
再问一下,rdd repartition后,hdfs上的存储不会相应的变化吧,这样的话,rdd是在不同的节点之间shuffle吗?
问一下为什么yarn模式下需要指定spark.yarn.jar,同时问一下,这个jar是自己打的吗,如果其中包括自己写的应用程序的话,就不用担心环境的兼容性问题和包冲突的问题了吧,不知道理解的对不对。
请问spark on yarn cluster 模式是否对服务器的内存有要求?必须每台服务器的内存必须超过8G?
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
 
请问启动yarn集群的时候,出现这个问题?
[code=Java]
董老师,我通过 [/code][code=Java]./spark-shell --master yarn 启动了spark后,运行以下内容,[/code][code=Java]
 [/code][code=Java]val sc.textFile("/opt/data/movie/users.dat")[/code][code=Java]val userrdd = sc.textFile("/opt/data/movie/users.dat")
[/code][code=Java]当调用 userrdd.count 时,出现以下错误,请问是怎么回事呢?
[/code][code=Java]SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop-2.5.1/nm-local-dir/usercache/root/filecache/28/__spark_libs__2285737196423591043.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/software/hadoop-2.5.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/04/16 18:54:54 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 24407@master
17/04/16 18:54:54 INFO util.SignalUtils: Registered signal handler for TERM
17/04/16 18:54:54 INFO util.SignalUtils: Registered signal handler for HUP
17/04/16 18:54:54 INFO util.SignalUtils: Registered signal handler for INT
17/04/16 18:54:55 INFO spark.SecurityManager: Changing view acls to: root
17/04/16 18:54:55 INFO spark.SecurityManager: Changing modify acls to: root
17/04/16 18:54:55 INFO spark.SecurityManager: Changing view acls groups to:
17/04/16 18:54:55 INFO spark.SecurityManager: Changing modify acls groups to:
17/04/16 18:54:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/04/16 18:54:56 INFO client.TransportClientFactory: Successfully created connection to /10.135.111.231:48282 after 104 ms (0 ms spent in bootstraps)
17/04/16 18:54:56 INFO spark.SecurityManager: Changing view acls to: root
17/04/16 18:54:56 INFO spark.SecurityManager: Changing modify acls to: root
17/04/16 18:54:56 INFO spark.SecurityManager: Changing view acls groups to:
17/04/16 18:54:56 INFO spark.SecurityManager: Changing modify acls groups to:
17/04/16 18:54:56 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/04/16 18:54:56 INFO client.TransportClientFactory: Successfully created connection to /10.135.111.231:48282 after 2 ms (0 ms spent in bootstraps)
17/04/16 18:54:56 INFO storage.DiskBlockManager: Created local directory at /opt/hadoop-2.5.1/nm-local-dir/usercache/root/appcache/application_1492328023585_0010/blockmgr-55d5a33c-e918-4dd8-af78-38c7d035c60b
17/04/16 18:54:56 INFO memory.MemoryStore: MemoryStore started with capacity 413.9 MB
17/04/16 18:54:57 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.135.111.231:48282
17/04/16 18:54:57 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/04/16 18:54:57 INFO executor.Executor: Starting executor ID 1 on host master
17/04/16 18:54:57 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43664.
17/04/16 18:54:57 INFO netty.NettyBlockTransferService: Server created on master:43664
17/04/16 18:54:57 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/04/16 18:54:57 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(1, master, 43664, None)
17/04/16 18:54:57 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(1, master, 43664, None)
17/04/16 18:54:57 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(1, master, 43664, None)
17/04/16 18:54:57 INFO executor.Executor: Using REPL class URI: spark://10.135.111.231:48282/classes
17/04/16 18:56:57 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
17/04/16 18:56:57 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/04/16 18:56:58 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/04/16 18:56:58 INFO client.TransportClientFactory: Successfully created connection to /10.135.111.231:35165 after 2 ms (0 ms spent in bootstraps)
17/04/16 18:56:58 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1889.0 B, free 413.9 MB)
17/04/16 18:56:58 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 179 ms
17/04/16 18:56:58 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 413.9 MB)
17/04/16 18:56:58 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:0+67184
17/04/16 18:56:58 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/04/16 18:56:58 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 413.9 MB)
17/04/16 18:56:58 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 11 ms
17/04/16 18:56:58 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 296.9 KB, free 413.6 MB)
17/04/16 18:56:59 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/04/16 18:56:59 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/04/16 18:56:59 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/04/16 18:56:59 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/04/16 18:56:59 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/04/16 18:56:59 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
17/04/16 18:57:00 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:67184+67184
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:136)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2
17/04/16 18:57:00 INFO executor.Executor: Running task 0.1 in stage 0.0 (TID 2)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:0+67184
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 0.1 in stage 0.0 (TID 2)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 3
17/04/16 18:57:00 INFO executor.Executor: Running task 1.1 in stage 0.0 (TID 3)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:67184+67184
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 1.1 in stage 0.0 (TID 3)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:136)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
17/04/16 18:57:00 INFO executor.Executor: Running task 0.2 in stage 0.0 (TID 4)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:0+67184
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 0.2 in stage 0.0 (TID 4)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 5
17/04/16 18:57:00 INFO executor.Executor: Running task 1.2 in stage 0.0 (TID 5)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:67184+67184
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 1.2 in stage 0.0 (TID 5)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:136)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6
17/04/16 18:57:00 INFO executor.Executor: Running task 0.3 in stage 0.0 (TID 6)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:0+67184
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 0.3 in stage 0.0 (TID 6)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/04/16 18:57:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/04/16 18:57:00 INFO executor.Executor: Running task 1.3 in stage 0.0 (TID 7)
17/04/16 18:57:00 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/opt/data/movie/users.dat:67184+67184
17/04/16 18:57:00 INFO executor.Executor: Executor is trying to kill task 1.3 in stage 0.0 (TID 7)
17/04/16 18:57:00 ERROR executor.Executor: Exception in task 1.3 in stage 0.0 (TID 7)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:231)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:136)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)[/code]
董老师,您好。请问在使用spark写入hbase数据的时候如何处理中文乱码问题。部分中文乱码,部分中文不乱码。读取的原始数据是不乱码的。在用spark做数据分析的时候怎么处理这种问题呢?
请问,用Spark Sql做交互式查询时,如何做到上百G的数据在最短的时间内计算出来返回呢?数据存放在什么介质中呢?

要回复问题请先登录注册