谁能指导一下Docker on Yarn的具体实现方法？最好有实践案列
我所在的Hulu的大数据团队正在做这方面的实践。在这里介绍一下我们北京团队开发的Docker On YARN实现：Voidbox，一种基于Docker，运行在YARN上的DAG计算框架，已在hulu多条生产线上使用，效果明显。 1. Voidbox Motivation YARN is the distributed resource management system in Hadoop 2.0, which is able to schedule cluster resources for diverse high-level applications such as MapReduce, Spark. However, nowadays, all existing framework on top of YARN are designed with assumption of specific system environment. How to support user applications with arbitrary complex environment dependencies is still an open question. Docker gives the answer. Docker is a very popular container virtualization technology. It provides a way to run almost any application isolated in a container. Docker is an open platform for developing, shipping, and running applications. Docker automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere. In order to integrate the unique advantages of Docker and YARN, the Hulu engineering team developed Voidbox. Voidbox enables any application encapsulated in docker image running on YARN cluster along with MapReduce and Spark. Voidbox brings the following benefits: Ease creating distributed application Voidbox handles most common issues in distributed computation system, say it, cluster discovery, elastic resource allocation, task coordination, disaster recovery. With its well-designed interface, it’s easy to implement a distributed application. Simplify deployment Without Voidbox, we need to create and maintain dedicated VM for application with complex environment even though the VM image is huge and not easy to deploy. With Voidbox, we could easily get resource allocated and make app run right the time we need it. Additional maintenance work is eliminated. Improve cluster efficiency As we could deploy Spark/MR and all kinds of Voidbox applications from different department together, we could maximize cluster usage. Thus, YARN as a big data operating platform has been further consolidated and enhanced. Voidbox supports Docker container-based DAG(Directed Acyclic Graph) tasks in execution. Moreover, Voidbox provides several ways to submit applications considering demands of the production environment and the debugging environment. In addition, Voidbox can cooperate with Jenkins, GitLab and private Docker Registry to set up a set of developing, testing, automatic release process. 2.Voidbox Architecture 2.1 YARN Architecture Overview YARN enables multiple applications to share resources dynamically in a cluster. Here is the architecture of applications running in YARN cluster: http://tech.hulu.com/blog/2014 ... rview/) is a video analysis application. It’s implemented by C and has lots of graphics libraries. That can be optimized by Voidbox: first of all we need to package all face match program into a Docker image, then write Voidbox application to handle the multiple videos. Voidbox solves the complex machine environment and the parallelism control problem. Building complex workflow Some tasks have a dependent with each other, such as it needs to load user behaviors first, then do the analysis of user behaviors. These two steps have successively dependencies. We use Voidbox container-based programming model to handle this case easily. 6. Different from DockerContainerExecutor in YARN 2.6.0 DockerContainerExecutor(link:https://issues.apache.org/jira/browse/YARN-1964) is released in YARN 2.6.0 and it’s alpha version. Not mature enough, and it is only an encapsulation layer above the default executor. DockerContainerExecutor is difficult to coexist with other ContainerExecutor in one YARN cluster. Voidbox features DAG programming model Configurable container level of fault tolerance A variety of running modes, considering development environment and production environment Share YARN cluster resources with other Hadoop job Graphical log view tool 7. Future work Support more versions of YARN Voidbox would like to support more versions in the future besides YARN 2.6.0. Voidbox Master fault tolerance, persistent metadata to reduce the cost in case of retry Currently, if a Voidbox Master crashes, YARN will recycle resources belonging to this Voidbox application and restart Voidbox Master to do some tasks from the very beginning. It’s not necessary to impact tasks which are already done or running. We might keep some metadatas in the State Server to reduce the cost in case of Voidbox Master on-failure. Voidbox Master as a permanent service Voidbox will support long running Voidbox Master to receive streaming tasks. Support long service Voidbox will support long running service if Voidbox Master’s downtime doesn’t influence running task.