In my previous blog entry I profiled Hadoop HDFS , now I will go one step higher in Hadoop architecture and introduce YARN. YARN was added for efficient resource management and scheduling relatively recently in Hadoop 2.0, previously MapReduce was a layer used in that role exclusively..
As you can see above this is relatively large change with YARN now taking its place as “data OS” on top of HDFS and MapReduce becoming another framework on top of YARN, just like many others.
YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications.
YARN provides the daemons and APIs necessary to develop generic distributed applications of any kind, handles and schedules resource requests (such as memory and CPU) from such applications, and supervises their execution.
YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce.
YARN pretty radically changes MapReduce internals. Mapreduce 1.0 had following workflow:
Note there is a single point of failure here – JobTracker. Job Tracker is a “master” component of Task Trackers. Client submit MapReduce jobs to Job Tracker which distributes tasks to Task Trackers.Task Trackers run on Data Node and perform actual MapReduce jobs.
With the advent of YARN, there is no longer a single JobTracker to run jobs and a TaskTracker to run tasks of the jobs. The old MapReduce 1.0 framework was rewritten to run within a submitted application on top of YARN. This application was christened MR2, or MapReduce version 2. It is the familiar MapReduce execution underneath, except that each job now controls its own destiny via its own ApplicationMaster taking care of execution flow (such as scheduling tasks, handling speculative execution and failures, etc.). It is a more isolated and scalable model than the MR1\Map Reduce 1.0 system where a singular JobTracker does all the resource management, scheduling and task monitoring work.
The ResourceManager has two main components: Scheduler and ApplicationsManager:
- The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, CPU, disk, network etc. In the first version, only memory is supported.
- The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
With the advent of YARN, you are no longer constrained by the simpler MapReduce paradigm of development, but can instead create more complex distributed applications. In fact, you can think of the MapReduce model as simply one more application in the set of possible applications that the YARN architecture can run, in effect exposing more of the underlying framework for customized development.
MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.
So how do I configure my YARN cluster settings\size my cluster correctly?
YARN configuration options are stored in the
/opt/mapr/hadoop/hadoop-2.x.x/etc/hadoop/yarn-site.xml file and are editable by the
root user. This file contains configuration information that overrides the default values for YARN parameters. Overrides of the default values for core configuration properties are stored in the yarn-default.xml file.
Common parameters for yarn-site.xml can be found here – http://doc.mapr.com/display/MapR/yarn-site.xml
There is a nice reference from Hortonworks here that talks about a tool that gives some best practice suggestions for memory settings and also goes over how to manually set these values, good article from Cloudera as well – http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-2/CDH4-Installation-Guide/cdh4ig_topic_11_4.html
For more on YARN see – http://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-substantial-step-forward-enterprise-hadoop/, http://blog.sequenceiq.com/blog/2014/07/22/schedulers-part-1/, http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-22.214.171.124/bk_installing_manually_book/content/rpm_chap3.html, http://blogs.msdn.com/b/bigdatasupport/archive/2014/11/11/some-commonly-used-yarn-memory-settings.aspx , http://www.informit.com/articles/article.aspx?p=2190194&seqNum=2, http://arturmkrtchyan.com/how-to-setup-multi-node-hadoop-2-yarn-cluster