? hadoop ?    2016-06-25 20:27:54    279    0    0
holynull   ? hadoop ?

翻译至 Hadoop The Definitive Guid 4th Edition


This section describes how to install and configure a basic Hadoop cluster from scratch using the Apache Hadoop distribution on a Unix operating system. It provides background information on the things you need to think about when setting up Hadoop. For a production installation, most users and operators should consider one of the Hadoop cluster management tools listed at the beginning of this chapter.



Hadoop runs on both Unix and Windows operating systems, and requires Java to be installed. For a production installation, you should select a combination of operating system, Java, and Hadoop that has been certified by the vendor of the Hadoop distribution you are using. There is also a page on the Hadoop wiki that lists ombinations that community members have run  with success.



It’s good practice to create dedicated Unix user accounts to separate the Hadoop processes from each other, and from other services running on the same machine. The HDFS,
MapReduce, and YARN services are usually run as separate users, named hdfs, mInapred, and yarn, respectively. They all belong to the same hadoop group.



Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the distribution in a sensible location, such as /usr/local (/opt is another standard choice; note that Hadoop should not be installed in a user’s home directory, as that may be an NFSmounted directory): % cd /usr/local % sudo tar xzf hadoop-x.y.z.tar.gz



% cd /usr/local
% sudo tar -xzvf hadoop-x.y.z.tar.gz​


You also need to change the owner of the Hadoop files to be the hadoop user and group:
% sudo chown -R hadoop:hadoop hadoop-x.y.z
It’s convenient to put the Hadoop binaries on the shell path too:
% export HADOOP_HOME=/usr/local/hadoop-x.y.z


% sudo chown -R hadoop:hadoop hadoop-x.y.z


% export HADOOP_HOME=/usr/local/hadoop-x.y.z


The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the
cluster. Note that the control scripts are optional — cluster-wide operations can be performed by other mechanisms, too, such as a distributed shell or dedicated Hadoop management applications.


To work seamlessly, SSH needs to be set up to allow passwordless login for the hdfs and
yarn users from machines in the cluster.[69] The simplest way to achieve this is to generate
a public/private key pair and place it in an NFS location that is shared across the cluster.

为了达到无缝运行的目的,需要使hdfs账户和yarn账户在集群中允许无密码登录任何一台机器。最简单的方法就是生成公钥和私钥,并放在NFS location中。

First, generate an RSA key pair by typing the following. You need to do this twice, once as the hdfs user and once as the yarn user:
% ssh-keygen -t rsa -f ~/.ssh/id_rsa


% ssh-keygen -t rsa -f ~/.ssh/id_rsa

Even though we want passwordless logins, keys without passphrases are not considered good practice (it’s OK to have an empty passphrase when running a local pseudodistributed cluster, as described in Appendix A), so we specify a passphrase when prompted for one. We use ssh-agent to avoid the need to enter a password for each


Next, we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to. If the users’ home directories are stored on an NFS filesystem, the keys can be shared across the cluster by typing the following (first as hdfs and then as yarn):
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys


% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test that you can SSH from the master to a worker machine by making sure ssh-agent is running,[70] and then run ssh-add to store your passphrase. You should be able to SSH to a worker without entering the passphrase again.



Hadoop must have its configuration set appropriately to run in distributed mode on a cluster. The important configuration settings to achieve this are discussed in Hadoop Configuration.

Hadoop必须经过适当的配置才能以分布模式在集群上运行。重要的配置将在《Hadoop 配置》一节中进行讨论。


Before it can be used, a brand-new HDFS installation needs to be formatted. The formatting process creates an empty filesystem by creating the storage directories and the initial versions of the namenode’s persistent data structures. Datanodes are not involved in the initial formatting process, since the namenode manages all of the filesystem’s metadata, and datanodes can join or leave the cluster dynamically. For the same reason, you don’t need to say how large a filesystem to create, since this is determined by the number of datanodes in the cluster, which can be increased as needed, long after the filesystem is formatted.


Formatting HDFS is a fast operation. Run the following command as the hdfs user:
% hdfs namenode -format


% hdfs namenode -format


Hadoop comes with scripts for running commands and starting and stopping daemons across the whole cluster. To use these scripts (which can be found in the sbin directory), you need to tell Hadoop which machines are in the cluster. There is a file for this purpose, called slaves, which contains a list of the machine hostnames or IP addresses, one per line. The slaves file lists the machines that the datanodes and node managers should run on. It resides in Hadoop’s configuration directory, although it may be placed elsewhere (and given another name) by changing the HADOOP_SLAVES setting in hadoop-env.sh. Also, this file does not need to be distributed to worker nodes, since they are used only by the control scripts running on the namenode or resource manager.


The HDFS daemons are started by running the following command as the hdfs user:
% start-dfs.sh


% start-dfs.sh

The machine (or machines) that the namenode and secondary namenode run on is determined by interrogating the Hadoop configuration for their hostnames. For example, the script finds the namenode’s hostname by executing the following: % hdfs getconf -namenodes

运行namenode和secondary namenode的服务器,是由程序在Hadoop的配置中查询他们的主机名得到的。例如,我们可以通过执行如下脚本来多的namenode的主机名: 

% hdfs getconf -namenodes


By default, this finds the namenode’s hostname from fs.defaultFS. In slightly more detail, the start-dfs.sh script does the following:

  • Starts a namenode on each machine returned by executing hdfs getconf -namenodes[71]
  • Starts a datanode on each machine listed in the slaves file
  • Starts a secondary namenode on each machine returned by executing hdfs getconf -secondarynamenodes 


  • 启动由hdfs getconf -namenodes返回的主机上的namenode
  • 启动slaves文件中所罗列的datanode
  • 启动由hdfs getconf -secondarynamenodes返回的主机上的secondary namenode

The YARN daemons are started in a similar way, by running the following command as
the yarn user on the machine hosting the resource manager:
% start-yarn.sh


% start-yarn.sh

In this case, the resource manager is always run on the machine from which the startyarn.sh script was run. More specifically, the script:

  • Starts a resource manager on the local machine
  • Starts a node manager on each machine listed in the slaves file


  • 在本地主机上启动了资源管理程序
  • 在slaves文件罗列的主机上启动了节点管理程序

Also provided are stop-dfs.sh and stop-yarn.sh scripts to stop the daemons started by the corresponding start scripts.

同时也有关闭守护进程的脚本 stop-dfs.sh和stop-yarn.sh。

These scripts start and stop Hadoop daemons using the hadoop-daemon.sh script (or the yarn-daemon.sh script, in the case of YARN). If you use the aforementioned scripts, you shouldn’t call hadoop-daemon.sh directly. But if you need to control Hadoop daemons from another system or from your own scripts, the hadoop-daemon.sh script is a good integration point. Likewise, hadoop-daemons.sh (with an “s”) is handy for starting the
same daemon on a set of hosts.

这些脚本可以通过执行hadoop-daemon.sh批量执行(或者yarn-daemon.sh )。如果你用到以上所有脚本,可以直接运行hadoop-daemon.sh。但是如果需要通过其他系统或者自定义脚本控制Hadoop守护进程, hadoop-daemon.sh将会是一个很好的整合点。同样hadoop-daemons.sh是一个很方便的启动脚本。

Finally, there is only one MapReduce daemon — the job history server, which is started as follows, as the mapred user:
% mr-jobhistory-daemon.sh start historyserver

最后,在mapred用户下,执行以下命令启动一个MapReduce守护进程——the job history server:

% mr-jobhistory-daemon.sh start historyserv


Once you have a Hadoop cluster up and running, you need to give users access to it. This involves creating a home directory for each user and setting ownership permissions on it:
% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username
This is a good time to set space limits on the directory. The following sets a 1 TB limit on the given user directory:
% hdfs dfsadmin -setSpaceQuota 1t /user/username


% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username


% hdfs dfsadmin -setSpaceQuota 1t /user/username


上一篇: Introduction to Apache Shiro

下一篇: Hadoop配置

279 人读过
立即登录, 发表评论.
没有帐号? 立即注册
0 条评论