Hadoop的安装
? hadoop ?    2016-06-25 20:27:54    279    0    0
holynull   ? hadoop ?

翻译至 Hadoop The Definitive Guid 4th Edition

集群的设置和安装

This section describes how to install and configure a basic Hadoop cluster from scratch using the Apache Hadoop distribution on a Unix operating system. It provides background information on the things you need to think about when setting up Hadoop. For a production installation, most users and operators should consider one of the Hadoop cluster management tools listed at the beginning of this chapter.

这一章我们将讲述如何在Unix操作系统安装、配置和使用Hadoop集群。

安装Java

Hadoop runs on both Unix and Windows operating systems, and requires Java to be installed. For a production installation, you should select a combination of operating system, Java, and Hadoop that has been certified by the vendor of the Hadoop distribution you are using. There is also a page on the Hadoop wiki that lists ombinations that community members have run  with success.

Hadoop可以在Unix和Windows操作系统上运行,前提条件是必须安装Java运行环境。

创建操作系统用户

It’s good practice to create dedicated Unix user accounts to separate the Hadoop processes from each other, and from other services running on the same machine. The HDFS,
MapReduce, and YARN services are usually run as separate users, named hdfs, mInapred, and yarn, respectively. They all belong to the same hadoop group.

创建专用账号将Hadoop进程与其他进程隔离开是一个很好的实践。通常HDFS,MapReduce和YARN分别在各自的账户hdfs,mapred和yarn下运行。hdfs,mapred和yarn账户同在用户组hadoop下。

安装Hadoop

Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the distribution in a sensible location, such as /usr/local (/opt is another standard choice; note that Hadoop should not be installed in a user’s home directory, as that may be an NFSmounted directory): % cd /usr/local % sudo tar xzf hadoop-x.y.z.tar.gz

从Apache网站上下载Hadoop,并解压到一个目录下,例如/usr/local(/opt是另外一个标准选择;注意Hadoop通常不应该被安装在一个用户的home目录下):

 

% cd /usr/local
% sudo tar -xzvf hadoop-x.y.z.tar.gz​

 

You also need to change the owner of the Hadoop files to be the hadoop user and group:
% sudo chown -R hadoop:hadoop hadoop-x.y.z
It’s convenient to put the Hadoop binaries on the shell path too:
% export HADOOP_HOME=/usr/local/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

同时需要将Hadoop目录以及目录下的文件的所有者改成hadoop用户组的用户:

% sudo chown -R hadoop:hadoop hadoop-x.y.z

为了方便起见,我们可以把Hadoop的安装目录加入到环境变量中:

% export HADOOP_HOME=/usr/local/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

配置SSH

The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the
cluster. Note that the control scripts are optional — cluster-wide operations can be performed by other mechanisms, too, such as a distributed shell or dedicated Hadoop management applications.

Hadoop的控制脚本需要依赖SSH才能进行进行集群间的操作。例如,一个脚本可以启动和关闭集群中的所有守护进程。注意控制脚本不是必须的,进群间的操作也可以通过其他机制来知心,例如分布式shell命令,或者Hadoop管理应用软件系统。

To work seamlessly, SSH needs to be set up to allow passwordless login for the hdfs and
yarn users from machines in the cluster.[69] The simplest way to achieve this is to generate
a public/private key pair and place it in an NFS location that is shared across the cluster.

为了达到无缝运行的目的,需要使hdfs账户和yarn账户在集群中允许无密码登录任何一台机器。最简单的方法就是生成公钥和私钥,并放在NFS location中。

First, generate an RSA key pair by typing the following. You need to do this twice, once as the hdfs user and once as the yarn user:
% ssh-keygen -t rsa -f ~/.ssh/id_rsa

首先我们用下面的命令生成一个RSA公/私钥。你需要分别在hdfs用户和yarn用户下分别执行一次如下操作:

% ssh-keygen -t rsa -f ~/.ssh/id_rsa

Even though we want passwordless logins, keys without passphrases are not considered good practice (it’s OK to have an empty passphrase when running a local pseudodistributed cluster, as described in Appendix A), so we specify a passphrase when prompted for one. We use ssh-agent to avoid the need to enter a password for each
connection.

我们在实现无密码登录时,没有安全口令的公/私钥并不是一个好的实践(对于本地的伪分布式集群不会有任何影响),所以在以上命令执行过程中会提示输入口令时,我们就输入一个。然后我们使用ssh-agent来登录每一台服务器一次,来避免以后每次都要求输入密钥口令。

Next, we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to. If the users’ home directories are stored on an NFS filesystem, the keys can be shared across the cluster by typing the following (first as hdfs and then as yarn):
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

下一步,我们需要把~/.ssh/authorized_keys中的公钥,复制到在集群中其他机器上的同样的文件中。我们可以通过如下命令将本机的公钥共享给集群中其他机器(先在hdfs用户下执行一次,然后再用yarn用户执行一次):

% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test that you can SSH from the master to a worker machine by making sure ssh-agent is running,[70] and then run ssh-add to store your passphrase. You should be able to SSH to a worker without entering the passphrase again.

运行ssh-agent,然后运行ssh-add把你的公钥口令添加进去。然后就可以从主机上通过ssh连接到从机上,并且再也不用输入口令了。

配置Hadoop

Hadoop must have its configuration set appropriately to run in distributed mode on a cluster. The important configuration settings to achieve this are discussed in Hadoop Configuration.

Hadoop必须经过适当的配置才能以分布模式在集群上运行。重要的配置将在《Hadoop 配置》一节中进行讨论。

格式化HDFS系统

Before it can be used, a brand-new HDFS installation needs to be formatted. The formatting process creates an empty filesystem by creating the storage directories and the initial versions of the namenode’s persistent data structures. Datanodes are not involved in the initial formatting process, since the namenode manages all of the filesystem’s metadata, and datanodes can join or leave the cluster dynamically. For the same reason, you don’t need to say how large a filesystem to create, since this is determined by the number of datanodes in the cluster, which can be increased as needed, long after the filesystem is formatted.

在使用全新的HDFS之前,我们需要格式化一下这个文件系统。格式化将创建存储目录,并创建最初版本的namenode的持久化数据结构,之后就形成了一个空的文件系统。datanode并不参与最初的格式化,因为namenode可以管理所有的metadata,并且可以动态的让datanode加入或者离开集群。同样的原因,集群中datanode的数量决定了文件系统的规模,在文件系统格式化之后还可以根据需要不断通过增加datanode来增加系统规模。

Formatting HDFS is a fast operation. Run the following command as the hdfs user:
% hdfs namenode -format

在hdfs用户下运行如下命令:

% hdfs namenode -format

启动停止守护进程

Hadoop comes with scripts for running commands and starting and stopping daemons across the whole cluster. To use these scripts (which can be found in the sbin directory), you need to tell Hadoop which machines are in the cluster. There is a file for this purpose, called slaves, which contains a list of the machine hostnames or IP addresses, one per line. The slaves file lists the machines that the datanodes and node managers should run on. It resides in Hadoop’s configuration directory, although it may be placed elsewhere (and given another name) by changing the HADOOP_SLAVES setting in hadoop-env.sh. Also, this file does not need to be distributed to worker nodes, since they are used only by the control scripts running on the namenode or resource manager.

Hadoop有一些脚本可以在集群间执行命令,并可以启动和关闭集群中的守护进程。使用这些脚本(在安装目录的sbin下)之前,需要告诉Hadoop都有哪些机器在集群中。有一个叫slaves的文件能够达到这个目的,这个文件中每一行记录一个主机名称或者IP地址。slaves文件以列表的形式记录了所有datanodes和节点管理服务器。这个文件在Hadoop的配置文件夹中,可以通过在hadoop-env.sh中设置环境变量HADOOP_SLAVES来改变slaves的位置和文件名。slaves文件不必分发到其他的从节点上,因为只有namenode上的控制脚本和资源管理才会用到。

The HDFS daemons are started by running the following command as the hdfs user:
% start-dfs.sh

在hdfs用户下执行如下命令来启动HDFS守护进程:

% start-dfs.sh

The machine (or machines) that the namenode and secondary namenode run on is determined by interrogating the Hadoop configuration for their hostnames. For example, the script finds the namenode’s hostname by executing the following: % hdfs getconf -namenodes

运行namenode和secondary namenode的服务器,是由程序在Hadoop的配置中查询他们的主机名得到的。例如,我们可以通过执行如下脚本来多的namenode的主机名: 

% hdfs getconf -namenodes

 

By default, this finds the namenode’s hostname from fs.defaultFS. In slightly more detail, the start-dfs.sh script does the following:

  • Starts a namenode on each machine returned by executing hdfs getconf -namenodes[71]
  • Starts a datanode on each machine listed in the slaves file
  • Starts a secondary namenode on each machine returned by executing hdfs getconf -secondarynamenodes 

这条命令默认从配置项fs.defaultFS中找到namenode的主机名。简单来讲,start-dfs.sh脚本执行了以下操作:

  • 启动由hdfs getconf -namenodes返回的主机上的namenode
  • 启动slaves文件中所罗列的datanode
  • 启动由hdfs getconf -secondarynamenodes返回的主机上的secondary namenode

The YARN daemons are started in a similar way, by running the following command as
the yarn user on the machine hosting the resource manager:
% start-yarn.sh

YARN守护进程以类似的方式启动。YARN通过yarn用户执行如下命令启动,在主机上进行资源管理:

% start-yarn.sh

In this case, the resource manager is always run on the machine from which the startyarn.sh script was run. More specifically, the script:

  • Starts a resource manager on the local machine
  • Starts a node manager on each machine listed in the slaves file

在这种情况下,当startyarn.sh脚本执行过后,资源管理器将一直在主机上运行着。startyarn.sh脚本同时也执行了另外一些操作:

  • 在本地主机上启动了资源管理程序
  • 在slaves文件罗列的主机上启动了节点管理程序

Also provided are stop-dfs.sh and stop-yarn.sh scripts to stop the daemons started by the corresponding start scripts.

同时也有关闭守护进程的脚本 stop-dfs.sh和stop-yarn.sh。

These scripts start and stop Hadoop daemons using the hadoop-daemon.sh script (or the yarn-daemon.sh script, in the case of YARN). If you use the aforementioned scripts, you shouldn’t call hadoop-daemon.sh directly. But if you need to control Hadoop daemons from another system or from your own scripts, the hadoop-daemon.sh script is a good integration point. Likewise, hadoop-daemons.sh (with an “s”) is handy for starting the
same daemon on a set of hosts.

这些脚本可以通过执行hadoop-daemon.sh批量执行(或者yarn-daemon.sh )。如果你用到以上所有脚本,可以直接运行hadoop-daemon.sh。但是如果需要通过其他系统或者自定义脚本控制Hadoop守护进程, hadoop-daemon.sh将会是一个很好的整合点。同样hadoop-daemons.sh是一个很方便的启动脚本。

Finally, there is only one MapReduce daemon — the job history server, which is started as follows, as the mapred user:
% mr-jobhistory-daemon.sh start historyserver

最后,在mapred用户下,执行以下命令启动一个MapReduce守护进程——the job history server:

% mr-jobhistory-daemon.sh start historyserv

创建用户目录

Once you have a Hadoop cluster up and running, you need to give users access to it. This involves creating a home directory for each user and setting ownership permissions on it:
% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username
This is a good time to set space limits on the directory. The following sets a 1 TB limit on the given user directory:
% hdfs dfsadmin -setSpaceQuota 1t /user/username

一旦Hadoop集群安装完毕并运行起来,接下来需要给用户授权来访问它。我们来为每个用户创建一个home目录,并赋予对应的用户在home目录上的所有者权限:

% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username

然后我们设置home目录的空间上限。下面命令为用户目录设置1TB的空间上限:

% hdfs dfsadmin -setSpaceQuota 1t /user/username

 

上一篇: Introduction to Apache Shiro

下一篇: Hadoop配置

279 人读过
立即登录, 发表评论.
没有帐号? 立即注册
0 条评论
文档导航