muxspace/cuny_msda_is622

语言: Shell

git: https://github.com/muxspace/cuny_msda_is622

CUNY MS数据分析课程的资源是IS622大数据和机器学习
Resources for CUNY MS Data Anayltics course IS622 Big Data & Machine Learning
README.md (中文)

概观

该脚本为其设置了完整的Hadoop和Spark环境 基于debian的Linux系统。它还会安装相关的R绑定 直接从R运行这些系统。注意它假设一个新的 环境,所以如果你有一个现有的Linux系统,你应该 查看脚本并注释掉您不想运行的部分。

如果您没有可用的Linux计算机,则可以选择其中一种 在虚拟机上安装Linux或使用托管云提供程序。一世 建议安装 Ubuntu 15.04服务器。

RHadoop

安装

第1步:安装Hadoop和Spark

脚本setup_reqs.sh安装了一堆依赖项并下载了Hadoop和 火花。如果您有预先存在的系统,请检查依赖关系以确保存在 与您的配置没有冲突。

./setup_reqs.sh

安装遵循单个本地实例的过程 在此描述 指南。 该教程建议〜/ Programs作为安装, 而脚本使用〜/ workspace / cuny_msda_is622。注意这个位置。

该脚本现在使用默认环境变量配置您的环境。 如果您使用脚本或具有不同的配置,则可能需要执行此操作 更新这些。

现在按照指南中的说明配置HDFS和YARN。你需要 更改hdfs-site.xml中的路径。另外两个可以逐字使用, 警告你的机器规格大于或等于 参考机器。

第2步:启动Hadoop和YARN

确保正确设置了环境变量。这通常会 在登录时发生,但如果你在setup_reqs.sh所在的shell中 运行后,您需要手动加载〜/ .bashrc

$HADOOP_HOME/bin/hdfs namenode -format

初始化名称节点。只做一次。

$HADOOP_HOME/bin/hdfs namenode -format

现在启动所有守护进程。

cd ~/workspace/cuny_msda_is622
./bin/start_all.sh

这将启动Hadoop守护进程和资源管理器。在...后面 场景这个脚本只是制作下面的命令。

# Start the namenode daemon
$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode
# Start the datanode daemon
$HADOOP_HOME/sbin/hadoop-daemon.sh start datanode

# Start the resourcemanager daemon
$HADOOP_HOME/sbin/yarn-daemon.sh start resourcemanager
# Start the nodemanager daemon
$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager

第3步:安装RHadoop

至少,环境变量为HADOOP_CMD和HADOOP_STREAMING 需要出口。这些指向Hadoop安装目录 和流媒体jar分别。流媒体jar是Hadoop的功能 和R.说话。这些已经被附加到你的.bashrc了 步骤1。

运行bash脚本setup_rhadoop.sh以安装必要的R库。 您需要事先安装devtools。

./setup_rhadoop.sh

这将为rhdfs和rmr2安装Revolution Analytics的软件包。 对rhbase的支持需要额外的依赖性, 并且为了简单起见,省略了它。

验证RHadoop集成

第1步:完整性检查

首先,进行简单的健全性检查以验证R是否可以与Hadoop通信。 该过程在下面有更详细的描述 本教程。

library(rmr2)
library(rhdfs)
hdfs.init()

small.ints <- to.dfs(1:1000)
fs.ptr <- mapreduce(input=small.ints, map=function(k,v) cbind(v,v^2))
result <- from.dfs(fs.ptr)
head(result$val)

第2步:读取并处理文件

下载一些数据。

cd ~/workspace/cuny_msda_is622
mkdir data
cd data
wget http://download.bls.gov/pub/time.series/sm/sm.data.1.AllData

现在把它放入Hadoop。

hadoop fs -mkdir -p /bls/employment
hadoop fs -copyFromLocal sm.data.1.AllData /bls/employment/state_metro.tsv

此时,我们在HDFS中有一个原始TSV。在处理它之前,TSV 需要采用RHadoop可以使用的格式。

tsv.format <- make.input.format("csv", sep="\t")
csv.format <- make.output.format("csv", sep=",")
input <- '/bls/employment/state_metro.tsv'
output <- '/bls/employment/state_metro_1.csv'
out.ptr <- mapreduce(input, input.format=tsv.format, 
  output=output, output.format=csv.format,
  map=function(k,v) {
    keyval(v[[1]], 
      cbind(yearmonth=sprintf("%s.%s", v[[2]],v[[3]]), value=v[[4]]))
  },
  reduce=function(k,v) {
    keyval(k, length(v))
  })
result <- from.dfs(out.ptr, format="csv")

HTTPS://GitHub.com/revolution analytics/让每日2/blob/master/docs/getting-data-in-安定-呕吐.面对

另一个例子

鸢尾花 导出为CSV

读为CSV

其他命令

以下是一些您会发现有用的其他命令。

停止Hadoop

$ HADOOP_HOME / sbin / hadoop-daemon.sh停止datanode $ HADOOP_HOME / sbin / hadoop-daemon.sh停止福特

故障排除

java.io.IOException:不兼容的clusterID

不知何故,您的namenode和datanode被分配了不同的集群ID。您可以 编辑Hadoop配置以使两个ID匹配。首先看一下ID:

cd $HADOOP_HOME
grep clusterID hdfs/*/current/VERSION

如果这些不匹配,那么: +停止datanode和namenode; +从一个文件中复制集群ID; +编辑其他文件并替换群集ID; +重新启动namenode和datanode。

在$ HADOOP_HOME / logs目录中,验证datanode和namenode日志 文件没有任何致命错误。

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

Overview

This script sets up a complete Hadoop and Spark environment for
a debian-based Linux system. It also installs the associated R bindings
to run these systems directly from R. Note that it assumes a new
environment, so if you have an existing Linux system, you should
review the script and comment out sections you do not want run.

If you don't have a Linux machine available, your options are to either
install Linux on a virtual machine or use a hosted cloud provider. I
recommend installing the
Ubuntu 15.04 server.

RHadoop

Installation

Step 1: Install Hadoop and Spark

The script setup_reqs.sh installs a bunch of dependencies anddownloads Hadoop and
Spark. If you have a pre-existing system, check the dependencies to ensure there are
no conflicts with your configuration.

./setup_reqs.sh

Installation follows the procedure for a single, local instance
described in this
guide.
That tutorial recommends ~/Programs as the installation,
while the script uses ~/workspace/cuny_msda_is622. Note this location.

The script now configures your environment with default environment variables.
If you go off script or have a different configuration, you may need to
update these.

Now configure HDFS and YARN as described in the guide. You will need to
change the paths in hdfs-site.xml. The other two can be used verbatim,
with the caveat that your machine has specs greater than or equal to
the reference machine.

Step 2: Start Hadoop and YARN

Be sure that your environment variables are properly set. This will normally
happen at log in, but if you are in the same shell as where setup_reqs.sh
was run, you'll need to manually load the ~/.bashrc via

$HADOOP_HOME/bin/hdfs namenode -format

Initialize the name node. Only do this once.

$HADOOP_HOME/bin/hdfs namenode -format

Now start all daemons.

cd ~/workspace/cuny_msda_is622
./bin/start_all.sh

This will start the Hadoop daemons and the resource managers. Behind the
scenes this script is simply making the below commands.

# Start the namenode daemon
$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode
# Start the datanode daemon
$HADOOP_HOME/sbin/hadoop-daemon.sh start datanode

# Start the resourcemanager daemon
$HADOOP_HOME/sbin/yarn-daemon.sh start resourcemanager
# Start the nodemanager daemon
$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager

Step 3: Install RHadoop

At a minimum, the environment variables HADOOP_CMD and HADOOP_STREAMING
need to be exported. These point to the Hadoop installation directory
and streaming jar, respectively. The streaming jar is what enables Hadoop
to speak to R. These have already been appended to your .bashrc in
Step 1.

Run the bash script setup_rhadoop.sh to install the necessary R libraries.
You need devtools installed beforehand.

./setup_rhadoop.sh

This will install Revolution Analytics' packages for rhdfs and rmr2.
Support for rhbase requires additional dependencies,
and for the sake of simplicity, it is omitted.

Verify RHadoop Integration

Step 1: Sanity check

First, do a simple sanity check to verify that R can speak to Hadoop.
This process is described in more detail in
this tutorial.

library(rmr2)
library(rhdfs)
hdfs.init()

small.ints <- to.dfs(1:1000)
fs.ptr <- mapreduce(input=small.ints, map=function(k,v) cbind(v,v^2))
result <- from.dfs(fs.ptr)
head(result$val)

Step 2: Read and process file

Download some data.

cd ~/workspace/cuny_msda_is622
mkdir data
cd data
wget http://download.bls.gov/pub/time.series/sm/sm.data.1.AllData

Now put it into Hadoop.

hadoop fs -mkdir -p /bls/employment
hadoop fs -copyFromLocal sm.data.1.AllData /bls/employment/state_metro.tsv

At this point, we have a raw TSV in HDFS. Before processing it, the TSV
needs to be in a format that RHadoop can work with.

tsv.format <- make.input.format("csv", sep="\t")
csv.format <- make.output.format("csv", sep=",")
input <- '/bls/employment/state_metro.tsv'
output <- '/bls/employment/state_metro_1.csv'
out.ptr <- mapreduce(input, input.format=tsv.format, 
  output=output, output.format=csv.format,
  map=function(k,v) {
    keyval(v[[1]], 
      cbind(yearmonth=sprintf("%s.%s", v[[2]],v[[3]]), value=v[[4]]))
  },
  reduce=function(k,v) {
    keyval(k, length(v))
  })
result <- from.dfs(out.ptr, format="csv")

https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md

Another Example

Iris
Export as CSV

Read as CSV

Other Commands

Here are some other commands that you will find useful.

Stop Hadoop

$HADOOP_HOME/sbin/hadoop-daemon.sh stop datanode
$HADOOP_HOME/sbin/hadoop-daemon.sh stop namenode

Troubleshooting

java.io.IOException: Incompatible clusterIDs

Somehow your namenode and datanode got assigned different cluster IDs. You can
edit the Hadoop configuration to make the two IDs match. First look at the IDs:

cd $HADOOP_HOME
grep clusterID hdfs/*/current/VERSION

If these do not match, then:
+ stop the datanode and namenode;
+ copy the cluster ID from one file;
+ edit the other file and replace the cluster ID;
+ restart the namenode and datanode.

In the $HADOOP_HOME/logs directory, verify that the datanode and namenode log
files do not have any FATAL errors.