月眸


flume学习笔记

毛毛小妖 2020-03-03 509浏览 0条评论
首页/正文
分享到: / / / /

一、flume概述

1.定义

Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构,灵活简单。

2.flume架构

2.1.Agent

Agent是一个JVM进程,它以事件的形式将数据从源头送至目的。

Agent主要有3个部分组成,SourceChannelSink

2.2.source

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directorynetcat、sequence generator、syslog、http、legacy。

2.3.Sink

Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

Sink组件目的地包括hdfsloggeravro、thrift、ipc、fileHBase、solr、自定义。

2.4.Channel

Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。

Flume自带两种Channel:Memory ChannelFile Channel

2.5.Event

传输单元,Flume数据传输的基本单元,以Event的形式将数据从源头送至目的地。Event由HeaderBody两部分组成,Header用来存放该event的一些属性,为K-V结构,Body用来存放该条数据,形式为字节数组。

二、Flume快速入门

1.Flume安装部署

1)将apache-flume-1.7.0-bin.tar.gz上传到linux的/opt/software目录下

2)解压apache-flume-1.7.0-bin.tar.gz到/opt/module/目录下

[atguigu@hadoop102 software]$ tar -zxf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

3修改apache-flume-1.7.0-bin的名称为flume

[atguigu@hadoop102 module]$ mv apache-flume-1.7.0-bin flume

2.Flume入门案例

2.1.需求

使用Flume监听一个端口,收集该端口数据,并打印到控制台。

2.2.需求分析

2.3.实现步骤

1)安装netcat工具

[atguigu@hadoop102 software]$ sudo yum install -y nc

2)创建Flume Agent配置文件

在flume目录下创建job文件夹并进入job文件夹。

[atguigu@hadoop102 flume]$ mkdir job
[atguigu@hadoop102 flume]$ cd job/

在job文件夹下创建Flume Agent配置文件flume-netcat-logger.conf。

[atguigu@hadoop102 job]$ vim flume-netcat-logger.conf

在flume-netcat-logger.conf文件中添加如下内容。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3)先开启flume监听端口

[atguigu@hadoop102 flume]$ bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

4)使用netcat工具向本机的44444端口发送内容

[atguigu@hadoop102 ~]$ nc localhost 44444
hello 
world

5)在Flume监听页面观察接收数据情况

三、Flume进阶

1.Flume事务

2.Flume Agent内部原理

重要组件

1)ChannelSelector

ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型,分别是Replicating(复制)和Multiplexing(多路复用)。

ReplicatingSelector会将同一个Event发往所有的Channel,Multiplexing会根据相应的原则,将不同的Event发往不同的Channel。

2)SinkProcessor

SinkProcessor共有三种类型,分别是DefaultSinkProcessorLoadBalancingSinkProcessorFailoverSinkProcessor。

DefaultSinkProcessor对应的是单个的Sink,LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group,LoadBalancingSinkProcessor可以实现负载均衡的功能,FailoverSinkProcessor可以错误恢复的功能。

3.Flume拓扑结构

3.1.简单串联

3.2.复制和多路复用

3.3.负载均衡和故障转移

3.4.聚合

4.Flume数据流监控

4.1.Ganglia的安装部署

1)安装httpd服务与php

[atguigu@hadoop102 flume]$ sudo yum -y install httpd php

2)安装其他依赖

[atguigu@hadoop102 flume]$ sudo yum -y install rrdtool perl-rrdtool rrdtool-devel
[atguigu@hadoop102 flume]$ sudo yum -y install apr-devel

3)安装ganglia

[atguigu@hadoop102 flume]$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
[atguigu@hadoop102 flume]$ sudo yum -y install ganglia-gmetad 
[atguigu@hadoop102 flume]$ sudo yum -y install ganglia-web
[atguigu@hadoop102 flume]$ sudo yum install -y ganglia-gmond

4)修改配置文件

[atguigu@hadoop102 flume]$ sudo vim /etc/httpd/conf.d/ganglia.conf

# Ganglia monitoring system php web frontend
Alias /ganglia /usr/share/ganglia
<Location /ganglia>
  Order deny,allow
  #Deny from all
  Allow from all
  # Allow from 127.0.0.1
  # Allow from ::1
  # Allow from .example.com
</Location>

5) 修改配置文件/etc/ganglia/gmetad.conf

[atguigu@hadoop102 flume]$ sudo vim /etc/ganglia/gmetad.conf

修改为:

data_source "hadoop102" 192.168.1.102

6) 修改配置文件/etc/ganglia/gmond.conf

[atguigu@hadoop102 flume]$ sudo vim /etc/ganglia/gmond.conf

修改为:

cluster {

  name = "hadoop102"

  owner = "unspecified"

  latlong = "unspecified"

  url = "unspecified"

}

udp_send_channel {

  #bind_hostname = yes # Highly recommended, soon to be default.

                       # This option tells gmond to use a source address

                       # that resolves to the machine's hostname.  Without

                       # this, the metrics may appear to come from any

                       # interface and the DNS names associated with

                       # those IPs will be used to create the RRDs.

  # mcast_join = 239.2.11.71

  host = 192.168.1.102

  port = 8649

  ttl = 1

}

udp_recv_channel {

  # mcast_join = 239.2.11.71

  port = 8649

  bind = 192.168.1.102

  retry_bind = true

  # Size of the UDP buffer. If you are handling lots of metrics you really

  # should bump it up to e.g. 10MB or even higher.

  # buffer = 10485760

}

7) 修改配置文件/etc/selinux/config

[atguigu@hadoop102 flume]$ sudo vim /etc/selinux/config

修改为:

# This file controls the state of SELinux on the system.

# SELINUX= can take one of these three values:

#     enforcing - SELinux security policy is enforced.

#     permissive - SELinux prints warnings instead of enforcing.

#     disabled - No SELinux policy is loaded.

SELINUX=disabled

# SELINUXTYPE= can take one of these two values:

#     targeted - Targeted processes are protected,

#     mls - Multi Level Security protection.

SELINUXTYPE=targeted

尖叫提示:selinux本次生效关闭必须重启,如果此时不想重启,可以临时生效之:

[atguigu@hadoop102 flume]$ sudo setenforce 0

8) 启动ganglia

[atguigu@hadoop102 flume]$ sudo service httpd start

[atguigu@hadoop102 flume]$ sudo service gmetad start

[atguigu@hadoop102 flume]$ sudo service gmond start

9) 打开网页浏览ganglia页面

http://192.168.1.102/ganglia

尖叫提示:如果完成以上操作依然出现权限不足错误,请修改/var/lib/ganglia目录的权限:

[atguigu@hadoop102 flume]$ sudo chmod -R 777 /var/lib/ganglia

 

最后修改:2020-03-03 20:00:47 © 著作权归作者所有
如果觉得我的文章对你有用,请随意赞赏
扫一扫支付

上一篇
登录即可评论哦~

评论列表

还没有人评论哦~赶快抢占沙发吧~