'Big Data' 카테고리의 글 목록

'Big Data'에 해당되는 글 6건

2017.04.11 [RethinkDB] 시작하기
2014.06.28 [ElasticSearch] ElasticSearch의 데이터 시각화 방법1
2013.09.12 [Hadoop] Mongo-Hadoop 에 대한 생각
2013.09.11 [Hadoop] MapReduce 직접 .jar 파일로 수행하기
2013.09.09 [Hadoop] Eclipse에서 Maven으로 하둡 코딩하기2
2013.09.09 [Hadoop] 개념이해 및 설치하기2

2017. 4. 11. 20:40 Big Data

[RethinkDB] 시작하기

RethinkDB는 리얼타임 웹을 위한 NoSQL 저장소이다.

설치하기

가이드에 따라 OS에 맞게 설치한다. (맥기준)

$ brew update && brew install rethinkdb

또는

다운로드 (.dmg)

docker로 실행할 경우

$ docker run -d -P --name rethink1 rethinkdb

저장소를 설치했으면 다음으로 NodeJS에서 사용할 클라이언트 드라이버를 설치한다.

실행하기

설치확인

$ which rethinkdb

/usr/local/bin/rethinkdb

실행하기

$ rethinkdb --bind all

Running rethinkdb 2.3.5 (CLANG 7.3.0 (clang-703.0.31))...

Running on Darwin 16.5.0 x86_64

Loading data from directory /Users/dowonyun/opensource/rethinkdb_data

Listening for intracluster connections on port 29015

Listening for client driver connections on port 28015

Listening for administrative HTTP connections on port 8080

브라우져기반 Admin(http://localhost:8080)을 통해 실시간 변경사항 조회, 테이블 생성, 서버관리, 데이터 조회등을 할 수 있다.

NodeJS환경에서 RethinkDB 접속하기

npm으로 노드환경을 만들고 rethink driver를 설치한다.

$ npm init -y

$ npm i rethinkdb --save

신규 파일을 만들고 Connect를 하고, 기본 데이터베이스인 test안에 신규 테이블을 생성해 보자.

// connect.js

r = require('rethinkdb');

let connection = null;

r.connect({ host: 'localhost', port: 28015 }, function (err, conn) {

if (err) throw err;

connection = conn;

r.db('test').tableCreate('authors').run(connection, function (err, result) {

if (err) throw err;

console.log(JSON.stringify(result, null, 2));

})

실행을 해보면 테이블이 생성되고 Admin화면에서도 확인해 볼 수 있다. rethinkDB 또한 Shard, Replica를 제공함을 알 수 있다.

$ node connect.js

{

"config_changes": [

{

"new_val": {

"db": "test",

"durability": "hard",

"id": "05f4b262-9e01-477e-a74c-3d4ccb14cf84",

"indexes": [],

"name": "authors",

"primary_key": "id",

"shards": [

{

"nonvoting_replicas": [],

"primary_replica": "DowonYunui_MacBook_Pro_local_bcb",

"replicas": [

"DowonYunui_MacBook_Pro_local_bcb"

]

}

"write_acks": "majority"

"old_val": null

}

"tables_created": 1

}

다음으로 rethinkDB는 자체 query 엔진을 가지고 있고, rethinkDB를 위한 쿼리 언어를 R eQL이라 한다. ReQL의 3가지 특성은 다음과 같다.

- ReQL embeds into your programming language: 펑션으로 만들어짐

- All ReQL queries are chainable: Dot ( . ) 오퍼레이터를 통해 체이닝 메소드 호출이다.

- All queries execute on the server: run을 호출해야 비로서 쿼리를 수행한다.

예제는 3가지의 특성을 보여주고 있다. query를 json 또는 SQL query statement으로 짜는게 아니라. 메소드 체이닝을 통해 쿼리를 만들고, 맨 나중에 Connection 객체를 파라미터로 넘겨주면서 run() 메소드를 실행한다.

r.table('users').pluck('last_name').distinct().count().run(conn)

일반 SQL 문과 자바스크립트 기반의 ReQL 차이점

구문의 차이저을 보자. row가 document이고, column이 field라는 차이점만 존재한다.

Insert

Select

다른 예제는 가이드 문서를 참조하자.

Horizon을 이용한 클라이언트 & 서버 구축하기

Horizon은 rehtinkDB를 사용해 realtime서비스 개발을 돕는 Node.js기반의 서버 프레임워크이면서 클라이언트에서 서버단의 접속을 RxJS 기반의 라이브러리를 통해 가능하다. 마치 Meteor 플랫폼과 유사한 리얼타임 아키텍트를 제공한다. Horizon의 CLI를 글로벌로 설치한다.

$ npm install -g horizon

호라이즌 실행은 hz 이다. 애플리케이션 스켈레톤을 hz을 이용해 만든다.

$ hz init realtime_app

Created new project directory realtime_app

Created realtime_app/src directory

Created realtime_app/dist directory

Created realtime_app/dist/index.html example

Created realtime_app/.hz directory

Created realtime_app/.gitignore

Created realtime_app/.hz/config.toml

Created realtime_app/.hz/schema.toml

Created realtime_app/.hz/secrets.toml

hz 서버 구동하고 http://localhost:55584/ 로 접속하면 http://localhost:8080 접속화면과 동일한 Admin화면을 볼 수 있다.

$ hz serve --dev

App available at http://127.0.0.1:8181

RethinkDB

├── Admin interface: http://localhost:55584

└── Drivers can connect to port 55583

Starting Horizon...

🌄 Horizon ready for connections

http://127.0.0.1:8181/을 호출하면 샘플 웹 화면을 볼 수 있다. 보다 자세한 사항은 홈페이지를 참조하고, 차후에 다시 다루어 보기로 하자.

<참조>

- rethinkDB 10분 가이드

- ReQL 가이드

- Pluralsight 가이드

'Big Data' 카테고리의 다른 글

[Hadoop] Mongo-Hadoop 에 대한 생각 (0)	2013.09.12
[Hadoop] MapReduce 직접 .jar 파일로 수행하기 (0)	2013.09.11
[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기 (2)	2013.09.09
[Hadoop] 개념이해 및 설치하기 (2)	2013.09.09

posted by Peter Note

2014. 6. 28. 19:07 Big Data/ElasticSearch

[ElasticSearch] ElasticSearch의 데이터 시각화 방법

ElasticSearch(이하 ES) 엔진에 수집된 데이터에 대하여 Kibana 도움 없이 직접 Data Visualization 하는 기술 스택을 알아보고, 실데이터를 통한 화면 1본을 만들어 본다.

ES 시각화를 위한 다양한 방식의 시도들

사전조사-1) 직접 클라이언트 모듈 제작하여 Data Visualization 수행

- elasticsearch.js 또는 elatic.js 사용하지 않고 REST 호출 통해 데이터 시각화 수행

(P rotovis 는 D3.js 기반 Chart를 사용함)

- FullScale.co에서는 dangle 이라는 시각화 차트를 소개함

(D3.js 기반의 AngularJS Directives 차트)

- D3.js 기반의 전문적인 차트를 사용하는 방법을 익힐 수 있다. 하지만 제대로 갖춰진 ES 클라이언트 모듈의 필요성 대두

사전조사-2) elasticsearch.js 사용하여 Data Visualization 수행

- elasticsearch.js와 D3.js, jquery, require.js를 통해서 샘플 데이터를 시각화 수행

(AngularJS는 사용하지 않고, 전문적인 차트 모듈사용하지 않음)

- AngularJS 기반으로 Protovis 또는 Dangle 차트를 사용하여 표현해 본다.

사전조사-3) elastic.js 사용하여 Data Visualization 수행

- elastic.js 홈페이지에서 API를 숙지한다.

- DSL API를 살펴보자. DSL을 이해하기 위해서는 ES의 Search API에 대한 이해가 필요하다.

- Query를 작성하고 Filtering 하면 group by having과 유사한 facets (지금은 aggregation 을 사용)하여 검색을 한다.

Query -> Filter -> Aggregation에 대해 알면 DSL구성을 이해할 수 있다.

- 자신의 Twitter 데이터를 가지고 elastic.js와 angular.js를 사용하여 트윗 내용을 표현하는 방법 (GitHub 소스)

ES Data Visualization을 위한 나만의 Tech Stack 만들기

- ES 클라이언트 모듈 : elastic.js 의 DSL(Domain Specific Language)를 숙지한다.

elastic.js는 ElasticSearch의 공식 클라이언트 모듈인 elasticsearch.js 의 DSL 화 모듈로 namespace는 ejs 이다.

- 시각화를 위한 D3.js 의 개념 이해 (D3.js 배우기)

- Kibana에서 사용하고 있는 Frontend MV* Framework인 AngularJS (AngularJS 배우기)

- AngularJS 생태계 도구인 yeoman 을 통해 개발 시작하기 (generator-fullstack, Yeoman 사용방법)

- 물론 Node.js는 기본이다.

그래서 다음과 같이 정리해 본다.

- AngularJS 기반 Frontend 개발

1) Node.js 기초

2) Yeoman + generator-angular 기반

- D3.js 기반의 Chart 이며 AngularJS 바로 적용가능토록 Directives 화 되어 있는 차트 중 선택사용

1) Protovis

2) Dangle

3) Angular nvd3 charts (추천)

4) Angular-Charts

5) Angular Google Charts

- elasticsearch.js를 DSL로 만든 elastic.js 사용

그래서 다시 그림으로 정리해본 기술 스택

ES Data Visualization 개발을 위한 구성 stack 그림

== 이제 만들어 봅시다!!! ==

환경설정

- node.js 및 yeoman 설치 : npm install -g generator-angular-fullstack 설치 (generator-angular는 오류발생)

- twitter bootstrap RWD 메뉴화면 구성 (기본 화면 구성)

- angular-ui의 angular-bootstrap 설치 : bower install angular-bootstrap --save

- elasticsearch.js 설치 : bower install elasticsearch --save

- elastic.js 설치 : bower install elastic.js --save

- angular-nvd3-directives chart 설치 : bower install angularjs-nvd3-directives --save

angular layout 구성

- 애플리케이션 생성 : GitHub Repository를 만들어서 clone 한 디렉토리에서 수행하였다

$ git clone https://github.com/elasticsearch-kr/es-data-visualization-hackerton

$ cd es-data-visualization-hackerton

$ yo angular-fullstack esvisualization

- main.html 과 scripts/controllers/main.js 를 주로 수정함

// main.html 안에 angular-nvd3.directives html 태그 및 속성 설정

<button type="button" ng-click="getImpression()" class="btn btn-default">Impression Histogram</button>

</div>

<nvd3-multi-bar-chart

data="impressionData"

id="dataId"

xAxisTickFormat="xAxisTickFormatFunction()"

width="550"

height="350"

showXAxis="true"

showYAxis="true">

</nvd3-multi-bar-chart>

</div>

// main.js 안에서 elasticsearch.js를 직접 호출하여 사용함

angular.module('esvisualizationApp')

.controller('MainCtrl', function ($scope, $http) {

// x축의 값을 date으로 변환하여 찍어준다

$scope.xAxisTickFormatFunction = function(){

return function(d){

return d3.time.format('%x')(new Date(d));

}

// 화면에서 클릭을 하면 impression index 값을 ES에서 호출하여

// ES aggregation json 결과값을 파싱하여 차트 데이터에 맵핑한다

$scope.getImpression = function() {

// ES 접속을 위한 클라이언트를 생성한다

var client = new elasticsearch.Client({

host: '54.178.125.74:9200',

sniffOnStart: true,

sniffInterval: 60000,

});

// search 조회를 수행한다. POST 방식으로 body에 실 search query를 넣어준다

client.search({

index: 'impression',

size: 5,

body: {

"filter": {

"and": [

{

"range": {

"time": {

"from": "2013-7-1",

"to": "2014-6-30"

}

]

"aggs": {

"events": {

"terms": {

"field": "event"

"aggs" : {

"time_histogram" : {

"date_histogram" : {

"field" : "time",

"interval" : "1d",

"format" : "yyyy-MM-dd"

}

// End query.

}

}).then(function (resp) {

var impressions = resp.aggregations.events.buckets[0].time_histogram.buckets;

console.log(impressions);

var fixData = [];

angular.forEach(impressions, function(impression, idx) {

fixData[idx] = [impression.key, impression.doc_count];

});

// 결과에 대해서 promise 패턴으로 받아서 angular-nvd3-directives의 데이터 구조를 만들어 준다

// {key, values}가 하나의 series가 되고 배열을 가지면 다중 series가 된다.

$scope.impressionData = [

{

"key": "Series 1",

"values": fixData

}

];

// apply 적용을 해주어야 차트에 데이터가 바로 반영된다.

$scope.$apply();

});

}

});

결과 화면 및 해커톤 소감

- "Impression Histogram" 버튼을 클릭하면 차트에 표현을 한다.

- elastic.js의 DSL을 이용하면 다양한 파라미터에 대한 핸들링을 쉽게 할 수 있을 것같다.

- AngularJS 생태계와 elasticsearch.js(elastic.js)를 이용하면 Kibana의 도움 없이 자신이 원하는 화면을 쉽게 만들 수 있다.

- 관건은 역시 ES에 어떤 데이터를 어떤 형태로 넣느냐가 가장 먼저 고민을 해야하고, 이후 분석 query를 어떻게 짤 것인가 고민이 필요!

* 헤커톤 소스 위치 : https://github.com/elasticsearch-kr/es-data-visualization-hackerton

<참조>

- elasticsearch.js 공식 클라이언트 모듈을 DSL로 만든 elastic.js

- elastic.js를 사용한 ES Data Visualization을 AngularJS기반으로 개발

- Protovis 차트를 이용한 facet 데이터 표현

- ElasticSearch Data Visualization 방법

- D3.js 와 Angular.js 기반으로 Data Visualization

- ElasticSearch의 Facet과 Aggregation 수행 : 향후 Facet은 없어지고 Aggregation으로 대체될 것이다.

- AngularJS Directives Chart 비교

저작자표시 비영리 변경금지

posted by Peter Note

2013. 9. 12. 20:45 Big Data

[Hadoop] Mongo-Hadoop 에 대한 생각

몽고디비를 하둡의 Input/Output의 Store를 사용하면 어떨까? 어차피 몽고디비는 Document Store 이며 Scale-Out을 위한 무한한 Sharding(RDB의 Partitioning) 환경을 제공하니 충분히 사용할 수 있을 것이다. Store에 저장된 데이터의 Batch Processing Engine으로 하둡을 사용하면 될 일이다.

Mongo-Hadoop Connector 소개

- Hadoop을 통하면 Mongo안에 있는 데이터를 전체 코어를 사용하면서 병렬로 처리할 수 있다

- 하둡포멧으로 Mongo를 BSON format을 파일로 저장하거나, MongoDB에 바로 저장할 수 있는 Java API존재

- Pig + Hive를 사용할 수 있음

- AWS의 Amazon Elastic MapReduce 사용

Webinar: What's New with MongoDB Hadoop Integration from mongodb/10gen

Batch Processing Model 종류

- 사실 MongoDB에서도 Aggregation Framework을 제공하여 MapReduce프로그래밍을 JavaScript로 개발 적용할 수 있다다

- 시간단위 Batch Processing은 요렇게도 사용할 수 있겠다

- 데이터가 정말 Big 이면 하둡을 이용하여 Batch Processing을 해야겠다. 여기서 몽고디비를 "Raw Data Store" 와 "Result Data Store"로 사용한다

- MongoDB & Hadoop : Batch Processing Model 전체 내역을 보자

MongoDB & Hadoop: Flexible Hourly Batch Processing Model from Takahiro Inoue

<참조>

- 결국 처리된 데이터는 표현되어야 한다 : Data Visualization Resources

- MongoDB 넌 뭐니? NoSQL에 대한 이야기 (조대협)

저작자표시 비영리 변경금지

'Big Data' 카테고리의 다른 글

[RethinkDB] 시작하기 (0)	2017.04.11
[Hadoop] MapReduce 직접 .jar 파일로 수행하기 (0)	2013.09.11
[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기 (2)	2013.09.09
[Hadoop] 개념이해 및 설치하기 (2)	2013.09.09

posted by Peter Note

2013. 9. 11. 19:20 Big Data

[Hadoop] MapReduce 직접 .jar 파일로 수행하기

Mapper & Reducer를 .jar로 배포하고 직접 하둡명령으로 수행하는 방법에 대하여 알아보자

MapReduce 프로그램

- Writable Interface는 Value에서 사용한다

- Mapper 인터페이스

Mapper<K1, V1, K2, V2>의 형태 : key는 WritableComparable를 구현해야 하며, value는 Writable를 구현해야 함.

- Reducer 인터페이스

reducer는 여러가지 매퍼로부터 생성된 결과를 받고, key/value 쌍의 key에 대해 데이터를 정렬하고 동일한 key에 대한 모든 값을 그룹핑 함.

- 이전의 WordCount에 대한 것을 직접 코딩하였는데, 맵퍼-TokenCountMapper-, 리듀서-LongSumReducer-를 사용해서 동일하게 만들 수 있다

hadoop 명령어로 .jar 직접 수행하기

- pom.xml 에 MapReduce Jar파일을 만들어 특정위치로 복사하는 플러그인 설정을 넣는다

<build>

<artifactId>maven-antrun-plugin</artifactId>

<tasks>

<copy file="target/${project.artifactId}-${project.version}.jar"

tofile="/Users/dowon/Documents/hadoop-jobs/${project.artifactId}-${project.version}.jar" />

</tasks>

</configuration>

<phase>install</phase>

<goals>

</goals>

</execution>

</executions>

</plugin>

</plugins>

</build>

- 기존 WordCount에 대한 WordCount3 복사본을 만들고 TokenCountMapper와 LongSumReducer로 변형한다

즉 직접 코딩하지 말고 하둡에서 제공하는 클래스를 사용한다

import java.io.IOException;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.lib.LongSumReducer;

import org.apache.hadoop.mapred.lib.TokenCountMapper;

public class WordCount3 {

public static void main(String[] args) throws IOException {

// 1. configuration Mapper & Reducer of Hadoop

JobConf conf = new JobConf(WordCount3.class);

conf.setJobName("wordcount3");

// 2. final output key type & value type

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(LongWritable.class);

// 3. in/output format

conf.setMapperClass(TokenCountMapper.class);

conf.setCombinerClass(LongSumReducer.class);

conf.setReducerClass(LongSumReducer.class);

// 4. set the path of file for read files

// input path : args[0]

// output path : args[1]

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

// 5. run job

JobClient client = new JobClient();

client.setConf(conf);

JobClient.runJob(conf);

}

- eclipse의 프로젝트를 선택하고 "Run As"에서 "Maven build..."를 선택하여 "clean install" 입력하고 "run"버튼을 클릭한다

- 결과로 배포가 성공으로 나오면 된다 : /Users/dowon/Documents/hadoop-jobs 디렉토리에 *.jar 파일 생성을 확인한다

- 하둡 데몬들을 수행하기 전 NameNode에 대해서 format을 하고 수행한다

// name node 포멧

$ hadoop namenode -format

// .bash_profile 에 PATH 설정

set -o vi

export JAVA_HOME=/Library/Java/Home

export H_HOME=~/Documents/hadoop-1.2.1

export PATH=.:$PATH:$JAVA_HOME/bin:$H_HOME/bin:/usr/bin

alias ll='ls -alrt'

alias cdh='cd $H_HOME'

// 하둡 데몬 수행

// 50030 : job-tracker 접속 포트

// 50070 : NameNode 접속 포트

$ start-all.sh

- input 의 위치를 지정하여 준다 (만일, NameNode를 포멧하였다면)

// 위치가 하기와 같다면

$ pwd

/Users/dowon/Documents/input

$ ls

total 16

-rw-r--r-- 1 dowon staff 22 9 11 19:26 file01

-rw-r--r-- 1 dowon staff 21 9 11 19:26 file02

// input 을 HDFS에 만든다

$ hadoop fs -put . input

1) http://localhost:50070/으로 접속하여 "Browser the filesystem"을 클릭하면 볼 수 있다

2) /user/dowon/input 경로로 만들어 졌음을 알 수 있다

- hadoop 명령어로 생성된 jar 파일을 수행해 보자

// 경로가 다음과 같고, WordCount3.class가 들어있는 .jar 파일이 존재한다

$ pwd

/Users/dowon/Documents/hadoop-jobs

$ ls

-rw-r--r-- 1 dowon staff 5391 9 11 19:45 MapReduce-1.0.0-SNAPSHOT.jar

// 명령어 수행

$ hadoop jar *.jar WordCount3 /user/dowon/input /user/dowon/output3

1) 네임노드에 접속해서 /user/dowon에 들어가 보면 "output3"이 생성된 것을 볼 수있다

2) output3으로 들어가면 결과값을 지니 파일이 존재한다

3) 만일 명령을 재수행하고 싶다면 output3 디렉토리리 삭제해야 한다

$ hadoop fs -rmr /user/dowon/output3

eclipse에서 수행하지 않고 반출된 .jar 파일을 가지고 hadoop명령으로 수행하는 방법을 알아보았다.

<참조>

없음

저작자표시 비영리 변경금지

'Big Data' 카테고리의 다른 글

[RethinkDB] 시작하기 (0)	2017.04.11
[Hadoop] Mongo-Hadoop 에 대한 생각 (0)	2013.09.12
[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기 (2)	2013.09.09
[Hadoop] 개념이해 및 설치하기 (2)	2013.09.09

posted by Peter Note

2013. 9. 9. 21:52 Big Data

[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기

Eclipse하에서 하둡코딩시 Maven을 기본으로 하여 외부 라이브러리 의존성을 관리하자.

Hadoop 역할

- 분산된 파일을 처리하는 순서

> input HDFS으로 들어오기

> Job 수행 : 읽어서 로직처리

> 결과를 파일 또는 DB에 넣는다

- Tera 단위의 데이터가 이미 HDFS에 있을 경우 해당 데이터를 처리하는데 하둡의 쓰임새가 있다

- HDFS와 MapReduce의 이해

Intro to HDFS and MapReduce from Ryan Tabora

Maven Project 만들기

- Maven Project 선택하고 "Create a simple project" 선택한다

- 메이븐의 GroupID와 ArtifactID 설정한다

- 최종 생성 내역

MapReduce 프로그래밍을 여기서 하게 되고, 단위 테스트 프로그래밍도 할 수 있다

- pom.xml 에 hadoop 관련 라이브러리 의존관계를 넣는다. (파란색이 추가부분)

추가하고 저장을 하면 자동으로 의존관계 라이브러리를 다운로드 받는다

이클립트 좌측 "Project Explorer"의 "Maven Dependencies"에서 관련 파일들이 추가된 것을 확인할 수 있다

// pom.xml 내역

<groupId>kr.mobiconsoft.hadoop</groupId>

<artifactId>MapReduce</artifactId>

<version>1.0.0-SNAPSHOT</version>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

</dependencies>

</project>

// 결과

Word Counting MapReduce 구현하기

- file 2개 생성하고 유사한 word를 넣는다

// file01

hello world bye world

// file02

hi world hello dowon

- Mapper Class를 생성

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

/**

* K1 : read key type

* V2 : read value type

* K2 : write key type

* V2 : write value type

//public class WordCountMapper implements Mapper<K1, V1, K2, V2> {

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

// map 결과는 reducer로 자동으로 던져진다

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

// TODO Auto-generated method stub

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while(tokenizer.hasMoreTokens()) {

Text outputKey = new Text(tokenizer.nextToken());

// Hadoop 에서 wrapping한 Integer 타입의 객체를 넣어줌

// param1: outputKey, param2: outputValue

output.collect(outputKey, new IntWritable(1));

}

- Reducer Class 생성

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

/**

* K1 : Mapper의 K2 와 동일

* V1 : Mapper의 V2 와 동일

public class WordCountReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

/**

* V1 에서 values는 Iterator이다. 실제 같은 단어가 여러개 일 경우

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

// TODO Auto-generated method stub

int sum = 0;

while(values.hasNext()) {

sum += values.next().get(); // get Integer value

}

output.collect(key, new IntWritable(sum));

}

- Job Tracker를 생성 : 하단 main 선택한다

import java.io.IOException;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {

public static void main(String[] args) throws IOException {

// 1. configuration Mapper & Reducer of Hadoop

JobConf conf = new JobConf();

conf.setJobName("wordcount");

conf.setMapperClass(WordCountMapper.class);

conf.setReducerClass(WordCountReducer.class);

// 2. final output key type & value type

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

// 3. in/output format

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

// 4. set the path of file for read files

// input path : args[0]

// output path : args[1]

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

// 5. run job

JobClient.runJob(conf);

}

- 최종 모습

- eclipse 설정하기

main펑션이 있는 WordCount를 수행할 때 input path와 output path를 지정하여 준다

이때 output path의 디렉토리는 생성되어 있지 않아야 한다 (target/hadoop-result)

하단 우측 "run" 클릭

- 결과값

- 결국 이런 처리과정을 수행하게 된다

- Mapper와 Reducer 역할

Mapper : 소스를 쪼개어 key:value 맵을 여러개 만들고

Reducer : 여러 Map 값을 하나의 결과값으로 만들어 준다

단위 테스트 해보기

- pom.xml에 mrunit 추가

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

<groupId>org.apache.mrunit</groupId>

<artifactId>mrunit</artifactId>

<version>0.8.0-incubating</version>

</dependency>

</dependencies>

- Mapper Test 클래스 생성

Run As... 에서 JUnit으로 테스트 하여 초록색-성공인지 체크한다

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mrunit.MapDriver;

import org.junit.Test;

/**

* 테스트를 통하여 Mapper와 Reducer를 테스트에서 수행하여 검증 할 수 있다

* @author dowon

public class WordCountMapperTest {

@Test

public void testMap() {

// 1. 설

Text value = new Text("Hello World Bye World");

MapDriver<LongWritable, Text, Text, IntWritable> mapDriver = new MapDriver();

mapDriver.withMapper(new WordCountMapper());

mapDriver.withInputValue(value);

// 2. 검정 및 실행

// 순서를 정확히 해야 에러없이 수행된다. 빼먹어도 에러가 난다

mapDriver.withOutput(new Text("Hello"), new IntWritable(1));

mapDriver.withOutput(new Text("World"), new IntWritable(1));

mapDriver.withOutput(new Text("Bye"), new IntWritable(1));

mapDriver.withOutput(new Text("World"), new IntWritable(1));

mapDriver.runTest();

}

- Reducer Test 클래스 생성

Run As... 에서 JUnit으로 테스트 하여 초록색-성공인지 체크한다

import java.util.Arrays;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mrunit.ReduceDriver;

import org.junit.Test;

public class WordCountReducerTest {

@Test

public void testReducer() {

// 1. 설정

ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver = new ReduceDriver();

reduceDriver.withReducer(new WordCountReducer());

reduceDriver.withInputKey(new Text("World"));

reduceDriver.withInputValues(Arrays.asList(new IntWritable(1), new IntWritable(1)));

// 2. 검증 및 실행

reduceDriver.withOutput(new Text("World"), new IntWritable(2));

reduceDriver.runTest();

}

<참조>

- Maven 기초 사용법

저작자표시 비영리 변경금지

'Big Data' 카테고리의 다른 글

[RethinkDB] 시작하기 (0)	2017.04.11
[Hadoop] Mongo-Hadoop 에 대한 생각 (0)	2013.09.12
[Hadoop] MapReduce 직접 .jar 파일로 수행하기 (0)	2013.09.11
[Hadoop] 개념이해 및 설치하기 (2)	2013.09.09

posted by Peter Note

2013. 9. 9. 21:34 Big Data

[Hadoop] 개념이해 및 설치하기

왜 빅데이터가 이슈가 되고 있을까? HW와 SW의 가격은 저렴해지고, 표준은 평준화 되고 접근이 수워지고 있다. 그러나 데이터는 복제나 공유가 되지 않고 자사의 데이터가 돈이 되는 시대가 왔다. 그런 의미에서 하둡은 빅데이터를 처리하는 분야의 SW이다.

하둡 개념

- Input: 분석할 데이터, Output: 결과값

- MasterNode: HDFS-분산파일위치 정보지님 (NameNode)

- SlaveNode: 분산된 실 데이터를 저장 (DataNode)

- MapReduce/HDFS Layer 영역으로 나뉨

- 역할에 대한 이해하기

Hadoop & MapReduce from Newvewm

- JobTracker : Map -> Reduce 할때 Shuffle+Sort의 로직처리가 성능을 좌우한다.

즉, Map출력결과 (Mapper) -> Suffle+Sort -> Sorting된 Reduce 입력 (Reducer)

Mapper/Reducer 프로그래밍도 분산된 것이다

- 개념이해하기

Hadoop from Scott Leberknight

설치하기

- http://apache.tt.co.kr/hadoop/common/hadoop-1.2.1/ 에서 hadoop-1.2.1-bin.tar.gz 파일을 다운로드 받는다

- 기본 환경은 Mac OS를 사용한다

- .bash_profile 안에 JAVA_HOME을 설정한다 : Java버전은 반드시 1.6 이상이어야 한다

$ cat .bash_profile

alias ll='ls -alrt'

set -o vi

export JAVA_HOME=/Library/Java/Home

- 압축을 푼다. 설치 끝

Standalone 사용하기

- Document 메뉴에서 1.2.1로 이동하여 "Single Node Setup"을 클릭 : http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

- 간단한 수행

// 폴더를 하나 만들고 xml 환경파일을 복사한다

$ mkdir input

$ cp conf/*.xml input

// 하기 명령을 수행한다

// input 읽을 꺼리를 주고 결과값을 output에 담아라

$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

// output 폴더가 자동 생성되고 cat 하였을 때 존재하는 파일의 내역이 하기와 같이 보이면 성공!

$ cat output/*

1 dfsadmin

- 수행은 어떤 의미일까?

+ NameNode : 하둡전체 관리 = JobTracker + DataNode

+ JobTracker : 처리역할

+ DataNode : HDFS 위치 (분산파일을 하나의 파일 인것처럼 사용하게 해줄 수 있는것)

- 하둡은 특정 input이 있고 처리하고 output 결과가 나온다. 현재 예는 로컬 input에 있는 것을 읽고 로컬 output에 생성하였다.

그러나 실제에서는 분산으로 처리하므로 local이 아닐 것이다.

Hadoop 종류

- 종류

 Local (Standalone) Mode [로컬(독립)모드]

하둡의 기본모드(아무런 환경설정을 하지 않음): 로컬 머신에서만 실행

다른 노드와 통신할 필요가 없기 때문에 HDFS를 사용하지 않으며 다른 데몬들도 실행시키지 않음

독립적으로 MapReduce 프로그램의 로직을 개발하고 디버깅하는데 유용함

 Pseudo-Distributed Mode [가상분산 모드]

한대의 컴퓨터로 클러스터를 구성하고, 모든 데몬을 실행함.

독립실행(standalone) 모드 기능 보완

– 메모리 사용 정도, HDFS 입출력 관련 문제, 다른 데몬 과의 상호작용에서 발생하는 일을 검사

 Fully-Distributed Mode [완전분산 모드]

분산 저장과 분산 연산의 모든 기능이 갖추어진 클러스터를 구성함

master - 클러스터의 master 노드로, NameNode와 JobTracker 데몬을 제공

backup - SNN(Secondary NameNode 데몬)을 제공하는 서버

slaves - DataNode와 TaskTracker 데몬을 실행하는 slaves 들

가상의 Standalone Hadoop 실행하기

- 3가지 기본 환경을 추가한다

- conf/core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

- conf/hdfs-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

</configuration>

- conf/mared-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

- NameNode 초기화

NameNode 가 있는 곳에서 수행한다

$ bin/hadoop namenode -format

13/09/09 20:51:59 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = KOSTA17ui-iMac.local/192.168.0.15

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 1.2.1

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013

STARTUP_MSG: java = 1.6.0_45

************************************************************/

13/09/09 20:52:00 INFO util.GSet: Computing capacity for map BlocksMap

... 중략 ...

13/09/09 20:52:00 INFO common.Storage: Image file /tmp/hadoop-dowon/dfs/name/current/fsimage of size 111 bytes saved in 0 seconds.

13/09/09 20:52:00 INFO common.Storage: Storage directory /tmp/hadoop-dowon/dfs/name has been successfully formatted.

- NameNode, JobTracker, DataNode를 한번에 띄우기

$ bin/start-all.sh

starting namenode, logging to /Users/dowon/Documents/hadoop-1.2.1/libexec/../logs/hadoop-dowon-namenode-KOSTA17ui-iMac.local.out

2013-09-09 20:53:36.430 java[7632:1603] Unable to load realm info from SCDynamicStore

// 수행후 브라우져에서 50030, 50070 포트 호출

http://localhost:50030/dfshealth.jsp : JobTracker

http://localhost:50070/dfshealth.jsp : NameNode

- 가상 HDFS 방식으로 수행해 보자

결국 결과값을 HDFS에 넣어주는 것이다

// NameNode의 HDFS에 conf/* 모든 파일을 input 디렉토리명으로 생성하여 복사한다

$ bin/hadoop fs -put conf input

2013-09-09 20:58:13.847 java[7966:1603] Unable to load realm info from SCDynamicStore

// 명령을 수행하면 JobTracker가 수행되고 HDFS에 output 디렉토리에 결과값이 생성된다

// 결과값은 JobTracker가 처리하여 생성된 것이다

$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

2013-09-09 21:01:57.770 java[8017:1603] Unable to load realm info from SCDynamicStore

13/09/09 21:01:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

13/09/09 21:01:58 WARN snappy.LoadSnappy: Snappy native library not loaded

13/09/09 21:01:58 INFO mapred.FileInputFormat: Total input paths to process : 17

13/09/09 21:01:58 INFO mapred.JobClient: Running job: job_201309092053_0001

13/09/09 21:01:59 INFO mapred.JobClient: map 0% reduce 0%

13/09/09 21:02:03 INFO mapred.JobClient: map 11% reduce 0%

13/09/09 21:02:05 INFO mapred.JobClient: map 23% reduce 0%

.. 중략..

// NameNode 결과 확인

// http://localhost:50070/ 에서 "Browser the filesystem" 을 클릭한다

// http://localhost:50075/browseDirectory.jsp?dir=%2Fuser%2Fdowon&namenodeInfoPort=50070

해당 브라우져 내역에 신규생성된 "user"밑의 "dowon"밑의 "input" 과 "output"이 보인다 (dowon은 계정명)

Name	Type	Size	Replication	Block Size	Modification Time	Permission	Owner	Group
input	dir				2013-09-09 20:58	rwxr-xr-x	dowon	supergroup
output	dir				2013-09-09 21:02	rwxr-xr-x	dowon	supergroup

// output에 결과값 : part-00000 에 결과내역이 write 되어 있다

Name	Type	Size	Replication	Block Size	Modification Time	Permission	Owner	Group
_SUCCESS	file	0 KB	1	64 MB	2013-09-09 21:02	rw-r--r--	dowon	supergroup
_logs	dir				2013-09-09 21:02	rwxr-xr-x	dowon	supergroup
part-00000	file	0.05 KB	1	64 MB	2013-09-09 21:02	rw-r--r--	dowon	supergrou

// JobTracker 처리현황 확인

// http://localhost:50030/jobtracker.jsp

다음에는 eclipse에서 하둡 프로그래밍을 해보자

<참조>

- Hadoop 튜토리얼

저작자표시 비영리 변경금지

'Big Data' 카테고리의 다른 글

[RethinkDB] 시작하기 (0)	2017.04.11
[Hadoop] Mongo-Hadoop 에 대한 생각 (0)	2013.09.12
[Hadoop] MapReduce 직접 .jar 파일로 수행하기 (0)	2013.09.11
[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기 (2)	2013.09.09

posted by Peter Note

AI Convergence

Publication

Tag

Category

Recent Post

'Big Data'에 해당되는 글 6건

[RethinkDB] 시작하기

'Big Data' 카테고리의 다른 글

[ElasticSearch] ElasticSearch의 데이터 시각화 방법

[Hadoop] Mongo-Hadoop 에 대한 생각

'Big Data' 카테고리의 다른 글

[Hadoop] MapReduce 직접 .jar 파일로 수행하기

'Big Data' 카테고리의 다른 글

[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기

'Big Data' 카테고리의 다른 글

[Hadoop] 개념이해 및 설치하기

'Big Data' 카테고리의 다른 글

티스토리툴바