[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기

2013. 9. 9. 21:52 Big Data

[Hadoop] Eclipse에서 Maven으로 하둡 코딩하기

Eclipse하에서 하둡코딩시 Maven을 기본으로 하여 외부 라이브러리 의존성을 관리하자.

Hadoop 역할

- 분산된 파일을 처리하는 순서

> input HDFS으로 들어오기

> Job 수행 : 읽어서 로직처리

> 결과를 파일 또는 DB에 넣는다

- Tera 단위의 데이터가 이미 HDFS에 있을 경우 해당 데이터를 처리하는데 하둡의 쓰임새가 있다

- HDFS와 MapReduce의 이해

Intro to HDFS and MapReduce from Ryan Tabora

Maven Project 만들기

- Maven Project 선택하고 "Create a simple project" 선택한다

- 메이븐의 GroupID와 ArtifactID 설정한다

- 최종 생성 내역

MapReduce 프로그래밍을 여기서 하게 되고, 단위 테스트 프로그래밍도 할 수 있다

- pom.xml 에 hadoop 관련 라이브러리 의존관계를 넣는다. (파란색이 추가부분)

추가하고 저장을 하면 자동으로 의존관계 라이브러리를 다운로드 받는다

이클립트 좌측 "Project Explorer"의 "Maven Dependencies"에서 관련 파일들이 추가된 것을 확인할 수 있다

// pom.xml 내역

<groupId>kr.mobiconsoft.hadoop</groupId>

<artifactId>MapReduce</artifactId>

<version>1.0.0-SNAPSHOT</version>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

</dependencies>

</project>

// 결과

Word Counting MapReduce 구현하기

- file 2개 생성하고 유사한 word를 넣는다

// file01

hello world bye world

// file02

hi world hello dowon

- Mapper Class를 생성

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

/**

* K1 : read key type

* V2 : read value type

* K2 : write key type

* V2 : write value type

//public class WordCountMapper implements Mapper<K1, V1, K2, V2> {

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

// map 결과는 reducer로 자동으로 던져진다

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

// TODO Auto-generated method stub

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while(tokenizer.hasMoreTokens()) {

Text outputKey = new Text(tokenizer.nextToken());

// Hadoop 에서 wrapping한 Integer 타입의 객체를 넣어줌

// param1: outputKey, param2: outputValue

output.collect(outputKey, new IntWritable(1));

}

- Reducer Class 생성

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

/**

* K1 : Mapper의 K2 와 동일

* V1 : Mapper의 V2 와 동일

public class WordCountReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

/**

* V1 에서 values는 Iterator이다. 실제 같은 단어가 여러개 일 경우

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

// TODO Auto-generated method stub

int sum = 0;

while(values.hasNext()) {

sum += values.next().get(); // get Integer value

}

output.collect(key, new IntWritable(sum));

}

- Job Tracker를 생성 : 하단 main 선택한다

import java.io.IOException;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {

public static void main(String[] args) throws IOException {

// 1. configuration Mapper & Reducer of Hadoop

JobConf conf = new JobConf();

conf.setJobName("wordcount");

conf.setMapperClass(WordCountMapper.class);

conf.setReducerClass(WordCountReducer.class);

// 2. final output key type & value type

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

// 3. in/output format

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

// 4. set the path of file for read files

// input path : args[0]

// output path : args[1]

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

// 5. run job

JobClient.runJob(conf);

}

- 최종 모습

- eclipse 설정하기

main펑션이 있는 WordCount를 수행할 때 input path와 output path를 지정하여 준다

이때 output path의 디렉토리는 생성되어 있지 않아야 한다 (target/hadoop-result)

하단 우측 "run" 클릭

- 결과값

- 결국 이런 처리과정을 수행하게 된다

- Mapper와 Reducer 역할

Mapper : 소스를 쪼개어 key:value 맵을 여러개 만들고

Reducer : 여러 Map 값을 하나의 결과값으로 만들어 준다

단위 테스트 해보기

- pom.xml에 mrunit 추가

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

<groupId>org.apache.mrunit</groupId>

<artifactId>mrunit</artifactId>

<version>0.8.0-incubating</version>

</dependency>

</dependencies>

- Mapper Test 클래스 생성

Run As... 에서 JUnit으로 테스트 하여 초록색-성공인지 체크한다

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mrunit.MapDriver;

import org.junit.Test;

/**

* 테스트를 통하여 Mapper와 Reducer를 테스트에서 수행하여 검증 할 수 있다

* @author dowon

public class WordCountMapperTest {

@Test

public void testMap() {

// 1. 설

Text value = new Text("Hello World Bye World");

MapDriver<LongWritable, Text, Text, IntWritable> mapDriver = new MapDriver();

mapDriver.withMapper(new WordCountMapper());

mapDriver.withInputValue(value);

// 2. 검정 및 실행

// 순서를 정확히 해야 에러없이 수행된다. 빼먹어도 에러가 난다

mapDriver.withOutput(new Text("Hello"), new IntWritable(1));

mapDriver.withOutput(new Text("World"), new IntWritable(1));

mapDriver.withOutput(new Text("Bye"), new IntWritable(1));

mapDriver.withOutput(new Text("World"), new IntWritable(1));

mapDriver.runTest();

}

- Reducer Test 클래스 생성

Run As... 에서 JUnit으로 테스트 하여 초록색-성공인지 체크한다

import java.util.Arrays;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mrunit.ReduceDriver;

import org.junit.Test;

public class WordCountReducerTest {

@Test

public void testReducer() {

// 1. 설정

ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver = new ReduceDriver();

reduceDriver.withReducer(new WordCountReducer());

reduceDriver.withInputKey(new Text("World"));

reduceDriver.withInputValues(Arrays.asList(new IntWritable(1), new IntWritable(1)));

// 2. 검증 및 실행

reduceDriver.withOutput(new Text("World"), new IntWritable(2));

reduceDriver.runTest();

}

<참조>

- Maven 기초 사용법

저작자표시 비영리 변경금지

'Big Data' 카테고리의 다른 글

[RethinkDB] 시작하기 (0)	2017.04.11
[Hadoop] Mongo-Hadoop 에 대한 생각 (0)	2013.09.12
[Hadoop] MapReduce 직접 .jar 파일로 수행하기 (0)	2013.09.11
[Hadoop] 개념이해 및 설치하기 (2)	2013.09.09

posted by Peter Note

AI Convergence

Publication

Tag

Category

Recent Post