블로그 이미지
Peter Note
Web & LLM FullStacker, Application Architecter, KnowHow Dispenser and Bike Rider

Publication

Category

Recent Post

'Mongo Hadoop Connector'에 해당되는 글 1

  1. 2013.09.13 [Hadoop] MongoDB Hadoop Connector 통해 하둡 처리하기
2013. 9. 13. 21:19 MongoDB/Prototyping

MongoDB에 구글의 도서검색 내역을 넣고, 여기서 도서의 Description을 하둡으로 분석하여 추천도서를 만들어 보자 



구글도서에서 Description을 MongoDB에 저장하기 

  - 이클립스에서 Maven Project를 하나 생성하고, pom.xml 을 다음과 같이 구성한다 

    자바에서 몽고디비를 사용하기 위한 드라이버와 구글 검색결과(JSON)을 파싱하기위한 JSON라이브러리를 추가한다 

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  <modelVersion>4.0.0</modelVersion>

  <groupId>com.mobiconsoft</groupId>

  <artifactId>booksearch</artifactId>

  <version>0.0.1-SNAPSHOT</version>

  <dependencies>

  <dependency>

  <groupId>org.mongodb</groupId>

  <artifactId>mongo-java-driver</artifactId>

  <version>2.11.2</version>

  </dependency>


  <dependency>

  <groupId>junit</groupId>

  <artifactId>junit</artifactId>

  <version>4.10</version>

  </dependency>

 

  <dependency>

  <groupId>org.json</groupId>

  <artifactId>json</artifactId>

  <version>20090211</version>

  </dependency>

  </dependencies>

</project>

  - BookSearcher.java 코딩 (편의상 import 구문 제외)

public class BookSearcher {


  // 도서검색 

  public String searchBooks(String keyword) {

    URL url = null;

    try {

      url = new URL("https://www.googleapis.com/books/v1/volumes?q=" + keyword);

    } catch (MalformedURLException e) {

      e.printStackTrace();

    }

    

    StringBuffer sb = new StringBuffer();

    String line;

    try {

      URLConnection urlConn = url.openConnection();

      BufferedReader br = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "utf-8"));

      while((line = br.readLine()) != null) sb.append(line);

    } catch(IOException e) {

      e.printStackTrace();

    }

    

    return sb.toString();

  }

  

  // 도서 검색 결과에서  items찾아와 저장한다 

  public void saveBooks(String books) {

    Mongo mongo = null;

    try{

      mongo = new MongoClient("localhost", 27017);

    } catch(Exception e) {

      e.printStackTrace();

      throw new RuntimeException();

    }

    

   // 몽고디비 db는 books-db 이고 컬렉션은 books 로 만들어짐 

    mongo.setWriteConcern(new WriteConcern(1, 2000));

    DB bookDB = mongo.getDB("books-db");

    DBCollection bookColl = bookDB.getCollection("books");

    

    try{

      JSONObject json = new JSONObject(books);

      JSONArray items = json.getJSONArray("items");

      for( int i=0; i<items.length(); i++) {

        DBObject doc = new BasicDBObject();

        // search-book key로 value가 들어간다

        doc.put("search-book", (DBObject)JSON.parse(items.getJSONObject(i).toString()));

        bookColl.save(doc);

      }

    } catch(JSONException e) {

      e.printStackTrace();

    }

  }

}

  - 테스트 해보자 

    JUnit 테스트전에 mongodb를 기동한다 

// 몽고디비 

$ ../bin/mongod -dbpath=/Users/dowon/Documents/mongodb/database


// 테스트 

public class BookSearcherTest {

  

    private BookSearcher bookSearcher;

    

    @Before

    public void setUp() throws Exception {

      this.bookSearcher = new BookSearcher();

    }


    // search 결과 보기 

    @Test

    public void testSearchBooks() throws Exception {

      String result = this.bookSearcher.searchBooks("nosql");

      //assertNotNull(result);

      System.out.println(result);

    }

    

    // 데이터 저장하기 

    @Test

    public void testSaveBooks() throws Exception {

      String result = this.bookSearcher.searchBooks("nosql");

      this.bookSearcher.saveBooks(result);

    }

}


// 테스트 성공후 mongo 쉘을 통하여 확인 

> use books-db

switched to db books-db

> show collections

books

system.indexes

> db.books.find().length();

10

> db.books.find()

{ "_id" : ObjectId("5232e69cda06561b2e11306c"), "search-book" : { "saleInfo" : { "saleability" : "NOT_FOR_SALE", "isEbook" : false, "country" : "KR" }, "id" : "tv5iO9MnObUC", "searchInfo" : { "textSnippet" : "They provide examples, practical solutions, and expert education in new technologies, all designed to help programmers do a better job. wrox.com Programmer Forums Join our Programmer to Programmer forums to ask and answer programming ..." }, "etag" : "HX8hesQgrJM", "volumeInfo" : { "pageCount" : 408, "averageRating" : 3, "infoLink" : "http://books.google.co.kr/books?id=tv5iO9MnObUC&dq=nosql&hl=&source=gbs_api", "printType" : "BOOK", "publisher" : "John Wiley & Sons", "authors" : [  "Shashank Tiwari" ], "canonicalVolumeLink" : "http://books.google.co.kr/books/about/Professional_NoSQL.html?hl=&id=tv5iO9MnObUC", "title" : "Professional NoSQL", "previewLink ... 중략 ...



MongoDB Hadoop Connector 사용하기  

  - 몽고디비와 하둡을 연결하는 방법을 제공한다 

     https://github.com/mongodb/mongo-hadoop 에서 1.1.x 의 Core 다운로드 한다 (mongo-hadoop-core_1.1.2-1.1.0.jar)

  - Input-Output으로 몽고디비를 사용할 경우   

    

 - 분석을 위하여 Pig, MR을 할 경우

    

  - ETL처럼 처리후 별도의 저장소로 던져질 경우

    ETL from MongoDB

      

    ETL to MongoDB

        


  - Eclipse에 새로운 Book Search Mapper와 Reducer 프로젝트를 만들고 pom.xml 을 만든다 

    mongo-hadoop-core  파일을 maven에 등록되어 있지 않기때문에 수동으로 .m2/repository에 만들어 주어야 한다 

    예)

    > 레파지토리 : /Users/dowon/.m2/repository

    > 파일위치 : mongo-hadoop-core/mongo-hadoop-core_1.1.2/1.1.0/mongo-hadoop-core_1.1.2-1.1.0.jar

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  <modelVersion>4.0.0</modelVersion>

  <groupId>com.mobiconsoft</groupId>

  <artifactId>booksearch_mapreduce</artifactId>

  <version>0.0.1-SNAPSHOT</version>

  <dependencies>

    <dependency>

      <groupId>org.apache.hadoop</groupId>

      <artifactId>hadoop-core</artifactId>

      <version>1.1.2</version>

    </dependency>

    

    <dependency>

      <groupId>org.mongodb</groupId>

      <artifactId>mongo-java-driver</artifactId>

      <version>2.11.2</version>

    </dependency>

    

    <!-- 수동 설정 --> 

    <dependency>

      <groupId>mongo-hadoop-core</groupId>

      <artifactId>mongo-hadoop-core_1.1.2</artifactId>

      <version>1.1.0</version>

    </dependency>

  </dependencies>

  

  <build>

    <plugins>

      <plugin>

        <artifactId>maven-antrun-plugin</artifactId>

        <configuration>

          <tasks>

            <copy file="target/${project.artifactId}-${project.version}.jar"

              tofile="/Users/dowon/Documents/input/${project.artifactId}-${project.version}.jar" />

          </tasks>

        </configuration>

        <executions>

          <execution>

            <phase>install</phase>

            <goals>

              <goal>run</goal>

            </goals>

          </execution>

        </executions>

      </plugin>

    </plugins>

  </build>

  

</project>

  - Mapper와 Reducer 클래스를 코딩 

// Mapper

public class BookSearchMapper extends Mapper<Object, BSONObject, Text, IntWritable> {

  

  private final static IntWritable ONE = new IntWritable();

  private Text word = new Text();

  

  protected void map(Object key, BSONObject value, Context context) 

    throws IOException, InterruptedException {

    BasicDBObject anItem = (BasicDBObject)value.get("search-book");

    BasicDBObject volumeInfo = (BasicDBObject)anItem.get("volumeInfo");

    String description = volumeInfo.getString("description");

    if(description == null || description.trim().length() <= 0) return;

    

    StringTokenizer st = new StringTokenizer(description);

    while(st.hasMoreTokens()) {

      word.set(st.nextToken());

      context.write(word, ONE);

    }

  }

}


// Reducer

public class BookSearcherReducer extends

  Reducer<Text, IntWritable, Text, IntWritable> {

  

  protected void reduce(Text key, Iterable<IntWritable> values, Context context) 

    throws IOException, InterruptedException {

    int sum = 0;

    for(final IntWritable value : values) sum += value.get();

    context.write(key, new IntWritable(sum));

  }

}

  - Job을 만든다

public class MongoJob extends MongoTool {

  static {

    Configuration.addDefaultResource("mongo-default.xml");

    Configuration.addDefaultResource("mongo-book.xml");

  }

  

  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    

    JobHelper.addJarForJob(conf, "/Users/dowon/.m2/repository/mongo-hadoop-core/mongo-hadoop-core_1.1.2/1.1.0/mongo-hadoop-core_1.1.2-1.1.0.jar:"

          + "/Users/dowon/.m2/repository/org/mongodb/mongo-java-driver/2.11.2/mongo-java-driver-2.11.2.jar");

    

    System.exit(ToolRunner.run(conf, new MongoJob(), args));

  }

}

  - 리소스 xml을 만든다 

    https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/resources/mongo-defaults.xml 

   에서 xml 정보를 copy 하여 mongo-book.xml 을 만든 후 하기 내용을 수정하여 입력해야 한다

<property>

    <!-- Class for the mapper -->

    <name>mongo.job.mapper</name>

    <value>booksearch_mapreduce.BookSearchMapper</value>

  </property>

  <property>

    <!-- Reducer class -->

    <name>mongo.job.reducer</name>

    <value>booksearch_mapreduce.BookSearcherReducer</value>

  </property>

  <property>

    <!-- InputFormat Class -->

    <name>mongo.job.input.format</name>

    <value>com.mongodb.hadoop.MongoInputFormat</value>

  </property>

  <property>

    <!-- OutputFormat Class -->

    <name>mongo.job.output.format</name>

    <value>com.mongodb.hadoop.MongoOutputFormat</value>

  </property>

  <property>

    <!-- Output key class for the output format -->

    <name>mongo.job.output.key</name>

    <value>org.apache.hadoop.io.Text</value>

  </property>

  <property>

    <!-- Output value class for the output format -->

    <name>mongo.job.output.value</name>

    <value>com.mongodb.hadoop.io.BSONWritable</value>

  </property>

  <property>

    <!-- Output key class for the mapper [optional] -->

    <name>mongo.job.mapper.output.key</name>

    <value>org.apache.hadoop.io.Text</value>

  </property>

  <property>

    <!-- Output value class for the mapper [optional] -->

    <name>mongo.job.mapper.output.value</name>

    <value>org.apache.hadoop.io.IntWritable</value>

  </property>

  <property>

    <!-- Class for the combiner [optional] -->

    <name>mongo.job.combiner</name>

    <value>booksearch_mapreduce.BookSearcherReducer</value>

  </property>

  - "Mave Build..." clean install 하여 .jar  파일을 만든다 (참조에 첨부파일)

 - 다음 하둡 runtime(start-all.sh) 을 수행한다 

  - 하둡 수행 쉘을 만든다 

//////////////////////

// mongodb.sh 내역 

#!/bin/sh


export REPO=/Users/dowon/.m2/repository

export MONGO_DRIVER=$REPO/org/mongodb/mongo-java-driver/2.11.2/mongo-java-driver-2.11.2.jar

export MONGO_HADOOP=$REPO/mongo-hadoop-core/mongo-hadoop-core_1.1.2/1.1.0/mongo-hadoop-core_1.1.2-1.1.0.jar

export HADOOP_CLASSPATH=$MONGO_DRIVER:$MONGO_HADOOP

export HADOOP_USER_CLASSPATH_FIRST=true


hadoop jar booksearch_mapreduce-0.0.1-SNAPSHOT.jar booksearch_mapreduce.MongoJob



/////////////

/// 수행하기 

$ mongodb.sh

2013-09-13 20:53:24.630 java[1431:1203] Unable to load realm info from SCDynamicStore

13/09/13 20:53:24 INFO util.MongoTool: Created a conf: 'Configuration: core-default.xml, core-site.xml, mongo-default.xml, mongo-book.xml, mapred-default.xml, mapred-site.xml' on {class booksearch_mapreduce.MongoJob} as job named '<unnamed MongoTool job>'

13/09/13 20:53:24 INFO util.MongoTool: Mapper Class: class booksearch_mapreduce.BookSearchMapper

13/09/13 20:53:24 INFO util.MongoTool: Setting up and running MapReduce job in foreground, will wait for results.  {Verbose? false}

13/09/13 20:53:25 INFO util.MongoSplitter: MongoSplitter calculating splits

13/09/13 20:53:25 INFO util.MongoSplitter: use range queries: false

.. 중략 ...

13/09/13 20:53:41 INFO mapred.JobClient:     Spilled Records=1428

13/09/13 20:53:41 INFO mapred.JobClient:     Map output bytes=14049

13/09/13 20:53:41 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200

13/09/13 20:53:41 INFO mapred.JobClient:     Combine input records=1299

13/09/13 20:53:41 INFO mapred.JobClient:     SPLIT_RAW_BYTES=195

13/09/13 20:53:41 INFO mapred.JobClient:     Reduce input records=714

13/09/13 20:53:41 INFO mapred.JobClient:     Reduce input groups=714

13/09/13 20:53:41 INFO mapred.JobClient:     Combine output records=714

13/09/13 20:53:41 INFO mapred.JobClient:     Reduce output records=714

13/09/13 20:53:41 INFO mapred.JobClient:     Map output records=1299



MongoDB에서 결과값 확인하기 

  - 브라우져에서 결과값을 확인하고 싶다면 몽고디비에 옵션으로 --rest 를 주면 28017 포트로 RESTful 하게 호출할 수 있다 

 $ ./bin/mongod -dbpath=/Users/dowon/Documents/mongodb/database --rest 

  - 결과 화면 

    결과값은 out 컬렉션에 생성이 된다


<참조>

  - 검색 책 정보 

books.json


  - 이클립스 project workspace

booksearch.tar


  - 하둡기동후 수행하는 쉘 

mongodb.sh


  - 반출한 booksearch mapreducer jar 파일 

booksearch_mapreduce-0.0.1-SNAPSHOT.jar






posted by Peter Note
prev 1 next