MongoDB에 구글의 도서검색 내역을 넣고, 여기서 도서의 Description을 하둡으로 분석하여 추천도서를 만들어 보자
구글도서에서 Description을 MongoDB에 저장하기
- 이클립스에서 Maven Project를 하나 생성하고, pom.xml 을 다음과 같이 구성한다
자바에서 몽고디비를 사용하기 위한 드라이버와 구글 검색결과(JSON)을 파싱하기위한 JSON라이브러리를 추가한다
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mobiconsoft</groupId>
<artifactId>booksearch</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>2.11.2</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20090211</version>
</dependency>
</dependencies>
</project>
- BookSearcher.java 코딩 (편의상 import 구문 제외)
public class BookSearcher {
// 도서검색
public String searchBooks(String keyword) {
URL url = null;
try {
url = new URL("https://www.googleapis.com/books/v1/volumes?q=" + keyword);
} catch (MalformedURLException e) {
e.printStackTrace();
}
StringBuffer sb = new StringBuffer();
String line;
try {
URLConnection urlConn = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "utf-8"));
while((line = br.readLine()) != null) sb.append(line);
} catch(IOException e) {
e.printStackTrace();
}
return sb.toString();
}
// 도서 검색 결과에서 items찾아와 저장한다
public void saveBooks(String books) {
Mongo mongo = null;
try{
mongo = new MongoClient("localhost", 27017);
} catch(Exception e) {
e.printStackTrace();
throw new RuntimeException();
}
// 몽고디비 db는 books-db 이고 컬렉션은 books 로 만들어짐
mongo.setWriteConcern(new WriteConcern(1, 2000));
DB bookDB = mongo.getDB("books-db");
DBCollection bookColl = bookDB.getCollection("books");
try{
JSONObject json = new JSONObject(books);
JSONArray items = json.getJSONArray("items");
for( int i=0; i<items.length(); i++) {
DBObject doc = new BasicDBObject();
// search-book key로 value가 들어간다
doc.put("search-book", (DBObject)JSON.parse(items.getJSONObject(i).toString()));
bookColl.save(doc);
}
} catch(JSONException e) {
e.printStackTrace();
}
}
}
- 테스트 해보자
JUnit 테스트전에 mongodb를 기동한다
// 몽고디비
$ ../bin/mongod -dbpath=/Users/dowon/Documents/mongodb/database
// 테스트
public class BookSearcherTest {
private BookSearcher bookSearcher;
@Before
public void setUp() throws Exception {
this.bookSearcher = new BookSearcher();
}
// search 결과 보기
@Test
public void testSearchBooks() throws Exception {
String result = this.bookSearcher.searchBooks("nosql");
//assertNotNull(result);
System.out.println(result);
}
// 데이터 저장하기
@Test
public void testSaveBooks() throws Exception {
String result = this.bookSearcher.searchBooks("nosql");
this.bookSearcher.saveBooks(result);
}
}
// 테스트 성공후 mongo 쉘을 통하여 확인
> use books-db
switched to db books-db
> show collections
books
system.indexes
> db.books.find().length();
10
> db.books.find()
{ "_id" : ObjectId("5232e69cda06561b2e11306c"), "search-book" : { "saleInfo" : { "saleability" : "NOT_FOR_SALE", "isEbook" : false, "country" : "KR" }, "id" : "tv5iO9MnObUC", "searchInfo" : { "textSnippet" : "They provide examples, practical solutions, and expert education in new technologies, all designed to help programmers do a better job. wrox.com Programmer Forums Join our Programmer to Programmer forums to ask and answer programming ..." }, "etag" : "HX8hesQgrJM", "volumeInfo" : { "pageCount" : 408, "averageRating" : 3, "infoLink" : "http://books.google.co.kr/books?id=tv5iO9MnObUC&dq=nosql&hl=&source=gbs_api", "printType" : "BOOK", "publisher" : "John Wiley & Sons", "authors" : [ "Shashank Tiwari" ], "canonicalVolumeLink" : "http://books.google.co.kr/books/about/Professional_NoSQL.html?hl=&id=tv5iO9MnObUC", "title" : "Professional NoSQL", "previewLink ... 중략 ...
MongoDB Hadoop Connector 사용하기
- 몽고디비와 하둡을 연결하는 방법을 제공한다
https://github.com/mongodb/mongo-hadoop 에서 1.1.x 의 Core 다운로드 한다 (mongo-hadoop-core_1.1.2-1.1.0.jar)
- Input-Output으로 몽고디비를 사용할 경우
- 분석을 위하여 Pig, MR을 할 경우
- ETL처럼 처리후 별도의 저장소로 던져질 경우
ETL from MongoDB
ETL to MongoDB
- Eclipse에 새로운 Book Search Mapper와 Reducer 프로젝트를 만들고 pom.xml 을 만든다
mongo-hadoop-core 파일을 maven에 등록되어 있지 않기때문에 수동으로 .m2/repository에 만들어 주어야 한다
예)
> 레파지토리 : /Users/dowon/.m2/repository
> 파일위치 : mongo-hadoop-core/mongo-hadoop-core_1.1.2/1.1.0/mongo-hadoop-core_1.1.2-1.1.0.jar
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mobiconsoft</groupId>
<artifactId>booksearch_mapreduce</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.1.2</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>2.11.2</version>
</dependency>
<!-- 수동 설정 -->
<dependency>
<groupId>mongo-hadoop-core</groupId>
<artifactId>mongo-hadoop-core_1.1.2</artifactId>
<version>1.1.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-antrun-plugin</artifactId>
<configuration>
<tasks>
<copy file="target/${project.artifactId}-${project.version}.jar"
tofile="/Users/dowon/Documents/input/${project.artifactId}-${project.version}.jar" />
</tasks>
</configuration>
<executions>
<execution>
<phase>install</phase>
<goals>
<goal>run</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
- Mapper와 Reducer 클래스를 코딩
// Mapper
public class BookSearchMapper extends Mapper<Object, BSONObject, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable();
private Text word = new Text();
protected void map(Object key, BSONObject value, Context context)
throws IOException, InterruptedException {
BasicDBObject anItem = (BasicDBObject)value.get("search-book");
BasicDBObject volumeInfo = (BasicDBObject)anItem.get("volumeInfo");
String description = volumeInfo.getString("description");
if(description == null || description.trim().length() <= 0) return;
StringTokenizer st = new StringTokenizer(description);
while(st.hasMoreTokens()) {
word.set(st.nextToken());
context.write(word, ONE);
}
}
}
// Reducer
public class BookSearcherReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for(final IntWritable value : values) sum += value.get();
context.write(key, new IntWritable(sum));
}
}
- Job을 만든다
public class MongoJob extends MongoTool {
static {
Configuration.addDefaultResource("mongo-default.xml");
Configuration.addDefaultResource("mongo-book.xml");
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
JobHelper.addJarForJob(conf, "/Users/dowon/.m2/repository/mongo-hadoop-core/mongo-hadoop-core_1.1.2/1.1.0/mongo-hadoop-core_1.1.2-1.1.0.jar:"
+ "/Users/dowon/.m2/repository/org/mongodb/mongo-java-driver/2.11.2/mongo-java-driver-2.11.2.jar");
System.exit(ToolRunner.run(conf, new MongoJob(), args));
}
}
- 리소스 xml을 만든다
에서 xml 정보를 copy 하여 mongo-book.xml 을 만든 후 하기 내용을 수정하여 입력해야 한다
<property>
<!-- Class for the mapper -->
<name>mongo.job.mapper</name>
<value>booksearch_mapreduce.BookSearchMapper</value>
</property>
<property>
<!-- Reducer class -->
<name>mongo.job.reducer</name>
<value>booksearch_mapreduce.BookSearcherReducer</value>
</property>
<property>
<!-- InputFormat Class -->
<name>mongo.job.input.format</name>
<value>com.mongodb.hadoop.MongoInputFormat</value>
</property>
<property>
<!-- OutputFormat Class -->
<name>mongo.job.output.format</name>
<value>com.mongodb.hadoop.MongoOutputFormat</value>
</property>
<property>
<!-- Output key class for the output format -->
<name>mongo.job.output.key</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<!-- Output value class for the output format -->
<name>mongo.job.output.value</name>
<value>com.mongodb.hadoop.io.BSONWritable</value>
</property>
<property>
<!-- Output key class for the mapper [optional] -->
<name>mongo.job.mapper.output.key</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<!-- Output value class for the mapper [optional] -->
<name>mongo.job.mapper.output.value</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<!-- Class for the combiner [optional] -->
<name>mongo.job.combiner</name>
<value>booksearch_mapreduce.BookSearcherReducer</value>
</property>
- "Mave Build..." clean install 하여 .jar 파일을 만든다 (참조에 첨부파일)
- 다음 하둡 runtime(start-all.sh) 을 수행한다
- 하둡 수행 쉘을 만든다
//////////////////////
// mongodb.sh 내역
#!/bin/sh
export REPO=/Users/dowon/.m2/repository
export MONGO_DRIVER=$REPO/org/mongodb/mongo-java-driver/2.11.2/mongo-java-driver-2.11.2.jar
export MONGO_HADOOP=$REPO/mongo-hadoop-core/mongo-hadoop-core_1.1.2/1.1.0/mongo-hadoop-core_1.1.2-1.1.0.jar
export HADOOP_CLASSPATH=$MONGO_DRIVER:$MONGO_HADOOP
export HADOOP_USER_CLASSPATH_FIRST=true
hadoop jar booksearch_mapreduce-0.0.1-SNAPSHOT.jar booksearch_mapreduce.MongoJob
/////////////
/// 수행하기
$ mongodb.sh
2013-09-13 20:53:24.630 java[1431:1203] Unable to load realm info from SCDynamicStore
13/09/13 20:53:24 INFO util.MongoTool: Created a conf: 'Configuration: core-default.xml, core-site.xml, mongo-default.xml, mongo-book.xml, mapred-default.xml, mapred-site.xml' on {class booksearch_mapreduce.MongoJob} as job named '<unnamed MongoTool job>'
13/09/13 20:53:24 INFO util.MongoTool: Mapper Class: class booksearch_mapreduce.BookSearchMapper
13/09/13 20:53:24 INFO util.MongoTool: Setting up and running MapReduce job in foreground, will wait for results. {Verbose? false}
13/09/13 20:53:25 INFO util.MongoSplitter: MongoSplitter calculating splits
13/09/13 20:53:25 INFO util.MongoSplitter: use range queries: false
.. 중략 ...
13/09/13 20:53:41 INFO mapred.JobClient: Spilled Records=1428
13/09/13 20:53:41 INFO mapred.JobClient: Map output bytes=14049
13/09/13 20:53:41 INFO mapred.JobClient: Total committed heap usage (bytes)=269619200
13/09/13 20:53:41 INFO mapred.JobClient: Combine input records=1299
13/09/13 20:53:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=195
13/09/13 20:53:41 INFO mapred.JobClient: Reduce input records=714
13/09/13 20:53:41 INFO mapred.JobClient: Reduce input groups=714
13/09/13 20:53:41 INFO mapred.JobClient: Combine output records=714
13/09/13 20:53:41 INFO mapred.JobClient: Reduce output records=714
13/09/13 20:53:41 INFO mapred.JobClient: Map output records=1299
MongoDB에서 결과값 확인하기
- 브라우져에서 결과값을 확인하고 싶다면 몽고디비에 옵션으로 --rest 를 주면 28017 포트로 RESTful 하게 호출할 수 있다
$ ./bin/mongod -dbpath=/Users/dowon/Documents/mongodb/database --rest
- 결과 화면
결과값은 out 컬렉션에 생성이 된다
<참조>
- 검색 책 정보
- 이클립스 project workspace
- 하둡기동후 수행하는 쉘
- 반출한 booksearch mapreducer jar 파일
booksearch_mapreduce-0.0.1-SNAPSHOT.jar
'MongoDB > Prototyping' 카테고리의 다른 글
[MongoDB] Shard 환경 구성하기 (2) (0) | 2013.07.27 |
---|---|
[MongoDB] ReplicaSet 환경 구성하기 (1) (0) | 2013.07.27 |
[MongoDB] Master/Slave 만들기 (0) | 2013.07.27 |
[MongoDB] Index 생성/삭제하기 (0) | 2013.07.27 |
[Mongoose] Schema의 Virtual, Method, Pre 테스트 (0) | 2013.07.10 |