머하웃 완벽 가이드) - MR.LDA 하고있는것들...
현재 facebook에 관련된 뉴스기사를 크롤링해서 분류가 잘 되는지 테스트중
1. To prepare the corpus into the internal format used by Mr.LDA, run the following command
[root@masters mahout]# hadoop jar mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus -input facebook.txt -output jack-face -stoplist stoplist.txt
2. And to example the first 10 terms of the dictionary:
[root@masters mahout]# hadoop jar mrlda-0.9.0-SNAPSHOT-fatjar.jar edu.umd.cloud9.io.ReadSequenceFile jack-face/term 10
15/03/12 10:23:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/12 10:23:07 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
15/03/12 10:23:07 INFO compress.CodecPool: Got brand-new decompressor
Reading jack-face/term...
Key type: class org.apache.hadoop.io.IntWritable
Value type: class org.apache.hadoop.io.Text
Record 0
Key: 1
Value: ...
----------------------------------------
Record 1
Key: 2
Value: Facebook
----------------------------------------
Record 2
Key: 3
Value: Twitter
----------------------------------------
Record 3
Key: 4
Value: The
----------------------------------------
Record 4
Key: 5
Value: posted
----------------------------------------
Record 5
Key: 6
Value: page
----------------------------------------
Record 6
Key: 7
Value: media
----------------------------------------
Record 7
Key: 8
Value: social
----------------------------------------
Record 8
Key: 9
Value: Facebook,
----------------------------------------
Record 9
Key: 10
Value: Facebook.
----------------------------------------
10 records read.