IT 기술/BigData

머하웃 완벽 가이드) - MR.LDA 하고있는것들...

안단테에 2023. 2. 2. 13:29
728x90
반응형

현재 facebook에 관련된 뉴스기사를 크롤링해서 분류가 잘 되는지 테스트중

 

1. To prepare the corpus into the internal format used by Mr.LDA, run the following command

[root@masters mahout]# hadoop jar mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus -input facebook.txt -output jack-face -stoplist stoplist.txt

 

 

2. And to example the first 10 terms of the dictionary:

[root@masters mahout]# hadoop jar mrlda-0.9.0-SNAPSHOT-fatjar.jar edu.umd.cloud9.io.ReadSequenceFile jack-face/term 10

15/03/12 10:23:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library

15/03/12 10:23:07 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library

15/03/12 10:23:07 INFO compress.CodecPool: Got brand-new decompressor

Reading jack-face/term...

 

Key type: class org.apache.hadoop.io.IntWritable

Value type: class org.apache.hadoop.io.Text

 

Record 0

Key: 1

Value: ...

----------------------------------------

Record 1

Key: 2

Value: Facebook

----------------------------------------

Record 2

Key: 3

Value: Twitter

----------------------------------------

Record 3

Key: 4

Value: The

----------------------------------------

Record 4

Key: 5

Value: posted

----------------------------------------

Record 5

Key: 6

Value: page

----------------------------------------

Record 6

Key: 7

Value: media

----------------------------------------

Record 7

Key: 8

Value: social

----------------------------------------

Record 8

Key: 9

Value: Facebook,

----------------------------------------

Record 9

Key: 10

Value: Facebook.

----------------------------------------

10 records read.

 
3. Mr.LDA implements LDA using variational inference. Here's an invocation for running 50 iterations on the sample dataset:
hadoop jar mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.VariationalInference -input jack-face/document -output jack-face-lda -term 10000 -topic 5 -iteration 10 -mapper 50 -reducer 20
728x90
반응형