목록IT 기술/BigData (38)
안단테 안단테
머하웃 LDA 실행순서 1. seqdirectory: : Generate sequence files (of Text) from a directory mahout seqdirectory -i content -o jack-seqdir -c UTF-8 -chunk 64 -xm sequential 문제점 : 파일의 크기가... 크면 힙사이즈 에러.... ㅠㅠ 2. seq2sparse : Sparse Vector generation from Text sequence files mahout seq2sparse -i jack-seqdir -o jack-cvb -wt tf -seq -nv 3. rowid : Map SequenceFile to {SequenceFile, SequenceFile} mahout rowid ..
현재 facebook에 관련된 뉴스기사를 크롤링해서 분류가 잘 되는지 테스트중 1. To prepare the corpus into the internal format used by Mr.LDA, run the following command [root@masters mahout]# hadoop jar mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus -input facebook.txt -output jack-face -stoplist stoplist.txt 2. And to example the first 10 terms of the dictionary: [root@masters mahout]# hadoop jar mrlda-0.9.0-SNAPSHOT-fat..
Mr.LDA 돌리려면 깃에서 다운받고 메이븐 프로젝트로 import 시키고 maven build 해주고 실행시켜보면 findCounter 못찾는다고 나와있을텐데 그건 하둡 2.0.0 부터 메소드가 달라져서 그런거니깐 1.1.1대로 바꿔주면 된다.
Usage: [--input --output --useKey --printKey --dictionary --dictionaryType --csv --namesAsComments --nameOnly --sortVectors --quiet --sizeOnly --numItems --vectorSize --filter [ ...] --help --tempDir --startPhase --endPhase ] Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --useKey (-u) useKey If the Key is a vector th..
Usage: [--input --output --maxIter --convergenceDelta --overwrite --num_topics --num_terms --doc_topic_smoothing --term_topic_smoothing --dictionary --doc_topic_output --topic_model_temp_dir --iteration_block_size --random_seed --test_set_fraction --num_train_threads --num_update_threads --max_doc_topic_iters --num_reduce_tasks --backfill_perplexity --help --tempDir --startPhase --endPhase ] Job..
========================================================================================================= 인자값 Usage: [--input --output --help --tempDir --startPhase --endPhase ] Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startP..
인자값 Usage: [--minSupport --analyzerName --chunkSize --output --input --minDF --maxDFSigma --maxDFPercent --weight --norm --minLLR --numReducers --maxNGramSize --overwrite --help --sequentialAccessVector --namedVector --logNormalize] Options --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk..
mahout seqdirectory 인자값 [root@masters mahout-distribution-0.9]# mahout seqdirectory -i path -o path-seqdir -c UTF-8 -chunk 64 -xm sequential -s MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/local/hadoop/hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-1.1.1/conf MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples..
9장 LDA 돌리는 순서 + $MAHOUT_HOME/bin/mahout seqdirectory + -i reuters + -o reuters-seqdir + -c UTF-8 + -chunk 64 + -xm sequential mahout seq2sparse \ -i reuters-seqdir \ -o reuters-cvb -wt tf -seq -nv mahout rowid -i reuters-cvb/tf-vectors -o reuters-cvb mahout cvb -dict reuters-cvb/dictionary.file-0 -ow -i reuters-cvb/matrix/ -o reuters-topics -k 10 -x 20 -dt topics-output -mt topics-model mahout v..
새로 만든 벡터로 k-means 인풋으로 넣고.... 결과값 확인하려고 classdump 이용했는데... 다음과 같은 오류가 나서... MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/local/hadoop/hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-1.1.1/conf MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar 15/03/09 20:15:03 INFO common.AbstractJob: Command line..