안단테 안단테
머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정... 본문
인자값
Usage:
[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--analyzerName (-a) analyzerName The class name of the analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. Default
Value: 100MB
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--minDF (-md) minDF The minimum document frequency. Default
is 1
--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors
to be used, expressed in times the
standard deviation (sigma) of the
document frequencies of these vectors.
Can be used to remove really high
frequency terms. Expressed as a double
value. Good value to be specified is 3.0.
In case the value is less than 0 no
vectors will be filtered out. Default is
-1.0. Overrides maxDFPercent
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.
Can be used to remove really high
frequency terms. Expressed as an integer
between 0 and 100. Default is 99. If
maxDFSigma is also set, it will override
this value.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF. Default: TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers (Optional) Number of reduce tasks.
Default Value: 1
--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to
create (2 = bigrams, 3 = trigrams, etc)
Default Value:1
--overwrite (-ow) If set, overwrite the output directory
--help (-h) Print out help
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false
[root@masters mahout-distribution-0.9]# mahout seq2sparse -i path-seqdir -o path-cvb -wt tf -seq -nv
MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar
15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in path-seqdir
15/03/11 09:34:10 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors
15/03/11 09:34:10 INFO vectorizer.DictionaryVectorizer: Creating dictionary from path-cvb/tokenized-documents and saving at path-cvb/wordcount
15/03/11 09:35:50 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF
15/03/11 09:36:19 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning
'IT 기술 > BigData' 카테고리의 다른 글
머하웃 완벽 가이드) - mahout cvb 인자값 (0) | 2023.02.02 |
---|---|
머하웃 완벽 가이드) - mahout rowid 인자값 & 과정 (0) | 2023.02.02 |
머하웃 완벽 가이드) - mahout seqdirectory 인자값 (0) | 2023.02.02 |
머하웃 완벽 가이드) - 9장 LDA 알고리즘 돌리기 (0) | 2023.02.02 |
머하웃 완벽 가이드) - 9장 classdump로 k-means 결과 확인하기 2 (0) | 2023.02.02 |