머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정...

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

안단테 안단테

머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정... 본문

IT 기술/BigData

머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정...

안단테에 2023. 2. 2. 13:26

728x90

인자값

Usage:

[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

--minSupport (-s) minSupport (Optional) Minimum Support. Default

Value: 2

--analyzerName (-a) analyzerName The class name of the analyzer

--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. Default

Value: 100MB

--output (-o) output The directory pathname for output.

--input (-i) input Path to job input directory.

--minDF (-md) minDF The minimum document frequency. Default

is 1

--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors

to be used, expressed in times the

standard deviation (sigma) of the

document frequencies of these vectors.

Can be used to remove really high

frequency terms. Expressed as a double

value. Good value to be specified is 3.0.

In case the value is less than 0 no

vectors will be filtered out. Default is

-1.0. Overrides maxDFPercent

--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.

Can be used to remove really high

frequency terms. Expressed as an integer

between 0 and 100. Default is 99. If

maxDFSigma is also set, it will override

this value.

--weight (-wt) weight The kind of weight to use. Currently TF

or TFIDF. Default: TFIDF

--norm (-n) norm The norm to use, expressed as either a

float or "INF" if you want to use the

Infinite norm. Must be greater or equal

to 0. The default is not to normalize

--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood

Ratio(Float) Default is 1.0

--numReducers (-nr) numReducers (Optional) Number of reduce tasks.

Default Value: 1

--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to

create (2 = bigrams, 3 = trigrams, etc)

Default Value:1

--overwrite (-ow) If set, overwrite the output directory

--help (-h) Print out help

--sequentialAccessVector (-seq) (Optional) Whether output vectors should

be SequentialAccessVectors. If set true

else false

--namedVector (-nv) (Optional) Whether output vectors should

be NamedVectors. If set true else false

--logNormalize (-lnorm) (Optional) Whether output vectors should

be logNormalize. If set true else false

=========================================================================================================

과정

[root@masters mahout-distribution-0.9]# mahout seq2sparse -i path-seqdir -o path-cvb -wt tf -seq -nv

MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in path-seqdir

15/03/11 09:34:10 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors

15/03/11 09:34:10 INFO vectorizer.DictionaryVectorizer: Creating dictionary from path-cvb/tokenized-documents and saving at path-cvb/wordcount

15/03/11 09:35:50 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF

15/03/11 09:36:19 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning

728x90

저작자표시 변경금지 (새창열림)

'IT 기술 > BigData' 카테고리의 다른 글

머하웃 완벽 가이드) - mahout cvb 인자값 (0)	2023.02.02
머하웃 완벽 가이드) - mahout rowid 인자값 & 과정 (0)	2023.02.02
머하웃 완벽 가이드) - mahout seqdirectory 인자값 (0)	2023.02.02
머하웃 완벽 가이드) - 9장 LDA 알고리즘 돌리기 (0)	2023.02.02
머하웃 완벽 가이드) - 9장 classdump로 k-means 결과 확인하기 2 (0)	2023.02.02

'IT 기술/BigData' Related Articles

Comments

안단테 안단테

머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정... 본문

머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정...

'IT 기술 > BigData' 카테고리의 다른 글

티스토리툴바