안단테 안단테

머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정... 본문

IT 기술/BigData

머하웃 완벽 가이드) - mahout seq2sparse 인자값 & 과정...

안단테에 2023. 2. 2. 13:26
728x90
반응형

인자값

 

Usage:

 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default

                                      Value: 2

  --analyzerName (-a) analyzerName    The class name of the analyzer

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. Default

                                      Value: 100MB

  --output (-o) output                The directory pathname for output.

  --input (-i) input                  Path to job input directory.

  --minDF (-md) minDF                 The minimum document frequency.  Default

                                      is 1

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors

                                      to be used, expressed in times the

                                      standard deviation (sigma) of the

                                      document frequencies of these vectors.

                                      Can be used to remove really high

                                      frequency terms. Expressed as a double

                                      value. Good value to be specified is 3.0.

                                      In case the value is less than 0 no

                                      vectors will be filtered out. Default is

                                      -1.0.  Overrides maxDFPercent

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.

                                      Can be used to remove really high

                                      frequency terms. Expressed as an integer

                                      between 0 and 100. Default is 99.  If

                                      maxDFSigma is also set, it will override

                                      this value.

  --weight (-wt) weight               The kind of weight to use. Currently TF

                                      or TFIDF. Default: TFIDF

  --norm (-n) norm                    The norm to use, expressed as either a

                                      float or "INF" if you want to use the

                                      Infinite norm.  Must be greater or equal

                                      to 0.  The default is not to normalize

  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood

                                      Ratio(Float)  Default is 1.0

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.

                                      Default Value: 1

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to

                                      create (2 = bigrams, 3 = trigrams, etc)

                                      Default Value:1

  --overwrite (-ow)                   If set, overwrite the output directory

  --help (-h)                         Print out help

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should

                                      be SequentialAccessVectors. If set true

                                      else false

  --namedVector (-nv)                 (Optional) Whether output vectors should

                                      be NamedVectors. If set true else false

  --logNormalize (-lnorm)             (Optional) Whether output vectors should

                                      be logNormalize. If set true else false

 
=========================================================================================================
과정

 

[root@masters mahout-distribution-0.9]# mahout seq2sparse -i path-seqdir -o path-cvb -wt tf -seq -nv

 

MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1

15/03/11 09:33:39 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in path-seqdir

 

15/03/11 09:34:10 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors

 

15/03/11 09:34:10 INFO vectorizer.DictionaryVectorizer: Creating dictionary from path-cvb/tokenized-documents and saving at path-cvb/wordcount

 

15/03/11 09:35:50 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF

 

15/03/11 09:36:19 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning

 
728x90
반응형
Comments