Documentation
Classes
HashingVectorizer

HashingVectorizer

Convert a collection of text documents to a matrix of token occurrences.

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

Python Reference (opens in a new tab)

Constructors

constructor()

Signature

new HashingVectorizer(opts?: object): HashingVectorizer;

Parameters

NameTypeDescription
opts?object-
opts.alternate_sign?booleanWhen true, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection. Default Value true
opts.analyzer?"word" | "char" | "char_wb"Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. Default Value 'word'
opts.binary?booleanIf true, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default Value false
opts.decode_error?"ignore" | "strict" | "replace"Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’. Default Value 'strict'
opts.dtype?anyType of the matrix returned by fit_transform() or transform().
opts.encoding?stringIf bytes or files are given to analyze, this encoding is used to decode. Default Value 'utf-8'
opts.input?"filename" | "file" | "content"If 'filename', the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze. Default Value 'content'
opts.lowercase?booleanConvert all characters to lowercase before tokenizing. Default Value true
opts.n_features?numberThe number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
opts.ngram_range?anyThe lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram\_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
opts.norm?"l1" | "l2"Norm used to normalize term vectors. undefined for no normalization. Default Value 'l2'
opts.preprocessor?anyOverride the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.
opts.stop_words?any[] | "english"If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer \== 'word'.
opts.strip_accents?"ascii" | "unicode"Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any character. undefined (default) does nothing. Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize (opens in a new tab).
opts.token_pattern?stringRegular expression denoting what constitutes a “token”, only used if analyzer \== 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
opts.tokenizer?anyOverride the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer \== 'word'.

Returns

HashingVectorizer

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:27 (opens in a new tab)

Properties

_isDisposed

boolean = false

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:25 (opens in a new tab)

_isInitialized

boolean = false

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:24 (opens in a new tab)

_py

PythonBridge

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:23 (opens in a new tab)

id

string

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:20 (opens in a new tab)

opts

any

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:21 (opens in a new tab)

Accessors

py

Signature

py(): PythonBridge;

Returns

PythonBridge

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:136 (opens in a new tab)

Signature

py(pythonBridge: PythonBridge): void;

Parameters

NameType
pythonBridgePythonBridge

Returns

void

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:140 (opens in a new tab)

Methods

build_analyzer()

Return a callable to process input data.

The callable handles preprocessing, tokenization, and n-grams generation.

Signature

build_analyzer(opts: object): Promise<any>;

Parameters

NameType
optsobject

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:226 (opens in a new tab)

build_preprocessor()

Return a function to preprocess the text before tokenization.

Signature

build_preprocessor(opts: object): Promise<any>;

Parameters

NameType
optsobject

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:256 (opens in a new tab)

build_tokenizer()

Return a function that splits a string into a sequence of tokens.

Signature

build_tokenizer(opts: object): Promise<any>;

Parameters

NameType
optsobject

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:286 (opens in a new tab)

decode()

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Signature

decode(opts: object): Promise<any>;

Parameters

NameTypeDescription
optsobject-
opts.doc?stringThe string to decode.

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:318 (opens in a new tab)

dispose()

Disposes of the underlying Python resources.

Once dispose() is called, the instance is no longer usable.

Signature

dispose(): Promise<void>;

Returns

Promise<void>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:207 (opens in a new tab)

fit()

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Signature

fit(opts: object): Promise<any>;

Parameters

NameTypeDescription
optsobject-
opts.X?anyTraining data.
opts.y?anyNot used, present for API consistency by convention.

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:355 (opens in a new tab)

fit_transform()

Transform a sequence of documents to a document-term matrix.

Signature

fit_transform(opts: object): Promise<any[]>;

Parameters

NameTypeDescription
optsobject-
opts.X?anySamples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
opts.y?anyIgnored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

Promise<any[]>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:395 (opens in a new tab)

get_stop_words()

Build or fetch the effective stop words list.

Signature

get_stop_words(opts: object): Promise<any>;

Parameters

NameType
optsobject

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:437 (opens in a new tab)

init()

Initializes the underlying Python resources.

This instance is not usable until the Promise returned by init() resolves.

Signature

init(py: PythonBridge): Promise<void>;

Parameters

NameType
pyPythonBridge

Returns

Promise<void>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:149 (opens in a new tab)

partial_fit()

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Signature

partial_fit(opts: object): Promise<any>;

Parameters

NameTypeDescription
optsobject-
opts.X?anyTraining data.
opts.y?anyNot used, present for API consistency by convention.

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:469 (opens in a new tab)

set_output()

Set output container.

See Introducing the set_output API for an example on how to use the API.

Signature

set_output(opts: object): Promise<any>;

Parameters

NameTypeDescription
optsobject-
opts.transform?"default" | "pandas"Configure output of transform and fit\_transform.

Returns

Promise<any>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:511 (opens in a new tab)

transform()

Transform a sequence of documents to a document-term matrix.

Signature

transform(opts: object): Promise<any[]>;

Parameters

NameTypeDescription
optsobject-
opts.X?anySamples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

Returns

Promise<any[]>

Defined in: generated/feature_extraction/text/HashingVectorizer.ts:546 (opens in a new tab)