Train a Classifier - Algorithms and Settings Overview

Abstract

This section provides an overview of algorithms available for the PoolParty Semantic Classifier, their basic working and the settings and values you can use.

Use the Train a Classifier - Best Practices topic to get a short overview of what to aim at in setting up a Train Classifier.

Available Algorithms and Values

Note

The algorithm information given here can only cover the bare minimum you need to know for a successful setup.

For a deeper understanding please refer to online resources and data scientist's knowledge bases.

For some details on the implementation in PoolParty, you may want to refer to Apache's Spark machine learning library documentation.

Name	Description	Available Settings	Values
Logistic Regression	A specially used regression calculation model, where a categorical variable is used together with constant values. It is useful if you need to make statements as to the probability of an occurrence based on a number of constant statistical factors. Simply put it means that values accumulated over a period of time in the past ('regression') are used as a base for calculating the probability of future similar events. The settings available here influence the outcome additionally since the regression is complemented with two kinds of regularization: simple and advanced regularization make sure that prediction errors are avoided.	Elastic Net Regularization Max. Number of Iterations Regularization	Defaults: 0.8 10 0.3
Linear Support Vector Machine (recommended)	Linear Support Vector Machine (LSVMs) algorithms are considered to be the most effective for classification tasks. They work by dividing training as well as classification data for the respective categories into two distinctly separated hyperplanes. They have an underlying calculation model that is particularly well-suited for categorizing and they are known to work best when possibles are most unlike the categories they are actually building on.	Max. Number of Iterations Regularization	Defaults: 10 0.3
Decision Tree	As in some business management models this algorithm also is based on a tree model that offers several ways of reaching a result goal. This kind of classifier is a good choice if you want to do a multi-category classification.	n.a.	Values set by default.
Gradient Boosted Tree	Boosting is the technique in machine learning where so-called weak learners among algorithms are combined and optimized to create a strong learner. The Gradient Boosted Tree is a special variant of this kind of algorithm. It combines decision tree algorithms for prediction and regression classification tasks to make the results of each combined with others more reliable and apt.	Max. Number of Iterations	Default: 10
Deep Learning (MLP)	Deep Learning algorithms are based on representational or feature based learning data. They are particularly well suited for classification tasks such as machine translation, speech recognition and natural language processing. This particular algorithm, the Multilayer Perceptron (MLP) Deep Learning, consists of three-layer nodes to help distinguish data that is not linearly separable.	Block Size Layers Max. Number of Iterations Seed	Defaults: 128 50, 10 10 1234
Naive Bayes	They are algorithms which have been used for classification for quite some time now and belong to the baseline methods. They distinguish between texts belonging to one category or another on the one hand. On the other, they also use word frequency as a feature. The fact that they use linearly connected variables, without taking correlations into account characterizes this family of algorithms further.	n.a.	Values set by default.
Random Forest	These algorithms are based on decision trees but extend that kind of calculation by combining a multitude of decision trees at training time and then providing results that are the mode of a class/category and/or the mean of the prediction. The advantage of random forest algorithms is their ability to correct overfitting to their training data set typical for decision trees.	n.a.	Values set by default.

Note

Make sure to train the classifier well but also take care to avoid overfitting: the expression for statistical data models that mirror the training data in almost every single particular so the prediction would not work well on future unknowns.

Vectorizer Settings

Vectorization in a classifier is needed to make the results machine readable.

Parameter	Values
Min. Document Frequency	Defines the value for the number of times a document occurs in validation. Default: 1.0 Recommended: 10 Allowed values: integer
Min. Term Frequency	Defines the value for the number of times a term occurs in a document. Default: 1.0 Recommended: between 50-100 Allowed values: integer
Vocabulary Size	Defines the amount of concepts taken from the thesaurus. Default value: 10000 Recommended values: 1000-5000, depending on the size of your thesaurus.

Parameter

Values

Min. Document Frequency

Defines the value for the number of times a document occurs in validation.

Default: 1.0

Recommended: 10

Allowed values: integer

Min. Term Frequency

Defines the value for the number of times a term occurs in a document.

Default: 1.0

Recommended: between 50-100

Allowed values: integer

Vocabulary Size

Defines the amount of concepts taken from the thesaurus.

Default value: 10000

Recommended values: 1000-5000, depending on the size of your thesaurus.

Available Cross-Validation Levels

Cross-validation is used to determine and predict the future results after training has completed. The parameters presented are those of recall and precision, their mean is f1.

We recommend to use a cross-validation level of at least 5-fold, an overall result for all three parameters of at least 70%.

Possible Levels
10-fold
5-fold
3-fold
No Validation

In this section: