Train a Classifier - Algorithms and Settings Overview
Train a Classifier - Algorithms and Settings Overview
This section provides an overview of algorithms available for the PoolParty Semantic Classifier, their basic working and the settings and values you can use.
Use the Train a Classifier - Best Practices topic to get a short overview of what to aim at in setting up a Train Classifier.
Note
The algorithm information given here can only cover the bare minimum you need to know for a successful setup.
For a deeper understanding please refer to online resources and data scientist's knowledge bases.
For some details on the implementation in PoolParty, you may want to refer to Apache's Spark machine learning library documentation.
Name | Description | Available Settings | Values |
---|---|---|---|
Logistic Regression | A specially used regression calculation model, where a categorical variable is used together with constant values. It is useful if you need to make statements as to the probability of an occurrence based on a number of constant statistical factors. Simply put it means that values accumulated over a period of time in the past ('regression') are used as a base for calculating the probability of future similar events. The settings available here influence the outcome additionally since the regression is complemented with two kinds of regularization: simple and advanced regularization make sure that prediction errors are avoided. |
| Defaults:
|
Linear Support Vector Machine (recommended) | Linear Support Vector Machine (LSVMs) algorithms are considered to be the most effective for classification tasks. They work by dividing training as well as classification data for the respective categories into two distinctly separated hyperplanes. They have an underlying calculation model that is particularly well-suited for categorizing and they are known to work best when possibles are most unlike the categories they are actually building on. |
| Defaults:
|
Decision Tree | As in some business management models this algorithm also is based on a tree model that offers several ways of reaching a result goal. This kind of classifier is a good choice if you want to do a multi-category classification. | n.a. | Values set by default. |
Gradient Boosted Tree | Boosting is the technique in machine learning where so-called weak learners among algorithms are combined and optimized to create a strong learner. The Gradient Boosted Tree is a special variant of this kind of algorithm. It combines decision tree algorithms for prediction and regression classification tasks to make the results of each combined with others more reliable and apt. | Max. Number of Iterations | Default: 10 |
Deep Learning (MLP) | Deep Learning algorithms are based on representational or feature based learning data. They are particularly well suited for classification tasks such as machine translation, speech recognition and natural language processing. This particular algorithm, the Multilayer Perceptron (MLP) Deep Learning, consists of three-layer nodes to help distinguish data that is not linearly separable. |
| Defaults:
|
Naive Bayes | They are algorithms which have been used for classification for quite some time now and belong to the baseline methods. They distinguish between texts belonging to one category or another on the one hand. On the other, they also use word frequency as a feature. The fact that they use linearly connected variables, without taking correlations into account characterizes this family of algorithms further. | n.a. | Values set by default. |
Random Forest | These algorithms are based on decision trees but extend that kind of calculation by combining a multitude of decision trees at training time and then providing results that are the mode of a class/category and/or the mean of the prediction. The advantage of random forest algorithms is their ability to correct overfitting to their training data set typical for decision trees. | n.a. | Values set by default. |
Note
Make sure to train the classifier well but also take care to avoid overfitting: the expression for statistical data models that mirror the training data in almost every single particular so the prediction would not work well on future unknowns.
Vectorization in a classifier is needed to make the results machine readable.
Parameter | Values |
---|---|
Min. Document Frequency | Defines the value for the number of times a document occurs in validation. Default: 1.0 Recommended: 10 Allowed values: integer |
Min. Term Frequency | Defines the value for the number of times a term occurs in a document. Default: 1.0 Recommended: between 50-100 Allowed values: integer |
Vocabulary Size | Defines the amount of concepts taken from the thesaurus. Default value: 10000 Recommended values: 1000-5000, depending on the size of your thesaurus. |
Cross-validation is used to determine and predict the future results after training has completed. The parameters presented are those of recall and precision, their mean is f1.
We recommend to use a cross-validation level of at least 5-fold, an overall result for all three parameters of at least 70%.
Possible Levels |
---|
10-fold |
5-fold |
3-fold |
No Validation |