Skip to main content

Train a Classifier - Best Practices

Abstract

Train a Classifier - Best Practices

In this section we have summarized our recommendations for training classifiers in PoolParty to get the best possible results.

Settings to Use
  • You should not use more than 50 categories per classifier and an average of 50-150 documents per category. Otherwise, performance will be impeded considerably.

  • Create categories that fit your training documents on the one hand and your use case for this particular classifier well.

  • In the Features tab, Concepts and Shadow Concepts should be set active in case you want to refine your results, but you have to test each time if the results make sense for your use case.

  • For Cross-Validation at least use 5-fold, depending on the bulk of training material, 10-fold is recommended for statistical soundness.

    • More documents will slow down the validation process for higher values.

  • For the Vocablulary Size in the Vectorizer settings, set a default value of 1000-5000, depending on the concepts to be taken from the thesaurus and number of documents.

Documents to Use

Note

We strongly recommend that you do not use the bulk of all existing training documents for training the classifier!

Leave a rough estimate of about 10% for testing the trained classifiers, before you use them on new and unknown documents.

  • Use documents that fit your use case and that you know, as regards the category they belong into.

    • You will be able to evaluate the classifier's training results better, otherwise training may fail or be much more difficult.

Results to Aim for
  • The values in the Cross-Validation section are those of recall, precision and their mean, f1.

    • We strongly recommend aiming for a value of above 70% for all three.

Avoid Overfitting

In statistical machine learning procedures the concept of overfitting has been identified: it means that you would choose your training data so well fitted to the categories that the classifier will always reach the highest possible values for cross-validation. Yet, although high values may seem impressive, they could result in a classifier being trained too narrowly, thus predictions on future unknowns might not work well.