Skip to main content

Add Documents to a Train Classifier

Abstract

Add Documents to a Train Classifier

In this section you find a guide on how to add documents to a Train Classifier.

A Train Classifier is the starting point of later classification on a large scale. You should add documents to it you already know about, that is you should have a good idea of the best possible results of classification for them. That way you can train the classifier effectively later, tweaking its settings until results are satisfactory. Supported file types are based on those of the Apache Tika library and in PoolParty all text formats listed there are supported.

After that you will go on using the classifiers together with PoolParty's API to classify documents.

You have two options to add documents to a classifier:

Note

At this point in time PoolParty supports training classifiers with up to 50 categories and about 50-150 documents per category.

We strongly recommend to not use the bulk of all existing training documents for training the classifier! Leave a rough estimate of about 10% for testing the trained classifiers, before you use them on new and unknown documents.