Web Service Method: Extract from File

Abstract

Description
[file] Extracts and returns meaningful metadata like concepts and terms from a given file upload.

URL: /extractor/api/extract

Request

Supported Methods
POST

Content-Type:

multipart/form-data

HTTP Parameters

Parameter	Type	Required	Description
categorizationWithPpxBoost	boolean	false	Use Extractor boosting, default = false
categorize	boolean	false	Categorization extraction, default = false
charset	String	false	Character set used in the File
conceptMinimumScore	Double	false	Minimum required score of concepts, default = 0
conceptSchemeFilters	Array of String	false	Concept scheme URI filters
corpusScoring	Array of String	false	Corpus term scoring. Enabled if corpusIds (UUID) are provided.
customAttributeFilters	Array of CustomProperty	false	Custom attribute (property uri and string value) filters
customClassFilters	Array of String	false	Custom class URI filters
disambiguate	boolean	false	Use thesaurus based disambiguation, default = false
displayText	boolean	false	Include text extracted from url in response, default = false
documentClassifierIds	Array of String	false	Enable document classification by giving the document classifier IDs as input.
documentId	String	false	Internal ID of the document
extraConceptLanguages	Array of PPLocale	false	Additional languages used for concept extraction (en\|de\|es\|fr\|...) Also supports wildcard * for all language
extractorVersion	String	false	Version of PPX Extractor used
file	MultipartFile	true	File to be extracted (word, excel, powerpoint, pdf, open documents) - Mimetype of file must be 'multipart/form-data'
filterNestedConcepts	boolean	false	Remove concepts matches which are contained within other matches, default = false
findPersonNames	boolean	false	Deprecated (use nerParameters) - extracts person names from the given text
language	PPLocale	false	Extraction language (en\|de\|es\|fr\|...)
lemmatization	boolean	false	Use lemmatization, default = false
locationExtraction	boolean	false	Deprecated (use nerParameters) - extracts locations from the given text
metadata	String	false	Metadata of the document (concatenated fields with delimiter: '.')
nerParameters	Array of NERConfig	false	Array of models that are used for Named Entity Recognition
numberOfConcepts	Integer	false	Retrieve number of concepts, default = 25
numberOfTerms	Integer	false	Retrieve number of terms, default = 25
phraseLength	Integer	false	Phrase length, default = 4
projectId	Array of String	false	Thesaurus projectIds
properties	Array of String	false	Array of custom class attributes and relations that will be fetched by providing their property URIs as input. `Set to all to fetch all properties.`
regexFilename	String	false	File name for regex patterns
sentimentAnalysis	boolean	false	Sentiment analysis, default: false
shadowConceptCorpusId	Array of String	false	Shadow concepts calculation. Enabled if corpusIds (UUID) are provided
showMatchingDetails	boolean	false	Shows which concept labels where found inside the text, default = false
showMatchingPosition	boolean	false	Shows the position of the matched text. Only shown if showMatchingDetails = true. default = false
tfidfScoring	boolean	false	Use TFIDF scoring
title	String	false	Title of the document
useRelatedConcepts	boolean	false	Retrieve related concepts, default = false
useTransitiveBroaderConcepts	boolean	false	Retrieve transitive broader concepts, default = false
useTransitiveBroaderTopConcepts	boolean	false	Retrieve transitive broader top concepts, default = false
useTypes	boolean	false	Retrieve custom types for concepts, default = false

CustomProperty

Custom property

Attribute	Type	Comment
property	String	Property
value	String	Value

PPLocale

A PPLocale object

Attribute	Type	Comment
ALL_LANGUAGES	PPLocale
DUTCH	PPLocale
ENGLISH	PPLocale
FRENCH	PPLocale
GERMAN	PPLocale
RUSSIAN	PPLocale
SPANISH	PPLocale
VALID	PPLocale
country	String
language	String
languageTag	String

MultipartFile

A MultipartFile object

NERConfig

Named Entity Recognition configuration

Attribute	Type	Required	Comment
classUri	String	false	Class URI given to identified Named Entities
method	Method	false	Method used for Named Entity Extraction. (default: MAXIMUM_ENTROPY) RULE_BASED \| MAXIMUM_ENTROPY
type	String	false	Type of Named Entity Model. Pre-defined models for MAXIMUM_ENTROPY: person, organization, location

Example of a Named Entity Recognition Usage:

{

"classUri" : "some classUri" ,

"method" : "RULE_BASED" ,

"type" : "https://semantic-web.com/api/type#13359"

}

ResponseReturns

Content-Type: application/json

Arrays of Response Attributes

Click here to expand...

FileExtractionResponse

Results of an file based text extraction request. Properties with no entries are not present

Attribute	Type	Comment
document	ExtractionResponse	Extraction result
metadata	ExtractionResponse	Metadata extraction result
text	String	File text content
title	String	File title

ExtractionResponse

Results of an text extraction request. Properties with no entries are not present

Attribute	Type	Comment
categories	Array of Category	Categories of the document
classificationResults	Array of DocumentClassification	Document classification results
concepts	Array of ThesaurusConcept	Matched concepts
detectedLanguage	PPLocale	Detected Language of the document
extractedTerms	Array of ExtractedTerm	Extracted freeTerms
locations	Array of Location	Matched locations
namedEntities	Array of NamedEntityResponse	Named Entities
personNames	Array of String	Deprecated
regexMatches	Array of RegexMatches	Regex token matches
sentiments	Array of Sentiment	Matched sentiments
shadowConcepts	Array of ShadowConceptResponse	Shadow Concepts
text	String	Text as extracted from url or file
title	String	Title as extracted from url or file

Category

Categorization result

Attribute	Type	Comment
categoryConceptResults	Array of ConceptCategory	Categorized concepts
prefLabel	String	Preferred label
score	double	Score between 0.0-100.0
uri	String	Category URI

ConceptCategory

Categorized concept

Attribute	Type	Comment
prefLabel	String	Preferred label
score	double	Score from 0.0 to 100.0
uri	String	URI

DocumentClassification

A DocumentClassification object.

Attribute	Type	Comment
predictedLabel	String	predictedLabel
probabilities	Array of Prediction	Probabilities
uri	String	URI of the classifier

ThesaurusConcept

Concept from a PoolParty thesaurus project.

Attribute	Type	Comment
altLabels	Map of PPLocale	Alternative labels
broaderConcepts	Array of String	URIs of all direct broader concepts
conceptSchemes	Array of ThesaurusConceptScheme	The concept schemes this concept resides in.
corporaScore	Double	Relevance score - e.g. when extracted from a text.
customAttributes	Array of CustomAttribute	Custom attributes
customRelations	Array of CustomRelation	Custom relations
customSchemeTypes	Array of CustomSchemeType	URIs of the custom types assigned to the concept
frequencyInDocument	int	Frequency of the concept in the text
frequencyInDocuments	int	Frequency of the concept in the text
hiddenLabels	Map of PPLocale	Hidden labels
id	String	Concept id
languages	Array of PPLocale	Language of the prefLabel, altLabels and hiddenLabels of this localized view of the concept.
matchingLabels	Array of MatchingLabel	Matching labels
prefLabels	Map of PPLocale	Preferred label
project	String	UUID of the containing PoolParty project
relatedConcepts	Array of String	URIs of all related concepts
score	double	Normalized relevance score - e.g. when extracted from a text.
transitiveBroaderConcepts	Array of String	URIs of all transitive broader concepts
transitiveBroaderTopConcepts	Array of String	URIs of all top concepts that this concept is connected to via a transitive broader-chain.
uri	String	Uniform resource identifier
wordForms	Array of String	Lemmatized word forms

ThesaurusConceptScheme

ConceptScheme from a PoolParty thesaurus project - acts as a container for concepts.

Attribute	Type	Comment
title	String	The localized title of this concept scheme
uri	String	Uniform resource identifier

CustomAttribute

Custom attribute

Attribute	Type	Comment
literal	Literal	Literal
property	String	Property

CustomRelation

Custom Relation

Attribute	Type	Comment
object	String	Object
property	String	Property

CustomSchemeType

(PoolParty) concept scheme - acts as a container for concepts

Attribute	Type	Comment
title	String	The name of this custom scheme type
uri	String	Uniform resource identifier

ExtractedTerm

Phrase extracted from a text that does not match any concepts

Attribute	Type	Comment
corporaScore	Double	Corpora score
frequencyInDocument	int	Frequency within the document where it was extracted.
frequencyInDocuments	int	Frequency within the documents where it was extracted.
score	Double	Relevance score
textValue	String	The term phrase

Location

A geographical location extracted from a text.

Attribute	Type	Comment
countryCode	String	ISO 3166-1 alpha-2 country code
latitude	float	Latitude
longitude	float	Longitude
matchedLabel	String	The location label that was found in the text
name	String	Common name of the location
score	Double	Relevance score
type	LocationType	Location type - either city or country City \| Country
uri	String	Uniform resource identifier of the location.

NamedEntityResponse

Named Entity

Attribute	Type	Comment
frequency	int	Frequency in document
metadata	Map of String	Metadata
method	String	Method
positions	Array of SimpleTokenPosition	Position
score	double	Score
textValue	String	Matched text
type	String	Type

RegexMatches

Regex match

Attribute	Type	Comment
regexMatches	Array of String	Tokens from the input text that match the regex pattern
regexPattern	String	The original pattern used to match

Sentiment

Sentiment result

Attribute	Type	Comment
negativeTerms	Array of String	List of negative terms
positiveTerms	Array of String	List of positive terms
score	float	Score
sentiment	String	Sentiment

ShadowConceptResponse

Shadow Concept

Attribute	Type	Comment
altLabels	Map of PPLocale	Alternative labels
broaderConcepts	Array of String	URIs of all direct broader concepts
conceptSchemes	Array of ThesaurusConceptScheme	The concept schemes this concept resides
corporaScore	Double	Relevance score - e.g. when extracted from a text
customAttributes	Array of CustomAttribute	Custom attributes
customRelations	Array of CustomRelation	Custom relations
customSchemeTypes	Array of CustomSchemeType	URIs of the custom types assigned to the concept
hiddenLabels	Map of PPLocale	Hidden labels
id	String	Concept id
languages	Array of PPLocale	Language of the prefLabel, altLabels and hiddenLabels of this localized view of the concept
prefLabels	Map of PPLocale	Preferred label
project	String	UUID of the containing PoolParty project
relatedConcepts	Array of String	URIs of all related concepts
score	double	Normalized relevance score - e.g. when extracted from a text
shadowConceptTerms	Array of ShadowTerm	Extracted terms that contribute to calculation of the shadow concept
transitiveBroaderConcepts	Array of String	URIs of all transitive broader concepts
transitiveBroaderTopConcepts	Array of String	URIs of all top concepts that this concept is connected to via a transitive broader-chain
uri	String	Uniform resource identifier

ShadowTerm

Phrase extracted from a text that does not match any Concepts

Attribute	Type	Comment
score	double	Relevance score
textValue	String	The term phrase

In this section: