TFIDF (term frequency-inverse document frequency) is a technique in machine learning, which is used to quantify the importance of terms in a text corpus. This is useful not only in preparing training data by removing unnecessary terms but also in information retrieval when finding the relevant data for a given query.
This is achieved by multiplying the Term frequency and the Inverse document frequency, which are defined as follows:
The Term frequency is a measure for the number of times a given term appears in a text corpus. Most of the time it is computed by simply counting the occurrences of each term individually, though other methods, such as boolean frequencies (1 if occurs, 0 otherwise), may also be used.
Inverse Document Frequency
IDF attempts to measure the importance of a given term in a collection of documents. If a term appears in a large number of documents, it is suspected to not carry much information.
IDF is calculated by dividing the number of all documents by the number of documents a given term appears in (and log scaling the result of this formula), which generally means that in order for a term to get a high IDF-value, it has to appear in only a small number of documents.
TFIDF is a way to vectorize a text corpus, by multiplying the TF and IDF for every single term. For every given document, the TFIDF-score for each term will be high, if the Term Frequency of the given word is high, and the term also doesn’t appear in a large number of documents.
More information about Natural Language Processing can be found in our article series NLP Insights.