Calculating the word frequencies and ranking
Posted: Mon Dec 23, 2024 5:26 am
2. Word Frequencies
When there is a large volume of text items to consider the first task is to extract which words are occurring most frequently in order to get an idea of the important topics. This involves “shredding” the text into words. The results can be displayed in a table or Word Cloud which uses font size to reflect the relative frequencies.
This can work well when comparing reviews of two products or trying to distil the essence of a large number of reviews. The two word clouds above are for horror films and comedies. In practice cyprus phone number you need to be able to exclude very common words and also those that are generic in your topic area e.g. “hotel” for a holiday company. To analyse hashtags you obviously need to filter to only include words prefixed by #.
3. Scoring Words
When we read a review we can use our knowledge of language, the subject area and the audience to quickly judge whether it expresses a good, bad or neutral opinion. Faced with large quantities of reviews it would be useful to automate this process. One approach is to train a model to recognise the keywords associated with known good or bad reviews then use this model to “score” new reviews.
The keywords and coefficients define the model. For example using our knowledge of language, and some gut feeling, we could assign the following values to keywords
The model is improved by basing the coefficients on the odds of the keyword appearing in good vs. bad reviews. Using this technique FastStats can calculate the coefficient estimates from a training set of reviews.
When there is a large volume of text items to consider the first task is to extract which words are occurring most frequently in order to get an idea of the important topics. This involves “shredding” the text into words. The results can be displayed in a table or Word Cloud which uses font size to reflect the relative frequencies.
This can work well when comparing reviews of two products or trying to distil the essence of a large number of reviews. The two word clouds above are for horror films and comedies. In practice cyprus phone number you need to be able to exclude very common words and also those that are generic in your topic area e.g. “hotel” for a holiday company. To analyse hashtags you obviously need to filter to only include words prefixed by #.
3. Scoring Words
When we read a review we can use our knowledge of language, the subject area and the audience to quickly judge whether it expresses a good, bad or neutral opinion. Faced with large quantities of reviews it would be useful to automate this process. One approach is to train a model to recognise the keywords associated with known good or bad reviews then use this model to “score” new reviews.
The keywords and coefficients define the model. For example using our knowledge of language, and some gut feeling, we could assign the following values to keywords
The model is improved by basing the coefficients on the odds of the keyword appearing in good vs. bad reviews. Using this technique FastStats can calculate the coefficient estimates from a training set of reviews.