An Introduction to Topics
Topics are the core of the text analysis in Gavagai Explorer. They are basically the main subjects in your data. For instance, in hotel reviews, topics like hotel, staff, room, restaurant, etc. are usually among main topics. In addition, there are usually another type of topics in the texts which are not related to a hotel in general but become a topic for one specific hotel because of the frequent mentioning by respondents. For example, if you have a data including hotel reviews and many of the respondents refer to many items in the hotel as great items (e.g. the hotel, the restaurant, their room, etc.) then “great” becomes a topic in your project. On the other hand, if many of the respondents complain about a specific issue in the hotel; e.g. renovations, then that issue also becomes a topic in your project.
Each topic in Explorer has a name and it includes one or more terms. The topic name is a label assigned by the system or the user for referring to the topic, while the terms are the actual words (or sequences of words) from the data that the topic is composed of. Each term can be included in only one topic. We say a document includes a topic if it includes at least one of the terms in that topic. Then, the frequency of a topic is defined as the number of documents that include that topic, divided by the whole number of documents.
Explorer finds the main topics in your data and it makes it possible for you to revise them. When you run explorer on a dataset for the first time, Explorer shows you the 30 most frequent topics throughout all the texts.
For each topic, Explorer considers a specific area for that topic in both working and Detail panel. We refer to the working and detail panels as simply the topic area. In addition, there is a blue box for each topic in the working panel which includes the name of the topic. We refer to this box as topic box. At the right side of each topic box, you can see the topic expander which contains two numbers. Here you can see the topic box and its expander for the topic “hotel” for data including hotel reviews.
The first number (in green) is the number of including terms in that topic and the second one (blue) is the number of suggestions (see next section). When you click on topic expander you can see the including terms and suggestions.
In the following, we explain the main elements in each topic and we guide you through revising them.
Terms in Explorer
An Introduction to Explorer language Capabilities
Explorer searches for terms in your data with more complexity than a simple search engine. Let us explain by an example. Consider the word “income” in the following sentences:
Many people are dissatisfied with their income.
I can't get by on such a small income.
The company's gross income grew considerably this year.
The single most important measure of a company's profitability is net income.
Compensation is far below the market.
In the first two sentences, income has been referred to as a general term while in the two latter sentences the terms gross income and net income are specific types of income. In the last sentence, the word income is not present, but we have the word compensation which is a synonym of it. Words like income, net income, gross income and compensation are called paradigmatic neighbours. Paradigmatic neighbours are semantically related words which are used in text or speech for related objectives. Paradigmatic neighbours can be synonyms, like “income” and “compensation” in above examples. They can also be words that do not have same meanings but they are related in some way; e.g. “fork” and “spoon” which are both used for eating.
An n-gram is a sequence of words which frequently appears in the same order in text or speech. Generally, it is important not to split n-grams as it might result in information loss. For instance, “water supply” is an n-gram consisting of two words “water” and “supply” while none of these words can define it individually. The word “San Francisco” is an n-gram which is different from both “San” and “Francisco”. Another capability of Explorer is to identify n-grams in your data and not split them. From now on, when we mention terms in this document we mean n-grams. The value of variable n is dependent on the corresponding language and it is usually between 1 and 3 or more.
Topic Terms and Synonyms
As mentioned, each topic can include one or more terms. Terms included in each topic are words that each can define that topic independently. Each term can belong to only one topic. For each topic, Explorer shows examples of texts including the terms in that topic. You can find them in explorer Detail panel. Only the 10 most frequent terms will be shown when the exploration results load. If there are more than 10 terms, you can see all of them by clicking the ellipsis button. You can also filter the examples by clicking on specific term(s).
In the Detail panel, each term can be either selected (shown in orange), or non-selected (shown in green). When you explore a project, all terms in all topics become selected by default. You can deselect a term by clicking on it in the Detail panel. The examples and sentiments which are shown for a topic in the Detail panel are derived from the selected terms for that topic. Therefore, when you select or deselect a term, you need to press show sentiments and show examples buttons to see the updated results.
When the project is explored for the first time, the Gavagai Explorer uses its language resources and algorithms to automatically merge words that we consider as "synonyms" into the same topic and these synonyms are taken into consideration when performing the analysis. There are multiple reasons that terms can be considered "synonyms" and merged into the same topic
The Gavagai Explorer's living lexicon constantly looks for new words the it considers synonyms. Once it has identified words that are strongly related, a Gavagai administrator can mark these words as Synonyms so that they are included as part of the same topic during exploration for all users.
The Gavagai Explorer also utilizes users' anonymized topic models to automatically automatically merge terms into the same topic. If a certain number of users individually and without knowledge of each other accept synonym suggestions that makes it more likely that those terms will automatically be included as part of the same topic for other users in the future. This never happens with a single accepted suggestion or even with several by a single account.
In most languages, there are several words that have the same root but have slight variation due to grammatical rules (such as 'teachers' and 'teacher'). This is especially true in certain languages such as Finnish and Croatian, in which there can be over 100 variants of the same word. In certain languages, the Gavagai Explorer also merges the morphological variants of words together into the same topic to ensure the exploration results are more accurate.
Topics with Sentiment words
Sentiment words are words that have some inherent sentiment or emotion; for example, "good" or "horrible" are sentiment words since they are positive and negative words respectively.
While there are benefits to having "sentiment topics" (topics containting sentiment words), any sentiment results computed for these topics will be adversely skewed since the sentiment calculation algorithm does not include the topic terms themselves while checking for the sentiment words about the topic terms.
As a consequence, sentences such as "the house was nice", will receieve a positivity score of 0 when sentiment analysis is performed for a topic containing the word "nice".
Since sentiment analysis plays an important part in the insights retrieved from Gavagai Explorer, sentiment words will never be automatically included in any topics created by the system. That being said, the Gavagai Explorer will always allow you to manually create topics which contain sentiment words, but the sentiment results of these topics maybe skewed due the reasons listed above.
For the terms included in a topic, Explorer automatically finds their paradigmatic neighbours and shows them to you as Suggestions. You can see more suggestions by clicking on Get words button. When you hover over a suggestion in the working panel, a question mark will appear. You can see the examples of the texts including that suggestion by clicking on this question mark. When you add new terms to your topic, Explorer updates the suggestions list by adding the new possible suggestions. The number of new suggestions is denoted by a little red square under the total number of suggestions. In addition, the new suggestions are shown with a slightly different color from the old ones with a small red square at the top right side of them.
Terms and suggestions are listed in descending order according to their number of mentions. This number is indicated at the bottom right when hovering over a word.
Adding and Removing Terms
You can always expand your topics by adding new terms to them. To add terms, you need to open terms and suggestions by clicking on the topic expander. Then you can either click on suitable suggestions or write them manually in the terms entry field. An auto-complete drop down menu presents terms from topics and related topics as candidates, and you can either select several options or add all of them at once.
To see more suggestions, you can click on Get words button.
Note that a term including in one topic might appear in the suggestions of another topic. For instance, consider the the topic income including the terms “income”, “salary” and “salaries” and the topic compensation including the terms “compensation” and “compensations”. Since the two words compensation and income can be used interchangeably in the same contexts, you see the term “compensation” as a suggestion for the topic income.
In this case, if you choose to accept compensation for the topic income, Explorer will merge the two topics automatically (read more about merging here). This is basically because the terms in each topic are assumed to be synonyms (so if two terms in two different groups are synonyms then all terms in that groups are synonyms). In figure below you can see the new topic income after automatic merging. The topic compensation will become disabled. The next time that you press Explore, you will see the new statistics for the new topic income and the topic compensation will be removed from the list.
The full search feature of the auto-complete drop-down menu presents terms from the project that are not yet members of any topic. As you type your auto-complete term, the full search finds all terms and multi-word expressions that match. Beware, however, that the search only finds multi-word expressions that the Explorer system knows about. This means that you cannot search for expressions that are two words or more unless these have been recognized as multi-word expressions in the system. Here is an example: the multi-word expression "san francisco" is a valid search term since it is so prevalent in ordinary language that our system knows about it as such. On the other hand "daniel san" is perhaps not as common and therefore a search for those two words will not get a hit even if the project in fact contains one or more texts with those words in sequence. Why not allow the search to find any arbitrary sequence of words you might ask. Similarly to much of the Explorer's functionality in general, the full search feature is focused on finding the most important expressions as opposed to everything in detail. This design is a careful balance of utility and performance considerations.
Results from the full search are shown in the auto-complete drop-down menu under the heading "Full search".
You can remove terms from topics by clicking on them in the working panel. When you remove a term, you might still see it in the list of suggestions of other topics and therefore you can add it to them. Note that when you remove a term from a topic, Explorer automatically pins that topic so that you would not lose the topic in the list of topics in case the topic becomes infrequent.
You can also ignore a term from all of your topics by writing it in the text box under Ignore Terms at the top of the details panel and pressing add term (previous figure). When you remove a term, Explorer ignore it in your analysis, however, note that texts including that term are still available for contributing to the topics.
For each topic, Explorer finds the words that are tightly connected to the terms in that topic, and shows them to you as "related topics". By tightly connected we mean words that appear together repeatedly and closely in the same sentences in different texts. For each topic, the list of related topics can be found under the topic box. You can also find the frequency of each related topic with respect to that topic in front of it. Related topics are ordered in the related topics list based on their frequency and also their closeness to the topics.
Related topics can give you a better understanding of the topics. For example, consider the topic “the restaurant” which is a frequent term in a hotel reviews data set. As you see, the terms “closed”, “renovation” and “under construction” are tightly connected to this topic which means most respondents are complaining about renovations when they mention restaurant in their responses.
You might have noticed that the term “breakfast” is at the bottom of the list although it is more frequent in the topic comparing to the upper terms. This is due to the fact that both frequency and closeness are taken into account in the process of listing the related topics.
For any subset of the set of terms and related topics, you can filter the texts examples that contain all the words in that subset. You only need to click on them in the detail panel to filter texts.↓
Gavagai Explorer applies word based or lexical based sentiment analysis principles to quantify the sentiments behind expressed opinions. In the Details Panel, you can see the quantity of three basic sentiments for each topic or group; that are Negativity, Positivity and Skepticism (next figure). When you filter texts by selecting terms and related topics in the details panel, the sentiment values will be updated as well. When you export the result of your analysis into excel, you can see 8 sentiment values for each single text (see the sentiment section). You can also select one or multiple topics for sentiment analysis when you export. This will give the sentiment scores for your selected topics in the report (see Sentiments Per Topic). Moreover, you can model your own sentiments by using Explorer Concept Modeler and then Explorer will analyze your data for these Concepts (read more in Explorer Concept Modeler).
More about Sentiment Analysis in Explorer
Explorer performs two different types of sentiment analysis; new and classic sentiment analysis. Explorer selects the new sentiment algorithm by default. You can switch to the classic sentiment algorithm in your advanced configuration settings located on the Account page. The Explorer will strictly apply only one algorithm for the entire Explorer project. When you set an algorithm for your account, you must click Update and Save to apply the new algorithm.
The classic sentiment algorithm performs a sentence level sentiment analysis for each topic. The new algorithm performs a topic level sentiment analysis for the topics. More on both of these algorithms in The Sentiment Analysis Algorithms that Determine the Score.
This difference is important when a sentence has different topics with different sentiments. If a sentence is so short and has only one topic: "The room is good", then using either the new or the classic algorithm results in the same score.
The Sentiment Scoring System
The system's sentiment scoring gives a score for each Sentiment word used to describe a topic. Note, amplification words such as "very" increase the score. And negations such as "not" impact the score by reducing the score.
The room was good. Has a 1 for Sent: Positivity.
The room was really good. Has a 2 for Sent: Positivity.
The room was not good. Has a 1 for Sent: Negativity.
There are eight sentiment categories: Skepticism, Fear, Violence, Hate, Negativity, Love, Positivity, and Desire. The categories avoid ambiguity of sentiment scoring by not containing words that can be inside many categories. For example, "cheap" could have different sentiments. For example, "X is cheap". This sentence is ambiguous. If X is beer, it is positive, if X is a wedding ring it is negative. Thus, if there is a word that you believe should be in the sentimental category then please try to use the word in sentences that are both that sentiment and the opposite sentiment to check. In contrast, words such as "good" that express only one sentiment are added to their respective sentiment category.
When a word from the eight categories like, Positivity, for example "good" is used to describe a topic such as hotel. The sentiment score will give a simple 1 for the topic hotel. The sentence, "The hotel is good." would receive a 1 for positivity, and 0 for the seven other sentiment categories. If you would like to see the actual numbers for the sentiment scores for each topic, you need to export the analysis as an Excel or CSV file and select the topic you would like to see. See Sentiments Per Topic for more. The sentiment score given for each topic varies when combination of words happen such as "really good".
The Sentiment Graphs in the Web Application
In the web application the sentiments are in the right hand side with a percentage breakdown and also in absolute values. We see the sentiments Negativity, Skepticism, and Positivity in a bar. This sentiment percentage bar is calculated by using the topic that the percentage bar is referring to. All the topic's Positivity, Negativity, and Skepticism scores are summed. For each sentiment category this sum can be found next to its name above the sentiment bar. Then, each of Positivity, Negativity, and Skepticism scores are divided by the sum to return a percentage. To see the text examples which are relevant to one specific sentiment category, you can click either on its name or on its color in the bar. It's also possible to change basic sentiments in the bar to any other sentiment Gavagai Explorer supports: Love, Hate, Violence, Desire, Fear. You can even have Neutral sentiment calculated and shown in the bar - this can be adjusted in Project Settings or Account Settings. See Project Settings - Sentiments.
The sentiments are also detailed in the Dashboard, represented in a circular chart.
The Sentiment Analysis Algorithms that Determine the Score
Consider the following text uploaded to The Explorer:
- The room was good and the staff are bad.
The text include two different topics and for each topic there is a different opinion. The new sentiment algorithm has more precision in getting the expressed opinions for individual topics and calculating the sentiment scores. The room topic has a 1 for positivity. And the staff topic has a 1 for negativity. Looking at the web application there is 100% positivity for room and 100% negativity for staff.
Moreover, for the new sentiment algorithm, not all topics are relevant for sentiment analysis. Explorer only performs sentiment analysis for topics (or topic terms) that are not sentiment terms themselves.
Consider the sentence "The room is nice". If we ask what is the expressed opinion about room in this sentence, one can say it is positive, it is nice. Now suppose that instead of room we focus on nice as a topic. Does it make sense to ask "what is the expressed opinion about nice in this sentence"? The answer is no. In fact nice is not a focus topic in this sentence. The sentence is explaining room and not nice. Nice is the word by which the sentence expresses opinion about room. This is the reason behind why we do not perform sentiment analysis for sentiment terms.
The classic algorithm takes into account sentimental words of an entire sentence. Regardless of what topics the sentimental words are describing. For example, the sentence, "The hotel was good and the staff are good." will return a 2 for positivity for the topic hotel and the topic staff. Thus, there is less precision in analyzing the sentiment. In addition, for the web application, the detail panel will only display sentiment when 10 texts have sentiment for a topic.
The sentence uploaded as an excel file to The Explorer (repeated 10 times since we need at least 10 texts):
- The room was good and the staff are bad.
returns for the topic room, 1 for Positivity and 1 for Negativity. And in the web application, we see the topic room have 50% Negativity and 50% Positivity. This is a coarse approach and can be inaccurate in calculating sentiments for topics.
However, there are some benefits. The classic algorithm is useful for calculating what the sentiment is on a sentence level is. Which is important for comparing individual documents. In addition, there is a slight speed advantage.
Sentiment Analysis and the Text Examples in the Web Application
In the web application and in the detail panel, Explorer shows examples of texts which are included in the topics. These texts are examples featuring the terms of the topics selected in the detail panel (shown in orange). At the top of each example area, Explorer shows sentence(s) that feature the topic terms with bullet points. To the right hand side of each sentence, their topic related sentiments are shown by colored circles, where each color shows the same sentiment as in the sentiment bar. You can see the complete texts by clicking on 'Show original text' at the bottom right corner of the examples. Here you can also see the sentiments found in the entire text (not specific to the particular topics). Note that Explorer only shows those sentiments that you have chosen in your project.
You can view the text examples corresponding to each sentiment by clicking the sentiment in the sentiment bar or in the sentiment legend; these examples are filtered by the selected sentiment. If you would like to see all examples matching the topic terms, unfiltered by sentiment, you can click on the "Back to all examples" button at the top of the text examples list.
Overall Sentiment of the Project
At the top of the detail panel and under Project Summary, we have the overall sentiment of the project shown in a colorful bar. This feature is dependant on your active sentiments, your sentiment settings and the neutral sentiment if it is on or off. Based on these settings and the contribution of each verbatim to each sentiment, the percentage of each sentiment is calculated for all verbatims. Each time that you load/re-explore a project, you need to click on show sentiment button to see the sentiment bar. Here you can also see text examples by clicking on the show examples button, and you can filter the examples for a specific sentiment by clicking on the related color on the sentiment bar.
More about n-grams in Explorer
Now that you have learned about topics, topic terms and sentiment analysis, it is worth learning a bit more about n-grams in Explorer as well. As mentioned before, n-grams are topic terms including multiple words, for example "San Francisco". Explorer identifies n-grams in your data and show them to you as topics if they are enough frequent. You can also add n-grams to your topics manually. The most important characteristics of the n-grams is that they are treated as one single entity, and therefore, the uni-grams included in an n-gram cannot contribute to topics individually. This is the case for both topic counts and sentiment analysis. As an example, suppose that you have an n-gram "junk food". Also assume that you have a topic FOOD which includes "food" as a topic term but not "junk food" as a topic term. Now for the sentence "I don't like junk food at all", for Explorer the topic FOOD is not included in this sentence because of "junk food" being a bi-gram.
Auto-Add Terms in Explorer
When working with topics you may notice that there could be several n-grams containing a general topic term you are interested in (for example, you could be interested in the general topic term 'food' and the Gavagai Explorer detects n-grams 'fast food', 'junk food' and 'food delivery') and you may want to include all variants of the general topic term in the topic. One way of achieving this is to utilize the auto-complete drop down menu and selecting all variants of the term. The drawback of doing this is that once data is appended to the project and new n-grams are detected in the data, the same process must be repeated for all such general terms if you wish to keep the topic definition up-to-date with the current data.
Alternatively, you can utilize the 'Auto-Add Terms' feature:
When you add a term in the 'Auto-Add Terms' section of the topic, Explorer will automatically add any n-grams containing this term to the topic when the project is explored. Any n-gram which matches an Auto-Add Term but which is already part of another topic will, however, not be included in the topic. Once a project is explored after data is appended to it, newly detected n-grams in the new dataset matching any Auto-Add Terms will be added to the appropriate topics and an email notification will be sent specifying the terms which were added to the topics.
If there are terms which have automatically been added to a topic as part of an Auto-Add Term, and you would like to exclude one or more of these terms from the topic (for example if you have an Auto-Add Term 'food' and you want to exclude the automatically added n-gram 'cat food' from the topic), you can always click the terms, as usual, to remove them. However, this term will now be added to the project's Ignore Terms to ensure that the term is not automatically re-added to the topic. If you want this term added another topic instead, remove the term from the project's Ignore Terms and then add the term to the other topic.