Skip to content

Textsearch

The module textsearch is an interface which enables you to search a database of texts. This implementation naively searches for the given fragments of texts and returns their absolute frequencies. Those frequencies can be visualised.

First, access to the interface is needed. If you are a part of the DSS-Chair, please request access from Olaf Kellermeier. Further, you need to request the so-called Snapshot-ID (maybe multiple are needed). When you have received your access token and your Snapshot-ID, you are ready to start.

Note

For advanced users: The Snapshot-ID can also be requested directly at the API endpoint. Please look into the way to do that yourself in the documentation of the WDC endpoint: https://dss-wdc.wiso.uni-hamburg.de/

Searching for frequencies of specific terms

Import your token as described in the introduction to the API.

Now we create an instance of a TextSearch class:

import dsslab.net_bench as dnb
ts = dnb.TextSearch("20210424_cliccs", token=wdc_token)

You can find out the name of the network like this:

ts = dnb.TextSearch(None, token=wdc_token)
snapshots = ts.get_snapshots(name_tag="cliccs")
The variable snapshots holds all of the snapshots that are available under that name. If name_tag isn't used, all available snapshots are displayed. Thus, None can be replaced with the snapshot name.

After defining the search terms, we can send a query:

terms = [
    '"divestment"',
    '"sustainable investment"',
]
# Path `graph` needs to be generated beforehand
graph, er = ts.search(graph, terms=terms)

Warning

In this case, '"sustainable investment"'' is in double quotation marks, because it contains a space. If we only put single quotation marks, Textsearch would search for both words: "sustainable" and "investment". What we want to search for is the combination: "sustainable investment". This can be achieved by using the double quotation marks.

The graph itself is returned as well as possible empty answers. This happens, if the crawler couldn't download text because of technical problems or because the website itself doesn't exist anymore.

Starting now, the graph objects holds the absolute values for the texts as part of the attributes.

Note

As an alternative, a list of domains can be passed into the search, if you just want to have a look at some domains but don't have a graph object yet.

Comple search terms and their combinations

ts.search(graph, terms=terms) accepts a multitude of types for terms:

  • List of Strings: ["sustainable", "investment"]. Each value in the list is used as a separate value. The key is also the search term in this case.
  • List of List of Strings: [["sustainable", "investment"], ["divestment"]]. Each of the inner lists is cumulated into a value. This means, that the frequency of "sustainable" and "investment" are added and returned with the key sustaiable (The first value is used as the key).
  • Dictionary of Strings: {"Sustainability": "sustainable"}. This returns the frequency of the search term "sustainable" with the key Sustainability. Complex queries are possible this way: {"Sustainability": "sustainable OR Sustainability OR 'sustainable economy''} returns the value for the search term sustainable OR Sustainability OR 'sustainable economy with the key Sustainabilitya.
  • Dictionary of Strings as keys und Lists as values: {"sustainable": ["sustainable", "renewable"], "divestment": ["divestment"]}. The values of a key are cumulated under the same key.
  • Panda Series: Works like Dictionary of Strings
  • Panda DataFrame: Each column is a collection of search terms which are cumulated for each column. The key is the name of the column.

Mixing the types is not possible. Dictionaries are preferable to Lists, especially if you want to create legends.

Drawing of Attributes

The integration writes the data needed for the search query directly into the graph object under a specific String. To use a term from the graph for visualisation, the helper class dnb.Code is used:

divt = str(dnb.Code("divestment", dnb.Category.TEXT))

Here, multiple things are happening simultaneously:

  • dnb.Category.TEXT assigns the type (meaning the origin of the attribute). dnb.Category.TEXT signifies that the origin is the text base, dnb.Category.MANUAL signifies manually coded attributes from dssCode.
  • dnb.Code("divestment", dnb.Category.TEXT) creates an object, which searches for the attribute "divestment" in the text base.
  • The object is directly converted into a str. For this example that would be "text:divestment".

Note

If you want, you can of course also create the string above manually.

divt can now be passed as an argument, for example to nodes.set_sizes():

ig.nodes.set_sizes(sequential(divt, out_range=(5,500)))

The same can be done for colouring:

ig.nodes.set_colors(sequential(divt, cmap="viridis"))