Script 1/2: retrieving data from dispatch (date, article number and place names)

This script allows you to apply a specific date as filter in order to reduce the amount of results. This is necessary because without a filter the SNL model is too heavy and besides memory errors while loading it the network is to complex to understand. The script uses a counter for articles and places combined with the article date to create unique IDs.

from bs4 import BeautifulSoup
import re
import os

newPathToFolder = "----path to target folder to save files----"
pathToFolder = "----path to source folder with dispatch xml files----"
listOfFiles = os.listdir(pathToFolder)

# function with one argument filter to define a specific date which will use values of the dispatch only if the date is true
def generate(filter):

    placeNames = []
    dicFreq = {}
    resultsCSV = []

    # for loop to access all files
    for f in listOfFiles:
        soup = BeautifulSoup (open(pathToFolder+f, "r", encoding="utf8"), features="html.parser")
        # searches for list items of "date" and return maximum two elemens
        issue_date = soup.find_all("date", limit=2)[1] # only returns the second match
        issue_date = issue_date.get("value")
        # searches for all tags with "div3" and stores it in variable "articles"
        articles = soup.find_all("div3", type = True)
        if issue_date.startswith(filter):
            for a, item in enumerate(articles):
                counter = str(a) + "-" + str(issue_date)                        # for loop that counts each article and combines it wit the date in the dispatch
                places = item.find_all("placename", key = True)                 # continues to find all placenames with an attribute
                article = "article-" + counter                                  # variable that holds the issue date of the dispatch a string and an article counter
                print(article)
                for a, item in enumerate(places):
                    counter2 = str(a) + "-" + counter                           # counter2 will be the unique identifier ID for each row
                    place = item.get_text()                                     # for loop to retrieve all placenames as value from the placename tag
                    key = item["key"].split(";")                                # variable that holds the tgn number
                    print(str(place))

                    for k in key:
                        key = [d for d in k if d.isdigit()]                     # for loop to clean tgn numbers digits only
                        tag_id = ''.join(key)
                        placeList = "\t".join([counter2,issue_date,article,place,tag_id])        # creating a variable that holds each result article, place, tag_id
                        placeNames.append(placeList)                            # appending the variable to a list

    for i in placeNames:                                                        # creating a frequency with a for loop for all placenames
        if i in dicFreq:
            dicFreq[i] += 1
        else:
            dicFreq[i]  = 1

    for key, value in dicFreq.items():                                          # removes all placenames that are mention once only
        if value < 2: # this will exclude items with frequency higher than 1 - we want unique rows
            newVal = "%09d\t%s" % (value, key)
            # newVal will looks like: `000005486 TAB Richmond`
            resultsCSV.append(newVal)

    resultsCSV = sorted(resultsCSV, reverse=True)                               # sorting the results variable
    print(len(resultsCSV)) # will print out the number of items in the list
    resultsToSave = "\n".join(resultsCSV)                                       # joining the results line by line

    # creates a new file in a target folder with name + article counter + name + issue_date + txt file
    newfile = newPathToFolder  + "placesFull_3.3.csv"
    # creating a header for the final file
    header = "freq\tid\tdate\tsource\ttarget\ttgn\n"
    # opens the newfile and writes each article into a sperate file
    with open(newfile, "w", encoding="utf8") as f8:
        f8.write(header+"".join(resultsToSave))

# how to use the function
generate("1860-11")

Results for Script 1/2:

Result: Network Edges List usable for Gephi

Total of : 16,832 entries

freq = frequency of how many duplicated rows are in the dataset (1 = unique) id = counting places + counting articles plus issue date date = issue date of the article source = article + plus article count for each dispatch plus issue date target = place name tgn = tgn number that references to a geolocation

ResultsForGephi

I tried to load the complete dispatch but the amount of results is to heavy for SNL. The second try was to load one year “1860” only but even this was too much for Gephi. The third try that worked was one month of a specific year. I choose November of 1860 “1860-11” and it worked. The network is still too complex but what appear to be little moons are the labels for articles for example that reference to a specific location. Basically Richmond is mentioned the most followed by others.

High zoom into the Richmond Node surrounded by all articles that reference to this place:

ResultsForGephi

Medium Zoom of several nodes:

ResultsForGephi

Other SNL network that used another script 2/2.

The script was not good enough since it did not produce unique IDs

Unique IDs are necessary for SNL or the double entries will be ignored. However, the network itself looked interessting. That is why I keep them online. The script for that is below.

Result: Network Edges List usable for Gephi

Total of : 47,429 entries

Important Note: The script above does not include place names that are mentioned once only.

I created an edges list that also includes place names mentioned once but the list is not workable for Gephi because the network gets to dense.

Social Network Analysis SNL HW 13

Using Python 3 to create edges from a dispatch to create a social network graph with Gephi

Script 1/2: retrieving data from dispatch (date, article number and place names)

Results for Script 1/2:

Other SNL network that used another script 2/2.

Script 2/2: Not valid because it does not produce unique IDs

Leave a Comment