When Enron collapsed in 2001, about 500.000 internal emails were made public (more info here). There are alredy some publications based on the data (also to be found in the former link). Jitesh Shetty and Jafar Adibi cleaned the data and put it in a MySQL database. The dump is still online, but was created under MySQL 4 and cannot be imported in MySQL 5 without changes in the dump-file (espacially changing “TYPE=MyISAM” to “ENGINE=MyISAM”). Addionally they tried to identify the former position of the employees.
I “repaired” the dump, found another list with the positions of the former employees and updated the old one. I also noticed that one person often has more than one email address. I switches through the mails and updated all addresses to a unique one, found under Email_id in table/dataframe employeelist. All replaced addresses are in the coloumns Email2 to Email4. I am not sure if the matching is totally correct or if I got all email addresses, so there is not a guarantee for a 100% consistent dataset.
The data can be downloaded as a single sql-file and as a RData-file. The RData-file is simply an import from the database with some simple data management (changing data-types). I will work with the data in the next weeks, so there will be more R data available soon.
the tables/dataframes contain the following coloumns:
Employeelist
- eid: Employee-ID
- firstName: First name
- lastName: Last name
- Email_id: Email address (primary). This one can be found in the other tables/dataframes and is useful for matching.
- Email2: Additionally E-Mail-Address that was replace by the primary one.
- Email3: See above
- Email4: See above
- folder: The user’s folder in the original data dump.
- status: Last position of the employee. “N/A”s could not be found out.
Message
- mid: Message-ID. Refers to the rows in recipientinfo and referenceinfo.
- sender: Email address (updated)
- date: Date.
- message_id: Internal message-ID from the mailserver.
- subject: Email subject
- body: Email body. Can be truncated in the R-Version!
- folder: Exact folder of the e-mail inclusing subfolders.
Recipientinfo
(Note: If an E-Mail is sent to multiple recipients, there is a new row for every recipient!)
- rid: Reference-ID
- mid: Message-ID from the message-table/-dataframe
- rtype: Shows if the reciever got the mail normally (“to”), as a carbon copy (“cc”) or a blind carbon copy (“bcc”).
- rvalue: The recipient’s email address.
Referenceinfo
- rfid: referenceinfo-ID
- mid: Message-ID
- reference: Contains the whole email with shortend headers.