Commit ef82666f by Sanjay Krishnan

Added examples to the course repository.

parent 92fd772c
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Aquisition and Formatting\n",
"\n",
"Data can literally be anything from images to numbers to text. The way that we *choose* to organize it can have a profound impact on our ability to extract value from it. This lecture describes some basic tradeoffs in data analysis.\n",
"\n",
"## Data Formats and Serialization\n",
"It is easy to communicate information between fuctions within a single program with lists, dictionaries, sets, and classes. However, we often need to move data across program and device boundaries. For example, a sensor might output raw data to a file and we may have to load it into Python for analysis. Or, our web application written in JavaScript might need to send data to our Database server based in SQL for durable storage. A data format is a specification of how data should be communicated between different programs. \n",
"\n",
"\"Serialization\" is the process of taking a native object in one programming language to generating a string-valued output that can be transferred to another program to be able to reconstruct that object. \n",
"\n",
"Let's consider a concrete use case. You want to create an object in python and email it to friend so they can reconstruct the same object. Let's create a dictionary:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'a': 1, 'b': 2, 'c': [1, 2, 3]}\n"
]
}
],
"source": [
"my_dict = {}\n",
"my_dict['a'] = 1\n",
"my_dict['b'] = 2\n",
"my_dict['c'] = [1,2,3]\n",
"\n",
"print(my_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python provides the \"pickle\" library to be able to save objects for later use. Now let's \"pickle\" this dictionary:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"pickle.dump(my_dict, open('output.pkl','wb')) #opens a file in binary mode for writing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data has been fully stored into a file and now we can retrieve this data in another Python program"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'a': 1, 'b': 2, 'c': [1, 2, 3]}\n"
]
}
],
"source": [
"new_dict = pickle.load(open('output.pkl','rb'))#opens a file in binary mode for reading\n",
"print(new_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The python pickle is useful for communication across python programs, but it has a couple of key downsides. \n",
"\n",
"## Problem 1. Human Readable Formats\n",
"I encourage you to open the file 'output.pkl' in a text editor with binary support (such as Sublime). You'll see something that looks like this:\n",
"\n",
"```\n",
"8003 7d71 0028 5801 0000 0061 7101 4b01\n",
"5801 0000 0062 7102 4b02 5801 0000 0063\n",
"7103 5d71 0428 4b01 4b02 4b03 6575 2e\n",
"```\n",
"Imagine that you were the friend that just recieved this file. Without loading it into python you would not be able to know what this data represents. Luckily, there are several serialization formats that are \"human readable\"; that is, they are represented in ASCII strings that you can inspect in a text editor. One of these formats is JavaScript Object Notation (JSON):"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"a\": 1, \"b\": 2, \"c\": [1, 2, 3]}\n"
]
}
],
"source": [
"import json\n",
"\n",
"json.dump(my_dict, open('output.json','w'))\n",
"\n",
"print(json.dumps(my_dict)) #print as a string i.e., what happens if you load the file"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'a': 1, 'b': 2, 'c': [1, 2, 3]}\n"
]
}
],
"source": [
"new_dict = json.load(open('output.json','r'))\n",
"\n",
"print(new_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"JSON is conviently readable. You can open up the file and inspect the contents. JSON is further a nested data format, where it can handle arbitrarily nested dictionaries, lists of strings, numbers, and Booleans. In your assignment, you will work with another such format called XML which is less easy. \n",
"\n",
"Now, let's look at a concrete example of JSON serialization. Twitter's API returns tweets in a serialized json format. See `twitter.sample.json`. Let's load this data into python and see how it looks like:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'created_at': 'Tue Feb 27 21:11:40 +0000 2018', 'id': 968594506663669800, 'id_str': '968594506663669760', 'text': 'RT @honeydrop_506: 180222 ICN #백현 #BAEKHYUNnn나의 겨울과 너nn#iHeartAwards #BestFanArmy #EXOL @weareoneEXO https://t.co/hg7I3xAlBg', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 4448809940, 'id_str': '4448809940', 'name': 'ayah', 'screen_name': 'lovbyun', 'location': 'bbh iu jjh pcy kjd dks', 'url': 'https://curiouscat.me/baekhyun-l', 'description': 'hi hello I love exo', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1142, 'friends_count': 125, 'listed_count': 20, 'favourites_count': 5712, 'statuses_count': 4011, 'created_at': 'Fri Dec 04 03:44:59 +0000 2015', 'utc_offset': -28800, 'time_zone': 'Pacific Time (US & Canada)', 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': 'F58EA8', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/967130320259526656/0xZ-wxXJ_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/967130320259526656/0xZ-wxXJ_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4448809940/1519340670', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'retweeted_status': {'created_at': 'Mon Feb 26 14:25:59 +0000 2018', 'id': 968130023566684200, 'id_str': '968130023566684160', 'text': '180222 ICN #백현 #BAEKHYUNnn나의 겨울과 너nn#iHeartAwards #BestFanArmy #EXOL @weareoneEXO https://t.co/hg7I3xAlBg', 'display_text_range': [0, 81], 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 785503802761683000, 'id_str': '785503802761682944', 'name': 'HONEY DROP 🍯', 'screen_name': 'honeydrop_506', 'location': '백현 마음 속 ‘ㅅ’', 'url': None, 'description': '좋아해, 백현아 ❤️ / 로고크롭 2차가공 상업적이용 ❌ 고화질은 마음 / Do not QT / 사담 @BHoney___', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 32861, 'friends_count': 10, 'listed_count': 746, 'favourites_count': 184, 'statuses_count': 1220, 'created_at': 'Mon Oct 10 15:34:35 +0000 2016', 'utc_offset': 32400, 'time_zone': 'Seoul', 'geo_enabled': False, 'lang': 'ko', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': 'F58EA8', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/956172776506761217/TCBwtjU1_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/956172776506761217/TCBwtjU1_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/785503802761682944/1516948135', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'quote_count': 15, 'reply_count': 1, 'retweet_count': 4119, 'favorite_count': 1610, 'entities': {'hashtags': [{'text': '백현', 'indices': [11, 14]}, {'text': 'BAEKHYUN', 'indices': [15, 24]}, {'text': 'iHeartAwards', 'indices': [36, 49]}, {'text': 'BestFanArmy', 'indices': [50, 62]}, {'text': 'EXOL', 'indices': [63, 68]}], 'urls': [], 'user_mentions': [{'screen_name': 'weareoneEXO', 'name': 'EXO', 'id': 873115441303924700, 'id_str': '873115441303924736', 'indices': [69, 81]}], 'symbols': [], 'media': [{'id': 968129121061490700, 'id_str': '968129121061490690', 'indices': [82, 105], 'media_url': 'http://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 968129121061490700, 'id_str': '968129121061490690', 'indices': [82, 105], 'media_url': 'http://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}}, {'id': 968129133724053500, 'id_str': '968129133724053505', 'indices': [82, 105], 'media_url': 'http://pbs.twimg.com/media/DW98UVhUMAE-81s.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98UVhUMAE-81s.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}}}, {'id': 968129145354952700, 'id_str': '968129145354952704', 'indices': [82, 105], 'media_url': 'http://pbs.twimg.com/media/DW98VA2VoAAdEU4.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98VA2VoAAdEU4.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}, 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}}, {'id': 968129158457966600, 'id_str': '968129158457966593', 'indices': [82, 105], 'media_url': 'http://pbs.twimg.com/media/DW98VxqVwAE_IM3.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98VxqVwAE_IM3.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}}}]}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'ko'}, 'is_quote_status': False, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': '백현', 'indices': [30, 33]}, {'text': 'BAEKHYUN', 'indices': [34, 43]}, {'text': 'iHeartAwards', 'indices': [55, 68]}, {'text': 'BestFanArmy', 'indices': [69, 81]}, {'text': 'EXOL', 'indices': [82, 87]}], 'urls': [], 'user_mentions': [{'screen_name': 'honeydrop_506', 'name': 'HONEY DROP 🍯', 'id': 785503802761683000, 'id_str': '785503802761682944', 'indices': [3, 17]}, {'screen_name': 'weareoneEXO', 'name': 'EXO', 'id': 873115441303924700, 'id_str': '873115441303924736', 'indices': [88, 100]}], 'symbols': [], 'media': [{'id': 968129121061490700, 'id_str': '968129121061490690', 'indices': [101, 124], 'media_url': 'http://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}, 'source_status_id': 968130023566684200, 'source_status_id_str': '968130023566684160', 'source_user_id': 785503802761683000, 'source_user_id_str': '785503802761682944'}]}, 'extended_entities': {'media': [{'id': 968129121061490700, 'id_str': '968129121061490690', 'indices': [101, 124], 'media_url': 'http://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}, 'source_status_id': 968130023566684200, 'source_status_id_str': '968130023566684160', 'source_user_id': 785503802761683000, 'source_user_id_str': '785503802761682944'}, {'id': 968129133724053500, 'id_str': '968129133724053505', 'indices': [101, 124], 'media_url': 'http://pbs.twimg.com/media/DW98UVhUMAE-81s.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98UVhUMAE-81s.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}}, 'source_status_id': 968130023566684200, 'source_status_id_str': '968130023566684160', 'source_user_id': 785503802761683000, 'source_user_id_str': '785503802761682944'}, {'id': 968129145354952700, 'id_str': '968129145354952704', 'indices': [101, 124], 'media_url': 'http://pbs.twimg.com/media/DW98VA2VoAAdEU4.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98VA2VoAAdEU4.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}, 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}, 'source_status_id': 968130023566684200, 'source_status_id_str': '968130023566684160', 'source_user_id': 785503802761683000, 'source_user_id_str': '785503802761682944'}, {'id': 968129158457966600, 'id_str': '968129158457966593', 'indices': [101, 124], 'media_url': 'http://pbs.twimg.com/media/DW98VxqVwAE_IM3.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DW98VxqVwAE_IM3.jpg', 'url': 'https://t.co/hg7I3xAlBg', 'display_url': 'pic.twitter.com/hg7I3xAlBg', 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1', 'type': 'photo', 'sizes': {'medium': {'w': 800, 'h': 1200, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}, 'small': {'w': 453, 'h': 680, 'resize': 'fit'}}, 'source_status_id': 968130023566684200, 'source_status_id_str': '968130023566684160', 'source_user_id': 785503802761683000, 'source_user_id_str': '785503802761682944'}]}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'ko', 'timestamp_ms': '1519765900661'}\n"
]
}
],
"source": [
"tweet_struct = json.load(open('twitter.sample.json','r'))\n",
"print(tweet_struct)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"created_at\n",
"id\n",
"id_str\n",
"text\n",
"truncated\n",
"in_reply_to_status_id\n",
"in_reply_to_status_id_str\n",
"in_reply_to_user_id\n",
"in_reply_to_user_id_str\n",
"in_reply_to_screen_name\n",
"user\n",
"geo\n",
"coordinates\n",
"place\n",
"contributors\n",
"retweeted_status\n",
"is_quote_status\n",
"quote_count\n",
"reply_count\n",
"retweet_count\n",
"favorite_count\n",
"entities\n",
"extended_entities\n",
"favorited\n",
"retweeted\n",
"possibly_sensitive\n",
"filter_level\n",
"lang\n",
"timestamp_ms\n"
]
}
],
"source": [
"#get the first level of keys\n",
"for key in tweet_struct:\n",
" print(key)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How could you get all the sub keys as well?"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"created_at\n",
"id\n",
"id_str\n",
"text\n",
"truncated\n",
"in_reply_to_status_id\n",
"in_reply_to_status_id_str\n",
"in_reply_to_user_id\n",
"in_reply_to_user_id_str\n",
"in_reply_to_screen_name\n",
"user\n",
"\tid\n",
"\tid_str\n",
"\tname\n",
"\tscreen_name\n",
"\tlocation\n",
"\turl\n",
"\tdescription\n",
"\ttranslator_type\n",
"\tprotected\n",
"\tverified\n",
"\tfollowers_count\n",
"\tfriends_count\n",
"\tlisted_count\n",
"\tfavourites_count\n",
"\tstatuses_count\n",
"\tcreated_at\n",
"\tutc_offset\n",
"\ttime_zone\n",
"\tgeo_enabled\n",
"\tlang\n",
"\tcontributors_enabled\n",
"\tis_translator\n",
"\tprofile_background_color\n",
"\tprofile_background_image_url\n",
"\tprofile_background_image_url_https\n",
"\tprofile_background_tile\n",
"\tprofile_link_color\n",
"\tprofile_sidebar_border_color\n",
"\tprofile_sidebar_fill_color\n",
"\tprofile_text_color\n",
"\tprofile_use_background_image\n",
"\tprofile_image_url\n",
"\tprofile_image_url_https\n",
"\tprofile_banner_url\n",
"\tdefault_profile\n",
"\tdefault_profile_image\n",
"\tfollowing\n",
"\tfollow_request_sent\n",
"\tnotifications\n",
"geo\n",
"coordinates\n",
"place\n",
"contributors\n",
"retweeted_status\n",
"\tcreated_at\n",
"\tid\n",
"\tid_str\n",
"\ttext\n",
"\tdisplay_text_range\n",
"\ttruncated\n",
"\tin_reply_to_status_id\n",
"\tin_reply_to_status_id_str\n",
"\tin_reply_to_user_id\n",
"\tin_reply_to_user_id_str\n",
"\tin_reply_to_screen_name\n",
"\tuser\n",
"\t\tid\n",
"\t\tid_str\n",
"\t\tname\n",
"\t\tscreen_name\n",
"\t\tlocation\n",
"\t\turl\n",
"\t\tdescription\n",
"\t\ttranslator_type\n",
"\t\tprotected\n",
"\t\tverified\n",
"\t\tfollowers_count\n",
"\t\tfriends_count\n",
"\t\tlisted_count\n",
"\t\tfavourites_count\n",
"\t\tstatuses_count\n",
"\t\tcreated_at\n",
"\t\tutc_offset\n",
"\t\ttime_zone\n",
"\t\tgeo_enabled\n",
"\t\tlang\n",
"\t\tcontributors_enabled\n",
"\t\tis_translator\n",
"\t\tprofile_background_color\n",
"\t\tprofile_background_image_url\n",
"\t\tprofile_background_image_url_https\n",
"\t\tprofile_background_tile\n",
"\t\tprofile_link_color\n",
"\t\tprofile_sidebar_border_color\n",
"\t\tprofile_sidebar_fill_color\n",
"\t\tprofile_text_color\n",
"\t\tprofile_use_background_image\n",
"\t\tprofile_image_url\n",
"\t\tprofile_image_url_https\n",
"\t\tprofile_banner_url\n",
"\t\tdefault_profile\n",
"\t\tdefault_profile_image\n",
"\t\tfollowing\n",
"\t\tfollow_request_sent\n",
"\t\tnotifications\n",
"\tgeo\n",
"\tcoordinates\n",
"\tplace\n",
"\tcontributors\n",
"\tis_quote_status\n",
"\tquote_count\n",
"\treply_count\n",
"\tretweet_count\n",
"\tfavorite_count\n",
"\tentities\n",
"\t\thashtags\n",
"\t\turls\n",
"\t\tuser_mentions\n",
"\t\tsymbols\n",
"\t\tmedia\n",
"\textended_entities\n",
"\t\tmedia\n",
"\tfavorited\n",
"\tretweeted\n",
"\tpossibly_sensitive\n",
"\tfilter_level\n",
"\tlang\n",
"is_quote_status\n",
"quote_count\n",
"reply_count\n",
"retweet_count\n",
"favorite_count\n",
"entities\n",
"\thashtags\n",
"\turls\n",
"\tuser_mentions\n",
"\tsymbols\n",
"\tmedia\n",
"extended_entities\n",
"\tmedia\n",
"favorited\n",
"retweeted\n",
"possibly_sensitive\n",
"filter_level\n",
"lang\n",
"timestamp_ms\n"
]
}
],
"source": [
"def get_skeleton(struct, prefix=''):\n",
" \n",
" #base case\n",
" if not type(struct) is dict:\n",
" return\n",
" \n",
" for key in struct:\n",
" print(prefix+key)\n",
" get_skeleton(struct[key], prefix + '\\t')\n",
"\n",
"get_skeleton(tweet_struct)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the recursive structure of this function. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problem 2. Too Much Freedom?\n",
"\n",
"Both JSON and Pickle are very general serialization formats. This generality can sometimes make it harder to program with them. Let's image that we were storing test scores for students in a class."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"scores = [{'name': 'Jared', 'score': 77}, {'name': 'Sylvie', 'score': 82}, {'name': 'Bud', 'score': 66} ]\n",
"\n",
"json.dump(scores, open('tests.json','w'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to calculate the average score, we can do it as follows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"75.0\n"
]
}
],
"source": [
"sum = 0\n",
"cnt = 0\n",
"for record in json.load(open('tests.json','r')):\n",
" cnt += 1\n",
" sum += record['score']\n",
"\n",
"print(sum/cnt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is problematic about this code? \n",
"\n",
"The key 'score' is not defined anywhere and is not guaranteed to be present in every record. What if due to a data entry error we added a new record that looked like this:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"scores += [{'name': 'Fred', 'grade': 76}]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We really need a way that we can agree upon a fixed \"schema\" (keys that we are going use) in advance for consistent programming. This need leads to the concept of \"structured\" data (or tabular data).\n",
"\n",
"## Structured or Tabular Data\n",
"\n",
"In a structured dataset, we have rows and columns (also called attributes). Each column has a consistent name. In Python, the pandas library gives us a table structure that we can use:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>score</th>\n",
" <th>grade</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Jared</td>\n",
" <td>77.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Sylvie</td>\n",
" <td>82.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Bud</td>\n",
" <td>66.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Fred</td>\n",
" <td>NaN</td>\n",
" <td>76.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name score grade\n",
"0 Jared 77.0 NaN\n",
"1 Sylvie 82.0 NaN\n",
"2 Bud 66.0 NaN\n",
"3 Fred NaN 76.0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"score_table = pd.DataFrame(scores)\n",
"score_table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how a table represents the inconsistency that we previously introduced. Rather than hiding the key, it inserts NULL values to represent missing information. Structured data allows you to program against the data without actually worrying about what each particular record is---because you are guaranteed that each record has the same schema.\n",
"\n",
"This is why structured representations are often the main choice for data science applications. They enable reliable sharing of datasets. "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"75.0"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_table['score'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Extraction\n",
"\n",
"Data extraction is the process of moving between different data formats. This process is necessarily lossy because formats are not always equivalent. The general strategy is to decompose one serialized format into its atomic pieces."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'.created_at': 'Tue Feb 27 21:11:40 +0000 2018',\n",
" '.id': 968594506663669800,\n",
" '.id_str': '968594506663669760',\n",
" '.text': 'RT @honeydrop_506: 180222 ICN #백현 #BAEKHYUNnn나의 겨울과 너nn#iHeartAwards #BestFanArmy #EXOL @weareoneEXO https://t.co/hg7I3xAlBg',\n",
" '.truncated': False,\n",
" '.in_reply_to_status_id': None,\n",
" '.in_reply_to_status_id_str': None,\n",
" '.in_reply_to_user_id': None,\n",
" '.in_reply_to_user_id_str': None,\n",
" '.in_reply_to_screen_name': None,\n",
" '.user.id': 4448809940,\n",
" '.user.id_str': '4448809940',\n",
" '.user.name': 'ayah',\n",
" '.user.screen_name': 'lovbyun',\n",
" '.user.location': 'bbh iu jjh pcy kjd dks',\n",
" '.user.url': 'https://curiouscat.me/baekhyun-l',\n",
" '.user.description': 'hi hello I love exo',\n",
" '.user.translator_type': 'none',\n",
" '.user.protected': False,\n",
" '.user.verified': False,\n",
" '.user.followers_count': 1142,\n",
" '.user.friends_count': 125,\n",
" '.user.listed_count': 20,\n",
" '.user.favourites_count': 5712,\n",
" '.user.statuses_count': 4011,\n",
" '.user.created_at': 'Fri Dec 04 03:44:59 +0000 2015',\n",
" '.user.utc_offset': -28800,\n",
" '.user.time_zone': 'Pacific Time (US & Canada)',\n",
" '.user.geo_enabled': False,\n",
" '.user.lang': 'en',\n",
" '.user.contributors_enabled': False,\n",
" '.user.is_translator': False,\n",
" '.user.profile_background_color': '000000',\n",
" '.user.profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',\n",
" '.user.profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',\n",
" '.user.profile_background_tile': False,\n",
" '.user.profile_link_color': 'F58EA8',\n",
" '.user.profile_sidebar_border_color': '000000',\n",
" '.user.profile_sidebar_fill_color': '000000',\n",
" '.user.profile_text_color': '000000',\n",
" '.user.profile_use_background_image': False,\n",
" '.user.profile_image_url': 'http://pbs.twimg.com/profile_images/967130320259526656/0xZ-wxXJ_normal.jpg',\n",
" '.user.profile_image_url_https': 'https://pbs.twimg.com/profile_images/967130320259526656/0xZ-wxXJ_normal.jpg',\n",
" '.user.profile_banner_url': 'https://pbs.twimg.com/profile_banners/4448809940/1519340670',\n",
" '.user.default_profile': False,\n",
" '.user.default_profile_image': False,\n",
" '.user.following': None,\n",
" '.user.follow_request_sent': None,\n",
" '.user.notifications': None,\n",
" '.geo': None,\n",
" '.coordinates': None,\n",
" '.place': None,\n",
" '.contributors': None,\n",
" '.retweeted_status.created_at': 'Mon Feb 26 14:25:59 +0000 2018',\n",
" '.retweeted_status.id': 968130023566684200,\n",
" '.retweeted_status.id_str': '968130023566684160',\n",
" '.retweeted_status.text': '180222 ICN #백현 #BAEKHYUNnn나의 겨울과 너nn#iHeartAwards #BestFanArmy #EXOL @weareoneEXO https://t.co/hg7I3xAlBg',\n",
" '.retweeted_status.display_text_range': [0, 81],\n",
" '.retweeted_status.truncated': False,\n",
" '.retweeted_status.in_reply_to_status_id': None,\n",
" '.retweeted_status.in_reply_to_status_id_str': None,\n",
" '.retweeted_status.in_reply_to_user_id': None,\n",
" '.retweeted_status.in_reply_to_user_id_str': None,\n",
" '.retweeted_status.in_reply_to_screen_name': None,\n",
" '.retweeted_status.user.id': 785503802761683000,\n",
" '.retweeted_status.user.id_str': '785503802761682944',\n",
" '.retweeted_status.user.name': 'HONEY DROP 🍯',\n",
" '.retweeted_status.user.screen_name': 'honeydrop_506',\n",
" '.retweeted_status.user.location': '백현 마음 속 ‘ㅅ’',\n",
" '.retweeted_status.user.url': None,\n",
" '.retweeted_status.user.description': '좋아해, 백현아 ❤️ / 로고크롭 2차가공 상업적이용 ❌ 고화질은 마음 / Do not QT / 사담 @BHoney___',\n",
" '.retweeted_status.user.translator_type': 'none',\n",
" '.retweeted_status.user.protected': False,\n",
" '.retweeted_status.user.verified': False,\n",
" '.retweeted_status.user.followers_count': 32861,\n",
" '.retweeted_status.user.friends_count': 10,\n",
" '.retweeted_status.user.listed_count': 746,\n",
" '.retweeted_status.user.favourites_count': 184,\n",
" '.retweeted_status.user.statuses_count': 1220,\n",
" '.retweeted_status.user.created_at': 'Mon Oct 10 15:34:35 +0000 2016',\n",
" '.retweeted_status.user.utc_offset': 32400,\n",
" '.retweeted_status.user.time_zone': 'Seoul',\n",
" '.retweeted_status.user.geo_enabled': False,\n",
" '.retweeted_status.user.lang': 'ko',\n",
" '.retweeted_status.user.contributors_enabled': False,\n",
" '.retweeted_status.user.is_translator': False,\n",
" '.retweeted_status.user.profile_background_color': '000000',\n",
" '.retweeted_status.user.profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',\n",
" '.retweeted_status.user.profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',\n",
" '.retweeted_status.user.profile_background_tile': False,\n",
" '.retweeted_status.user.profile_link_color': 'F58EA8',\n",
" '.retweeted_status.user.profile_sidebar_border_color': '000000',\n",
" '.retweeted_status.user.profile_sidebar_fill_color': '000000',\n",
" '.retweeted_status.user.profile_text_color': '000000',\n",
" '.retweeted_status.user.profile_use_background_image': False,\n",
" '.retweeted_status.user.profile_image_url': 'http://pbs.twimg.com/profile_images/956172776506761217/TCBwtjU1_normal.jpg',\n",
" '.retweeted_status.user.profile_image_url_https': 'https://pbs.twimg.com/profile_images/956172776506761217/TCBwtjU1_normal.jpg',\n",
" '.retweeted_status.user.profile_banner_url': 'https://pbs.twimg.com/profile_banners/785503802761682944/1516948135',\n",
" '.retweeted_status.user.default_profile': False,\n",
" '.retweeted_status.user.default_profile_image': False,\n",
" '.retweeted_status.user.following': None,\n",
" '.retweeted_status.user.follow_request_sent': None,\n",
" '.retweeted_status.user.notifications': None,\n",
" '.retweeted_status.geo': None,\n",
" '.retweeted_status.coordinates': None,\n",
" '.retweeted_status.place': None,\n",
" '.retweeted_status.contributors': None,\n",
" '.retweeted_status.is_quote_status': False,\n",
" '.retweeted_status.quote_count': 15,\n",
" '.retweeted_status.reply_count': 1,\n",
" '.retweeted_status.retweet_count': 4119,\n",
" '.retweeted_status.favorite_count': 1610,\n",
" '.retweeted_status.entities.hashtags': [{'text': '백현', 'indices': [11, 14]},\n",
" {'text': 'BAEKHYUN', 'indices': [15, 24]},\n",
" {'text': 'iHeartAwards', 'indices': [36, 49]},\n",
" {'text': 'BestFanArmy', 'indices': [50, 62]},\n",
" {'text': 'EXOL', 'indices': [63, 68]}],\n",
" '.retweeted_status.entities.urls': [],\n",
" '.retweeted_status.entities.user_mentions': [{'screen_name': 'weareoneEXO',\n",
" 'name': 'EXO',\n",
" 'id': 873115441303924700,\n",
" 'id_str': '873115441303924736',\n",
" 'indices': [69, 81]}],\n",
" '.retweeted_status.entities.symbols': [],\n",
" '.retweeted_status.entities.media': [{'id': 968129121061490700,\n",
" 'id_str': '968129121061490690',\n",
" 'indices': [82, 105],\n",
" 'media_url': 'http://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg',\n",
" 'media_url_https': 'https://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg',\n",
" 'url': 'https://t.co/hg7I3xAlBg',\n",
" 'display_url': 'pic.twitter.com/hg7I3xAlBg',\n",
" 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1',\n",
" 'type': 'photo',\n",
" 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},\n",
" 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'},\n",
" 'small': {'w': 453, 'h': 680, 'resize': 'fit'},\n",
" 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}}],\n",
" '.retweeted_status.extended_entities.media': [{'id': 968129121061490700,\n",
" 'id_str': '968129121061490690',\n",
" 'indices': [82, 105],\n",
" 'media_url': 'http://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg',\n",
" 'media_url_https': 'https://pbs.twimg.com/media/DW98TmWU0AItmlv.jpg',\n",
" 'url': 'https://t.co/hg7I3xAlBg',\n",
" 'display_url': 'pic.twitter.com/hg7I3xAlBg',\n",
" 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1',\n",
" 'type': 'photo',\n",
" 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},\n",
" 'medium': {'w': 800, 'h': 1200, 'resize': 'fit'},\n",
" 'small': {'w': 453, 'h': 680, 'resize': 'fit'},\n",
" 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'}}},\n",
" {'id': 968129133724053500,\n",
" 'id_str': '968129133724053505',\n",
" 'indices': [82, 105],\n",
" 'media_url': 'http://pbs.twimg.com/media/DW98UVhUMAE-81s.jpg',\n",
" 'media_url_https': 'https://pbs.twimg.com/media/DW98UVhUMAE-81s.jpg',\n",
" 'url': 'https://t.co/hg7I3xAlBg',\n",
" 'display_url': 'pic.twitter.com/hg7I3xAlBg',\n",
" 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1',\n",
" 'type': 'photo',\n",
" 'sizes': {'medium': {'w': 800, 'h': 1200, 'resize': 'fit'},\n",
" 'large': {'w': 1000, 'h': 1500, 'resize': 'fit'},\n",
" 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},\n",
" 'small': {'w': 453, 'h': 680, 'resize': 'fit'}}},\n",
" {'id': 968129145354952700,\n",
" 'id_str': '968129145354952704',\n",
" 'indices': [82, 105],\n",
" 'media_url': 'http://pbs.twimg.com/media/DW98VA2VoAAdEU4.jpg',\n",
" 'media_url_https': 'https://pbs.twimg.com/media/DW98VA2VoAAdEU4.jpg',\n",
" 'url': 'https://t.co/hg7I3xAlBg',\n",
" 'display_url': 'pic.twitter.com/hg7I3xAlBg',\n",
" 'expanded_url': 'https://twitter.com/honeydrop_506/status/968130023566684160/photo/1',\n",
" 'type': 'photo',\n",
" 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},\n",