"A JOIN operation is used to combine rows from two or more tables based on related data shared in them. Let's overview some of the practical details of these opeerations.\n",
"\n",
"## Pandas Merge\n",
"The pandas package implements efficient \"equality\" joins. This function is called `merge` (pandas also has a `join` function which behaves slightly differently but similar idea!). Let's think of a simple example:"
"This function merges the two tables together on the category column and automatically removes the redundancy (1 single column is left). The behavior of this function can be subtle. Suppose, we change the category field to D for one of the rows:"
"The row gets dropped from the result! In the basic operating mode of the merge command any row that doesn't have a match gets dropped. There is a key word `how` that can modify this behavior. Suppose, we want the left rows that don't match:"
"It returns those rows but with any additional columns null or nan, depending on the data type. If you set how to right you'll get the same answer as before (why?)"
"Pandas has an efficient implementation for equality join problems. Let's mock up a general join algorithm (any filter condition). Let's ignore the efficiency problem for a bit."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name_x</th>\n",
" <th>category_x</th>\n",
" <th>dummy</th>\n",
" <th>name_y</th>\n",
" <th>category_y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" <td>1</td>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" <td>1</td>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" <td>1</td>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" <td>1</td>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" <td>1</td>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" <td>1</td>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name_x category_x dummy name_y category_y\n",
"0 John Doe A 1 John Doe A\n",
"1 John Doe A 1 Jane Smith B\n",
"2 John Doe A 1 Alex Taylor D\n",
"3 John Doe A 1 Brett Daniels C\n",
"4 Jane Smith B 1 John Doe A\n",
"5 Jane Smith B 1 Jane Smith B\n",
"6 Jane Smith B 1 Alex Taylor D\n",
"7 Jane Smith B 1 Brett Daniels C\n",
"8 Alex Taylor D 1 John Doe A\n",
"9 Alex Taylor D 1 Jane Smith B\n",
"10 Alex Taylor D 1 Alex Taylor D\n",
"11 Alex Taylor D 1 Brett Daniels C\n",
"12 Brett Daniels C 1 John Doe A\n",
"13 Brett Daniels C 1 Jane Smith B\n",
"14 Brett Daniels C 1 Alex Taylor D\n",
"15 Brett Daniels C 1 Brett Daniels C"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def all_pairs(df):\n",
" new_df = df.copy() # make a copy of the data frame\n",
" new_df['dummy'] = 1\n",
" \n",
" return new_df.merge(new_df, on='dummy')\n",
"\n",
"all_pairs(table1_df)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name_x</th>\n",
" <th>category_x</th>\n",
" <th>dummy</th>\n",
" <th>name_y</th>\n",
" <th>category_y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>John Doe</td>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Jane Smith</td>\n",
" <td>B</td>\n",
" <td>1</td>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Brett Daniels</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" <td>Alex Taylor</td>\n",
" <td>D</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name_x category_x dummy name_y category_y\n",
"1 John Doe A 1 Jane Smith B\n",
"2 John Doe A 1 Alex Taylor D\n",
"3 John Doe A 1 Brett Daniels C\n",
"6 Jane Smith B 1 Alex Taylor D\n",
"7 Jane Smith B 1 Brett Daniels C\n",
"14 Brett Daniels C 1 Alex Taylor D"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pair_df = all_pairs(table1_df)\n",
"pair_df[pair_df['name_x'] > pair_df['name_y']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Self-Join\n",
"There are even scenarios when you might want to join a table with itself! Consider the following example."