Commit 7da0fcb6 by Sanjay Krishnan

Clean up for the new quarter

parent d4e2575b
......@@ -19,11 +19,11 @@ Git is installed on all of the CSIL computers, and to install git on your machin
[https://git-scm.com/book/en/v2/Getting-Started-Installing-Git]
Every student in the class has a git repository (a place where you can store completed assignments). This git repository can be accessed from:
[https://mit.cs.uchicago.edu/cmsc13600-spr-20/<your cnetid>.git]
[https://mit.cs.uchicago.edu/cmsc13600-spr-21/<your cnetid>.git]
The first thing to do is to open your terminal application, and ``clone`` this repository (NOTE skr is ME, replace it with your CNET id!!!):
```
$ git clone https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr.git cmsc13600-submit
$ git clone https://mit.cs.uchicago.edu/cmsc13600-spr-21/skr.git cmsc13600-submit
```
Your username and id is your CNET id and CNET password. This will create a new folder that is empty titled cmsc13600-submit. There is similarly a course repository where all of the homework materials will stored. Youshould clone this repository as well:
```
......@@ -45,4 +45,4 @@ After adding your files, to submit your code you must run:
$ git commit -m"My submission"
$ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-19/skr]
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-21/skr]
# Homework 1. Introduction to Python and File I/O
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
# Homework 1. Introduction to Data Extraction
In this assignment, you will extract meaningful information from unstructured data.
Due Date: *Friday April 17, 2020 11:59 pm*
Due Date: *Friday April 9, 2020 11:59 pm*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
......@@ -10,9 +10,9 @@ $ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
Copy the folder ``hw1`` to your submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``extract.py``. Once you are done, you must add 'extract.py' to git:
```
$ git add encoding.py
$ git add extract.py
```
After adding your files, to submit your code you must run:
```
......@@ -21,65 +21,88 @@ $ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Delta Encoding
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
## Background
RSS is a web standard that allows users and applications to access updates to websites in a standardized, computer-readable format. These feeds can, for example, allow a user to keep track of many different websites in a single news aggregator. The news aggregator will automatically check the RSS feed for new content, allowing the list to be automatically passed from website to website or from website to user.
Recent events have shown how important tracking online media is for financial markets.
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
## TODO 1. Loading the data file
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
## TODO 2. Compute the basic encoding
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
For example:
In this project, you will scan through a series of reddit posts and count the frequency that certain stock ticker symbols are mentioned. You're essentially implementing one important function that scans through posts in an XML file of posts and returns the number of posts in which a ticker symbol occurs:
```
> data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
>>> count_ticker('reddit.xml')
{'$HITI': 1, '$GME': 3, '$MSFT': 1, '$ISWH': 1, '$ARBKF': 1, '$HCANF': 1, '$AMC': 1, '$OZOP': 1, '$VMNT': 2, '$CLIS': 1, '$EEENF': 2, '$GTII': 1}
```
However, before we get there we will break the implemention up into a few smaller parts.
Or,
## Data Files
You are given a data file to process `reddit.xml`. This file contains an RSS feed taken from a few Reddit pages covering stocks. RSS feeds are stored in a semi-structured format called XML. XML defines a tree elements, where you have items and subitems (which can be named). This example is shamelessly taken from this link: https://stackabuse.com/reading-and-writing-xml-files-in-python/
```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
<data>
<items>
<item name="item1">item1abc</item>
<item name="item2">item2abc</item>
</items>
</data>
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
These tags can be extracted using built in modules in most programming languages. Let's what happens when we process this with python. I stored the above data in a test file (also included) test.xml. We can first try to extract the *item* tags.
```
from xml.dom import minidom
## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
mydoc = minidom.parse('test.xml')
items = mydoc.getElementsByTagName('item')
for elem in items:
print(elem.firstChild.data)
```
The code above: (1) gets all the tags labeld *item*, (2) then iterates those those items, (3) gets the child data aka the data contained between `> * </`. The output is:
```
item1abc
item2abc
```
If I wanted to grab the names instead, I could write the following code:
```
from xml.dom import minidom
## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
mydoc = minidom.parse('test.xml')
items = mydoc.getElementsByTagName('item')
for elem in items:
print(elem.attributes['name'].value)
```
The code above: (1) gets all the tags labeld *item*, (2) then iterates those those items, (3) gets the attribute data aka the data contained in `< * attr=value > `.
Reading from such a file is a little tricky, so we've provided that function for you.
## TODO 5. Delta Decoding
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
### TODO 1. Extract Title, Links, and Post Times
Your first todo will be to use the examples above to extract *titles*,
```
<title>$EEENF Share Price Valuation Model | Low Range Estimate increase of 1700% in Current Share Price Equaling $0.49 Per Share| Average Range Estimate Increase in Current Share Price of 3600% Equaling $1.05 Per Share</title>
```
then extract *links* (only extract the URL),
```
<link href="https://www.reddit.com/r/pennystocks/comments/mdexsk/eeenf_share_price_valuation_model_low_range/" />
```
and *post times* for each reddit post in the RSS feed:
```
<updated>2021-03-26T02:44:10+00:00</updated>
```
It is up to you to read the documentation on the python xml module if you are confused on how to use it. You must write a helper function:
```
def _reddit_extract(file)
```
That returns a Pandas DataFrame with three columns (*title*, *link*, *updated*). On `reddit.xml` your output should be a 25 row, 3 column pandas data frame.
For example:
### TODO 2. Extract Ticker Symbols
Each title of a reddit post might mention a stock of interest and most use a consistent format to denote a ticker symbol (starting with a dollar sign). For example: "$ISWH Takes Center Stage at Crypto Conference". You will now write a function called extract ticker which given a single title extracts all of the ticker symbols present in the title:
```
> enc = [1,2,1,-1]
> data = delta_decoding(enc)
1,3,4,3
def _ticker_extract(title)
```
Or,
## TODO 3. Count Ticker Frequency
Using the two helper functions you defined above. Finally, you will count the frequency (the number of posts) in which each ticker symbol occurs.
```
def count_ticker(file)
```
Your result should be a dictionary of ticker to count and look as follows:
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
>>> count_ticker('reddit.xml')
{'$HITI': 1, '$GME': 3, '$MSFT': 1, '$ISWH': 1, '$ARBKF': 1, '$HCANF': 1, '$AMC': 1, '$OZOP': 1, '$VMNT': 2, '$CLIS': 1, '$EEENF': 2, '$GTII': 1}
```
## Testing
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
We've provided a sample dataset ``reddit.xml`` which can be used to test your code by seeing if you can reproduce the output above. We have also provided an autograder that will check for some basic issues.
import random
from encoding import *
def test_load():
data = load_orig_file('data.txt')
try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
def test_encoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
try:
assert(sum(encoded) == data[-1])
assert(sum(encoded) == 26)
assert(len(data) == len(encoded))
except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try:
assert(sum(shift(data, 10)) == N*10 + sum(data))
assert(all([d >=0 for d in shift(encoded,4)]))
except AssertionError:
print('TODO 3. Failure check your shift function')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try:
assert(data == data_p)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding')
def generate_file(size, seed):
FILE_NAME = 'data.gen.txt'
f = open(FILE_NAME,'w')
initial = seed
for i in range(size):
f.write(str(initial) + '\n')
initial += random.randint(-4, 4)
def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
cnt = 0
for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
data = load_orig_file('data.gen.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
write_encoding(sencoded, 'data_out.txt')
loaded = unshift(read_encoding('data_out.txt'),4)
decoded = delta_decoding(loaded)
cnt += (decoded == data)
try:
assert(cnt == 10)
except AssertionError:
print('Failed Random Tests', str(10-cnt), 'out of 10')
test_load()
test_encoding()
test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
from extract_sol import _reddit_extract, \
_ticker_extract, \
count_ticker
def testExtraction():
df = _reddit_extract('reddit.xml')
try:
cols = list(df.columns.values)
cols.sort()
except:
return "[ERROR] The output of _reddit_extract doesn't look like a dataframe"
if cols != ['link', 'title', 'updated']:
return "[ERROR] Expected ['link', 'title', 'updated'], got = " + str(cols)
N = len(df)
if N != 25:
return "[ERROR] Seems like you have too many rows"
return "[PASSED] testExtraction()"
def testTicker():
ticker = "The quick $BRWN fox jumped over the $LZY $DAWG"
val = _ticker_extract(ticker)
expected = set(['$BRWN','$LZY','$DAWG'])
if val != expected:
return "[ERROR] Expected set(['$BRWN','$LZY','$DAWG']), got = " + str(val)
return "[PASSED] testTicker()"
def testAll():
expected = {'$HITI': 1, '$GME': 3, '$MSFT': 1, '$ISWH': 1, \
'$ARBKF': 1, '$HCANF': 1, '$AMC': 1, '$OZOP': 1, \
'$VMNT': 2, '$CLIS': 1, '$EEENF': 2, '$GTII': 1}
val = count_ticker('reddit.xml')
if expected != val:
return "[ERROR] Expected " + str(expected) + " , got = " + str(val)
return "[PASSED] testAll()"
print(testExtraction())
print(testTicker())
print(testAll())
'''extract.py
In this first assignment, you will learn the basics of python
data manipulation. You will process an XML file of reddit posts
relating to stocks and count the frequency of certain ticker
symbols appearing
'''
#Two libraries that we will need for the helper functions
import pandas as pd
from xml.dom import minidom
'''count_ticker is the main function that you will implement.
Input: filename of a reddit RSS feed in XML format
Output: dictionary mapping
ticker symbols => frequency of occurance in the post titles
Example Usage:
>>> count_ticker('reddit.xml')
{'$HITI': 1, '$GME': 3, '$MSFT': 1,
'$ISWH': 1, '$ARBKF': 1, '$HCANF': 1,
'$AMC': 1, '$OZOP': 1, '$VMNT': 2,
'$CLIS': 1, '$EEENF': 2, '$GTII': 1}
'''
def count_ticker(file):
raise ValueError('Count Ticker Not Implemented')
# TODO1 Helper Function to Extract XML
'''_reddit_extract is a helper function that extracts
the post title, timestamp, and link into a pandas dataframe.
Input: filename of a reddit RSS feed in XML format
Output: 3 col pandas dataframe ('title', 'updated', 'link')
with each row a reddit post from the RSS XML file.
'''
def _reddit_extract(file):
raise ValueError('Count Ticker Not Implemented')
#TODO2 Helper Function to Extract Tickers
'''_ticker_extract is a helper function that extracts
the mentioned ticker symbols in each title.
Input: string representing a post title
Output: set of ticker symbols mentioned each in consistent
notation $XYZ
'''
def _ticker_extract(title):
raise ValueError('Count Ticker Not Implemented')
<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><category term="multi" label="r/multi"/><updated>2021-03-26T16:26:15+00:00</updated><id>/r/pennystocks+investing+wallstreetbets2+stocks/.rss</id><link rel="self" href="https://www.reddit.com/r/pennystocks+investing+wallstreetbets2+stocks/.rss" type="application/atom+xml" /><link rel="alternate" href="https://www.reddit.com/r/pennystocks+investing+wallstreetbets2+stocks/" type="text/html" /><title>posts from investing, stocks, pennystocks, wallstreetbets2</title><entry><author><name>/u/juaggo_</name><uri>https://www.reddit.com/user/juaggo_</uri></author><category term="stocks" label="r/stocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;So lately the market has been going down and people might have gotten some bloody days in their portfolios. The correction has affected tech the most as the Nasdaq is about 8% from its all time highs.&lt;/p&gt; &lt;p&gt;The correction has happened because of number one: Rising treasury yields and number two: Sector rotation. Reopening plays are currently the trend that big money likes and money has gone there recently. &lt;/p&gt; &lt;p&gt;This doesn’t mean that tech is bad in the long term. Stocks go down sometimes and this is the moment that it’s happening. But there is a silver lining to this story...&lt;/p&gt; &lt;p&gt;This gives us a good opportunity get your favourite stocks at a cheaper price. Averaging down is a very delightful thing to do and this is a perfect opportunity. And even if we continue to go down, it’s ok, since you can average down even more.&lt;/p&gt; &lt;p&gt;Another thing that I want to say is that you shouldn’t listen to the media too much. It’s their job to create havoc and drama in the stock market. Their opinions change every week almost, and it’s kinda funny sometimes. One week they say that you shouldn’t sell and another day CNBC reporters tell us how big tech is in a bad place and you should move to industrials, travel, etc. &lt;/p&gt; &lt;p&gt;You have YOUR own plan. Do your plan and don’t listen to those whose job is to dramatize things. The stock market needs patience. Investing is for the long run.&lt;/p&gt; &lt;p&gt;Don’t look at the 1 day chart all the time. It can be very toxic for yourself, especially during a red day. So just chill and remember that your time horizon is in 10 years, not tomorrow.&lt;/p&gt; &lt;p&gt;That’s my 2 cents, have good one everyone!&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/juaggo_&quot;&gt; /u/juaggo_ &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/stocks/&quot;&gt; r/stocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdn3nz/tech_is_tanking_at_the_moment_but_it_will_come/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdn3nz/tech_is_tanking_at_the_moment_but_it_will_come/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdn3nz</id><link href="https://www.reddit.com/r/stocks/comments/mdn3nz/tech_is_tanking_at_the_moment_but_it_will_come/" /><updated>2021-03-26T12:04:26+00:00</updated><title>Tech is tanking at the moment, but it will come back up eventually. Don’t listen to the big media platforms too much!</title></entry><entry><author><name>/u/vanchman11</name><uri>https://www.reddit.com/user/vanchman11</uri></author><category term="investing" label="r/investing"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;https://www.cnbc.com/2021/03/26/sofi-to-give-amateur-investors-early-access-to-ipos-in-break-from-wall-street-tradition-.html&quot;&gt;https://www.cnbc.com/2021/03/26/sofi-to-give-amateur-investors-early-access-to-ipos-in-break-from-wall-street-tradition-.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;Online finance start-up SoFi is lowering the barrier for amateur investors to buy shares of companies as they go public.&lt;/p&gt; &lt;p&gt;These IPO shares have historically been set aside for Wall Street&amp;#39;s institutional investors or high-net worth individuals. Retail traders don&amp;#39;t have a way to buy into newly listed companies until those shares begin actually trading on the exchange. By that time, the price has often gapped higher.&lt;/p&gt; &lt;p&gt;&amp;quot;Main Street will have access to investing in a way they wouldn&amp;#39;t have before,&amp;quot; SoFi CEO Anthony Noto said in a phone interview. &amp;quot;It gives more differentiation, and more access so people can build diversified portfolios.&amp;quot;&lt;/p&gt; &lt;p&gt;SoFi itself will be an underwriter in these deals, meaning it works with companies to determine a share price, buys securities from the issuer then sells them back to certain investors. It&amp;#39;s common for brokerage firms to get a portion of IPO shares in that process. But they don&amp;#39;t typically offer them to the everyday investor.&lt;/p&gt; &lt;p&gt;Noto worked on more than 50 IPOs, including Twitter&amp;#39;s debut, in his former role as partner and head of the technology media and telecom group at Goldman Sachs. Firms like Goldman generate revenue from Wall Street funds, which often choose to get in on an IPO &amp;quot;based on the access they get to that unique product,&amp;quot; he said.&lt;/p&gt; &lt;p&gt;&amp;quot;Individual investors don&amp;#39;t generate those types of revenues, therefore they don&amp;#39;t get access to the unique product,&amp;quot; Noto said. &amp;quot;The cost of serving retail, if they did decide to do that, would be too high.&amp;quot;&lt;/p&gt; &lt;p&gt;SoFi clients who have at least $3,000 in account value will be able enter the amount of shares they want as a &amp;quot;reservation.&amp;quot; The app will alert them when it&amp;#39;s time to confirm an order.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/vanchman11&quot;&gt; /u/vanchman11 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/investing/&quot;&gt; r/investing &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/investing/comments/mdppti/sofi_to_give_amateur_investors_early_access_to/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/investing/comments/mdppti/sofi_to_give_amateur_investors_early_access_to/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdppti</id><link href="https://www.reddit.com/r/investing/comments/mdppti/sofi_to_give_amateur_investors_early_access_to/" /><updated>2021-03-26T14:20:52+00:00</updated><title>SoFi to give amateur investors early access to IPOs</title></entry><entry><author><name>/u/Muznick</name><uri>https://www.reddit.com/user/Muznick</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;Another day another store opened! &lt;/p&gt; &lt;p&gt;CALGARY, AB, March 26, 2021 /CNW/ - High Tide Inc. (&amp;quot;High Tide&amp;quot; or the &amp;quot;Company&amp;quot;) (TSXV: $HITI) (OTCQB: $HITIF) (FRA: $2LY), a retail-focused cannabis corporation enhanced by the manufacturing and distribution of consumption accessories, announced today that its new Canna Cabana retail store, located at 3505 Upper Middle Road, Unit D3, in Burlington, Ontario, has begun selling recreational cannabis products for adult use. The new store represents High Tide&amp;#39;s 80th branded retail location across Canada selling recreational cannabis products and consumption accessories, and the Company&amp;#39;s eighth new organically built store in the month of March alone. The new Burlington store is strategically located within a popular commercial plaza with a major grocery anchor and several national restaurant chains nearby.&lt;/p&gt; &lt;p&gt;&amp;quot;The new store is another step towards our commitment of reaching 30 branded retail locations within Ontario by September of this year. I am so proud of the work our team has put into launching eight new stores this month alone. March has been the busiest month in terms of new store openings since High Tide&amp;#39;s inception,&amp;quot; said Raj Grover, President and Chief Executive Officer of High Tide. &amp;quot;Continued expansion in Canada&amp;#39;s largest province remains a core part of High Tide&amp;#39;s organic growth strategy. We will continue to execute this strategy by bringing our one-stop cannabis shop concept to high traffic areas like the new Upper Middle Store in Burlington,&amp;quot; added Mr. Grover.&lt;/p&gt; &lt;p&gt;About High Tide Inc.&lt;/p&gt; &lt;p&gt;High Tide is a retail-focused cannabis company enhanced by the manufacturing and distribution of consumption accessories. The Company is the largest Canadian retailer of recreational cannabis as measured by revenue, with 80 branded retail locations spanning Ontario, Alberta, Manitoba and Saskatchewan. High Tide&amp;#39;s retail segment features the Canna Cabana, KushBar, Meta Cannabis Co., Meta Cannabis Supply Co. and NewLeaf Cannabis banners, with additional locations under development across the country. High Tide has been serving consumers for over a decade through its numerous consumption accessory businesses including e-commerce platforms Grasscity.com, Smokecartel.com and CBDcity.com, and its wholesale distribution division under Valiant Distribution, including the licensed entertainment product manufacturer Famous Brandz. High Tide&amp;#39;s strategy as a parent company is to extend and strengthen its integrated value chain, while providing a complete customer experience and maximizing shareholder value. Key industry investors in High Tide include Aphria Inc. (TSX: $APHA) (NYSE: $APHA) and Aurora Cannabis Inc. (NYSE: $ACB) (TSX: $ACB).&lt;/p&gt; &lt;p&gt;Neither the TSX Venture Exchange (the &amp;quot;TSXV&amp;quot;) nor its Regulation Services Provider (as that term is defined in the policies of the TSXV) accepts responsibility for the adequacy or accuracy of this release.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/Muznick&quot;&gt; /u/Muznick &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdlyq8/guess_what_high_tide_opened_a_new_store/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdlyq8/guess_what_high_tide_opened_a_new_store/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdlyq8</id><link href="https://www.reddit.com/r/pennystocks/comments/mdlyq8/guess_what_high_tide_opened_a_new_store/" /><updated>2021-03-26T10:51:46+00:00</updated><title>Guess what, $HITI opened a new store.</title></entry><entry><author><name>/u/TobiasKGordon</name><uri>https://www.reddit.com/user/TobiasKGordon</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/TobiasKGordon&quot;&gt; /u/TobiasKGordon &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://i.redd.it/y6cvhu4uk8p61.jpg&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/md7zya/gme_to_the_moon/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_md7zya</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/md7zya/gme_to_the_moon/" /><updated>2021-03-25T20:46:30+00:00</updated><title>$GME to the Moon!</title></entry><entry><author><name>/u/DangerStranger138</name><uri>https://www.reddit.com/user/DangerStranger138</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/DangerStranger138&quot;&gt; /u/DangerStranger138 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://markets.businessinsider.com/news/stocks/mormon-church-fund-sells-big-tech-buys-tesla-gamestop-stock-2021-3-1030242012&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mdhn4z/mormons_love_the_stocks/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdhn4z</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/mdhn4z/mormons_love_the_stocks/" /><updated>2021-03-26T05:34:38+00:00</updated><title>Mormons love the stocks $GME</title></entry><entry><author><name>/u/coolcomfort123</name><uri>https://www.reddit.com/user/coolcomfort123</uri></author><category term="stocks" label="r/stocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;https://www.cnbc.com/2021/03/09/microsoft-closes-bethesda-acquisition-aiming-to-take-on-sony.html&quot;&gt;https://www.cnbc.com/2021/03/09/microsoft-closes-bethesda-acquisition-aiming-to-take-on-sony.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;$MSFT has closed its $7.5 billion acquisition of ZeniMax, the parent company of Bethesda.&lt;/p&gt; &lt;p&gt;Microsoft confirmed that some new Bethesda games would be exclusive to Xbox consoles and PCs.&lt;/p&gt; &lt;p&gt;The firm has often been seen as lagging behind Sony when it comes to major first-party releases.&lt;/p&gt; &lt;p&gt;This is a positive news as msft could improve the gaming business. It will be more competitive to sony and able to generate more subscription revenue. The stock is trading around $230 and it is an attractive entry point for long term investors.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/coolcomfort123&quot;&gt; /u/coolcomfort123 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/stocks/&quot;&gt; r/stocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdnzup/microsoft_closes_75_billion_bethesda_acquisition/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdnzup/microsoft_closes_75_billion_bethesda_acquisition/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdnzup</id><link href="https://www.reddit.com/r/stocks/comments/mdnzup/microsoft_closes_75_billion_bethesda_acquisition/" /><updated>2021-03-26T12:55:18+00:00</updated><title>$MSFT closes $7.5 billion Bethesda acquisition, aiming to take on Sony with exclusive games</title></entry><entry><author><name>/u/TheSubwayTrader</name><uri>https://www.reddit.com/user/TheSubwayTrader</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/TheSubwayTrader&quot;&gt; /u/TheSubwayTrader &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://i.redd.it/ml6fd7gam6p61.gif&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mcz3z9/amc_10_9_8_7_6_5_4_3_2_1/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mcz3z9</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/mcz3z9/amc_10_9_8_7_6_5_4_3_2_1/" /><updated>2021-03-25T14:09:57+00:00</updated><title>AMC 10, 9, 8, 7, 6, 5, 4, 3, 2, 1</title></entry><entry><author><name>/u/Hichek2</name><uri>https://www.reddit.com/user/Hichek2</uri></author><category term="investing" label="r/investing"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;https://www.cnbc.com/2021/03/25/fed-says-banks-will-have-to-wait-until-june-30-to-start-issuing-buybacks-and-bigger-dividends.html&quot;&gt;https://www.cnbc.com/2021/03/25/fed-says-banks-will-have-to-wait-until-june-30-to-start-issuing-buybacks-and-bigger-dividends.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;Fed says banks will have to wait until June 30 to start issuing buybacks and bigger dividends&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;strong&gt;Big banks will be allowed to resume normal levels of dividend payouts and share repurchases as of June 30, as long as they pass this year’s stress test.&lt;/strong&gt;&lt;/li&gt; &lt;li&gt;&lt;strong&gt;Payouts had been restricted based on income, as a precautionary move during the Covid-19 pandemic.&lt;/strong&gt;&lt;/li&gt; &lt;li&gt;&lt;strong&gt;Banks that fail the stress test will have to wait until Sept. 30, and face even more stringent measures if they still don’t meet capital requirements by then.&lt;/strong&gt;&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Banks will be able to accelerate dividends and buybacks to shareholders this year, but not until June 30 and provided they pass the current round of stress tests, the Federal Reserve announced on Thursday.&lt;/p&gt; &lt;p&gt;The biggest Wall Street institutions have been limited based on income in their ability to do both for nearly the past year as a precautionary measure during &lt;a href=&quot;https://www.cnbc.com/2021/03/25/covid-live-updates.html&quot;&gt;the Covid-19 pandemic&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The Fed had said late last year that it would begin allowing regular disbursements in the first quarter of 2021, so the Thursday announcement pushes that date back.&lt;/p&gt; &lt;p&gt;ADVERTISING&lt;/p&gt; &lt;p&gt;“The banking system continues to be a source of strength and returning to our normal framework after this year’s stress test will preserve that strength,” Vice Chair for Supervision Randal Quarles said in a statement.&lt;/p&gt; &lt;p&gt;Bank stocks rose in after-hours trading on the news, with Wells Fargo and $JPM up around 1%.&lt;/p&gt; &lt;p&gt;Lifting the restrictions only applies to institutions that maintain proper capital levels as evaluated through the stress tests. Under normal circumstances, capital distributions are guided by a bank’s “stress capital buffer,” a measure of capital that each bank should carry based on the riskiness of its holdings.&lt;/p&gt; &lt;p&gt;The income-based measures &lt;a href=&quot;https://www.cnbc.com/2020/12/18/fed-to-allow-big-banks-to-resume-share-buybacks-with-limitations.html&quot;&gt;were put in place&lt;/a&gt; as a safeguard to make sure banks had enough capital as the pandemic tore through the U.S. economy.&lt;/p&gt; &lt;p&gt;Any bank not reaching the target will have the pandemic-era restrictions reimposed until Sept. 30. Banks that still can’t meet the required capital levels will face even stricter limitations.&lt;/p&gt; &lt;p&gt;The financial sector is one of the stock market’s leaders this year, with the group up 14.7% year to date on the S&amp;amp;P 500. People’s United, Fifth Third and Wells Fargo have led the banking space.&lt;/p&gt; &lt;p&gt;The announcement comes a day after Treasury Secretary Janet Yellen, who chaired the Fed from 2014-18, said &lt;a href=&quot;https://www.cnbc.com/2021/03/24/yellen-supports-buybacks-warren-wants-blackrock-deemed-too-big-to-fail.html&quot;&gt;she would be comfortable&lt;/a&gt; with lifting the restrictions on dividends and buybacks.&lt;/p&gt; &lt;p&gt;At a congressional hearing Wednesday, Yellen said she agreed both with the decision to suspend capital disbursements, and to resume them.&lt;/p&gt; &lt;p&gt;“I have been opposed earlier when we were very concerned about the situation the banks would face about stock buybacks,” Yellen said. “But financial institutions look healthier now, and I believe they should have some of the liberty provided by the rules to make returns to shareholders.”&lt;/p&gt; &lt;p&gt;Banks bought back just $80.7 billion of their shares in 2020, with most coming before the pandemic hit.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/Hichek2&quot;&gt; /u/Hichek2 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/investing/&quot;&gt; r/investing &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/investing/comments/md937s/fed_says_banks_will_have_to_wait_until_june_30_to/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/investing/comments/md937s/fed_says_banks_will_have_to_wait_until_june_30_to/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_md937s</id><link href="https://www.reddit.com/r/investing/comments/md937s/fed_says_banks_will_have_to_wait_until_june_30_to/" /><updated>2021-03-25T21:37:04+00:00</updated><title>Fed says banks will have to wait until June 30 to start issuing buybacks and bigger dividends</title></entry><entry><author><name>/u/TheSubwayTrader</name><uri>https://www.reddit.com/user/TheSubwayTrader</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/TheSubwayTrader&quot;&gt; /u/TheSubwayTrader &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://i.redd.it/cx9w20i3l7p61.gif&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/md3jio/gme_roller_coaster_of_love/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_md3jio</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/md3jio/gme_roller_coaster_of_love/" /><updated>2021-03-25T17:25:17+00:00</updated><title>GME Roller Coaster of Love</title></entry><entry><author><name>/u/TheSubwayTrader</name><uri>https://www.reddit.com/user/TheSubwayTrader</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;strong&gt;EL CAJON, CA / ACCESSWIRE / March 25, 2021 /&lt;/strong&gt; &lt;a href=&quot;https://pr.report/DoRzBe1j&quot;&gt;Solar Integrated Roofing Corp.&lt;/a&gt; (OTC Pink:&lt;a href=&quot;https://marketwirenews.com/stock/sirc/&quot;&gt;SIRC&lt;/a&gt;), an integrated, single-source solar power and roofing systems installation company, today provided a corporate update on near-term operational and capital markets milestone achievements.&lt;/p&gt; &lt;p&gt;&amp;quot;As we transition into a national brand with various portfolio companies across the country, we will seek to uplist to the $OTCQB in the near-term with a goal of uplisting to Nasdaq thereafter,&amp;quot; said David Massey, Chief Executive Officer of Solar Integrated Roofing Corp. &amp;quot;This marks a new era for our shareholders as we continue to mature and improve our prestige within the capital markets community.&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://marketwirenews.com/news-releases/isw-holdings-to-take-center-stage-at-prestigious-min-5665692495868452.html&quot;&gt;https://marketwirenews.com/news-releases/isw-holdings-to-take-center-stage-at-prestigious-min-5665692495868452.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;#parler #stocks #pennystocks&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/TheSubwayTrader&quot;&gt; /u/TheSubwayTrader &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mdny9i/iswh_takes_center_stage_at_crypto_conference/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mdny9i/iswh_takes_center_stage_at_crypto_conference/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdny9i</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/mdny9i/iswh_takes_center_stage_at_crypto_conference/" /><updated>2021-03-26T12:53:01+00:00</updated><title>$ISWH Takes Center Stage at Crypto Conference 'Mining Disrupt'</title></entry><entry><author><name>/u/AutoModerator</name><uri>https://www.reddit.com/user/AutoModerator</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;Buy? Sell? Call? Put? Iron triangle? Steel curtain? MEAT CURTAIN? You tell us&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/AutoModerator&quot;&gt; /u/AutoModerator &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mdkfns/daily_plays_positions_and_problems_thread/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mdkfns/daily_plays_positions_and_problems_thread/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdkfns</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/mdkfns/daily_plays_positions_and_problems_thread/" /><updated>2021-03-26T09:00:18+00:00</updated><title>Daily Plays, Positions, and Problems Thread!</title></entry><entry><author><name>/u/dufusoftheriver</name><uri>https://www.reddit.com/user/dufusoftheriver</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;table&gt; &lt;tr&gt;&lt;td&gt; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mcx9l9/some_weirdo_tried_to_sell_me_something_he/&quot;&gt; &lt;img src=&quot;https://a.thumbs.redditmedia.com/y6h3vi6WRca5QI0scXWgTZiIDq6XBzhuTy90rpWDCC4.jpg&quot; alt=&quot;Some weirdo tried to sell me something, he messaged me after I commented on a thread here.&quot; title=&quot;Some weirdo tried to sell me something, he messaged me after I commented on a thread here.&quot; /&gt; &lt;/a&gt; &lt;/td&gt;&lt;td&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/dufusoftheriver&quot;&gt; /u/dufusoftheriver &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/gallery/mcx9l9&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mcx9l9/some_weirdo_tried_to_sell_me_something_he/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</content><id>t3_mcx9l9</id><media:thumbnail url="https://a.thumbs.redditmedia.com/y6h3vi6WRca5QI0scXWgTZiIDq6XBzhuTy90rpWDCC4.jpg" /><link href="https://www.reddit.com/r/pennystocks/comments/mcx9l9/some_weirdo_tried_to_sell_me_something_he/" /><updated>2021-03-25T12:38:57+00:00</updated><title>Some weirdo tried to sell me something, he messaged me after I commented on a thread here.</title></entry><entry><author><name>/u/TowerTom</name><uri>https://www.reddit.com/user/TowerTom</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;Https://www.proactiveinvestors.co.uk/companies/news/944988/argo-blockchain-enters-partnership-to-launch-first-clean-energy-bitcoin-mining-pool-944988.html&quot;&gt;Https://www.proactiveinvestors.co.uk/companies/news/944988/argo-blockchain-enters-partnership-to-launch-first-clean-energy-bitcoin-mining-pool-944988.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&amp;quot;Argo Blockchain PLC (LON: $ARB) (OTCQX: $ARBKF) said it has signed a memorandum of understanding (MOU) with DMG Blockchain Solutions, a blockchain and cryptocurrency technology firm, to launch the first Bitcoin mining pool powered exclusively by clean energy.&lt;/p&gt; &lt;p&gt;Under the terms of the deal, Argo and DMG will jointly launch Terra Pool, which will initially consist of both companies processing power which is mostly generated by hydroelectric resources&amp;quot;&lt;/p&gt; &lt;p&gt;I&amp;#39;ve been in since before it listed on OTC, first at £0.30 on LSE after finding it here and have averaged up on some dips to £0.85 - it&amp;#39;s almost 20% of my portfolio by now. &lt;/p&gt; &lt;p&gt;It&amp;#39;s held steady this past month, even with some btc dips it&amp;#39;s held well. More catalysts in the pipeline include their new facility in Texas and potential NASDAQ listing after earlier this year appointing same firm Riot used for better US stocks exposure. I see this matching if not at least getting half what Riot &amp;amp; Mara by eoy with how much bitcoin has and will be accepted further in the mainstream finance world now.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/TowerTom&quot;&gt; /u/TowerTom &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdmmk7/arbkf_arbl_argo_blockchain_to_launch_first/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdmmk7/arbkf_arbl_argo_blockchain_to_launch_first/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdmmk7</id><link href="https://www.reddit.com/r/pennystocks/comments/mdmmk7/arbkf_arbl_argo_blockchain_to_launch_first/" /><updated>2021-03-26T11:36:03+00:00</updated><title>$ARBKF £ARB.L - Argo Blockchain to launch first Bitcoin mining pool powered exclusively by clean energy.</title></entry><entry><author><name>/u/flobbley</name><uri>https://www.reddit.com/user/flobbley</uri></author><category term="stocks" label="r/stocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;I&amp;#39;ve seen a lot of comments along the lines of &amp;quot;tech stocks are down, but they&amp;#39;ll go back up, there&amp;#39;s no reason not to buy them&amp;quot; but that line of thinking is flawed. Of course tech stocks will go back up, but what&amp;#39;s important to your returns is if you think they will rise faster than whatever index you choose to compare them too. Because if they don&amp;#39;t, you could have just bought that index fund, had less risk, less work, and better returns. You should always be thinking about your stock purchases in the context of &amp;quot;Do I think this will beat the index?&amp;quot; &lt;em&gt;not&lt;/em&gt; &amp;quot;Do I think this will go up in the future?&amp;quot;&lt;/p&gt; &lt;p&gt;Edit: this post is not to advocate for index funds, it is to educate about how to measure stocks returns. This is stocks 101, how well a stock has performed is measured in reference to an index, the difference in returns between the two is called Alpha. Honestly if this concept is new to you then you need to do a lot more research before you try to start picking stocks.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/flobbley&quot;&gt; /u/flobbley &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/stocks/&quot;&gt; r/stocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdorgy/remember_it_doesnt_matter_how_much_a_stock_goes/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdorgy/remember_it_doesnt_matter_how_much_a_stock_goes/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdorgy</id><link href="https://www.reddit.com/r/stocks/comments/mdorgy/remember_it_doesnt_matter_how_much_a_stock_goes/" /><updated>2021-03-26T13:34:28+00:00</updated><title>Remember, it doesn't matter how much a stock goes up, what matters is how much more it goes up than the index</title></entry><entry><author><name>/u/Juswatchingthis</name><uri>https://www.reddit.com/user/Juswatchingthis</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;I worked a bit on it to try and make it better for you guys. No push or unrealistic price targets here !&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;So as there are a lot of weed stocks out there let me explain to you why I think &lt;strong&gt;$HCANF&lt;/strong&gt; will be more valuable than others.&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;Some major information&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;- founded in 2015 in medford&lt;/p&gt; &lt;p&gt;- sold about 8 million grams of oil and concentrates to this day&lt;/p&gt; &lt;p&gt;- constantly growing and expanding into different locations in Afrika, Europe, Kanada and the US&lt;/p&gt; &lt;p&gt;- head quarters in West Vancouver / Canada&lt;/p&gt; &lt;p&gt;- positive EBITDA will be expected in the Q4 2020 numbers&lt;/p&gt; &lt;p&gt;- they will expand into the shroom business and will take agreement&amp;#39;s. &lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;After these very general informations lets get into some more detail.&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;- many investors see 2021 as a huge chance for halo due to the fact that if they can keep increasing their profits this following year will be their turning point&lt;/p&gt; &lt;p&gt;- the potential is huge due to the fact that we have a growing market and with ongoing legalization the demanded amounts will become bigger&lt;/p&gt; &lt;p&gt;- they survived as a penny stock through hard times and could become profitable the first time in their history&lt;/p&gt; &lt;p&gt;- &lt;strong&gt;FINANCIALS for the Q4 of 2020&lt;/strong&gt; which will decide if they continue to rise &lt;strong&gt;will be published on 31st march 2021&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;- there is am option to carry out a reverse split in 2021 to move out of the pennystock price range and attempt to be listed on NASDAQ - no further informations about a date so far&lt;/p&gt; &lt;p&gt;- &lt;strong&gt;$HCANF&lt;/strong&gt; is set to become a leading company with an engaged &lt;strong&gt;CEO Kiran Sidhu&lt;/strong&gt;. He has experience with start up companies and developed them from growing, over making oil, till selling it to the costumer&lt;/p&gt; &lt;p&gt;- &lt;strong&gt;some prominent support comes from the rapper G-Eazy&lt;/strong&gt;. He will launch his own products by flower shop and will be supplied by Halo Collective. The opening of the Flowershop Store is planed in summer 2021 in the centre of Hollywood&lt;/p&gt; &lt;p&gt;- has huge farming spaces, actually the biggest ones in Africa and is shortly before getting a GACP -certification for Bophelo location which will open the door for expansions into Europe and finally the whole world. &lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;In the following lines I want to show you some of their upcoming projects&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;- expanding to the UK, USA, EUROPE&lt;/p&gt; &lt;p&gt;- opening 3 stores in Hollywood&lt;/p&gt; &lt;p&gt;&lt;strong&gt;&lt;em&gt;- the CEO gave an interview where he concluded the upcoming projects - for those that are interested in it&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=2BI4Kdc_Fg8&quot;&gt;&lt;strong&gt;&lt;em&gt;https://www.youtube.com/watch?v=2BI4Kdc_Fg8&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/Juswatchingthis&quot;&gt; /u/Juswatchingthis &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdpgom/hcanf_halo_collective_hot_stock_imo_2021_following/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdpgom/hcanf_halo_collective_hot_stock_imo_2021_following/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdpgom</id><link href="https://www.reddit.com/r/pennystocks/comments/mdpgom/hcanf_halo_collective_hot_stock_imo_2021_following/" /><updated>2021-03-26T14:08:46+00:00</updated><title>$HCANF -Halo Collective - Hot Stock imo 2021 following</title></entry><entry><author><name>/u/Ex_President35</name><uri>https://www.reddit.com/user/Ex_President35</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/Ex_President35&quot;&gt; /u/Ex_President35 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://v.redd.it/3ocz77ytxap61&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mdhymx/whats_good_video_editor_amc_apes/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdhymx</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/mdhymx/whats_good_video_editor_amc_apes/" /><updated>2021-03-26T05:58:57+00:00</updated><title>What’s good video editor $AMC APES?</title></entry><entry><author><name>/u/13ry4n</name><uri>https://www.reddit.com/user/13ry4n</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;Released yesterday&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://www.globenewswire.com/news-release/2021/03/25/2199442/0/en/Ozop-Energy-OZSC-Develops-the-Neo-Grids-Supply-Chain.html&quot;&gt;https://www.globenewswire.com/news-release/2021/03/25/2199442/0/en/Ozop-Energy-OZSC-Develops-the-Neo-Grids-Supply-Chain.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&amp;quot;The exponential growth that the energy production industry is starting to experience is only part of the solution to meeting the needs of the US energy using population. Ozop has positioned itself to supply these markets whether on grid, micro grids, and delayed distribution models. For this model, we will be executing the first step of meeting orders of &lt;strong&gt;&lt;em&gt;$1 million per month&lt;/em&gt;&lt;/strong&gt;, FOB East Coast, with direct drop shipments to the developer’s actual sites. We expect the container ships arriving every 3-5 weeks providing the &lt;em&gt;first&lt;/em&gt; step to being operationally neutral.&amp;quot;&lt;/p&gt; &lt;p&gt;That&amp;#39;s right, you read that right, &lt;strong&gt;$1 million per month&lt;/strong&gt;.&lt;/p&gt; &lt;p&gt;Also release a few days ago, March 22 2021, the hiring of &lt;strong&gt;Dr. Martello&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;http://deltadiligence.com/dr-steven-a-martello/&quot;&gt;http://deltadiligence.com/dr-steven-a-martello/&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://www.globenewswire.com/news-release/2021/03/22/2197128/0/en/OZSC-Ozop-Energy-Systems-Announces-New-Corporate-Advisor.html&quot;&gt;https://www.globenewswire.com/news-release/2021/03/22/2197128/0/en/OZSC-Ozop-Energy-Systems-Announces-New-Corporate-Advisor.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;M&amp;amp;A. Dr Martello was co-founder and Director of the International Commodity Exchange (ICE) in Moscow, Russia, Chancellor of the International Management Institute, co- founder and CEO/ Chairman of the Nasdaq listed company Alcohol Sensors International, various mortgage and private equity banking operations, and currently as Managing Director and Chairman, of Delta Strategic Solutions and Delta Diligence.&lt;/p&gt; &lt;p&gt;The entire OTC has been shit on by manipulation and it&amp;#39;s been horrid but a great time to buy in or average down if you bought high.&lt;/p&gt; &lt;p&gt;If you would like more info, feel free to join the discord channel which is &lt;em&gt;located at the bottom of the PRs&lt;/em&gt;.&lt;/p&gt; &lt;p&gt;I do hold shares in this company (85% of my portfolio) &amp;amp; have been buying the dips whenever I have some extra cash on hand.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/13ry4n&quot;&gt; /u/13ry4n &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdq1jt/ozop_energy_solutions_develops_the_neo_grids/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdq1jt/ozop_energy_solutions_develops_the_neo_grids/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdq1jt</id><link href="https://www.reddit.com/r/pennystocks/comments/mdq1jt/ozop_energy_solutions_develops_the_neo_grids/" /><updated>2021-03-26T14:36:53+00:00</updated><title>$OZOP Energy Solutions Develops the Neo Grids Supply + New Corporate Advisor</title></entry><entry><author><name>/u/WeenisWrinkle</name><uri>https://www.reddit.com/user/WeenisWrinkle</uri></author><category term="investing" label="r/investing"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;https://www.reuters.com/article/ousivMolt/idUSKBN2BH1TQ&quot;&gt;https://www.reuters.com/article/ousivMolt/idUSKBN2BH1TQ&lt;/a&gt;&lt;/p&gt; &lt;p&gt;The number of Americans filing new claims for unemployment benefits dropped to a one-year low last week, providing a powerful boost to an economy on the verge of stronger growth as the public health situation improves and temperatures rise.&lt;/p&gt; &lt;p&gt;But the labor market is not out of the woods yet, with the weekly jobless claims report from the Labor Department on Thursday showing a staggering 18.953 million people were still receiving unemployment checks in early March. It will likely take years for a full recovery from the pandemic’s scarring.&lt;/p&gt; &lt;p&gt;Initial claims for state unemployment benefits tumbled 97,000 to a seasonally adjusted 684,000 for the week ended March 20, the lowest since mid-March. Data for the prior week was revised to show 11,000 more applications received than previously reported. Economists polled by Reuters had forecast 730,000 applications for the latest week.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/WeenisWrinkle&quot;&gt; /u/WeenisWrinkle &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/investing/&quot;&gt; r/investing &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/investing/comments/md3d5a/us_weekly_jobless_claims_hit_oneyear_low_in_boost/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/investing/comments/md3d5a/us_weekly_jobless_claims_hit_oneyear_low_in_boost/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_md3d5a</id><link href="https://www.reddit.com/r/investing/comments/md3d5a/us_weekly_jobless_claims_hit_oneyear_low_in_boost/" /><updated>2021-03-25T17:17:09+00:00</updated><title>U.S. weekly jobless claims hit one-year low in boost to economic outlook $VMNT</title></entry><entry><author><name>/u/NEOstockhacker</name><uri>https://www.reddit.com/user/NEOstockhacker</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;table&gt; &lt;tr&gt;&lt;td&gt; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdojop/vmnt_qb_new_filings_institutional_investors_gets/&quot;&gt; &lt;img src=&quot;https://b.thumbs.redditmedia.com/YN3Sc1fkSiU51MeATJNdGb7wsDo0MNOy_WTEqqJxubo.jpg&quot; alt=&quot;$VMNT QB NEW FILINGS | Institutional Investors Gets Green Light #Defi #Fintech #StableCoin&quot; title=&quot; $VMNT QB NEW FILINGS | Institutional Investors Gets Green Light #Defi #Fintech #StableCoin&quot; /&gt; &lt;/a&gt; &lt;/td&gt;&lt;td&gt; &lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;In a tweet 2 days ago CEO, Tan Tran said&lt;/p&gt; &lt;p&gt;&amp;quot;BTW, institutional investors have been waiting for Vemanti $VMNT to finish its 2020 financial audit. We got it done yesterday!&amp;quot;&lt;/p&gt; &lt;p&gt;The next day OTCQB Certification and 12/31/2020 Annual report gets published, don&amp;#39;t ever doubt this man. &lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://preview.redd.it/vkpzycvehdp61.png?width=1110&amp;amp;format=png&amp;amp;auto=webp&amp;amp;s=47d67b6c3a54b0f23165e78afb7abcef3df65dc7&quot;&gt;OTCQB Certification &lt;/a&gt;&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://preview.redd.it/dbwq9uqghdp61.png?width=1127&amp;amp;format=png&amp;amp;auto=webp&amp;amp;s=897777169e68ff5c89d826ab249204bb9a4f560c&quot;&gt;12/31/2020 Annual report&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;CEO also kept his word that RJI international CAPs will be doing the auditing.&lt;/p&gt; &lt;p&gt;RJI CPAs Recognized as one of America’s Top Tax and Accounting Firms by Forbes.&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;Most new traders wouldn&amp;#39;t understand the value of having a credible auditor like RJI CPAs means. To put everything into perspective in regards to the auditor, this just adds another stamp of legitimacy and quality. We&amp;#39;re dealing with institutional investors now. Quality is key.&lt;/p&gt; &lt;p&gt;&amp;#x200B;&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://preview.redd.it/dog7g4f7idp61.png?width=1883&amp;amp;format=png&amp;amp;auto=webp&amp;amp;s=7e0245d121c873d5f31effae307f9d399fe938ca&quot;&gt;RJI CAPs&lt;/a&gt;&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/NEOstockhacker&quot;&gt; /u/NEOstockhacker &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdojop/vmnt_qb_new_filings_institutional_investors_gets/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdojop/vmnt_qb_new_filings_institutional_investors_gets/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</content><id>t3_mdojop</id><media:thumbnail url="https://b.thumbs.redditmedia.com/YN3Sc1fkSiU51MeATJNdGb7wsDo0MNOy_WTEqqJxubo.jpg" /><link href="https://www.reddit.com/r/pennystocks/comments/mdojop/vmnt_qb_new_filings_institutional_investors_gets/" /><updated>2021-03-26T13:23:09+00:00</updated><title>$VMNT QB NEW FILINGS | Institutional Investors Gets Green Light #Defi #Fintech #StableCoin</title></entry><entry><author><name>/u/Printer84</name><uri>https://www.reddit.com/user/Printer84</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;https://finance.yahoo.com/news/clickstreams-heypal-tm-app-surpassed-123000387.html&quot;&gt;https://finance.yahoo.com/news/clickstreams-heypal-tm-app-surpassed-123000387.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;March 26, 2021&lt;/strong&gt; / ClickStream Corp. (OTC PINK:CLIS) a technology company focused on developing apps and digital platforms that disrupt conventional industries announces its subsidiary Nebula Software Corp.&amp;#39;s HeyPal™ app received a total of over 400,000 messages, 100,000 translations, 20,000 likes between almost 10,000 members since the app was beta-released on Monday, February 8th in a select group of countries as part of the beta soft launch program. HeyPal™ is currently live in 15 countries, including: Australia, Taiwan, Spain, Ireland, Switzerland, Morocco, Ukraine, Turkey, Colombia, Israel, United Kingdom, Brazil, Germany, Italy &amp;amp; South Korea.&lt;/p&gt; &lt;p&gt;&lt;strong&gt;HeyPal™&lt;/strong&gt;, by way of ClickStream subsidiary Nebula Software Corp., is a language learning app that focuses on &amp;quot;language exchanging&amp;quot; between users around the world.&lt;/p&gt; &lt;p&gt;&lt;strong&gt;Nifter™&lt;/strong&gt;, by way of ClickStream subsidiary Rebel Blockchain Inc., is a music NFT marketplace that allows artists to create, sell and discover unique music and sound NFTs on the Nifter™ marketplace. &lt;/p&gt; &lt;p&gt;HeyPal and Nifter are going to keep bringing this stock up!&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/Printer84&quot;&gt; /u/Printer84 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdo6qx/clis_heypaltm_app_has_surpassed_expectations_by/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdo6qx/clis_heypaltm_app_has_surpassed_expectations_by/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdo6qx</id><link href="https://www.reddit.com/r/pennystocks/comments/mdo6qx/clis_heypaltm_app_has_surpassed_expectations_by/" /><updated>2021-03-26T13:04:57+00:00</updated><title>$CLIS HeyPal(TM) App has Surpassed Expectations by Reaching Over 400,000 Messages, 100,000 translations, 20,000 likes and almost 10,000 Members During the First 6 Weeks of Beta Soft Launch in Initial 15 Countries</title></entry><entry><author><name>/u/Top-Acanthocephala46</name><uri>https://www.reddit.com/user/Top-Acanthocephala46</uri></author><category term="wallstreetbets2" label="r/wallstreetbets2"/><content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/Top-Acanthocephala46&quot;&gt; /u/Top-Acanthocephala46 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/&quot;&gt; r/wallstreetbets2 &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://i.redd.it/5ae4eztfz5p61.jpg&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/wallstreetbets2/comments/mcwsrf/here_we_go_game_on/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mcwsrf</id><link href="https://www.reddit.com/r/wallstreetbets2/comments/mcwsrf/here_we_go_game_on/" /><updated>2021-03-25T12:12:35+00:00</updated><title>Here we go. $GME on</title></entry><entry><author><name>/u/kayaarr</name><uri>https://www.reddit.com/user/kayaarr</uri></author><category term="stocks" label="r/stocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;Online finance start-up SoFi ($IPOE) is lowering the barrier for amateur investors to buy shares of companies as they go public.&lt;/p&gt; &lt;p&gt;These IPO shares have historically been set aside for Wall Street’s institutional investors or high-net worth individuals. Retail traders don’t have a way to buy into newly listed companies until those shares begin actually trading on the exchange. By that time, the price has often gapped higher.&lt;/p&gt; &lt;p&gt;“Main Street will have access to investing in a way they wouldn’t have before,” SoFi CEO Anthony Noto said in a phone interview. “It gives more differentiation, and more access so people can build diversified portfolios.&lt;/p&gt; &lt;p&gt;SoFi itself will be an underwriter in these deals, meaning it works with companies to determine a share price, buys securities from the issuer then sells them back to certain investors. It’s common for brokerage firms to get a portion of IPO shares in that process. But they don’t typically offer them to the everyday investor. &lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://www.cnbc.com/2021/03/26/sofi-to-give-amateur-investors-early-access-to-ipos-in-break-from-wall-street-tradition-.html&quot;&gt;Source&lt;/a&gt;&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/kayaarr&quot;&gt; /u/kayaarr &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/stocks/&quot;&gt; r/stocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdoppt/sofi_to_give_amateur_investors_early_access_to/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/stocks/comments/mdoppt/sofi_to_give_amateur_investors_early_access_to/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdoppt</id><link href="https://www.reddit.com/r/stocks/comments/mdoppt/sofi_to_give_amateur_investors_early_access_to/" /><updated>2021-03-26T13:31:56+00:00</updated><title>SoFi to give amateur investors early access to IPOs in break from Wall Street tradition</title></entry><entry><author><name>/u/ReasonableWindow7383</name><uri>https://www.reddit.com/user/ReasonableWindow7383</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;h1&gt;Analyzing $EEENF &amp;#39;s Path before imminent news.&lt;/h1&gt; &lt;p&gt;📷&lt;/p&gt; &lt;p&gt;Today started off rather rough, with an opening less ideal than we all expected. A classic dip around 11 emerged, stumping many. This can be easily justified by major funds shorting the stock in conjunction with buying it at that lower dip. People panic selling didn&amp;#39;t help either. However, FOMO helped us push through all of this downtrading pressure, and in the end we held a rally as shorts covered and funds bought. The big question really is tomorrow, and the price we will see for the stock. There are many factors. 1. Social media presence and more importantly &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt;r/pennystocks&lt;/a&gt; presence and knowing about this ticker. The best way to win big on great news is to have a lot of people find out. Tomorrow will likely open with a small dip from selling and shorting, followed by retail investors scooping up shares in speculation. Breaking .030 sometime during the day is almost a definite, as we are flying through every resistance point and approaching a blue sky breakout, similar to MDMP pre announcement. Next is the news. Naturally, I&amp;#39;m a pessimist. However, I&amp;#39;m optimistic on EEENF.&lt;/p&gt; &lt;ol&gt; &lt;li&gt;9 out of 10 Wells located north of Merlin-1 have found viable industrial grade oil.&lt;/li&gt; &lt;li&gt;David Wall repeating multiple times in interviews that his goal is to strike oil and sell his company to the highest bidder.&lt;/li&gt; &lt;li&gt;ELKO receiving shares for their labor. Some may think of this as a spur of the moment decision, but it isn&amp;#39;t. They were informed of the project before December, at a price lower than the 6.5 million range they are being payed now. The increased price is from delays and confirmation to start drilling.&lt;/li&gt; &lt;li&gt;Mud logging, a process of estimating with oil in the area, apparently should be known by Sunday night. A bad result in the beginning oil field layers may have resulted in an early announcement declaring absence of oil.&lt;/li&gt; &lt;/ol&gt; &lt;p&gt;In the end, hold if you got in below .025 range, or buy if you don&amp;#39;t have a position yet. My price target is .30 cents based upon O/S and hype.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/ReasonableWindow7383&quot;&gt; /u/ReasonableWindow7383 &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/md7o1c/eeenf_read_this/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/md7o1c/eeenf_read_this/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_md7o1c</id><link href="https://www.reddit.com/r/pennystocks/comments/md7o1c/eeenf_read_this/" /><updated>2021-03-25T20:31:11+00:00</updated><title>$EEENF READ THIS</title></entry><entry><author><name>/u/FckMyStudentLoans</name><uri>https://www.reddit.com/user/FckMyStudentLoans</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;&lt;a href=&quot;https://finance.yahoo.com/news/global-tech-industries-group-inc-143000208.html&quot;&gt;https://finance.yahoo.com/news/global-tech-industries-group-inc-143000208.html&lt;/a&gt;&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/FckMyStudentLoans&quot;&gt; /u/FckMyStudentLoans &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdq9j6/gtii_global_tech_industries_group_nft_acquisition/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdq9j6/gtii_global_tech_industries_group_nft_acquisition/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content><id>t3_mdq9j6</id><link href="https://www.reddit.com/r/pennystocks/comments/mdq9j6/gtii_global_tech_industries_group_nft_acquisition/" /><updated>2021-03-26T14:47:58+00:00</updated><title>$GTII - Global Tech Industries Group NFT Acquisition PR Today - Up 25%</title></entry><entry><author><name>/u/wetdirtkurt</name><uri>https://www.reddit.com/user/wetdirtkurt</uri></author><category term="pennystocks" label="r/pennystocks"/><content type="html">&lt;table&gt; &lt;tr&gt;&lt;td&gt; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdexsk/eeenf_share_price_valuation_model_low_range/&quot;&gt; &lt;img src=&quot;https://b.thumbs.redditmedia.com/5e182t7HIlMaPFdCecboaCJq1NChreDDb7KjZy3aG3I.jpg&quot; alt=&quot;$EEENF Share Price Valuation Model | Low Range Estimate increase of 1700% in Current Share Price Equaling $0.49 Per Share| Average Range Estimate Increase in Current Share Price of 3600% Equaling $1.05 Per Share | 🚀🚀🚀&quot; title=&quot;$EEENF Share Price Valuation Model | Low Range Estimate increase of 1700% in Current Share Price Equaling $0.49 Per Share| Average Range Estimate Increase in Current Share Price of 3600% Equaling $1.05 Per Share | 🚀🚀🚀&quot; /&gt; &lt;/a&gt; &lt;/td&gt;&lt;td&gt; &lt;!-- SC_OFF --&gt;&lt;div class=&quot;md&quot;&gt;&lt;p&gt;I posted this on the &lt;a href=&quot;/r/eeenf&quot;&gt;/r/eeenf&lt;/a&gt; but didn&amp;#39;t get much feedback or rocket emojis so I am reposting here with some additional commentary to provide what is going on in my model.&lt;/p&gt; &lt;p&gt;Using the estimated output of Merlin-1 of 645M Barrels of Oil (+ or - 10% to give a buffer for the estimate and to establish a high and low range for share price), the current price of a barrel of oil, the 100 day moving average on a barrel of oil (an increase of $20.62 at current price), the estimated break-even price of drilling for a single barrel of oil in Alaska of $40, the current market capitalization as of today&amp;#39;s close (according to Yahoo Finance $316M), the company&amp;#39;s own December 31, 2020 Financial Statements to gather cash and debt balances, current share price ($.0286), and some benchmark industry revenue multiples for oil - I made four separate valuation matrixes using three different methodologies to estimate future share price if Merlin-1 is successful. Two of them (the two rightmost) are the same method, just different sources for the multiples.&lt;/p&gt; &lt;p&gt;The valuation methodologies are generally accepted as standard in finance. Each calculation is referenced between each data point in blue (A, B, C...etc). Each cell that is highlighted a color are related to other cells of the same color. At the bottom, you get the ranges for each methodology using the estimated output plus or minus the 10% buffer and the current and expected future price of oil.&lt;/p&gt; &lt;p&gt;Finally, I highlight the absolute highest and lowest estimate of each methodology as well as the overall average by each methodology. I do this again, but use only the highest and lowest estimate of all the methodologies and then average all of the estimates that are shown arriving us at $1.05 per share.&lt;/p&gt; &lt;p&gt;Here are the references I used, although not all of them.&lt;/p&gt; &lt;p&gt;&lt;strong&gt;R1 -&lt;/strong&gt; &lt;a href=&quot;https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/half-of-producing-shale-oil-wells-are-profitable-at-40-bbl-analyst-says-60035427&quot;&gt;https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/half-of-producing-shale-oil-wells-are-profitable-at-40-bbl-analyst-says-60035427&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;R2/R3 -&lt;/strong&gt; &lt;a href=&quot;https://www.equidam.com/ebitda-multiples-trbc-industries/&quot;&gt;https://www.equidam.com/ebitda-multiples-trbc-industries/&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;R4 -&lt;/strong&gt; &lt;a href=&quot;http://pages.stern.nyu.edu/%7Eadamodar/New_Home_Page/datafile/vebitda.html&quot;&gt;http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/vebitda.html&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;strong&gt;R5 -&lt;/strong&gt; &lt;a href=&quot;https://www.eval.tech/valuation-multiples-by-industry&quot;&gt;https://www.eval.tech/valuation-multiples-by-industry&lt;/a&gt;&lt;/p&gt; &lt;p&gt;Here is the link to the google sheet if you want to poke around a little bit. Commenting is turned on if you want to give feedback. I will change anything I agree with. &lt;a href=&quot;https://docs.google.com/spreadsheets/d/1obQVjCeDD3WmyuNZEC4a_FZK3kEh_85m3JkG6lLqXes/edit?usp=sharing&quot;&gt;https://docs.google.com/spreadsheets/d/1obQVjCeDD3WmyuNZEC4a_FZK3kEh_85m3JkG6lLqXes/edit?usp=sharing&lt;/a&gt;&lt;/p&gt; &lt;p&gt;DISCLAIMER: this is not financial advice. I am not a financial adviser. I am just some fucking guy.&lt;/p&gt; &lt;p&gt;&lt;a href=&quot;https://preview.redd.it/mhzkpdjzcap61.png?width=1462&amp;amp;format=png&amp;amp;auto=webp&amp;amp;s=f4103e7176d40b4f4d70d0a2b2ddd6319b49fc2e&quot;&gt;https://preview.redd.it/mhzkpdjzcap61.png?width=1462&amp;amp;format=png&amp;amp;auto=webp&amp;amp;s=f4103e7176d40b4f4d70d0a2b2ddd6319b49fc2e&lt;/a&gt;&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/wetdirtkurt&quot;&gt; /u/wetdirtkurt &lt;/a&gt; &amp;#32; to &amp;#32; &lt;a href=&quot;https://www.reddit.com/r/pennystocks/&quot;&gt; r/pennystocks &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdexsk/eeenf_share_price_valuation_model_low_range/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/pennystocks/comments/mdexsk/eeenf_share_price_valuation_model_low_range/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</content><id>t3_mdexsk</id><media:thumbnail url="https://b.thumbs.redditmedia.com/5e182t7HIlMaPFdCecboaCJq1NChreDDb7KjZy3aG3I.jpg" /><media:preview url="https://external-preview.redd.it/mM5WAvgi9ejJ7ZdiGnRbz06VzXQnQ0sOdFpluH-fqOY.jpg?width=108&amp;crop=smart&amp;auto=webp&amp;s=f9ce8bd49f4114abe1a64a1c96d969287e409bc3" width="108" height="80" /><link href="https://www.reddit.com/r/pennystocks/comments/mdexsk/eeenf_share_price_valuation_model_low_range/" /><updated>2021-03-26T02:44:10+00:00</updated><title>$EEENF Share Price Valuation Model | Low Range Estimate increase of 1700% in Current Share Price Equaling $0.49 Per Share| Average Range Estimate Increase in Current Share Price of 3600% Equaling $1.05 Per Share</title></entry></feed>
\ No newline at end of file
<data>
<items>
<item name="item1">item1abc</item>
<item name="item2">item2abc</item>
</items>
</data>
\ No newline at end of file
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter.
# Homework 1. Introduction to Python and File I/O
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
Due Date: *Friday May 1st, 2020 11:59 pm*
Due Date: *Friday April 17, 2020 11:59 pm*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
......@@ -10,9 +10,9 @@ $ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
```
$ git add bloom.py
$ git add encoding.py
```
After adding your files, to submit your code you must run:
```
......@@ -21,44 +21,65 @@ $ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
## Delta Encoding
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
Here's how the basic Bloom filter works:
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## TODO 1. Loading the data file
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
## TODO 2. Compute the basic encoding
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
For example:
```
> data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
```
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
Or,
```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
Reading from such a file is a little tricky, so we've provided that function for you.
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 5. Delta Decoding
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
For example:
```
> enc = [1,2,1,-1]
> data = delta_decoding(enc)
1,3,4,3
```
Or,
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
```
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random
import string
from encoding import *
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
def test_load():
data = load_orig_file('data.txt')
try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
def test_encoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
assert(sum(encoded) == data[-1])
assert(sum(encoded) == 26)
assert(len(data) == len(encoded))
except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
assert(sum(shift(data, 10)) == N*10 + sum(data))
assert(all([d >=0 for d in shift(encoded,4)]))
except AssertionError:
print('TODO 3. Failure check your shift function')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try:
b = Bloom(100,10)
assert(data == data_p)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding')
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
def generate_file(size, seed):
FILE_NAME = 'data.gen.txt'
except:
print('[#5] Hash exceeds range')
f = open(FILE_NAME,'w')
initial = seed
for i in range(size):
f.write(str(initial) + '\n')
initial += random.randint(-4, 4)
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
assert(bh1 != bh2)
cnt = 0
for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
except:
print('[#6] Hashes generated are not independent')
data = load_orig_file('data.gen.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
write_encoding(sencoded, 'data_out.txt')
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
loaded = unshift(read_encoding('data_out.txt'),4)
decoded = delta_decoding(loaded)
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
cnt += (decoded == data)
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
assert(cnt == 10)
except AssertionError:
print('Failed Random Tests', str(10-cnt), 'out of 10')
test_load()
test_encoding()
test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
# HW3 String Matching
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter.
*Due 5/14/20 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this assignment, you will link two product catalogs.
Due Date: *Friday May 1st, 2020 11:59 pm*
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
```
$ cd cmsc13600-materials
$ git pull
```
Then, copy the `hw3` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
```
$ git add analyze.py
$ git commit -m'initialized homework'
$ git add bloom.py
```
You will also need to fetch the datasets used in this homework assignment:
After adding your files, to submit your code you must run:
```
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
$ git commit -m"My submission"
$ git push
```
Download each of the files and put it into your `hw3` folder.
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
... print(a)
... break
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
... print(a)
... break
Here's how the basic Bloom filter works:
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 auto_grader.py
```
Running the code will print out a result report as follows (accuracy, precision, and recall):
```
----Accuracy----
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
168.670348 seconds
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
```
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
* Price and manufacturer will also be important attributes to use.
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
## Submission
After you finish the assignment you can submit your code with:
```
$ git push
```
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in your_matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
import random
import string
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
try:
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
try:
b = Bloom(100,10)
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
except:
print('[#5] Hash exceeds range')
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
assert(bh1 != bh2)
except:
print('[#6] Hashes generated are not independent')
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
# Extract-Transform-Load
# HW3 String Matching
*Due Friday 5/22/20 11:59 PM*
Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s). In this project, you will write some of the core primitives in an ETL system.
*Due 5/14/20 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this assignment, you will link two product catalogs.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw4` folder to your submission repository. Change directories to enter your submission repository. Your code will go into the `etl.py` and `etl_programs.py` files. You can the files to the repository using `git add`:
Then, copy the `hw3` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
```
$ git add *.py
$ git add analyze.py
$ git commit -m'initialized homework'
```
You will additionally have to install the Pandas library to do this assignment:
```
$ pip3 install pandas
```
Feel free to skip this section if you already know how Pandas works. Pandas is a data analysis toolkit for Python that makes it easy to work with tabular data. We organize our tutorial of this library around an exploration of data from the 2015 New York City Street Tree Survey, which is freely available from the New York City open data portal (https://data.cityofnewyork.us). This survey was performed by the New York City Department of Parks and Recreation with help from more than 2000 volunteers. The goal of the survey is to catalog the trees planted on the City right-of-way, typically the space between the sidewalk and the curb, in the five boroughs of New York. The survey data is stored in a CSV file that has 683,789 lines, one per street tree. (Hereafter we will refer to trees rather than street trees for simplicity.) The census takers record many different attributes for each tree, such as the common name of the species, the location of the tree, etc. Of these values, we will use the following:
* boroname: the name of the borough in which the tree resides;
*health: an estimate of the health of the tree: one of good, fair, or poor;
* latitude and longitude : the location of the tree using geographical coordinates;
* spc_common: the common, as opposed to Latin, name of the species;
status: the status of the tree: one of alive, dead, or stump;
* tree_id: a unique identifier
Some fields are not populated for some trees. For example, the health field is not populated for dead trees and stumps and the species field (spc_common) is not populated for stumps and most dead trees.
To use pandas, you can simply import it as follows:
```
>>> import pandas as pd
```
DataFrames are the the main data structure in Pandas. The library function pd.read_csv takes the name of a CSV file and loads it into a data frame. Let’s use this function to load the tree data from a file named 2015StreetTreesCensus_TREES.csv:
```
>>> trees = pd.read_csv("2015StreetTreesCensus_TREES.csv")
```
The variable trees now refers to a Pandas DataFrame. Let’s start by looking at some of the actual data. You can similarly create a DataFrame from lists:
```
>>> df = pd.DataFrame([['Bob', 'Stewart'],
['Anna', 'Davis'],
['Jerry', 'Dole'],
['John', 'Marsh']],
columns=['first_name', 'last_name'])
```
We’ll explain the various ways to access data in detail later. For now, just keep in mind that the columns have names (for example, “Latitude”, “longitude”, “spc_common”, etc) and leverage the intuition you’ve built up about indexing in other data structures.
Here, for example, are a few columns from the first ten rows of the dataset:
```
>>> trees10 = trees[:10]
>>> trees10[["Latitude", "longitude", "spc_common", "health", "boroname"]]
Latitude longitude spc_common health boroname
0 40.723092 -73.844215 red maple Fair Queens
1 40.794111 -73.818679 pin oak Fair Queens
2 40.717581 -73.936608 honeylocust Good Brooklyn
3 40.713537 -73.934456 honeylocust Good Brooklyn
4 40.666778 -73.975979 American linden Good Brooklyn
5 40.770046 -73.984950 honeylocust Good Manhattan
6 40.770210 -73.985338 honeylocust Good Manhattan
7 40.762724 -73.987297 American linden Good Manhattan
8 40.596579 -74.076255 honeylocust Good Staten Island
9 40.586357 -73.969744 London planetree Fair Brooklyn
```
Notice that the result looks very much like a table in which both the columns and the rows are labelled. In this case, the column labels came from the first row in the file and the rows are simply numbered starting at zero.
Here’s the full first row of the dataset with all 41 attributes:
```
>>> trees.iloc[0]
created_at 08/27/2015
tree_id 180683
block_id 348711
the_geom POINT (-73.84421521958048 40.723091773924274)
tree_dbh 3
stump_diam 0
curb_loc OnCurb
status Alive
health Fair
spc_latin Acer rubrum
spc_common red maple
steward None
guards None
sidewalk NoDamage
user_type TreesCount Staff
problems None
root_stone No
root_grate No
root_other No
trnk_wire No
trnk_light No
trnk_other No
brnch_ligh No
brnch_shoe No
brnch_othe No
address 108-005 70 AVENUE
zipcode 11375
zip_city Forest Hills
cb_num 406
borocode 4
boroname Queens
cncldist 29
st_assem 28
st_senate 16
nta QN17
nta_name Forest Hills
boro_ct 4073900
state New York
Latitude 40.7231
longitude -73.8442
x_sp 1.02743e+06
y_sp 202757
Name: 0, dtype: object
You will also need to fetch the datasets used in this homework assignment:
```
and here are a few specific values from that row:
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
```
>>> first_row = trees.iloc[0]
>>> first_row["Latitude"]
40.72309177
>>> first_row["longitude"]
-73.84421522
>>> first_row["boroname"]
'Queens'
```
Notice that the latitude and longitude values are floats, while the borough name is a string. Conveniently, read_csv analyzes each column and if possible, identifies the appropriate type for the data stored in the column. If this analysis cannot determine a more specific type, the data will be represented using strings.
Download each of the files and put it into your `hw3` folder.
We can also extract data for a specific column:
```
>>> trees10["boroname"]
0 Queens
1 Queens
2 Brooklyn
3 Brooklyn
4 Brooklyn
5 Manhattan
6 Manhattan
7 Manhattan
8 Staten Island
9 Brooklyn
Name: boroname, dtype: object
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
... print(a)
... break
and we can easily do useful things with the result, like count the number of times each unique value occurs:
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
>>> trees10["boroname"].value_counts()
Brooklyn 4
Manhattan 3
Queens 2
Staten Island 1
Name: boroname, dtype: int64
You can similarly, do the same for the Google catalog:
```
Now that you have a some feel for the data, we’ll move on to discussing some useful attributes and methods provided by data frames. The shape attribute yields the number of rows and columns in the data frame:
>>>for a in google_catalog():
... print(a)
... break
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
>>> trees.shape
(683788, 42)
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
The data frame has fewer rows (683,788) than lines in the file (683,789), because the header row is used to construct the column labels and does not appear as a regular row in the data frame. To access a row using the row number, that is, its position in the data frame, we use iloc operator and square brackets:
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
>>> trees.iloc[3]
created_at 08/27/2015
block_id 348711
the_geom POINT (-73.84421521958048 40.723091773924274)
tree_dbh 3
stump_diam 0
curb_loc OnCurb
status Alive
health Fair
spc_latin Acer rubrum
spc_common red maple
steward None
guards None
sidewalk NoDamage
user_type TreesCount Staff
problems None
root_stone No
root_grate No
root_other No
trnk_wire No
trnk_light No
trnk_other No
brnch_ligh No
brnch_shoe No
brnch_othe No
address 108-005 70 AVENUE
zipcode 11375
zip_city Forest Hills
cb_num 406
borocode 4
boroname Queens
cncldist 29
st_assem 28
st_senate 16
nta QN17
nta_name Forest Hills
boro_ct 4073900
state New York
Latitude 40.7231
longitude -73.8442
x_sp 1.02743e+06
y_sp 202757
Name: 180683, dtype: object
```
In both cases the result of evaluating the expression has type Pandas Series:
We can extract the values in a specific column using square brackets with the column name as the index:
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
>>> trees10["spc_common"]
tree_id
180683 red maple
200540 pin oak
204026 honeylocust
204337 honeylocust
189565 American linden
190422 honeylocust
190426 honeylocust
208649 American linden
209610 honeylocust
192755 London planetree
Name: spc_common, dtype: object
python3 auto_grader.py
```
We can also use dot notation to access a column, if the corresponding label conforms to the rules for Python identifiers and does not conflict with the name of a DataFrame attribute or method:
```
>>> trees10.spc_common
tree_id
180683 red maple
200540 pin oak
204026 honeylocust
204337 honeylocust
189565 American linden
190422 honeylocust
190426 honeylocust
208649 American linden
209610 honeylocust
192755 London planetree
Name: spc_common, dtype: object
Running the code will print out a result report as follows (accuracy, precision, and recall):
```
----Accuracy----
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
168.670348 seconds
The tree dataset has many columns, most of which we will not be using to answer the questions posed at the beginning of the chapter. As we saw above, we can extract the desired columns using a list as the index:
```
>>> cols_to_keep = ['spc_common', 'status', 'health', 'boroname', 'Latitude', 'longitude']
>>> trees_narrow = trees[cols_to_keep]
>>> trees_narrow.shape
(683788, 6)
```
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
This new data frame has the same number of rows and the same index as the original data frame, but only six columns instead of the original 41.
If we know in advance that we will be using only a subset of the columns, we can specify the names of the columns of interest to pd.read_csv and get the slimmer data frame to start. Here’s a function that uses this approach to construct the desired data frame:
```
>>> def get_tree_data(filename):
... '''
... Read slim version of the tree data and clean up the labels.
...
... Inputs:
... filename: name of the file with the tree data
...
... Returns: DataFrame
... '''
... cols_to_keep = ['tree_id', 'spc_common', 'status', 'health', 'boroname',
... 'Latitude', 'longitude']
... trees = pd.read_csv(filename, index_col="tree_id",
... usecols=cols_to_keep)
... trees.rename(columns={"Latitude":"latitude"}, inplace=True)
... return trees
...
...
>>> trees = get_tree_data("2015StreetTreesCensus_TREES.csv")
```
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
A few things to notice about this function: first, the index column, tree_id, needs to be included in the value passed with the usecols parameter. Second, we used the rename method to fix a quirk with the tree data: “Latitude” is the only column name that starts with a capital letter. We fixed this problem by supplying a dictionary that maps the old name of a label to a new name using the columns parameter. Finally, by default, rename constructs a new dataframe. Calling it with the inplace parameter set to True, causes frame updated in place, instead.
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
We encourage you to read the Pandas API before you do this homework, most of the functions that you will implement are trivial if you have the right Pandas library routine!
https://pandas.pydata.org/pandas-docs/stable/reference/index.html
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
## Implementing ETL Functions
The ETL class defines basic language primitives for manipulating Pandas
DataFrames. It takes a DataFrame in and outputs a transformed DataFrame. You will implement several of the routines to perform these transformations.
Here is how we intend the `ETL` class to be used. You can create DataFrame and create an ETL class that takes the DataFrame as input.
```
>> df1 = pd.DataFrame([['Bob', 'Stewart'],
['Anna', 'Davis'],
['Jerry', 'Dole'],
['John', 'Marsh']],
columns=["first_name", "last_name"])
>> etl = ETL(df1)
```
For example, the add() function creates a new column with a specified value. We might want to add a new colum to represent ages:
```
>> etl.add("age", 0)
>> etl.df
first_name last_name age
0 Bob Stewart 0
1 Anna Davis 0
2 Jerry Dole 0
3 John Marsh 0
```
### Drop and Copy
As a warm-up, the first functions that you will write are `drop(colname)` which drops a column from the dataset with a specific column name.
```
>> etl.drop(colname="first_name")
>> etl.df
last_name
0 Stewart
1 Davis
2 Dole
3 Marsh
```
and `copy(colname, new_colname)` which duplicates a column and saves it to the new name:
```
>> etl.copy(colname="first_name", new_colname="first_name2")
>> etl.df
first_name last_name first_name2
0 Bob Stewart Bob
1 Anna Davis Anna
2 Jerry Dole Jerry
3 John Marsh John
```
### Split/Merge
Next, you will write a `split(colname, new_colname, splitter)` function. This function takes an input dataframe and splits all values in colname on a delimiter. It puts the substring before the delimiter in colname, and the substring after the delimiter in a new column. For example,
```
>>> df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John']],
columns=["name"])
>>> etl = ETL(df1)
>> etl.split("name", "last_name","-")
>> etl.df
name last_name
0 Bob Stewart
1 Anna Davis
2 Jerry Dole
3 John
```
When a value does not contain the delimiter new_colname is an empty string. The `merge` function does the opposite of `split`. It takes `merge(col1, col2, splitter)` replaces col1
with the values of col1 and col2 concatenated and seperated by the delimiter. If the value in either col1 or col2 is an empty string, then the delimiter is ignored:
```
>> etl.df
name last_name
0 Bob Stewart
1 Anna Davis
2 Jerry Dole
3 John
>> pw1.merge("name", "last_name","-")
name last_name
0 Bob-Stewart Stewart
1 Anna-Davis Davis
2 Jerry-Dole Dole
3 John
```
### Format
Next, you will write a `format` function that transforms values in a specified column. Format applies an input function to every value in a column. For example,
```
df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John']],
columns=["name"])
etl = ETL(df1)
etl.format("name", lambda x: x.replace("-",","))
>> etl.df
name
0 Bob,Stewart
1 Anna,Davis
2 Jerry,Dole
3 John
```
### Divide
Divide conditionally divides a column, sending values that satisfy the condition into one of two columns. For example, consider the data frame below that has names delimited by two different delimiters. Divide can be used to separate these:
```
df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John,Smith']],
columns=["name"])
etl = ETL(df1)
etl.divide("name", "dash", "comma", lambda x: '-' in x)
>> etl.df
name dash comma
0 Bob-Stewart Bob-Stewart
1 Anna-Davis Anna-Davis
2 Jerry-Dole Jerry-Dole
3 John,Smith John,Smith
```
##ETL Programs
Now, you will use the functions that you wrote to write ETL programs. The remainder of this homework must be completed using a sequence of functions from the ETL class. Your code goes into `etl_programs.py`. As an example, suppose we are given the data:
```
df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John,Smith']],
columns=["name"])
```
Some of the names are delimited by dashes and some by commas in a single column `name`. We want to to transform this dataframe to have two columns `first_name` and `last_name` with the appropriate names extracted from the dataframe. We could do the following:
```
>> pw1 = ETL(df1)
>> pw1.divide("name", "dash", "comma", lambda x: '-' in x) #divide on a dash
>> pw1.split("dash", "last_name_dash", "-") #split the dash column
>> pw1.split("comma", "last_name_comma", ",") #split the comma column
>> pw1.add("last_name", "")
>> pw1.add("first_name", "")
>> pw1.merge("first_name", "dash", "") #add the first names
>> pw1.merge("first_name", "comma", "")
>> pw1.merge("last_name", "last_name_dash", "") #add the lastnames
>> pw1.merge("last_name", "last_name_comma", "")
>> pw1.drop("name") #drop uncessary columns
>> pw1.drop("dash")
>> pw1.drop("comma")
>> pw1.drop("last_name_dash")
>> pw1.drop("last_name_comma")
>> pw1.df
last_name first_name
0 Stewart Bob
1 Davis Anna
2 Dole Jerry
3 Smith John
```
The following functions are case insensitive.
### phone
You are given an input dataframe as follows:
```
df = pd.DataFrame([['(408)996-758'],
['+1 667 798 0304'],
['(774)998-758'],
['+1 442 030 9595']],
columns=["phoneno"])
```
Write an ETL program that results in a dataframe with two columns: area_code,
phone_number. area_code must be formated as a number with only digits (no parens) and the phone number must be of the form xxx-xxxx.
### date
You are given an input dataframe as follows:
```
df = pd.DataFrame([['03/2/1990'],
['2/14/1964'],
['1990-04-30'],
['7/9/2012'],
['1989-09-13'],
['1994-08-21'],
['1996-11-30'],
['2004-12-23'],
['4/21/2016']]
columns=["date"])
```
Write an ETL program that results in a dataframe with three columns: day, month, year. The day must be in two-digit format i.e, 01, 08. The month must be the full month name, e.g., "May". The year must be in YYYY format.
### name
You are given an input dataframe as follows:
```
df = pd.DataFrame([['Such,Bob', ''],
['Ann', 'Davis'],
['Dole,Jerry', ''],
['Joan', 'Song']],
columns=["first_name", "last_name"])
```
Some of the names are incorrectly formated where the "first_name" is actually the person's (last name,first name) Write an ETL program that correctly formats names
into first_name and last_name, so all the cells are appropriately filled.
* Price and manufacturer will also be important attributes to use.
## Submission
After you finish the assignment you can submit your code with:
......
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in your_matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
# Out-of-Core Group By Aggregate
# Extract-Transform-Load
*Graduating Seniors: Due 6/5/20 11:59 PM*
*Everyone else: Due 6/8/20 11:59 PM*
*Due Friday 5/22/20 11:59 PM*
In this assignment, you will implement an out-of-core
version of the group by aggregate (aggregation by key)
seen in lecture. You will have a set memory limit and
you will have to count the number of times a string shows
up in an iterator. Your program should work for any limit
greater than 20.
Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s). In this project, you will write some of the core primitives in an ETL system.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw5` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `countD.py` this is the only file that you will modify. Finally, add `countD.py` using `git add`:
Then, copy the `hw4` folder to your submission repository. Change directories to enter your submission repository. Your code will go into the `etl.py` and `etl_programs.py` files. You can the files to the repository using `git add`:
```
$ git add countD.py
$ git add *.py
$ git commit -m'initialized homework'
```
You will additionally have to install the Pandas library to do this assignment:
```
$ pip3 install pandas
```
Now, you will need to fetch the data used in this assignment. Download title.csv put it in the hw5 folder:
Feel free to skip this section if you already know how Pandas works. Pandas is a data analysis toolkit for Python that makes it easy to work with tabular data. We organize our tutorial of this library around an exploration of data from the 2015 New York City Street Tree Survey, which is freely available from the New York City open data portal (https://data.cityofnewyork.us). This survey was performed by the New York City Department of Parks and Recreation with help from more than 2000 volunteers. The goal of the survey is to catalog the trees planted on the City right-of-way, typically the space between the sidewalk and the curb, in the five boroughs of New York. The survey data is stored in a CSV file that has 683,789 lines, one per street tree. (Hereafter we will refer to trees rather than street trees for simplicity.) The census takers record many different attributes for each tree, such as the common name of the species, the location of the tree, etc. Of these values, we will use the following:
https://www.dropbox.com/s/zl7yt8cl0lvajxg/title.csv?dl=0
* boroname: the name of the borough in which the tree resides;
*health: an estimate of the health of the tree: one of good, fair, or poor;
* latitude and longitude : the location of the tree using geographical coordinates;
* spc_common: the common, as opposed to Latin, name of the species;
status: the status of the tree: one of alive, dead, or stump;
* tree_id: a unique identifier
DO NOT ADD title.csv to the git repo! After downloading the
dataset, there is a python module provided for you called `core.py`, which reads the dataset. This module loads the data in as
an iterator in two functions `imdb_years()` and `imdb_title_words()`:
```
>> for i in imdb_years():
... print(i)
1992
1986
<so on>
```
Play around with both `imdb_years()` and `imdb_title_words()` to get a feel for how the data works.
Some fields are not populated for some trees. For example, the health field is not populated for dead trees and stumps and the species field (spc_common) is not populated for stumps and most dead trees.
## MemoryLimitedHashMap
In this project, the main data structure is the `MemoryLimitedHashMap`. This is a hash map that has an explicit limit on the number of keys it can store. To create one of these data structure, you can import it from core module:
To use pandas, you can simply import it as follows:
```
from core import *
#create a memory limited hash map
m = MemoryLimitedHashMap()
>>> import pandas as pd
```
To find out what the limit of this hash map is, you can:
DataFrames are the the main data structure in Pandas. The library function pd.read_csv takes the name of a CSV file and loads it into a data frame. Let’s use this function to load the tree data from a file named 2015StreetTreesCensus_TREES.csv:
```
print("The max size of m is: ", m.limit)
>>> trees = pd.read_csv("2015StreetTreesCensus_TREES.csv")
```
The data structure can be constructed with an explicit limit (the default is 1000), e.g., `MemoryLimitedHashMap(limit=10)`.
Adding data to this hash map is like you've probably seen before in a data structure class. There is a `put` function that takes in a key and assigns that key a value:
The variable trees now refers to a Pandas DataFrame. Let’s start by looking at some of the actual data. You can similarly create a DataFrame from lists:
```
# put some keys
m.put('a', 1)
m.put('b', 45)
print("The size of m is: ", m.size())
>>> df = pd.DataFrame([['Bob', 'Stewart'],
['Anna', 'Davis'],
['Jerry', 'Dole'],
['John', 'Marsh']],
columns=['first_name', 'last_name'])
```
You can fetch the data using the `get` function and `keys` function:
We’ll explain the various ways to access data in detail later. For now, just keep in mind that the columns have names (for example, “Latitude”, “longitude”, “spc_common”, etc) and leverage the intuition you’ve built up about indexing in other data structures.
Here, for example, are a few columns from the first ten rows of the dataset:
```
>>> trees10 = trees[:10]
>>> trees10[["Latitude", "longitude", "spc_common", "health", "boroname"]]
Latitude longitude spc_common health boroname
0 40.723092 -73.844215 red maple Fair Queens
1 40.794111 -73.818679 pin oak Fair Queens
2 40.717581 -73.936608 honeylocust Good Brooklyn
3 40.713537 -73.934456 honeylocust Good Brooklyn
4 40.666778 -73.975979 American linden Good Brooklyn
5 40.770046 -73.984950 honeylocust Good Manhattan
6 40.770210 -73.985338 honeylocust Good Manhattan
7 40.762724 -73.987297 American linden Good Manhattan
8 40.596579 -74.076255 honeylocust Good Staten Island
9 40.586357 -73.969744 London planetree Fair Brooklyn
```
# get keys
for k in m.keys():
print("The value at key=", k, 'is', m.get(k))
# You can test to see if a key exists
print('Does m contain a?', m.contains('a'))
print('Does m contain c?', m.contains('c'))
Notice that the result looks very much like a table in which both the columns and the rows are labelled. In this case, the column labels came from the first row in the file and the rows are simply numbered starting at zero.
Here’s the full first row of the dataset with all 41 attributes:
```
>>> trees.iloc[0]
created_at 08/27/2015
tree_id 180683
block_id 348711
the_geom POINT (-73.84421521958048 40.723091773924274)
tree_dbh 3
stump_diam 0
curb_loc OnCurb
status Alive
health Fair
spc_latin Acer rubrum
spc_common red maple
steward None
guards None
sidewalk NoDamage
user_type TreesCount Staff
problems None
root_stone No
root_grate No
root_other No
trnk_wire No
trnk_light No
trnk_other No
brnch_ligh No
brnch_shoe No
brnch_othe No
address 108-005 70 AVENUE
zipcode 11375
zip_city Forest Hills
cb_num 406
borocode 4
boroname Queens
cncldist 29
st_assem 28
st_senate 16
nta QN17
nta_name Forest Hills
boro_ct 4073900
state New York
Latitude 40.7231
longitude -73.8442
x_sp 1.02743e+06
y_sp 202757
Name: 0, dtype: object
```
When a key does not exist in the data structure the `get` function will raise an error:
and here are a few specific values from that row:
```
#This gives an error:
m.get('c')
>>> first_row = trees.iloc[0]
>>> first_row["Latitude"]
40.72309177
>>> first_row["longitude"]
-73.84421522
>>> first_row["boroname"]
'Queens'
```
Similarly, if you assign too many unique keys (more than the limit) you will get an error:
Notice that the latitude and longitude values are floats, while the borough name is a string. Conveniently, read_csv analyzes each column and if possible, identifies the appropriate type for the data stored in the column. If this analysis cannot determine a more specific type, the data will be represented using strings.
We can also extract data for a specific column:
```
for i in range(0,1001):
m.put(str(i), i)
>>> trees10["boroname"]
0 Queens
1 Queens
2 Brooklyn
3 Brooklyn
4 Brooklyn
5 Manhattan
6 Manhattan
7 Manhattan
8 Staten Island
9 Brooklyn
Name: boroname, dtype: object
```
The `MemoryLimitedHashMap` allows you to manage this limited storage with a `flush` function that allows you to persist a key and its assignment to disk. When you flush a key it removes it from the data structure and decrements the limit. Flush takes a key as a parameter.
and we can easily do useful things with the result, like count the number of times each unique value occurs:
```
m.flushKey('a')
print("The size of m is: ", m.size())
>>> trees10["boroname"].value_counts()
Brooklyn 4
Manhattan 3
Queens 2
Staten Island 1
Name: boroname, dtype: int64
```
Note that the disk is not intelligent! If you flush a key multiple times it simply appends the flushed value to a file on disk:
Now that you have a some feel for the data, we’ll move on to discussing some useful attributes and methods provided by data frames. The shape attribute yields the number of rows and columns in the data frame:
```
m.flushKey('a')
<some work...>
m.flushKey('a')
>>> trees.shape
(683788, 42)
```
Once a key has been flushed it can be read back using the `load` function (which takes a key as a parameter). This loads back *all* of the flushed values:
The data frame has fewer rows (683,788) than lines in the file (683,789), because the header row is used to construct the column labels and does not appear as a regular row in the data frame. To access a row using the row number, that is, its position in the data frame, we use iloc operator and square brackets:
```
#You can also load values from disk
for k,v in m.load('a'):
print(k,v)
>>> trees.iloc[3]
created_at 08/27/2015
block_id 348711
the_geom POINT (-73.84421521958048 40.723091773924274)
tree_dbh 3
stump_diam 0
curb_loc OnCurb
status Alive
health Fair
spc_latin Acer rubrum
spc_common red maple
steward None
guards None
sidewalk NoDamage
user_type TreesCount Staff
problems None
root_stone No
root_grate No
root_other No
trnk_wire No
trnk_light No
trnk_other No
brnch_ligh No
brnch_shoe No
brnch_othe No
address 108-005 70 AVENUE
zipcode 11375
zip_city Forest Hills
cb_num 406
borocode 4
boroname Queens
cncldist 29
st_assem 28
st_senate 16
nta QN17
nta_name Forest Hills
boro_ct 4073900
state New York
Latitude 40.7231
longitude -73.8442
x_sp 1.02743e+06
y_sp 202757
Name: 180683, dtype: object
```
If you try to load a key that has not been flushed, you will get an error:
In both cases the result of evaluating the expression has type Pandas Series:
We can extract the values in a specific column using square brackets with the column name as the index:
```
#Error!!
for k,v in m.load('d'):
print(k,v)
>>> trees10["spc_common"]
tree_id
180683 red maple
200540 pin oak
204026 honeylocust
204337 honeylocust
189565 American linden
190422 honeylocust
190426 honeylocust
208649 American linden
209610 honeylocust
192755 London planetree
Name: spc_common, dtype: object
```
If you want multiple flushes of the same key to be differentiated, you can set a *subkey*:
We can also use dot notation to access a column, if the corresponding label conforms to the rules for Python identifiers and does not conflict with the name of a DataFrame attribute or method:
```
>>> trees10.spc_common
tree_id
180683 red maple
200540 pin oak
204026 honeylocust
204337 honeylocust
189565 American linden
190422 honeylocust
190426 honeylocust
208649 American linden
209610 honeylocust
192755 London planetree
Name: spc_common, dtype: object
```
#first flush
m.flushKey('a', '0')
<some work...>
The tree dataset has many columns, most of which we will not be using to answer the questions posed at the beginning of the chapter. As we saw above, we can extract the desired columns using a list as the index:
#second flush
m.flushKey('a', '1')
```
The `load` function allows you to selectively pull
certain subkeys:
```
# pull only the first flush
m.load('a', '0')
>>> cols_to_keep = ['spc_common', 'status', 'health', 'boroname', 'Latitude', 'longitude']
>>> trees_narrow = trees[cols_to_keep]
>>> trees_narrow.shape
(683788, 6)
```
We can similarly iterate over all of the flushed data (which optionally takes a subkey as well!):
This new data frame has the same number of rows and the same index as the original data frame, but only six columns instead of the original 41.
If we know in advance that we will be using only a subset of the columns, we can specify the names of the columns of interest to pd.read_csv and get the slimmer data frame to start. Here’s a function that uses this approach to construct the desired data frame:
```
for k,v in m.loadAll():
print(k,v)
>>> def get_tree_data(filename):
... '''
... Read slim version of the tree data and clean up the labels.
...
... Inputs:
... filename: name of the file with the tree data
...
... Returns: DataFrame
... '''
... cols_to_keep = ['tree_id', 'spc_common', 'status', 'health', 'boroname',
... 'Latitude', 'longitude']
... trees = pd.read_csv(filename, index_col="tree_id",
... usecols=cols_to_keep)
... trees.rename(columns={"Latitude":"latitude"}, inplace=True)
... return trees
...
...
>>> trees = get_tree_data("2015StreetTreesCensus_TREES.csv")
```
It also takes in an optional parameter that includes the in memory keys as well:
A few things to notice about this function: first, the index column, tree_id, needs to be included in the value passed with the usecols parameter. Second, we used the rename method to fix a quirk with the tree data: “Latitude” is the only column name that starts with a capital letter. We fixed this problem by supplying a dictionary that maps the old name of a label to a new name using the columns parameter. Finally, by default, rename constructs a new dataframe. Calling it with the inplace parameter set to True, causes frame updated in place, instead.
We encourage you to read the Pandas API before you do this homework, most of the functions that you will implement are trivial if you have the right Pandas library routine!
https://pandas.pydata.org/pandas-docs/stable/reference/index.html
## Implementing ETL Functions
The ETL class defines basic language primitives for manipulating Pandas
DataFrames. It takes a DataFrame in and outputs a transformed DataFrame. You will implement several of the routines to perform these transformations.
Here is how we intend the `ETL` class to be used. You can create DataFrame and create an ETL class that takes the DataFrame as input.
```
for k,v in m.loadAll(subkey='myskey', inMemory=True):
print(k,v)
>> df1 = pd.DataFrame([['Bob', 'Stewart'],
['Anna', 'Davis'],
['Jerry', 'Dole'],
['John', 'Marsh']],
columns=["first_name", "last_name"])
>> etl = ETL(df1)
```
Since there are some keys in memory and some flushed to disk there are two commands to iterate over keys.
For example, the add() function creates a new column with a specified value. We might want to add a new colum to represent ages:
```
m.keys() #returns all keys that are in memory
>> etl.add("age", 0)
>> etl.df
first_name last_name age
0 Bob Stewart 0
1 Anna Davis 0
2 Jerry Dole 0
3 John Marsh 0
```
There is also a way to iterate over all of the flushed keys (will strip out any subkeys):
### Drop and Copy
As a warm-up, the first functions that you will write are `drop(colname)` which drops a column from the dataset with a specific column name.
```
>> etl.drop(colname="first_name")
>> etl.df
last_name
0 Stewart
1 Davis
2 Dole
3 Marsh
```
and `copy(colname, new_colname)` which duplicates a column and saves it to the new name:
```
>> etl.copy(colname="first_name", new_colname="first_name2")
>> etl.df
first_name last_name first_name2
0 Bob Stewart Bob
1 Anna Davis Anna
2 Jerry Dole Jerry
3 John Marsh John
```
m.flushed() #return keys that are flushed.
### Split/Merge
Next, you will write a `split(colname, new_colname, splitter)` function. This function takes an input dataframe and splits all values in colname on a delimiter. It puts the substring before the delimiter in colname, and the substring after the delimiter in a new column. For example,
```
>>> df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John']],
columns=["name"])
>>> etl = ETL(df1)
>> etl.split("name", "last_name","-")
>> etl.df
name last_name
0 Bob Stewart
1 Anna Davis
2 Jerry Dole
3 John
```
When a value does not contain the delimiter new_colname is an empty string. The `merge` function does the opposite of `split`. It takes `merge(col1, col2, splitter)` replaces col1
with the values of col1 and col2 concatenated and seperated by the delimiter. If the value in either col1 or col2 is an empty string, then the delimiter is ignored:
```
>> etl.df
name last_name
0 Bob Stewart
1 Anna Davis
2 Jerry Dole
3 John
>> pw1.merge("name", "last_name","-")
name last_name
0 Bob-Stewart Stewart
1 Anna-Davis Davis
2 Jerry-Dole Dole
3 John
```
### Format
Next, you will write a `format` function that transforms values in a specified column. Format applies an input function to every value in a column. For example,
```
df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John']],
columns=["name"])
etl = ETL(df1)
etl.format("name", lambda x: x.replace("-",","))
>> etl.df
name
0 Bob,Stewart
1 Anna,Davis
2 Jerry,Dole
3 John
```
## Count Per Group
In this assignment, you will implement an out-of-core count operator which for all distinct strings in an iterator returns
the number of times it appears (in no particular order).
For example,
### Divide
Divide conditionally divides a column, sending values that satisfy the condition into one of two columns. For example, consider the data frame below that has names delimited by two different delimiters. Divide can be used to separate these:
```
df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John,Smith']],
columns=["name"])
etl = ETL(df1)
etl.divide("name", "dash", "comma", lambda x: '-' in x)
>> etl.df
name dash comma
0 Bob-Stewart Bob-Stewart
1 Anna-Davis Anna-Davis
2 Jerry-Dole Jerry-Dole
3 John,Smith John,Smith
```
In: "the", "cow", "jumped", "over", "the", "moon"
Out: ("the",2), ("cow",1), ("jumped",1), ("over",1), ("moon",1)
##ETL Programs
Now, you will use the functions that you wrote to write ETL programs. The remainder of this homework must be completed using a sequence of functions from the ETL class. Your code goes into `etl_programs.py`. As an example, suppose we are given the data:
```
df1 = pd.DataFrame([['Bob-Stewart'],
['Anna-Davis'],
['Jerry-Dole'],
['John,Smith']],
columns=["name"])
```
Some of the names are delimited by dashes and some by commas in a single column `name`. We want to to transform this dataframe to have two columns `first_name` and `last_name` with the appropriate names extracted from the dataframe. We could do the following:
```
>> pw1 = ETL(df1)
>> pw1.divide("name", "dash", "comma", lambda x: '-' in x) #divide on a dash
>> pw1.split("dash", "last_name_dash", "-") #split the dash column
>> pw1.split("comma", "last_name_comma", ",") #split the comma column
>> pw1.add("last_name", "")
>> pw1.add("first_name", "")
>> pw1.merge("first_name", "dash", "") #add the first names
>> pw1.merge("first_name", "comma", "")
>> pw1.merge("last_name", "last_name_dash", "") #add the lastnames
>> pw1.merge("last_name", "last_name_comma", "")
>> pw1.drop("name") #drop uncessary columns
>> pw1.drop("dash")
>> pw1.drop("comma")
>> pw1.drop("last_name_dash")
>> pw1.drop("last_name_comma")
>> pw1.df
last_name first_name
0 Stewart Bob
1 Davis Anna
2 Dole Jerry
3 Smith John
```
Or,
The following functions are case insensitive.
### phone
You are given an input dataframe as follows:
```
In: "a", "b", "b", "a", "c"
Out: ("c",1),("b",2), ("a", 2)
df = pd.DataFrame([['(408)996-758'],
['+1 667 798 0304'],
['(774)998-758'],
['+1 442 030 9595']],
columns=["phoneno"])
```
The catch is that you CANNOT use a list, dictionary, or set from
Python. We provide a general purpose data structure called a MemoryLimitedHashMap (see ooc.py). You must maintain the iterator
state using this data structure.
Write an ETL program that results in a dataframe with two columns: area_code,
phone_number. area_code must be formated as a number with only digits (no parens) and the phone number must be of the form xxx-xxxx.
The class that you will implement is called Count (in countD.py).
The constructor is written for you, and ittakes in an input iterator and a MemoryLimitedHashMap. You will use these objects
in your implementation. You will have to implement `__next__` and `__iter__`. Any solution using a list, dictionary, or set inside `Count` will recieve 0 points.
### date
You are given an input dataframe as follows:
```
df = pd.DataFrame([['03/2/1990'],
['2/14/1964'],
['1990-04-30'],
['7/9/2012'],
['1989-09-13'],
['1994-08-21'],
['1996-11-30'],
['2004-12-23'],
['4/21/2016']]
columns=["date"])
```
Write an ETL program that results in a dataframe with three columns: day, month, year. The day must be in two-digit format i.e, 01, 08. The month must be the full month name, e.g., "May". The year must be in YYYY format.
The hint is to do this in multiple passes and use a subkey to track keys flushed between different passes.
### name
You are given an input dataframe as follows:
```
df = pd.DataFrame([['Such,Bob', ''],
['Ann', 'Davis'],
['Dole,Jerry', ''],
['Joan', 'Song']],
columns=["first_name", "last_name"])
```
Some of the names are incorrectly formated where the "first_name" is actually the person's (last name,first name) Write an ETL program that correctly formats names
into first_name and last_name, so all the cells are appropriately filled.
## Testing and Submission
We have provided a series of basic tests in `auto_grader.py`, these tests are incomplete and are not meant to comprehensively grade your assignment. There is a file `years.json` with an expected output. After you finish the assignment you can submit your code with:
## Submission
After you finish the assignment you can submit your code with:
```
$ git push
```
# Out-of-Core Group By Aggregate
*Graduating Seniors: Due 6/5/20 11:59 PM*
*Everyone else: Due 6/8/20 11:59 PM*
In this assignment, you will implement an out-of-core
version of the group by aggregate (aggregation by key)
seen in lecture. You will have a set memory limit and
you will have to count the number of times a string shows
up in an iterator. Your program should work for any limit
greater than 20.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw5` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `countD.py` this is the only file that you will modify. Finally, add `countD.py` using `git add`:
```
$ git add countD.py
$ git commit -m'initialized homework'
```
Now, you will need to fetch the data used in this assignment. Download title.csv put it in the hw5 folder:
https://www.dropbox.com/s/zl7yt8cl0lvajxg/title.csv?dl=0
DO NOT ADD title.csv to the git repo! After downloading the
dataset, there is a python module provided for you called `core.py`, which reads the dataset. This module loads the data in as
an iterator in two functions `imdb_years()` and `imdb_title_words()`:
```
>> for i in imdb_years():
... print(i)
1992
1986
<so on>
```
Play around with both `imdb_years()` and `imdb_title_words()` to get a feel for how the data works.
## MemoryLimitedHashMap
In this project, the main data structure is the `MemoryLimitedHashMap`. This is a hash map that has an explicit limit on the number of keys it can store. To create one of these data structure, you can import it from core module:
```
from core import *
#create a memory limited hash map
m = MemoryLimitedHashMap()
```
To find out what the limit of this hash map is, you can:
```
print("The max size of m is: ", m.limit)
```
The data structure can be constructed with an explicit limit (the default is 1000), e.g., `MemoryLimitedHashMap(limit=10)`.
Adding data to this hash map is like you've probably seen before in a data structure class. There is a `put` function that takes in a key and assigns that key a value:
```
# put some keys
m.put('a', 1)
m.put('b', 45)
print("The size of m is: ", m.size())
```
You can fetch the data using the `get` function and `keys` function:
```
# get keys
for k in m.keys():
print("The value at key=", k, 'is', m.get(k))
# You can test to see if a key exists
print('Does m contain a?', m.contains('a'))
print('Does m contain c?', m.contains('c'))
```
When a key does not exist in the data structure the `get` function will raise an error:
```
#This gives an error:
m.get('c')
```
Similarly, if you assign too many unique keys (more than the limit) you will get an error:
```
for i in range(0,1001):
m.put(str(i), i)
```
The `MemoryLimitedHashMap` allows you to manage this limited storage with a `flush` function that allows you to persist a key and its assignment to disk. When you flush a key it removes it from the data structure and decrements the limit. Flush takes a key as a parameter.
```
m.flushKey('a')
print("The size of m is: ", m.size())
```
Note that the disk is not intelligent! If you flush a key multiple times it simply appends the flushed value to a file on disk:
```
m.flushKey('a')
<some work...>
m.flushKey('a')
```
Once a key has been flushed it can be read back using the `load` function (which takes a key as a parameter). This loads back *all* of the flushed values:
```
#You can also load values from disk
for k,v in m.load('a'):
print(k,v)
```
If you try to load a key that has not been flushed, you will get an error:
```
#Error!!
for k,v in m.load('d'):
print(k,v)
```
If you want multiple flushes of the same key to be differentiated, you can set a *subkey*:
```
#first flush
m.flushKey('a', '0')
<some work...>
#second flush
m.flushKey('a', '1')
```
The `load` function allows you to selectively pull
certain subkeys:
```
# pull only the first flush
m.load('a', '0')
```
We can similarly iterate over all of the flushed data (which optionally takes a subkey as well!):
```
for k,v in m.loadAll():
print(k,v)
```
It also takes in an optional parameter that includes the in memory keys as well:
```
for k,v in m.loadAll(subkey='myskey', inMemory=True):
print(k,v)
```
Since there are some keys in memory and some flushed to disk there are two commands to iterate over keys.
```
m.keys() #returns all keys that are in memory
```
There is also a way to iterate over all of the flushed keys (will strip out any subkeys):
```
m.flushed() #return keys that are flushed.
```
## Count Per Group
In this assignment, you will implement an out-of-core count operator which for all distinct strings in an iterator returns
the number of times it appears (in no particular order).
For example,
```
In: "the", "cow", "jumped", "over", "the", "moon"
Out: ("the",2), ("cow",1), ("jumped",1), ("over",1), ("moon",1)
```
Or,
```
In: "a", "b", "b", "a", "c"
Out: ("c",1),("b",2), ("a", 2)
```
The catch is that you CANNOT use a list, dictionary, or set from
Python. We provide a general purpose data structure called a MemoryLimitedHashMap (see ooc.py). You must maintain the iterator
state using this data structure.
The class that you will implement is called Count (in countD.py).
The constructor is written for you, and ittakes in an input iterator and a MemoryLimitedHashMap. You will use these objects
in your implementation. You will have to implement `__next__` and `__iter__`. Any solution using a list, dictionary, or set inside `Count` will recieve 0 points.
The hint is to do this in multiple passes and use a subkey to track keys flushed between different passes.
## Testing and Submission
We have provided a series of basic tests in `auto_grader.py`, these tests are incomplete and are not meant to comprehensively grade your assignment. There is a file `years.json` with an expected output. After you finish the assignment you can submit your code with:
```
$ git push
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment