CryptoDataScience Part 2: Exploratory Data Analysis on CryptoPunks Transactions

Jeanna Schoonmaker
8 min readNov 3, 2021

--

Cleaning Blockchain Data for Data Science and Analysis

gm!

In my previous story about crypto data science, I discussed several possible sources for accessing and downloading blockchain data. In Part 2, I will be exploring the data from two of the data sources mentioned in Part 1 in order to clean the data, analyze it, and prep it for being used in machine learning processes, which will be Part 3.

The two data sources I’m using for this article include the punks.csv dataset from the Omni Analytics course Learning Data Science on the Ethereum Blockchain found here: https://github.com/Omni-Analytics-Group/eth-data-science-course and the API from Flipside Crypto created by querying the Ethereum blockchain using the code shown in Part 1.

First, let’s start with the easy part — loading data from the csv.

CSV Data

punks_url = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/punks.csv'df = pd.read_csv(punks_url, encoding='unicode_escape')df.head()

will give us this:

5 rows of a dataset showing CryptoPunks transactions

Let’s dig into the data a little more.

Table of metadata collected about the dataframe from the command df.info

It looks like the “Txn” field has our date info, but is in string format. Let’s change it to datetime format and then take a look at the max and min dates in this dataset.

Code showing converting a column to datetime, then finding the min and max dates within that column

Let’s explore further.

df.describe and the resulting data table showing statistics associated with the fields

Initial thoughts after first pass at this data:

  1. There are 17554 transactions in this dataset, and 10019 of them occurred on Jun 23, 2017 when the punks were first made available. More than half of our transactions in this dataset are from the initial ‘claims’ of the punks.
  2. The “From” and “To” fields indicate the number of unique users who transacted in this dataset. Interesting that with 10,000 punks, there are only 682 unique users on the selling side of the transaction, with 1524 users on the buying/claiming side of the transaction.
  3. The “Sex,” “Type,” and “Skin” columns refer to descriptive Punks attributes. The ID is the Punk identifier. “Slots” and “Rank” offer additional data about the sales of the Punks.
  4. Ethereum addresses are 40 character hexadecimal strings. However, in this dataset, the “From” and “To” columns, which show the users buying and selling the Punks, have been shortened to 8 characters. Some addresses also use a name instead of the shortened hex string.
  5. The column titled “Crypto” is referring to the cryptocurrency Ether. With a mean of 2.39 Eth/857.31 USD and a standard deviation of 6.12 Eth/2833.27 USD, we have a lot of variation in our data. Let’s take a look at it in a scatterplot.
Scatterplot graph of CryptoPunks transactions through 12/30/2020

There was scarce but steady CryptoPunks sales activity until 2020, but summer through fall of 2020 became much more active. There also appear to be a few outliers in our data.

One more quick analysis of this data — with only 682 users selling and 1524 users responsible for 17554 transactions, let’s see which users are responsible for the most buying and selling in this dataset.

Data showing top users and sellers of CryptoPunks through 12/30/2020

It is great to be looking at understandable blockchain data, and pd.read_csv is one of the easiest ways to start exploring.

But what if you need to access raw blockchain data and clean it yourself?

Now let’s bring in 2021’s data from Flipside Crypto using the process I introduced in Part 1.

Flipside Crypto Data

Here is the code to bring in data from Flipside Crypto’s API:

url = ('https://api.flipsidecrypto.com/api/v2/queries/99990404-5727-40cb-989a-f9d66738bdca/data/latest')response = requests.get(url)print(f"Request returned {response.status_code} : '{response.reason}'")payload = response.json()

Using json_normalize helps to unpack the json response, then let’s look at the first few rows of data.

df_flip = pd.json_normalize(payload)df_flip.head()
First few rows of transaction data retrieved from Flipside Crypto about CryptoPunks transactions

Looking interesting already, but we still have a nested json result in our EVENT_INPUTS column. We’ll use flat-table to unpack that, then take another look.

df_flip['EVENT_INPUTS'] = df_flip['EVENT_INPUTS'].apply(lambda x:json.loads(x))df_flip = flat_table.normalize(df_flip)df_flip.head()
First few rows of modified data table with more unnested json data now visible

We can now see the EVENT_INPUTS values in their own columns based on the key names.

Unfortunately, we can also see that the EVENT_INPUTS.value (which is the amount paid for a transaction) has a negative value in a field which should not have negative values.

While I originally planned for this part to have EDA on datasets from both a csv and from the Flipside Crypto API, the data from Flipside is not accurate right now. I have submitted a ticket to Flipside to determine what has gone wrong. In the meantime I will share my thoughts and then move on with a different data source.

Thoughts from this dataset:

  1. The csv was much more nicely formatted. As is generally the case, raw data is messy, noisy, and likely needs quite a bit of work before it is understandable and/or useful.
  2. I see several columns we don’t need — as mentioned in Part 1, the contract address refers to the original address of the CryptoPunks contract, so it will be the same value for every column and can be dropped. Same with the “PunkBought” event_name. Next steps will include dropping unnecessary columns.
  3. The To and From addresses in this dataset contain the entire hexadecimal string instead of just the first 8 characters. It does not include any named accounts, so we will need to reconcile named accounts with hex strings if we plan to join the csv dataset with this one.
  4. The EVENT_INPUTS.value field looks like our cryptocurrency amount for the transaction — but why is the number so large? Because it shows the transaction amount in wei (see here for more info: https://www.investopedia.com/terms/w/wei.asp) which we will convert to Ether by dividing by 1e18.
  5. A key difference from our csv dataset — we don’t have the USD conversion in this dataset, so we will need to add that in ourselves.
  6. Finally — data projects often don’t go as planned. Being able to pivot and augment or use alternate sources is an important skill when working with data!

As an alternate data source, I’ll be using the CryptoPunks transaction data from Kaggle, also mentioned in Part 1. You can either download the file to your local computer, then upload it to your google drive to access it in Google Colab (as I’ve done below), or you can use it directly in Kaggle at https://www.kaggle.com/tunguz/cryptopunks

Table showing Punks transaction data

I am only interested in Punks sales and Punks transfers, and since our other csv has data through 12/30/20, we’ll filter this one to start where that one left off.

df_alt['date'] = pd.to_datetime(df_alt['date']).dt.datestart_date = startdate = pd.to_datetime('2020-12-30').date()df_alt = df_alt.loc[(df_alt['txn_type'] == 'Sold') | (df_alt['txn_type'] == 'Transfer')].copy()df_alt = df_alt.loc[df_alt['date'] > start_date].copy()
df_alt.describe() results

Interesting to note — in the 3.5 year period from Jun 2017 thru Dec 2020, there were 7554 sale transactions of CryptoPunks to 1524 unique users.

In the 9.5 month time frame from 12/31/20 to 10/7/21, there were over 11,400 transactions of Punks to 5256 unique users.

The max price paid for a Punk through 12/30/20 was 189.99 Eth, worth roughly $140k at the time.

The max price paid for a Punk from 12/31/20 to Oct 2021 was 4200 Eth, worth roughly — well, I’m not sure yet. We need to put a USD-Eth conversion rate in our dataset!

Many thanks to this blog post that walks through how to access conversion rates and create a dataframe: https://blog.patricktriest.com/analyzing-cryptocurrencies-python/. I will follow his process, so see his post for explanation and code, or see it in my Github here.

Here’s how our dataframe of conversion rates looks:

daily dataframe of Ether to USD conversion rates

Now that we have the conversion rate table, next steps include:

  • Moving the date from being an index column to a regular date column
  • Renaming the ‘close’ column as our conversion rate column
  • Joining the conversion rate table to the df_alt Punks dataframe
crypto_price_df.reset_index(inplace=True)
crypto_price_df['date_join'] = crypto_price_df['date'].dt.date
rates = crypto_price_df[['date_join', 'close']].copy()
rates = rates.rename(columns={'date_join': 'date', 'close': 'eth_usd_conversion_rate'})
df_alt2 = pd.merge(df_alt, rates, on='date', how='left')

Next we’ll multiply the Eth amount by the conversion rate to get USD for each transaction.

df_alt2['USD'] = df_alt2['eth'] * df_alt2['eth_usd_conversion_rate']
df_alt2['USD'] = pd.to_numeric(df_alt2['USD'])
df_alt2.fillna(0, inplace=True)
df_alt2['type_str'] = df_alt2['type'].apply(lambda x: ','.join(map(str, x)))

which gives us

df_alt2 describe info

The minimum transaction amount in this dataset is 0 USD, but the max is a whopping $7,658,401!

Let’s do another scatterplot.

import matplotlib.ticker as ticker
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(20, 8))
g = sns.scatterplot(x=df_alt2['date'], y=df_alt2['USD'], hue=df_alt2['type_str']).set(title='2021 Punks Purchases')
ax.yaxis.set_major_formatter(ticker.EngFormatter())
Scatterplot of 2021 Punks Purchases

The biggest difference between our previous scatterplot on 2017–2020 data and this one is the scale of the y axis.

Our previous scatterplot’s y axis topped out at 140k.

This scatterplot’s y axis is now scaled to millions!

Finally, let’s take a look at this dataset’s top sellers and buyers:

List of users with the most selling transactions in 2021 Punks data
list of most common buyers in 2021 CryptoPunks dataset

Do any of these users look familiar from the top buyers and sellers in the 2017–2020 dataset…?

Now that we have loaded, cleaned, and explored our data, we’re ready to move forward with the final step in our data science project. More to come in Part 3!

All code used for this story can be found here: https://github.com/JSchoonmaker/blockchain_projects/blob/main/CryptoPunks_EDA_Part_2.ipynb

--

--

Jeanna Schoonmaker
Jeanna Schoonmaker

Written by Jeanna Schoonmaker

Data scientist. Machine Learning. Python. Forever in search of another dataset and another set of clamps.

No responses yet