CryptoDataScience Part 3: Two Methods for Anomaly Detection on the CryptoPunk Blockchain

Comparing Human in the Loop and Unsupervised Machine Learning Methods

Jeanna Schoonmaker
13 min readNov 10, 2021

A record-setting CryptoPunks sale occurred recently — on Oct 30, 2021, Punk 9998 sold for over 500 million dollars.

Or did it?

According to news.slashdot.org:

Larva Labs, which created the CryptoPunks, said on Twitter that “someone bought this punk from themself with borrowed money and repaid the loan in the same transaction.” Evidently, this isn’t the first time this has happened. “Some recent large bids were done the same way. The ether is offered and removed in a single transaction. So, while technically briefly valid, the bid can never be accepted. We’ll add filtering to avoid generating notifications for these kinds of transactions in the future.”

In conventional, regulated securities markets, this would be called wash trading, which is banned on grounds that trading with yourself can artificially inflate prices and suggest more demand than really exists.

In this case, wash trading was obvious. Are suspicious transactions always so easy to spot? That’s what we’ll be exploring in today’s post.

My CryptoDataScience posts thus far have involved:

Part 1 — Accessing blockchain data

Part 2 — Exploring blockchain data

all of which leads up to

Part 3 — Sciencing blockchain data

Now that we have determined how to access and explore blockchain data, I’ll be examining the CryptoPunks transaction data more closely to look for patterns of buying and selling behavior. This is our ‘human in the loop’ standard for determining which transactions look odd.

Following that, I’ll run an unsupervised machine learning anomaly detection algorithm on the data to see which transactions are caught in the algorithm’s net.

What are we looking for?

First, let’s define what we’re looking for in the blockchain transactions.

From Investopedia — the definition of wash trading is:

Wash trading is a process whereby a trader buys and sells a security for the express purpose of feeding misleading information to the market. In some situations, wash trades are executed by a trader and a broker who are colluding with each other, and other times wash trades are executed by investors acting as both the buyer and the seller of the security. Wash trading is illegal under U.S. law.

What does the data show about the possibility of wash trading in CryptoPunks transactions?

As mentioned in the definition and example above, wash trading is when one person “bids up” their own asset with a secondary account in an attempt to establish increased demand and justify a higher asking price for their asset.

Other than accounts that choose to name themselves or publicly post their wallet address, ownership of crypto wallets is pseudonymous (meaning not fully anonymous — wallet owners are identifiable by the wallet address, but their identity, location, etc. is unknown) and a person can sign up for as many wallets as they want.

Thus, it is difficult to PROVE wash trading unless a person admits to owning several wallets that are shown to be buying from each other — or unless a person buys and sells a Punk from the same address, as occurred with Punk 9998.

However, it is possible to see patterns in the data. If the transactions identified within those patterns are also identified as anomalies by a machine learning algorithm, it strongly suggests that the transactions are suspicious and at least worth an eyebrow raise, even if nothing has been proven.

An example of wash trading in blockchain NFT transactions could look something like this:

  1. Person 1 ‘claims’ the original Punk #6276 in its initial offering.
  2. Person 1 sells Punk #6276 to Buyer A for $50.
  3. Buyer A sells Punk #6276 to Buyer B for $100.
  4. Buyer B sells Punk #6276 to Buyer C for $200.
  5. Buyer C sells Punk #6276 to Buyer A for $350.
  6. Buyer A sells Punk #6276 to New Buyer for $600.

And imagine steps 2–6 all happen within a few days or weeks.

This is a plausible turn of events — perhaps a little strange that Buyer A rebought the Punk after already selling it once, but maybe buyer’s remorse played a part?

However, what crosses the line from strange buying behavior into wash trading is if Buyer A, B, and C are all the same person or a group of people colluding together to buy and sell the Punk to each other in an effort to raise its perceived value, both because of increased volume of trades, and because of a steady increase in the asset price.

Buyer A/B/C has paid the original cost of the Punk ($50 in this example) and the gas fees required to ‘write’ the purchase transactions to the blockchain (let’s say $1 per transaction, so $4 total) but has sold the Punk for $600. The cryptocurrency amounts per transaction are transferred from the wallet of Buyer A to Buyer B to Buyer C, but since they are the same person, it’s the equivalent of transferring funds from one account to another, albeit with a small transaction fee. The wash trader has spent $54 and been paid $600 from New Buyer, for a net gain of $546.

Now multiply that by several Punks, and multiply the cryptocurrency value by up to 10x, and it becomes clear why fraudulent transactions occur.

Let’s dig in and see what we find.

The Human In the Loop Approach

Plotting a histogram of the CryptoPunks data over time shows 3 distinct time periods when counts of Punk sales transactions were elevated — fall of 2020, spring 2021, and summer/fall 2021.

Histogram of CryptoPunks sales transactions over time

We also learned during the EDA that the scale of transactions changed significantly between 2020 and 2021, going from sales worth $140k per Punk up to sales worth millions. Because the anomaly detection algorithm we will use depends on looking for outliers in the data, I will split the data into 3 subsets representing the distinct time periods with heavier sales traffic.

Today’s post will center on the first dataset which includes 2017 to 2020, but will mainly focus on the fall of 2020.

As noted at the end of CryptoDataScience Part 2, there are a few users/addresses that turn up many times in our CryptoPunks datasets as frequent sellers and buyers.

Next, I will take it a step further and identify the accounts that bought and sold the SAME Punk NFT more than once. Similar to the Buyer A/B/C example, while doing so might not be a concrete indicator of wash trading, it justifies another look.

Let’s create a subset of the 2017–2020 data with only the From, To, and ID fields — remember that ‘ID’ in this data refers to the CryptoPunk id, which is a unique identifier for each of the 10k Punks that were minted.

wash_check = df_punks[['From', 'To', 'ID']].copy()

Next, we’ll check for transactions where the same seller sold a Punk multiple times OR the same buyer bought a Punk multiple times, and we’ll put those Punk IDs into a list for further analysis.

Keep in mind, this will only identify accounts that transact on the same Punk multiple times. Due to the pseudonymous nature of crypto wallets, it is possible that there are suspicious trades that occur that aren’t detectable through this process.

As mentioned earlier, it’s also possible none of these transactions are fraudulent— however, the ‘human in the loop’ baseline in this case is identifying accounts that repeatedly bought and sold the same asset. This merits delving into the transaction details for more information.

dupes = wash_check[wash_check.duplicated(['From', 'ID'])]
suss_ids = dupes['ID'].tolist()
dupes2 = wash_check[wash_check.duplicated(['To', 'ID'])]suss_ids2 = dupes2['ID'].tolist()all_suss = suss_ids + suss_ids2print('The number of transactions we want to examine further is:')
len(all_suss)

which gives us:

The number of transactions we want to examine further is:135

135 transactions that meet the criteria, involving 74 punks. There were 7,554 transactions in this dataset, so roughly 1.8% of the transactions merit further analysis. Keep this number in mind for later on.

Now that we have a list of Punk IDs with odd transactions, let’s create a new dataframe that shows all sales transactions associated with those IDs.

boolean_check = df_punks.ID.isin(all_suss)df_p = df_punks[boolean_check]df_p.sort_values(by=['ID', 'Txn', 'Crypto']).head()

Looking through the Punks transactions, it seems some names/wallet addresses show up often, just as was noted during exploratory data analysis. Let’s get a list of all accounts showing in the From/To fields of the specified transactions and then filter it to accounts that show up at least 10 times.

accounts = df_p['From'].to_list() + df_p['To'].to_list()
most_common = Counter(accounts).most_common()
most_common = [x for x in most_common if x[1]>10]

gives us

[(‘Pranksy’, 165), (‘0x5aaeb9’, 32), (‘Hemba’, 26), (‘MrNFT’, 24), (‘0x00d7c9’, 21), (‘Carlini8’, 20), (‘natealex’, 18), (‘Dude_Nak\x85’, 18), (‘0x11c9a7’, 17), (‘evkort’, 16), (‘ross_VRO\x85’, 14), (‘Goop’, 14), (‘NeonNFT’, 12), (‘Zieg’, 11)]

Let’s take a closer look at a few of the transactions that involve the top account listed.

NOTE: After a google search, it seems Pranksy is an active NFT collector and early adopter of many NFTs. Analyzing this transaction history is not meant to be an accusation of Pranksy or any other accounts mentioned in the analysis.

dataframe showing sales transaction for CryptoPunk 3260

Let’s look at Punk 3260’s transaction history, step by step

a. Starting at the top of the table, you can see that Hemba was Punk 3260’s first owner. On 9/22/20, Pranksy purchased 3260 for 1.8 Eth (‘Crypto’), which was equivalent to $613.

b. Two days later on 9/24, Pranksy sold 3260 to 0x5aaeb9 for 2.19 Eth or $757.

c. On that same day, 9/24, 0x5aaeb9 sold 3260 to 0xef764b for 2.29 Eth/$792.

d. Four days later, on 9/28/20, 0xef764b sold 3260 back to Pranksy for 3.99 Eth, or $1427.

e. Later that day, Pranksy sold 3260 to 0x13816f for 4.79 Eth, or $1714.

See additional examples below, both of which also include sales to and from Pranksy to 0x5aaeb9.

Dataframe showing sales transaction for CryptoPunk 2880
dataframe showing sales transaction for CryptoPunk 6755

To be fair, while Pranksy was a part of several of the transactions we are analyzing, there were also transactions that don’t register as anomalous, like this one:

dataframe showing sales transaction for CryptoPunk 3379

There are also transactions which met the ‘human in the loop’ standards for further review that involve other accounts, such as the one below:

dataframe showing sales transaction of CryptoPunk 1845

One transaction not shown in the above data (because it was not a sale) is when the Punk was “wrapped” between the first and 2nd row, which is why the ‘To’ and ‘From’ fields don’t match between those rows. This transaction is also one of the few where the Punk was sold for $500 LESS than it had been previously — in fact, a couple of the transactions for this Punk lose value, which is an oddity in itself.

The strangest Punk buying and selling behavior that I saw when exploring the datasets, though, was on Punk 6276 from the 2021 dataset. Despite a previous sale of $40k, this Punk has been repeatedly sold and transferred…for $0. The transactions are a little jumbled due to multiple transactions on the same date, but the long list of transfers and sales back and forth to the same addresses for $0 is definitely strange:

dataframe of sales transactions for CryptoPunk 6276

Several examples fit our criteria of odd transactions that might suggest wash trading is occurring when using a ‘human in the loop’ approach to labeling.

Next, let’s train an unsupervised machine learning model to identify anomalous transactions.

The Isolation Forest Approach

The Isolation Forest algorithm takes a unique approach to finding anomalies — instead of creating a profile of what a ‘normal’ transaction looks like in a dataset, and classifying any transactions outside of the norm as outliers, it instead looks for the outliers from the beginning, which results in a faster model that uses less memory.

Because we do not have any labeled data or concrete examples of fraudulent transactions, we need an unsupervised machine learning method.

I will be using the PyOD toolkit to access the IForest model. My thanks to JustintoData for their tutorial using IForest on financial transactions!

Two additional preprocessing steps we need to do before training the IForest model:

  1. Converting the dataframe’s date field into the index.
df_punks = df_punks.set_index('Txn')df_punks = df_punks.sort_index()

2. Creating a count of transactions within a 5 day window per Punk.

df_punks['count_5days'] = df_punks.groupby('ID')['USD'].transform(lambda s: s.rolling(timedelta(days=5)).count())

Next, we need to give the model a parameter for how many anomalous transactions we expect it to find. Because we don’t have labeled data, this parameter keeps the model to a reasonable output.

Remember earlier in the human in the loop section when I said to keep the amount 1.8% of transactions in mind? That’s the same range we will use now. We’ll instruct the Isolation Forest model to expect 1.8% of the transactions to be outliers, and then we will fit the model on the 5 day count of transactions along with the amount of ether (‘Crypto’) spent on the transaction.

anomaly_proportion = 0.018clf_name = 'Anomaly Detection - Isolation Forest'
clf = IForest(contamination=anomaly_proportion)
X = df_punks[['count_5days', 'Crypto']]
clf.fit(X)

We can now get the prediction scores and outlier labels from the model and add it to our dataframe. If the transaction is labeled 1, that indicates the model identified it as an outlier, and in the raw outlier scores, the higher the number is, the greater the anomaly it was found to be.

df_punks['y_pred'] = clf.labels_df_punks['y_scores'] = clf.decision_scores_

You may recognize a few of the accounts on this list:

dataframe showing anomalous transactions

Next, we’ll create a visualization of the inliers and outliers of our Isolation Forest model, along with the boundaries the model established between normal and anomalous transactions.

Isolation Forest visualization

Note that the model was trained on the price per transaction along with the count of transactions per 5 day window, which is reflected in how the model identified outliers. In our human in the loop approach, I paid little attention to the price of a transaction simply because the volatility of the NFT market and the lack of established valuations means determining a “normal” amount to pay for an NFT is very difficult to do.

The Isolation Forest approach also identifies specific transactions, where our human in the loop approach looked at an entire Punk’s sales history. Still, many of the transactions identified by the IForest algorithm involved Punks that were also identified by our human in the loop approach.

Now that we’ve scienced the data and used a couple of methods for identifying outliers and possible wash trading in the CryptoPunks transactions — what did I learn from all this?

Takeaways from my CryptoDataScience adventure:

  1. The NFT/web3/blockchain space is still very new. Even during the few weeks while I was working on this project, major news stories occurred — the $500 million sale of a CryptoPunk that never was, the newly passed bill that has tax and reporting implications for NFTs, and the continued interest in NFTs by celebrities, just to name a few.
  2. There’s amazing potential, especially (in my opinion) for data science and analytics on blockchain data. The data is always stored on chain, but that doesn’t make it easy to access or clean. Those who find ways to add insights will likely have plenty of opportunities to do so, and get paid while doing it.
  3. Caveat emptor. The deregulated, decentralized, pseudonymous nature of blockchain transactions — including NFTs (a least for now) — means high volatility and high risk. It is very much a gamble, and chances are there is fraudulent activity occurring. If you plan to wade in, keep that in mind.
  4. The hype cycle for web3 and blockchain is in full effect. Even though I just mentioned the potential in this space myself, it is also worth noting that almost everyone* who advocates for this tech has a financial incentive to do so because they own eth or bitcoin or other altcoins and benefit when others buy in and increase scarcity and hype. This makes honest valuations tough to come by. It doesn’t mean people don’t honestly believe in the tech, but it does mean they likely benefit financially when you buy into it too. (* as of writing, I do not own a crypto wallet or any NFTs)
  5. Outcomes from data projects are often ambiguous, and crypto data projects are no exception. It would be amazing if every data science project resulted in a crystal clear answer for how to easily implement or identify needed changes. Despite their outlier status and suspicious sales patterns, I can’t immediately identify clearly fraudulent behavior in CryptoPunks transactions. And there are experts in the space who don’t see evidence of wash trading in the NFT market, though there are also some who do. So was it worth it? Yes, because…
  6. The real treasure is the lessons learned along the way. Okay, a wallet full of CryptoPunks would also qualify as treasure, but given that I don’t own any of those, I’ll take the lessons learned instead. This was a side project for me, unrelated to my job, but exploring it gave me the chance to learn a lot about previously unknown (to me) tech, and even join a DAO myself: https://www.linkedin.com/company/charliedao/about/. If you work in data and are interested in this space, I encourage you to jump in and explore!
  7. Possible next steps could include doing the same 2 methods described in this post on the other CryptoPunks datasets from 2021. It would also be interesting to go beyond CryptoPunks and explore other NFTs, or explore other blockchain data related to DeFi (decentralized finance) or DAOs (decentralized autonomous organizations).
  8. Data science rules. The end.

You can find all code used (plus bonus odd transactions not shared here for the sake of space) at this github link. My thanks to Omni Analytics and Bojan Tunguz for the datasets I used for this project.

Revisit Part 1 here: https://jeanna-schoonmaker.medium.com/crypto-data-science-comparing-five-options-for-accessing-blockchain-data-619f5f4e2f70

Or Part 2 here: https://jeanna-schoonmaker.medium.com/cryptodatascience-part-2-exploratory-data-analysis-on-cryptopunks-transactions-910e6dcb14bd

--

--

Jeanna Schoonmaker
Jeanna Schoonmaker

Written by Jeanna Schoonmaker

Data scientist. Machine Learning. Python. Forever in search of another dataset and another set of clamps.

Responses (1)