Crypto Data Science: Comparing Five Options for Accessing Blockchain Data

Jeanna Schoonmaker
8 min readOct 20, 2021

--

Part 1 in a series of posts about data science and machine learning on blockchain data

If you are new to learning about blockchain, I recommend reading an intro like this: https://www.geeksforgeeks.org/blockchain-technology-introduction/ My post assumes the reader has a basic understanding of how blockchains function.

gm!

(that’s ‘good morning’ in crypto circles)

On any given day, I see dozens of tweets about blockchain, NFTs, web3, crypto, etc. Many will address the potential, the tech, and the hype. Others will point out the scams, the lack of regulations, and well, the hype. Like a lot of hot topics right now, the polarization of opinions seems stark. You’re either all in on the crypto train, or you’re dismissing it as a Star Registry-esque grift.

The Why

So why explore blockchain data?

As exciting as hype cycles are, and as promising as tech twitter makes blockchain seem, W. Edward Deming said:

“In God we trust.

All others must bring data.”

That goes for both the pro and anti crypto camps, in my opinion. Reading about emerging tech whose selling points include data that is readily available and immutable was enough to get me interested, but the critics of crypto may be right. Let’s dig into the data and see what we find. We’ll start with a well known NFT, or non-fungible token, example — transactions for CryptoPunks, which are on the Ethereum blockchain.

Please note, I am new to working with crypto data, so I am not representing any of my research or recommendations here as an expert opinion. Feel free to add a comment about options I missed, things that could have been done better, or new things to try!

The Where

With ease of access to blockchain data being a major talking point of the blockchain system, I was surprised to find out that getting a csv of a subset of transactions from a blockchain was not as easy as I expected.*

It is an option to install web3.py and query a blockchain directly, but as of 10/18/21 when writing this, bitinfocharts.com shows that the Ethereum blockchain size is over 345 GB. It seemed unnecessary to download the entire Ethereum blockchain when I only wanted to look at a small subset of transactions that have occurred on it.

Instead, I explored 5 services that provide options for querying Ethereum data. They are:

  1. Etherscan.io
  2. Thegraph.com
  3. Dune Analytics
  4. Bitquery.io
  5. Flipside Crypto

My criteria for using a service to query the data are pretty simple and encompass what I feel most data scientists would find reasonable for a personal project:

  • No crypto wallet required. Much like the annoyance of needing to enter a credit card number when signing up for a free trial, I had no interest in going through the steps to set up a wallet. Perhaps someday having a crypto wallet (or multiple wallets) will be as ubiquitous as having an email address. For now, though, I simply wanted to access data.
  • It should be free. Writing to a blockchain has a cost, but reading/viewing a blockchain is free, and since I am new to learning about this data space, I didn’t want to rack up a lot of charges without even knowing if I’d get the data I wanted.
  • It should use a language most data scientists know, like SQL.
  • It should output the data into either a csv or an API for further use in machine learning tasks.
  • While not required, a ‘playground’ to test and see the results from queries is super useful, especially when you’re new and don’t know what you’re doing (hi!). Connecting to an API only to find out your dataset is empty is beyond frustrating.

So let’s review our options and how they stack up to the criteria.

Etherscan.io is a powerful search platform for crypto transactions that I used several times while checking info for this post, but in order to access NFT-specific transactions via their API, you have to sign up for a Pro account, which has a cost.

Screenshot of Etherscan.io’s CryptoPunks page (image by author)

Bitquery.io uses a graphql IDE, which is a very intriguing option that I’d like to learn more about in future projects — but for the purposes of this project, there were enough new pieces in play without needing to adopt a new language syntax as well.

Screenshot of Bitquery.io’s graphql IDE (image by author)

Dune.xyz is a terrific analytics platform for blockchain data, with many publicly shared queries by people who write much better SQL than I do. The output of Dune’s queries can be graphs or dashboards, but only paid users can output to csvs, and no APIs are offered.

Screenshot of Dune.xyz dashboard by user RantumBits (image by author)

Flipside Crypto and Thegraph.com were both worth exploring further for querying blockchain data.

Thegraph.com does require a crypto wallet if you are accessing their Subgraphs service, but if you are accessing legacy data via their Hosted Service, you do not need to connect a wallet. According to their docs, while the Hosted Service will eventually no longer be supported, it will still be around for a while. Thegraph actually uses graphql like bitquery.io, but their example queries made their service accessible despite that.

Flipside Crypto requires logging in with an email address and password, and is set up to provide bounties (usually in the form of crypto tokens) for doing analytics tasks on blockchain data. However, I’m most impressed by how they allow you to either download a csv of query results OR access the results data via an API.

The What

What will we be using Flipside Crypto and Thegraph.com to explore? As mentioned earlier, we will be querying the Ethereum blockchain for CryptoPunks sales transactions.

Cryptopunk image of female punk with brown hair and glasses
CryptoPunk 8771. I think there’s a bit of a resemblance. (img from CryptoPunks: Details for Punk #8771 (larvalabs.com))

CryptoPunks are 24x24 pixel portraits that were originally offered by Larva Labs for free in June 2017. The only cost associated with them was the gas fee needed to write the transaction to the blockchain and ‘claim’ one of the 10,000 Punks. At the time, the gas fee was as low as 11 cents.

Following their inauspicious beginnings, CryptoPunks exploded in popularity in late 2020 and 2021. Prices for Punks now range from 30k to as much as 10 million dollars. (https://techcrunch.com/2021/04/08/the-cult-of-cryptopunks)

From my reading, CryptoPunks seem to be the OG NFTs that paved the way for CryptoKitties, BoredApeYachtClubs, and many other NFTs that have followed.

The How

Now that we know what we’re looking at, why, and where to find it, how do we access the data?

Flipside Crypto

  1. Log in to your account on app.flipsidecrypto.com, then click on the ‘New’ button in the upper right hand corner of your screen and then click on ‘New Query.’
  2. On the left hand side of your screen, there is a dropdown menu that lets you choose from the blockchains available to query on Flipside.
  3. Click on Ethereum to see the tables you can query within that blockchain.
  4. Click on the small table avi next to one of the table names to get a list of columns within that table.
  5. Type your SQL query in the middle window.
  6. Click Run, and get immediate feedback on whether your query worked or not!

To begin, I recommend trying an easy query just to get a better idea of what the fields are and what you might want to filter by. Here’s an example of when I selected 3 columns (contract_name, contract_address, and block_timestamp) from the ethereum.events_emitted table, limiting my results to 10:

Query results from the Ethereum blockchain by Flipside Crypto (img by author)

Contract_address is a reference to the 42 character hexadecimal address location of the actual token contract that manages the logic for the tokens. For Cryptopunks, the contract address is ‘0xb47e3cd837dDF8e4c57F05d70Ab865de6e193BBB’. That address will not change and will be referenced every time a CryptoPunks transaction occurs. Contract addresses can be found by using Etherscan.io to look up an NFT or token by name, or by googling the token.

Flipside Crypto note — while contract addresses are not case sensitive elsewhere, on Flipside all contract addresses are in lower case. Make sure you either modify any contract addresses you copy/paste in or wrap them in the lower() SQL command to save yourself from empty result dataset frustration!

Now let’s actually query for CryptoPunks sales transactions. ALL transactions (including bids offered, bids withdrawn, transfers, etc) get recorded on the blockchain, but I am only interested in purchases, so I filtered by the event_name ‘PunkBought.’

There are much better ways to write the SQL query so that you pull in less unnecessary data and return only what you need in your final dataset, but since this is an exploratory project, I selected everything from these transactions and will later decide what to keep and what to drop.

My SQL query for Flipside Crypto:

select *
from ethereum.events_emitted
WHERE contract_address = lower('0xb47e3cd837dDF8e4c57F05d70Ab865de6e193BBB')
AND event_name = 'PunkBought'
AND TX_SUCCEEDED = 'TRUE'
LIMIT 100000

When my results displayed, I clicked on the ‘API’ button, verified that it contained the data I expected, then copied and pasted the API’s url to use in my code, which we’ll see after the next section.

Thegraph.com

Using this url: https://thegraph.com/hosted-service/subgraph/itsjerryokolo/cryptopunks?query=Example%20query, we can see an example of how to query the specific blockchain transactions that are related to CryptoPunks.

TheGraph.com’s Hosted Service Query Playground (img by author)
  1. On the right hand panel, there is a list of the schema available in this subgraph.
  2. Click on the ‘Sale’ schema to see the columns available from that table.
  3. Using the example query in the left hand panel as a guide, rewrite the query to look at all of the available information in the Sale schema (see query below).
  4. Then hit the purple ‘play’ button over the middle panel, and your results will show up in that middle panel. Note — Thegraph limits the results returned to 1000, so this dataset will be incomplete and we will either need to download multiple versions of it or figure out another way to access more of the data from Thegraph.com.
{sales(first: 1000){
id
to {
id
}
amount
from {
id
}
nft {
id
}
timestamp
type
contract {
id
}}}

Despite the fact that we are pulling the same information from the same blockchain, you can see there are much different approaches to accessing the data depending on which service you use.

Now that we have working queries for both services, let’s connect to them using a Google Colab notebook and pull the data into dataframes for exploratory analysis and eventually, anomaly detection. (spoiler alert!)

Using Flipside Crypto:

Querying the Ethereum blockchain for CryptoPunks sales data using Flipside Crypto

Using Thegraph.com:

Querying the Ethereum blockchain for CryptoPunks sales data using Thegraph.com

That’s it for Part 1. Part 2 will explore the CryptoPunks transaction data itself. See you then!

gn.

* NOTE: There are some already-existing csvs of Cryptopunks data available, including one that was added while I was researching this post: https://www.kaggle.com/tunguz/cryptopunks

You can also find cryptopunks transaction data through 12/30/2020 in this dataset provided by Omni Analytics for their Learning Data Science on the Ethereum Blockchain: https://github.com/Omni-Analytics-Group/eth-data-science-course

Despite the immutability of blocks already written to the chain, new blocks are added constantly, so be sure to check how often datasets are updated if the recency of your blockchain data matters.

--

--

Jeanna Schoonmaker
Jeanna Schoonmaker

Written by Jeanna Schoonmaker

Data scientist. Machine Learning. Python. Forever in search of another dataset and another set of clamps.

No responses yet