Working with Data Feeds

时间: 2018-09-24 05:13:05

As companies become more data-driven there is often a proliferation of data from both internal sources as well as third parties being consumed. Rarely have I seen firms try and centralise where datasets are stored. Instead, data is often copied onto infrastructure for individual teams and departments. This allows teams to not disrupt others with their work as well as avoid disruption from other teams.

Data sources are often refreshed in batches ranging from every few minutes to monthly updates. The file formats, compression schemes and encryption systems used to proliferate these datasets can vary greatly.

There is no one single tool I use for collection and analysis of new datasets. I do my best to pick tools that help me avoid writing a lot of bespoke code while taking advantage of the hardware available on any one system I may be using.

In this guide I'll walk through a exercise in consuming, transforming and analysing a data dump of the English language version of Wikipedia.

Installing Prerequisites

The following commands were run on a fresh install of Ubuntu 16.04.2 LTS on a system with 16 GB of RAM.

I'll first add the repository information needed to install ClickHouse, an open source analytics database, from Debian packages.

$ sudo apt-key adv \    --keyserver hkp://keyserver.ubuntu.com:80 \    --recv E0C56BD4$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | \    sudo tee /etc/apt/sources.list.d/clickhouse.list$ sudo apt-get update

I'll then install ClickHouse, Python, a JSON manipulation tool and a few decompression utilities (some of which can take advantage of multiple CPU cores).

$ sudo apt-get install \    clickhouse-client \    clickhouse-server-common \    jq \    lbzip2 \    pigz \    python-pip \    python-virtualenv \    software-properties-common \    unzip

I'll be using a number of Java-based utilities in this exercise so I'll install Oracle's Java 8 distribution.

$ sudo add-apt-repository ppa:webupd8team/java$ sudo apt update$ sudo apt install oracle-java8-installer

I'll create a Python virtual environment and install a couple data transformation and analysis utilities.

$ virtualenv ~/.feed$ source ~/.feed/bin/activate$ pip install \    csvkit \    pandas \    xmltodict

Finally, I'll download and unpack a few libraries for the Hadoop-centric file formats I'll be working with.

$ wget -c https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.2/orc-tools-1.5.2-uber.jar$ wget -c https://www-eu.apache.org/dist/avro/avro-1.8.2/java/avro-tools-1.8.2.jar$ wget -c https://github.com/jairamc/csv-avro-converter/releases/download/v0.1-alpha/csv-avro-converter-cli-1.0.zip$ unzip csv-avro-converter-cli-1.0.zip

Downloading Wikipedia

Wikipedia dumps page contents and metadata for their English-language website every couple of weeks. As of this writing the dump is made up of 55 bzip2-compressed XML files. They're around ~300 MB when compressed and ~1.5 GB uncompressed each.

The following will download the 55 compressed XML files creating a little over 14 GB of compressed content. This command will pull the URLs from a manifest file and transform them so that they are complete URLs. The URLs are piped through xargsso that two files will be downloaded in parallel at any one time.

$ wget -qO- https://dumps.wikimedia.org/enwiki/20180920/dumpstatus.json \    | jq '.jobs.articlesdump.files[].url' \    | sed 's/\"//g' \    | sed 's/^/https:\/\/dumps.wikimedia.org/' \    | xargs -n 1 \            -P 2 \            wget -c

Converting XML into JSON

XML files can be challenging to work with. Some libraries expect to work with files as a whole and can quickly exhaust system memory with simple operations. Other libraries can be unintuitive to work with and require a great deal of cognitive effort for simple tasks. For these reasons I try and convert XML files into JSON as soon as I receive them.

I'll be streaming out records from each of the bzip2-compressed XML files and dumping each record onto individual lines using JSON serialisation. I will use a Python library to first convert the XML into a Python dictionary. Then the Python script below will convert the Python dictionary into a JSON string.

$ vi dump_pages.py

import jsonimport marshalimport syswhile True:    try:        _, page = marshal.load(sys.stdin)        print json.dumps(page)    except EOFError:        break

Below is a bash script that will find each of the bzip2-compressed XML files. Each file will be decompressed using a multi-core-optimised bzip2utility. The decompressed XML will be converted into Python dictionaries with the xmltodictlibrary. Then the script above will convert each record into a single-line JSON string. Finally, a multi-core-optimised gzipdrop-in replacement will compress them using gzip compression.

$ for FILE in *.bz2; do      echo $FILE      lbunzip2 --stdout $FILE \        | python xmltodict.py 2 \        | python dump_pages.py \        | pigz > $FILE.json.gz  done

The transformed content in the bzip2 files grew by 33% once they turned into line-delimited, gzip-compressed JSON files.

Converting JSON into CSV

Now that I have the data in compressed JSON format I want to be able to flatten out the nested elements into their own columns. For reference, this is a truncated and pretty-printed view of what one JSON record currently looks like:

{    "id": "41067206",    "ns": "0",    "revision": {        "comment": "###",        "contributor": {            "id": "244263",            "username": "Czar"        },        "format": "text/x-wiki",        "id": "817456361",        "model": "wikitext",        "parentid": "788230430",        "sha1": "2egdyphfnavxhsjrxvhjpfvb6ndh0wi",        "text": {            "#text": "###",            "@xml:space": "preserve"        },        "timestamp": "2017-12-28T14:11:08Z"    },    "title": "###"}

I'm looking to transform the above into the following columns. Note I'm using an underscore to denote a nested child node.

idnsredirect_@titlerevision_commentrevision_contributor_idrevision_contributor_usernamerevision_formatrevision_idrevision_modelrevision_parentidrevision_sha1revision_text_#textrevision_text_@xml:spacerevision_timestamptitlerevision_contributor_iprevision_minorrestrictions

The following Python script will iterate through each of the compressed JSON files, flatten out their nested data structures and save the data as a gzip-compressed CSV file. The only dataset-specific check I have in this code is making sure there is a title element in each record. This distinguishes the page records from other types held in the dump.

$ python

from   glob import globimport gzipimport jsonimport pandas as pdfrom   pandas.io.json.normalize import nested_to_recordfor filename in glob('*.json.gz'):    df = pd.DataFrame()    for line in gzip.open(filename, 'rb'):        page = nested_to_record(json.loads(line),                                sep='_')        if 'title' in page.keys():            df = df.append(page, ignore_index=True)    df.to_csv('%s.csv.gz' % filename.split('.')[0],              index=False,              encoding='utf-8',              compression='gzip')

The gzip-compressed CSV files are now a little over double the size of the bzip2-compressed XML files.

Note that the columns in each CSV file may not align with columns in others CSV files if the datasets vary in structure. You can examine the consistency of the headers across the gzip-compressed CSV files with the following:

$ for FILE in *.csv.gz; do      echo $FILE      pigz -d -c $FILE | head -n1  done

Cherry-Picking Columns of Interest

I want to cut down on the amount of data I'll be working with in this exercise. To start, I'll examine the first thousand lines of one of the gzip-compressed CSV files and see what field names and data types it contains.

Note I've set the maximum field size csvstatwill allow for to eight million bytes so that none of lengthy Markdown contents will cause an issue. I've also had to declare the delimiter as the last row is being cut off mid-record on the 1000th line (records can span multiple lines) and this is throwing csvstat's delimiter auto-detection off.

$ pigz -d -c enwiki-20180920-pages-articles1.csv.gz \    | head -n1000 \    | csvstat --maxfieldsize 8000000 \              --delimiter ',' \              --type

 1. id: Number 2. ns: Boolean 3. redirect_@title: Text 4. revision_comment: Text 5. revision_contributor_id: Number 6. revision_contributor_username: Text 7. revision_format: Text 8. revision_id: Number 9. revision_model: Text10. revision_parentid: Number11. revision_sha1: Text12. revision_text_#text: Text13. revision_text_@xml:space: Text14. revision_timestamp: DateTime15. title: Text16. revision_contributor_ip: Text17. revision_minor: Boolean18. restrictions: Boolean

I'll create a new gzip-compressed CSV file made up of five columns of interest.

$ pigz -d -c enwiki-20180920-pages-articles1.csv.gz \    | csvcut --maxfieldsize 8000000 \             --columns 1,5,6,14,15 \    | pigz > metadata.csv.gz

I'll now examine the statistics of each column in the newly-created gzip-compressed CSV file. Note, I can examine the compressed file, there's no need to decompress it ahead of time.

$ csvstat metadata.csv.gz

1. "id"      Type of data:          Number      Contains null values:  False      Unique values:         19833      Smallest value:        10      Largest value:         30,302      Sum:                   305,772,174      Mean:                  15,417.344      Median:                15,524      StDev:                 8,567.878      Most common values:    10 (1x)                             12 (1x)                             13 (1x)                             14 (1x)                             15 (1x)2. "revision_contributor_id"      Type of data:          Number      Contains null values:  True (excluded from calculations)      Unique values:         5197      Smallest value:        0      Largest value:         34,699,348      Sum:                   204,280,488,782      Mean:                  11,801,299.179      Median:                9,612,106      StDev:                 11,423,342.666      Most common values:    None (2523x)                             9,784,415 (1248x)                             27,015,025 (462x)                             212,624 (412x)                             194,203 (390x)3. "revision_contributor_username"      Type of data:          Text      Contains null values:  True (excluded from calculations)      Unique values:         5198      Longest value:         51 characters      Most common values:    None (2523x)                             Tom.Reding (1248x)                             InternetArchiveBot (462x)                             MZMcBride (412x)                             Graham87 (390x)4. "revision_timestamp"      Type of data:          DateTime      Contains null values:  False      Unique values:         19788      Smallest value:        2002-02-25 15:43:11+00:00      Largest value:         2018-09-20 11:47:36+00:00      Most common values:    2002-02-25 15:51:15+00:00 (11x)                             2002-02-25 15:43:11+00:00 (8x)                             2018-09-16 18:47:23+00:00 (2x)                             2018-09-20 06:53:14+00:00 (2x)                             2018-09-13 20:16:20+00:00 (2x)5. "title"      Type of data:          Text      Contains null values:  True (excluded from calculations)      Unique values:         19833      Longest value:         97 characters      Most common values:    AccessibleComputing (1x)                             Anarchism (1x)                             AfghanistanHistory (1x)                             AfghanistanGeography (1x)                             AfghanistanPeople (1x)

The above is incredibly handy for building data fluency. I have enough information to create a fairly granular schema in a structured data store.

Note, the values printed out have been transformed to be more consistent and readable. Timestamps share the same format and numbers are printed using commas.

The following will print the first ten records of the new CSV file. Note the data has been formatted for easier reading here as well.

$ pigz -d -c metadata.csv.gz \    | head -n10 \    | csvlook

| id | revision_contributor_id | revision_contributor_username |        revision_timestamp | title                          || -- | ----------------------- | ----------------------------- | ------------------------- | ------------------------------ || 10 |              23,257,138 | Godsy                         | 2018-08-14 06:47:24+00:00 | AccessibleComputing            || 12 |                         |                               | 2018-09-19 12:07:26+00:00 | Anarchism                      || 13 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:18:18+00:00 | AfghanistanHistory             || 14 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:18:23+00:00 | AfghanistanGeography           || 15 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:19:42+00:00 | AfghanistanPeople              || 18 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:19:45+00:00 | AfghanistanCommunications      || 19 |               9,784,415 | Tom.Reding                    | 2017-06-04 21:42:11+00:00 | AfghanistanTransportations     || 20 |               9,784,415 | Tom.Reding                    | 2017-06-04 21:43:11+00:00 | AfghanistanMilitary            || 21 |               9,784,415 | Tom.Reding                    | 2017-06-04 21:43:14+00:00 | AfghanistanTransnationalIssues |

If you want to see the data in tabular form but without the transformations run the command again with the no inference flag.

$ pigz -d -c metadata.csv.gz \    | head -n10 \    | csvlook --no-inference

| id | revision_contributor_id | revision_contributor_username | revision_timestamp   | title                          || -- | ----------------------- | ----------------------------- | -------------------- | ------------------------------ || 10 | 23257138                | Godsy                         | 2018-08-14T06:47:24Z | AccessibleComputing            || 12 |                         |                               | 2018-09-19T12:07:26Z | Anarchism                      || 13 | 9784415                 | Tom.Reding                    | 2017-06-05T04:18:18Z | AfghanistanHistory             || 14 | 9784415                 | Tom.Reding                    | 2017-06-05T04:18:23Z | AfghanistanGeography           || 15 | 9784415                 | Tom.Reding                    | 2017-06-05T04:19:42Z | AfghanistanPeople              || 18 | 9784415                 | Tom.Reding                    | 2017-06-05T04:19:45Z | AfghanistanCommunications      || 19 | 9784415                 | Tom.Reding                    | 2017-06-04T21:42:11Z | AfghanistanTransportations     || 20 | 9784415                 | Tom.Reding                    | 2017-06-04T21:43:11Z | AfghanistanMilitary            || 21 | 9784415                 | Tom.Reding                    | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues |

The csvlookutility does support a max-rows parameter as well but it takes a lot longer to pull the first ten records from a compressed CSV file than pigz and head do.

Converting CSV to AVRO

AVRO filesstore data in a self-describing form by maintaining a schema of each individual record along with its data. Data migrations are supported as well with schema versioning for both reading and writing records. AVRO files themselves can be read and worked with using a large variety of tools in the Hadoop ecosystem as well as with various Python and Java libraries. These features make AVRO an excellent transport format for micro-batches of data.

Below I'll use a simple CSV to AVRO conversion tool. It doesn't support reading data from gzip-compressed files so I'll decompress the CSV file first.

$ pigz -d --keep metadata.csv.gz$ ~/csv-avro-converter-cli-1.0/bin/csv-avro-converter-cli \    --in  metadata.csv \    --out metadata.avro

The file sizes of the uncompressed CSV and AVRO are about the same.

The tool will perform some basic analysis on the contents of the CSV data and put together a schema for the dataset. Note, the timestamp column ended up being stored as a string rather than as any sort of smaller timestamp column type. Below is a dump of the schema the conversion tool put together.

$ java -jar ~/avro-tools-1.8.2.jar \       getschema \       metadata.avro

{  "type" : "record",  "name" : "csv",  "namespace" : "metadata",  "fields" : [ {    "name" : "id",    "type" : "int"  }, {    "name" : "revision_contributor_id",    "type" : [ "null", "int" ],    "default" : null  }, {    "name" : "revision_contributor_username",    "type" : [ "null", "string" ],    "default" : null  }, {    "name" : "revision_timestamp",    "type" : "string"  }, {    "name" : "title",    "type" : "string"  } ]}

Below I'll extract a record from the AVRO file. Note that the data hasn't been truncated or degraded in any way.

$ java -jar ~/avro-tools-1.8.2.jar \       tojson \       metadata.avro \    | head -n1 \    | python -mjson.tool

{    "id": 10,    "revision_contributor_id": {        "int": 23257138    },    "revision_contributor_username": {        "string": "Godsy"    },    "revision_timestamp": "2018-08-14T06:47:24Z",    "title": "AccessibleComputing"}

Converting CSV to ORC

ORC filesstore data in a columnar and compressed form. The project itself began in early 2013 and has been lead by Hortonwork's Founder and Technical Fellow Owen O'Malley. The format is very popular in the Hadoop ecosystem and in 2014, Facebook announced they had successfully converted 10s of petabytesof data in their then-300 PB data warehouse into ORC format.

Statistics are kept for each column throughout an ORC file and for each group of 10,000 rows within each column. This allows queries to skip the parts of the file that aren't relevant and does a good job of speeding up queries by narrowing down the search space.

When the ORC project started it was closely tied to Hivebut since 2016 releases no longer depend on it. Different Hadoop projects often don't share the same version of their underlying ORC library but this year it was announcedthat the next release of Spark, version 2.4, will use the same version of the ORC library that Hive 3.0 uses.

Below I'll convert the compressed CSV file into an ORC file. The tool I'm using doesn't support feeding in compressed data nor can it skip the header row so I'll decompress all but the first line of data into a new CSV file.

$ pigz -d -c metadata.csv.gz \    | tail -n +2 \    > metadata.noheader.csv

ORC files have schemas with field names and data types. Below I'll define the schema for the five fields within the CSV file. I've kept the field names a, b, c, d and e in order to keep the command short. Longer field names are supported if you don't wish to be so ambiguous.

$ java -jar ~/orc-tools-1.5.2-uber.jar \     convert \     metadata.noheader.csv \     --schema "struct<a:int,b:int,c:string,d:timestamp,e:string>" \     --timestampformat "yyyy-MM-dd'T'HH:mm:ss'Z'" \     --output metadata.orc

The resulting ORC file is around 25% of the size of the uncompressed CSV data.

I'll print out the properties of the resulting ORC file. Below you can see the file is compressed using zlib, has a single stripe and statistics on the minimum, maximum and sum values for each column are given. The third column that contains the revision contributor usernames is dictionary encoded.

$ java -jar ~/orc-tools-1.5.2-uber.jar \     meta \     metadata.orc

Processing data file metadata.orc [length: 341494]Structure for metadata.orcFile Version: 0.12 with ORC_135Rows: 19833Compression: ZLIBCompression size: 262144Type: struct<a:int,b:int,c:string,d:timestamp,e:string>Stripe Statistics:  Stripe 1:    Column 0: count: 19833 hasNull: false    Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174    Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782    Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207    Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0    Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852File Statistics:  Column 0: count: 19833 hasNull: false  Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174  Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782  Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207  Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0  Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852Stripes:  Stripe: offset: 3 data: 340464 rows: 19833 tail: 144 index: 412    Stream: column 0 section ROW_INDEX start: 3 length 21    Stream: column 1 section ROW_INDEX start: 24 length 62    Stream: column 2 section ROW_INDEX start: 86 length 62    Stream: column 3 section ROW_INDEX start: 148 length 115    Stream: column 4 section ROW_INDEX start: 263 length 62    Stream: column 5 section ROW_INDEX start: 325 length 90    Stream: column 1 section DATA start: 415 length 5511    Stream: column 2 section PRESENT start: 5926 length 1686    Stream: column 2 section DATA start: 7612 length 43310    Stream: column 3 section PRESENT start: 50922 length 1686    Stream: column 3 section DATA start: 52608 length 27735    Stream: column 3 section LENGTH start: 80343 length 3229    Stream: column 3 section DICTIONARY_DATA start: 83572 length 32190    Stream: column 4 section DATA start: 115762 length 70288    Stream: column 4 section SECONDARY start: 186050 length 15    Stream: column 5 section DATA start: 186065 length 140495    Stream: column 5 section LENGTH start: 326560 length 14319    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: DIRECT_V2    Encoding column 3: DICTIONARY_V2[5197]    Encoding column 4: DIRECT_V2    Encoding column 5: DIRECT_V2File length: 341494 bytesPadding length: 0 bytesPadding ratio: 0%

I'll dump out the first ten rows of data from the ORC file and as you can see, the data looks to be in place.

$ java -jar ~/orc-tools-1.5.2-uber.jar \        data \        metadata.orc \    | head

{"a":10,"b":23257138,"c":"Godsy","d":"2018-08-14 06:47:24.0","e":"AccessibleComputing"}{"a":12,"b":null,"c":null,"d":"2018-09-19 12:07:26.0","e":"Anarchism"}{"a":13,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:18:18.0","e":"AfghanistanHistory"}{"a":14,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:18:23.0","e":"AfghanistanGeography"}{"a":15,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:42.0","e":"AfghanistanPeople"}{"a":18,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:45.0","e":"AfghanistanCommunications"}{"a":19,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:42:11.0","e":"AfghanistanTransportations"}{"a":20,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:43:11.0","e":"AfghanistanMilitary"}{"a":21,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:43:14.0","e":"AfghanistanTransnationalIssues"}{"a":23,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:50.0","e":"AssistiveTechnology"}

For comparison, this is the data for the first ten rows in the source CSV file.

$ csvlook --no-header-row \          --no-inference \          --max-rows 10 \          metadata.noheader.csv

| a  | b        | c          | d                    | e                              || -- | -------- | ---------- | -------------------- | ------------------------------ || 10 | 23257138 | Godsy      | 2018-08-14T06:47:24Z | AccessibleComputing            || 12 |          |            | 2018-09-19T12:07:26Z | Anarchism                      || 13 | 9784415  | Tom.Reding | 2017-06-05T04:18:18Z | AfghanistanHistory             || 14 | 9784415  | Tom.Reding | 2017-06-05T04:18:23Z | AfghanistanGeography           || 15 | 9784415  | Tom.Reding | 2017-06-05T04:19:42Z | AfghanistanPeople              || 18 | 9784415  | Tom.Reding | 2017-06-05T04:19:45Z | AfghanistanCommunications      || 19 | 9784415  | Tom.Reding | 2017-06-04T21:42:11Z | AfghanistanTransportations     || 20 | 9784415  | Tom.Reding | 2017-06-04T21:43:11Z | AfghanistanMilitary            || 21 | 9784415  | Tom.Reding | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues || 23 | 9784415  | Tom.Reding | 2017-06-05T04:19:50Z | AssistiveTechnology            |

Importing CSVs into ClickHouse

Once I've gotten a large sample of a data feed together, it's helpful to build fluency in the dataset using an analytics-focused database. For this exercise I'll import the CSV data into ClickHouse.

ClickHouse has done well in my 1.1 Billion Taxi RidesBenchmarks and even though it doesn't execute queries on GPUs, it is very performant, even on older Intel CPUs. The installation process is done via a simple apt installcommand, the default configuration works well and there's no license fees required for commercial purposes.

I'll make sure the ClickHouse server is running and then connect to it with the client.

$ sudo service clickhouse-server start$ clickhouse-client

I'll first create a table called pagesthat will store the data in more or less the same form as it appears in the CSV file. Since ClickHouse won't interpret the Z in the timestamp strings properly I'll store the timestamp initially as a string and transform it later. The Z in the timestamps refers to Zulu time / UTC.

CREATE TABLE pages (    page_id                       UInt32,    revision_contributor_id       UInt32,    revision_contributor_username Nullable(String),    revision_timestamp            String,    title                         Nullable(String)) ENGINE=Log;

The following will feed the decompressed CSV data into the table while skipping the header row.

$ pigz -d -c metadata.csv.gz \    | tail -n +2 \    | clickhouse-client \        --query="INSERT INTO pages FORMAT CSV"

Now I can create a new pages_mttable which will use the faster MergeTree engine. This will do a good job at speeding up queries on this dataset.

$ clickhouse-client

CREATE TABLE pages_mt ENGINE=MergeTree(date_, page_id, 8192) AS    SELECT page_id,           revision_contributor_id,           revision_contributor_username,           toDateTime(replaceOne(revision_timestamp, 'Z', '')) as revision_timestamp_,           toDate(replaceOne(revision_timestamp, 'Z', '')) as date_,           title    FROM pages;

The following is an example query looking for the ten most frequent years for revisions.

SELECT toYear(date_) AS year,       count(*) AS revisionsFROM pages_mtGROUP BY yearORDER BY 2 DESCLIMIT 10

┌─year─┬─revisions─┐│ 2002 │       289 ││ 2003 │       176 ││ 2004 │       193 ││ 2005 │       159 ││ 2006 │       211 ││ 2007 │       125 ││ 2008 │       129 ││ 2009 │       186 ││ 2010 │       200 ││ 2011 │       188 │└──────┴────────────┘