NOAA generates tens of terabytes of data a day from satellites, radars, ships, weather models, and other sources. Although vast wealth of data represents a substantial economic opportunity, it can cause a host of new problems in particular with performance. If data keeps growing, it can in return slow down dashboards, models, and reports.

Open source technology stack (APACHE PINOT, SUPERSET). Apache Pinot ​is distributed near real-time OLAP data store that offers a SQL query on top of custom written data store.​ Near real-time ingestion of events from various data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage) as well as stream data sources (such as Apache Kafka, HTTP, and FTP data servers).​ Scales horizontally and linearly if data size or query increases. Superset is a modern data exploration and visualization platform.​ Superset can connect to any SQL-based data source through SQLAlchemy, including modern cloud native databases and engines at petabyte scale. Seamless, in-memory asynchronous caching and queries​. A cloud-native architecture designed from the ground up for scale.

Utilizing real-time distributed OLAP (Online Analytical Processing) data store, designed to answer OLAP queries with low latency. Open source tools, like Apache Pinot, can be used to easily ingest, query, and visualize millions of climate events sourced from the NOAA database. The OLAP data store is used as an analytics backend to do most of the heavy lifting, and paired with another open source tool Superset to create real-time dashboards. ​

Connecting to big data on NOAA’s HTTP server with custom bash script utilizing wget through Ubuntu to ingest the data in near real-time. Creating Docker network for containerized resource allocation, utilizing Apache Pinot as an OLAP data store, creating web-based BI with Superset.


