Big Data – Skere9

GROUP BY on a GROUP BY

Here’s something I’ve been working on lately – a number of metrics across large data sets where I want to determine aggregates of aggregates. I can’t present the actual data I’m working with here, but this is an illustrative example. Consider the data set below.

SQL> SELECT   *
  2  FROM     shop_orders;

        ID CATEGORY   ITEM_N
---------- ---------- --------------------
         1 Office     Stapler
         2 Fruit      Banana
         3 Bread      French Bread
         4 Steak      Ribeye
         5 Fish       Bass
         6 Crackers   Saltines
         7 Fish       Swordfish
         8 Fruit      Pear
         9 Bread      Rolls
        10 Crackers   Chips
        11 Steak      Prime Rib
        12 Bakery     Birthday Cake
        13 Fish       Salmon
        14 Fish       Flounder
        15 Cleaners   Detergent
        16 Bakery     Donuts
        17 Office     Ink Cartridges
        18 Steak      Filet Mignon
        19 Vegetable  Broccoli
        20 Steak      Flank
        21 Fruit      Raspberry
        22 Sundries   Shampoo
        23 Steak      Salisbury
        24 Vegetable  Radish
        25 Vegetable  Mushroom
        26 Pharmacy   Band-aids
        27 Pharmacy   Aspirin
        28 Sundries   Toothbrush
        29 Sundries   Shampoo
        30 Cereal     Bran
        31 Cereal     Corn Flakes
        32 Cleaners   Soap
        33 Cereal     Oatmeal

33 rows selected.

One glance at the data and you can see the category values repeat. But how often do they repeat? Which categories recur the most frequently? What if you needed to determine which categories have the greatest number of values, and which don’t?

To determine this, we can start with a GROUP BY, like this.

SQL> SELECT   category, COUNT(*) CT
  2  FROM     shop_orders
  3  GROUP BY category;

CATEGORY       CT
---------- ----------
Cereal              3
Fruit               3
Crackers            2
Vegetable           3
Office              2
Steak               5
Pharmacy            2
Fish                4
Bread               2
Cleaners            2
Sundries            3
Bakery              2

12 rows selected.

Now let’s move the GROUP BY to an inline view, and do a GROUP on the GROUP BY, like this.

SQL> SELECT COUNT(b.category) ct_category, b.ct ct_item_n
  2  FROM
  3    ( SELECT   a.category, COUNT(a.item_n) CT
  4  	 FROM	  shop_orders a
  5  	 GROUP BY a.category
  6    ) b
  7  GROUP BY b.ct
  8  ORDER BY b.ct DESC;

CT_CATEGORY  CT_ITEM_N
----------- ----------
          1          5
          1          4
          4          3
          6          2

Voila – there is one category that recurs five times, one recurs 4 times, four recur three times, and finally six categories have two values each.

In a relatively small data set like this, the answers might be obvious. I’ve been working with very large rows that number in the tens of millions and queries like this have been invaluable in confirming data shape and patterns.

If you wish to experiment with these SQL statements, here is the code to create the sample table and sample data.

DROP   TABLE shop_orders;
CREATE TABLE shop_orders
(   id       NUMBER
  , category VARCHAR2(15)
  , item_n   VARCHAR2(20)
);
INSERT INTO shop_orders VALUES ( 1, 'Office', 'Stapler');
INSERT INTO shop_orders VALUES ( 2, 'Fruit', 'Banana');
INSERT INTO shop_orders VALUES ( 3, 'Bread', 'French Bread');
INSERT INTO shop_orders VALUES ( 4, 'Steak', 'Ribeye');
INSERT INTO shop_orders VALUES ( 5, 'Fish',  'Bass');
INSERT INTO shop_orders VALUES ( 6, 'Crackers', 'Saltines');
INSERT INTO shop_orders VALUES ( 7, 'Fish', 'Swordfish');
INSERT INTO shop_orders VALUES ( 8, 'Fruit', 'Pear');
INSERT INTO shop_orders VALUES ( 9, 'Bread', 'Rolls');
INSERT INTO shop_orders VALUES (10, 'Crackers', 'Chips');
INSERT INTO shop_orders VALUES (11, 'Steak', 'Prime Rib');
INSERT INTO shop_orders VALUES (12, 'Bakery', 'Birthday Cake');
INSERT INTO shop_orders VALUES (13, 'Fish', 'Salmon');
INSERT INTO shop_orders VALUES (14, 'Fish', 'Flounder');
INSERT INTO shop_orders VALUES (15, 'Cleaners', 'Detergent');
INSERT INTO shop_orders VALUES (16, 'Bakery', 'Donuts');
INSERT INTO shop_orders VALUES (17, 'Office', 'Ink Cartridges');
INSERT INTO shop_orders VALUES (18, 'Steak', 'Filet Mignon');
INSERT INTO shop_orders VALUES (19, 'Vegetable', 'Broccoli');
INSERT INTO shop_orders VALUES (20, 'Steak', 'Flank');
INSERT INTO shop_orders VALUES (21, 'Fruit', 'Raspberry');
INSERT INTO shop_orders VALUES (22, 'Sundries', 'Shampoo');
INSERT INTO shop_orders VALUES (23, 'Steak', 'Salisbury');
INSERT INTO shop_orders VALUES (24, 'Vegetable', 'Radish');
INSERT INTO shop_orders VALUES (25, 'Vegetable', 'Mushroom');
INSERT INTO shop_orders VALUES (26, 'Pharmacy', 'Band-aids');
INSERT INTO shop_orders VALUES (27, 'Pharmacy', 'Aspirin');
INSERT INTO shop_orders VALUES (28, 'Sundries', 'Toothbrush');
INSERT INTO shop_orders VALUES (29, 'Sundries', 'Shampoo');
INSERT INTO shop_orders VALUES (30, 'Cereal', 'Bran');
INSERT INTO shop_orders VALUES (31, 'Cereal', 'Corn Flakes');
INSERT INTO shop_orders VALUES (32, 'Cleaners', 'Soap');
INSERT INTO shop_orders VALUES (33, 'Cereal', 'Oatmeal');

COMMIT;

There you go! Have fun!

Generating Test Data

Here’s a quick way to generate dummy data for performing tests that require lots of records.

SQL> DROP   TABLE lots_o_data;

Table dropped.

SQL> CREATE TABLE lots_o_data
  2  (	 id	  NUMBER
  3    , name	  VARCHAR2(5)
  4    , category VARCHAR2(26)
  5  );

Table created.

SQL> INSERT INTO lots_o_data VALUES (1, 'Lotso', 'Wow this is a lot of data.');

1 row created.

SQL> INSERT INTO lots_o_data VALUES (2, 'More',  'Bunches and bunches.');

1 row created.

SQL> COMMIT;

Commit complete.

SQL> INSERT INTO lots_o_data
  2  	 SELECT a.id, a.name, b.category
  3  	 FROM lots_o_data a CROSS JOIN lots_o_data b;

4 rows created.

SQL> COMMIT;

Commit complete.

SQL> INSERT INTO lots_o_data
  2  	 SELECT a.id, a.name, b.category
  3  	 FROM lots_o_data a CROSS JOIN lots_o_data b;

36 rows created.

SQL> COMMIT;

Commit complete.

SQL> INSERT INTO lots_o_data
  2  	 SELECT a.id, a.name, b.category
  3  	 FROM lots_o_data a CROSS JOIN lots_o_data b;

1764 rows created.

SQL> COMMIT;

Commit complete.

SQL> INSERT INTO lots_o_data
  2  	 SELECT a.id, a.name, b.category
  3  	 FROM lots_o_data a CROSS JOIN lots_o_data b;

3261636 rows created.

SQL> COMMIT;

Commit complete.

SQL> SELECT COUNT(*)
  2  FROM   lots_o_data;
  COUNT(*)                                                               ----------                                                             3263442                                                                      
SQL> SPOOL OFF

Note the following

With only four INSERT … SELECT statements based on a Cartesian product, we were able to generate over 3 million rows of data.
A few more INSERT … SELECT statements could easily result in billions of rows of data.

If you need a lot of data for performance testing or capacity testing, this is an easy way to get however much you require.

I have been working on building a data lake environment in anticipation of a series of large data feeds I’m expecting soon and found this to be a useful approach to preparing the environment.

Conference in Williamsburg, Sqoop, and Big Data Connectors

Yesterday – April 24, 2014 – I had the privilege of presenting at the Virginia Oracle User Group (VOUG) annual Oracle Conference on the James, or OCOJ for short. The “James” is the James River right there in Williamsburg, Virginia. I promised my audience I would publish the Power Point slides, so here they are:

From SQL to Hadoop and Back: The “Sqoop” about Data Connections between SQL and Hadoop – VOUG, April 24, 2014, 3:45 – 4:45 p.m.

Yesterday could not have been a more glorious day. The Doubletree Hotel in Williamsburg is really a conference center, with a variety of beautiful meeting rooms and great skylights and large glass walls. The room in which I presented was unusual in that it had one door that led directly outside. The hotel propped it open and the beautiful air flowed it, it was remarkably refreshing. I particularly enjoyed it, given the many snow days we’ve had this past winter and even into the spring. The air was fresh and the temperatures just perfect, what a great day.

I was definitely in with some illustrious company – other presenters throughout the day included Mary Gable, David Mann, Craig Shallahammer, Greg Mays, Scott Poteet, Bill Myers, and Oracle’s own Bob Bunting as well as Robert Freeman. The legendary Tom Kyte (of Ask Tom fame) was the keynote speaker. Brilliant talent was on display everywhere.

For more information about VOUG, visit their website here: http://www.voug.org.

Here’s a copy of the full conference agenda: VOUG OCOJ Conference Agenda.

Thanks to one of my great audience members who took the photo. And a huge thanks to Linda Hoover for making the entire event possible – thank you Linda!