MJFChat: Sorting out data warehousing and data lakes on Azure

We are now in the second year of our twice-monthly interview show on Petri.com that is dedicated to covering topics of interest to our tech-professional audience. We have branded this show “MJFChat.”

In my role as Petri’s Community Magnate, I will be interviewing a variety of IT-savvy technology folks. Some of these will be Petri contributors; some will be tech-company employees; some will be IT pros. We will be tackling various subject areas in the form of 30-minute audio interviews. I will be asking the questions, the bulk of which we’re hoping will come from you, our Petri.com community of readers.

We will ask for questions a week ahead of each chat. Readers can submit questions via Twitter, Instagram, Facebook and/or LinkedIn using the #AskMJF hashtag. Once the interviews are completed, we will post the audio and associated transcript in the forums for readers to digest at their leisure. (By the way, did you know MJFChats are now available in podcast form? Go here for Spotify; here for Apple Podcasts on iTunes; and here for Google Play.)

Our next MJFChat, scheduled for Monday, February 3, is all about data warehousing and data lakes on Azure. My special guest is Andrew Brust, founder of Blue Badge Insights. Brust also is a Microsoft Regional Director and Most Valuable Professional (MVP).

We want you to submit your best questions for Andrew ahead of our chat. If you’ve got questions about data warehousing, data lakes, Microsoft’s new Azure Synapse Analytics offering or Microsoft’s database strategy in general, he is your guy. If there are any specific topics or scenarios you’d like Andrew to cover, make sure to chime in ahead of time.

Also: If you know someone you’d like to see interviewed on the MJFChat show, including yourself, send me a note at [email protected]. (Let me know why you think this person would be an awesome guest and what topics you’d like to see covered.) We’ll take things from there….

Transcript of the conversation:

Mary Jo Foley: 00:01 Hi, you’re listening to the Petri.com MJF chat show. I am Mary Jo Foley, aka your Petri.com community magnet. I’m here to interview tech industry experts about various topics that you, our readers and listeners want to know about. Today’s MJF chat is going to be all about data warehousing and data lakes on Azure. My special guest today is Andrew Brust, the founder of Blue Badge Insights. Andrew also is a Microsoft Regional Director and an MVP and a longtime friend of mine. Welcome Andrew and thank you so much for doing this chat.

Andrew Brust: 00:43 Oh, thank you for having me. You know, I always like talking about data stuff.

Mary Jo Foley: 00:47 I know.

Andrew Brust: 00:48 In the world of Microsoft or elsewhere.

Mary Jo Foley: 00:51 Exactly, exactly. So I’m excited about this because I feel like I mentioned data lakes and data warehousing from time to time in my coverage, but I’m always thinking to myself, do I really even understand what I’m saying? So I’m excited to get some good insights and background from you about this. I think a good place to start might be with definitions, just for level setting here. So maybe you could give us a quick definition of data warehouse, data lake and how those two things are different or how they compete and complement each other.

Andrew Brust: 01:28 Right. And how they’re converging even.

Mary Jo Foley: 01:31 Yes.

Andrew Brust: 01:32 Just to have a little bit of a paradox there. The data warehousing story is really kind of about the circle of life. We’ve had data warehouse technology for decades and it has in many ways always been what it’s about today. It’s about taking the data from transactional operational databases getting it all in one place, getting it conformed and optimized in structure and otherwise for doing analysis as opposed to just doing operational things like looking up a particular record and updating it, or inserting a new record, or maybe doing a query over a small set of them. That’s what transactional, operational databases are great at. But when you want to start aggregating across huge, huge numbers of rows of data, that’s what the warehouse is for. And as I said, it’s been there for decades. I mean Terradata’s been around doing stuff in this world and lots of interesting innovations have come out of that.

Andrew Brust: 02:41 The thing is though, that we got into kind of a rut for at least a decade where data warehouses were, well let’s just say very expensive, therefore very limited in terms of who in the organization could use it and do the analysis work. And really most of the companies were monetizing on the hardware and especially on the storage. Data warehouses were sold physically as appliance devices. And there was only so much storage in them. So eventually you ran out and when you needed more, that’s when you, let’s just say, didn’t have a lot of leverage and when you had to get ready to open your pocketbook. That really encouraged a kind of mentality and practices to be very sparing about what you put in the warehouse and sparing about who could use it. Then along came what we used to call big data.

Andrew Brust: 03:38 You remember big data, you remember it.

Mary Jo Foley: 03:39 I do.

Andrew Brust: 03:39 And although purportedly that was all about being able to deal with larger volumes of data, which to some extent was true. It really, with hindsight was about something different. So, if warehouses were all about proprietary technology, then big data was all about commodity technology. And especially on the storage side, instead of using premium enterprise storage, we were using, you know, the most commodity direct attached disk drives on the most commodity servers that we could. And that changed the whole mentality. That said, instead of being sparing about what we store, we’re just going to actually err on the side of being inclusive and put everything there. And the place where we end up storing everything has come to be called a data link. Now we’re still doing the same kinds of analysis on those platforms, but with a totally different mentality.

Andrew Brust: 04:47 And then the revenge of the data warehouse, really started with Amazon and its Redshift product. Which said, you know what, we’re going to make storage commodity on the warehouse as well and we’re going to set it up in the cloud. So, this whole appliance model, we’re going to blow that out and now people can go back to the platforms that they were familiar with. Right. But economics totally changed. And that then caused a challenge to Hadoop. So you’ve got the warehouse and the lake and now, they’re both primarily using cloud object storage and you’re starting to see warehouses, different data warehouse products actually onboard data lake technologies. And in the Microsoft world, the latest two examples of that, and it’s a little problematic that there’s two, is that Azure SQL Data Warehouse has become Azure Synapse Analytics and it onboards, data lake technology and Apache Spark and SQL Server 2019, big data clusters kind of do likewise. So now we’ve got everything coming together and ultimately that makes sense. You have to go out into the wilderness, innovate with very new technology and then eventually the sleeping giants wake up and say, okay, well we’ll add some features and capabilities so that we can do that too.

Mary Jo Foley: 06:10 So this may be a can of worms question that I’m about to open here, but now that those two things are converging so much, how do you choose? Like how do you decide I’m putting this in a data warehouse or a data lake?

Andrew Brust: 06:23 Yeah, that’s an excellent question. And you know, like many things, you’re going to get different answers from different vendors as it suits their messaging and their marketing. There will be data warehouse vendors who say, just put everything in the data warehouse and in fact the data warehouse can take on data lake workloads and you’re going to get that same argument. Although the converse from the data lake vendors who say, wow, there’s really no reason for data warehouses anymore, but whether we’re talking about products or not, there really is at least a methodology difference between what the two things are. Data warehouses are really places for very vetted, highly structured, very analysis-ready data that everyone in the organization can agree on in terms of it’s being included and how it’s structured and how it’s related. Data lakes are much more a place for, as I was saying before, erring on the side of conclusion, putting lots of stuff in there and leaving it in a relatively rare form until it comes time to analyze it. And what you’ll probably see is that different organizations, if they start analyzing data in the lake repeatedly, the same data repeatedly in the same way it can kind of transition or graduate into being in the warehouse. So there’s at least a thin dotted line between the two in terms of philosophy.

Mary Jo Foley: 07:51 Okay. Is there also any connection between the type of data and which one you choose? So I’m thinking like IOT Sensor Data. Like does it makes obvious sense that that goes in a data lake or a data warehouse or it depends?

Andrew Brust: 08:05 I would say in most cases that should go in the data lake. And even if you’re saying, well, it’s going to go in the warehouse, then I would say fine, but then in effect, you’re using your warehouse as a data lake, which is, you know, which is okay, depending. But yeah, the more raw data, the less structured, the more kind of time series data like you’re talking about typically want to put that in the lake.

Mary Jo Foley: 08:27 Okay, cool. So now let’s make this even more complicated. What about BI and how business insights and analytics fit in, right?

Andrew Brust: 08:37 Yeah. Yeah. So once upon a time you had BI Tools and they were really the go-to place for doing the analysis for data that was in the warehouse or for stuff that was inBI platforms like OLAP cubes and so forth. I’m very careful not to disparage those because if it weren’t for OLAP and cubes, I wouldn’t be in this part of the industry. That’s really where I got hooked. And so BI tools are actually, even to this day, they’re based on the paradigms that came out of OLAP, which is that you have a structure for the data where you have things called measures, which are the metrics that you’re analyzing and things called dimensions, which are the categories that you’re going to drill down the metrics by. And in the world of OLAP, those are very legislative kind of objects, right?

Andrew Brust: 09:34 In warehouses and certainly in data lakes it’s a little bit more, kind of conventional, like you say, well, I’m using this as a measure and I’m using this as the dimension. Anyway, so BI tools are really the place where you’re doing the analytics on both the warehouse and the data lake. And you’re starting to see in the Microsoft case, a tool like Power BI. I get more and more capabilities where it can talk to either and more and more capabilities, where in fact, even though it’s using the same OLAP engine that we had in SQL Server Analysis Services going back 20 years, regardless of that, it’s still able to kind of tolerate things being not in a cube format but in a very tabular format. And then it kinda picks things up on the fly based on how you use them. So BI tools are you know, they were where it’s at. They’re still where it’s at. That’s why Salesforce paid huge gobs of money for Tableau. And that’s why Microsoft is having so much success with Power BI and Power BI isn’t just successful in its own right. It’s a huge driver to get customers on to Azure even though technically it’s not even part of Azure.

Mary Jo Foley: 10:54 Yup. So yeah, speaking of Azure, it feels like the cloud has changed everything when it comes to database, data warehouse, and data lakes. And then Kubernetes, same thing, changing the whole scene all over again. Can you talk a little bit about how things have changed because of those two things?

Andrew Brust: 11:15 Yeah. Well, you know, if Hadoop changed the whole economics of storage because it used commodity disk drives, the cloud changed it even more because it said, well, if you’re an Amazon, you’re going to put all your data in S3 and if you’re in Azure, you’re going to put all your data in Blob Storage or these days you put it in something called Azure Data Lake Storage, which is like Blob Storage plus plus plus. But it was the same idea, but it’s even easier. It’s even less friction. And then once the data’s there, then you can be using BI platforms or data warehouse platforms or data lake platforms to talk to the technology. However, let’s not get ahead of ourselves as much as we all kind of love jumping on the cloud bandwagon. You know, better probably than most that there are still plenty of organizations that can’t move to the cloud or can’t move whole hog to the cloud. There is various data in various industries that needs to remain on-premises. And there are also lots of customers who move back and forth between different clouds and maybe have their own cloud-like infrastructure set up. So we start to need technology that can make things portable between all those environments.

Andrew Brust: 12:36 That is where Kubernetes has come in because with the Kubernetes services, for example, Elastic Kubernetes Services on Amazon or Azure Kubernetes Service on Azure. And of course various on premises Kubernetes services, that lets you move stuff up and down between clouds and from on-prem to the cloud. And so the SQL Server 2019 Big Data clusters that I mentioned before, that’s all based on Kubernetes. So you can run that on premises, you would think so, it’s an on-premises database, but actually the easiest way to run SQL 2019 BDC is in Azure Kubernetes Service. So it’s an on premises product that runs in the cloud and gosh, you know, it uses, the Hadoop distributed file system on-premises, but really that translates to using cloud storage when it’s running in the cloud. And so you’re starting to get a lot of fluidity where you can move between environments and it’s all because of Azure and then the SQL Server space. It’s all because of the foundational work that we’re doing to make SQL server run in Linux and in containers. And now that’s all paying off where you can move it pretty much anywhere you want. It turns out that was the plan all along.

Mary Jo Foley: 13:59 Who knew?

Andrew Brust: 14:02 Yeah, exactly.

New Speaker: 14:02 Okay. There was a big announcement at Ignite this past year about Azure Synapse Analytics, which I think is partially a rebranding if not completely a rebranding, but could you explain that a bit more and how that fits in with data warehousing and data lakes?

Andrew Brust: 14:19 Yeah, absolutely. And we chatted about it earlier, but I can go into detail here. So, yes. We had this thing called Azure SQL Data Warehouse, which kind of looks like a companion to Azure SQL Database and it’s based on the technology that Microsoft, once upon a time called Parallel Data Warehouse on-premises. It never really you know, caught on and had lots and lots of adoption. But in the cloud, it’s been very successful. So that team has said, you know what we’re going to do? We’re going to handle things so that anything you have in the warehouse, we’re also going to shadow and have that in, in cloud storage as well. And give you the option to use Apache Spark to do analytics on it or to use the warehouse to do analytics on it.

Andrew Brust: 15:10 And that’s, that’s the difference between SQL Data Warehouse and Synapse. However, however, all that Spark goodness is not actually released yet. And so for all practical you know, concerns, it is just a rebrand. Azure Sequel Data Warehouse became Synapse Analytics. Now what’s nutty is that Synapse Analytics and SQL 2019 Big Data clusters and SQL 2019 overall, were both announced at Ignite. So same day, same event. And you’ve got two different ways to converge SQL server technology-based analytics with Spark technology-based analytics. They’re different, but to release them on the same day at the same event, has already created some some market confusion. Shall we say?

Mary Jo Foley: 16:11 Yeah. There was a lot of rebranding confusion I felt like an Ignite this year, not just in the database space, but also in the Azure space in general and the Edge. A lot of new names to get used to and figure out.

Andrew Brust: 16:26 Yeah. And you know, there’s a competitive necessity to stand up to Snowflake which is an independent data warehouse company, although the rumors are that it’s independence maybe weaning

Mary Jo Foley: 16:41 Oh, yeah.

Andrew Brust: 16:44 Looks like there’s a buyer out there.

Mary Jo Foley: 16:44 Oh really? Who? Who?

Andrew Brust: 16:46 I’m hearing rumors of Salesforce.

Mary Jo Foley: 16:49 Oh, interesting. Very interesting.

Andrew Brust: 16:51 Yup. So you may end up with Snowflake and Tableau under the same roof, which could get interesting. And then Salesforce starts to look a lot more like SAP, which is really

Mary Jo Foley: 17:04 Kind of crazy. Yeah.

Andrew Brust: 17:06 Yeah, well it makes sense though. But, anyway it’s important to compete against that. It’s important to say, look, let’s not let the lake disrupt the warehouse. Let’s not let the warehouse preclude the lake. Let’s be inclusive. Let’s have a big tent and let’s try and make these things work together even though they’re based on different technologies and soul chains. But the real driver here in my opinion is AI because the data scientists of the world like to use the open-source stack and notebooks and for that matter Apache Spark. This then allows the enterprise database folks and DBA’s and the data scientists to start to come together and be using roughly the same data on the same overall platform. And that’s important too. By the way. Synapse Analytics will also have its own studio which will put a lot of stuff together, both sides of the warehouse product. And then it will also bring things like Azure Data Factory into the experience, quote-unquote as well. So part of this is really just a play to take the rather sprawled pieces of the Azure Stack and try and bring them together in a more integrated way.

Mary Jo Foley: 18:33 This is a big question I’m going to ask you, but do you feel like Microsoft, Amazon with AWS, and the Google Cloud Platform are all kind of looking at this the same way as Microsoft? Or do you feel like they all are approaching the space of data warehouse and data lakes in a very different way than you could distinguish?

Andrew Brust: 18:56 Yeah, I mean, gosh, at least in terms of the services offered, I haven’t seen Amazon or Google try to put the lake technology and the warehouse technology directly together the way Microsoft is trying with two different products, Synapse and SQL 2019. I think philosophically there’s a big difference if we just think about Amazon and Microsoft for now. Amazon loves to have lots, and it’s funny because we used to say this about Microsoft, Amazon likes to have a platform and lots of building blocks and leave it up to practitioners and partners to put them together. Microsoft, while they may kind of be faulted for doing that, they at least have aspirations of bringing things more together, making them more turnkey. And allowing, you know, each one to kind of accentuate the capabilities and the advantages of the other. So I don’t know that we’ll see, for example, Elastic MapReduce and Redshift on the Amazon side come together. In the Google side, same thing.

Mary Jo Foley: 20:13 Yeah.

Andrew Brust: 20:14 Google’s Data Warehouse is interesting too, because the tech it’s based on is not typical data warehouse technologies. So they have this thing called BigQuery. And BigQuery is not based on the same massively parallel processing, columnar storage kinds of technologies that most data warehouse products are. So Google’s different altogether, although that gets pretty much into the technical weeds. So I won’t go any further, but yeah, they’re all a little different.

Mary Jo Foley: 20:43 Yeah. And then what about the old guard companies like Teradata, Oracle, SAP? Are they playing in the space at all or are they just kind of saying, you know what, people still need the tried and true solutions that we’ve always sold?

Andrew Brust: 20:57 Yeah. Well, I mean, what’s interesting is that, almost a decade ago, Teradata did a bunch of acquisitions to bring big data technology in house and marry it up with their data warehouse technology. So they were way ahead of their time on that. I think the execution was kind of weak, and that ironically, now they’re sort of behind and it’s mostly because none of what they were doing had much to do with the cloud. Now they’re playing catch up and you know, their legacy is everything I said before about, you know, really difficult economics and appliance form factors, and you know, making the change there has been, you know, not exactly second nature to them. SAP has their HANA technology and then lots of things based on that. They are constantly rebranding to the point where I’ve lost track of what’s called what now, but because they’re very focused of course on applications, right?

Andrew Brust: 22:06 That’s their mainline business. And then making their analytics work with the data that their applications are tracking. They have a built in value proposition that is very, you know, sensible and concrete. And so even if their tack is a little bit more scattered, they’re still in pretty good shape. If you’re an SAP shop in terms of, you know, their ERP software or so many other, you know, CRM and so many other categories that they have acquired over the, you know, last couple of decades then using their analytics technology just kind of makes sense. It’s adjacent, it’s integrated, it just works. And then in terms of Oracle, gosh, I’m not sure what to make of Oracle these days. They have their cloud and you know, they have their kind of alliance with Microsoft to make their cloud stuff run on Azure. You know, they’ve long had Exadata as a data warehouse product and they have Bruno Aziza running their analytics and AI and he once upon a time was in the BI world at Microsoft, so he knows.

Mary Jo Foley: 23:16 Oh really?

Andrew Brust: 23:16 Yup. Yup. That’s where I met him and he knows lots and lots. But getting all those pieces put together and you know, really firing on all cylinders I haven’t seen that happen yet. I would, to answer your original question in much shorter verbiage, the old guard is challenged here. The only really old guard company that I would say is rising to the challenge is Microsoft, right? They’re an old guard enterprise company, and the new guard cloud company, and they’ve been able to bridge the two eras. Nobody else really has done that.

Mary Jo Foley: 23:53 I think to your point about Oracle and Microsoft having an alliance and then more recently Oracle saying they’re going to be co-locating some of their data centers with Azure. SAP committing to using Azure and Azure Data Lake as a backbone to some of their products. It kind of, I mean all these companies have alliances with all the other cloud vendors too. But it feels like there’s a lot of momentum with these data warehouse, database, data lake companies and Microsoft. Yeah.

Andrew Brust: 24:27 Yeah. Well at this point, the barrier to entry for building one’s own cloud is so high that you know, if you’re not one of the big three, and you know, arguably two of those big three are much bigger than the third. But if you’re not one of the big three, you’ve kind of given up and said, all right, we’re not going to be in the hyperscale cloud business, you know, we’re going to run on those clouds. SAP has said that very explicitly. Oracle has said it implicitly and what Terradata and Vertica, and others are saying. Greenplum is less clear. Greenplum is now effectively part of Dell EMC and Vertica is part of Micro Focus, so it’s sort of old guard being acquired by older old guard. And in terms of cloudifying that, I don’t know. There’s a lot of steps to getting that.

Mary Jo Foley: 25:24 A lot of moving parts.

Andrew Brust: 25:27 Yup. Yup.

Mary Jo Foley: 25:28 Yeah. Alright. I think we’re going to end on that note. I just, I always like going back and thinking about the history part, you know, and kind of bringing that into the present. So I think that’s a good ending spot for this chat.

Andrew Brust: 25:43 Absolutely. Well, Microsoft’s got the history, they’ve been there with SQL Server since the late 80’s, early 90’s and they’re still leveraging that platform. It’s super impressive. So timeless.

Mary Jo Foley: 25:54 I know. You know, it’s funny, one of the very first areas that I had to cover in-depth when I started covering Microsoft, was databases. And it was kind of the time when SQL Server was seen as a toy database and kind of a joke almost in the industry. And then to see how that’s evolved as I’ve been doing my coverage of the company has been really interesting to me.

Andrew Brust: 26:17 Once upon a time Exchange was seen as a toy in a jar. Now look where we are.

Mary Jo Foley: 26:25 Exactly. Hey, Andrew Thank you so much.

Andrew Brust: 26:25 Look where Lotus Notes is. Yeah, thanks, it was a pleasure.

Mary Jo Foley: 26:29 It’s always fun to chat with you and thanks again for doing this.

Andrew Brust: 26:35 All right take care.

Mary Jo Foley: 26:35 Yeah. For everybody else who’s listening to this chat, we’re getting ready right now for our next chat. I’ll be posting that information on Petri very soon, and then you can submit your questions right on Twitter, Facebook, or LinkedIn for my next guest. In the meantime, if you or someone you know might make a good guest for one of these MJF chats, please do not hesitate to drop me a note. All my contact information is available on Petri.com. Thanks again.