Process more files than ever and use Parquet with Azure Data Lake Analytics

Azure Data Lake Analytics
Spread the love

Azure Data Lake Analytics (ADLA) is a serverless PaaS service in Azure to prepare and transform large amounts of data stored in Azure Data Lake Store or Azure Blob Storage at unparalleled scale.

ADLA now offers some new, unparalleled capabilities for processing files of any formats including Parquet at tremendous scale.

Previously: Handling tens of thousands of files is painful!

Many of our customers tell us that handling a large number of files is challenging – if not downright painful in all the big data systems that they have tried. Figure 1 shows the distribution of files in common data lake systems. Most files are less than one GB, although a few may be huge.

ADLA has been developed from a system that was originally designed to operate on very large files that have internal structure that help with scale-out, but it only operated on a couple of hundred to about 3,000 files. It also over-allocated resources when processing small files by giving one extract vertex to a file (a vertex is a compute container that will execute a specific part of the script on a partition of the data and take time to create and destroy). Thus, like other analytics engines, it was not well-equipped to handle the common case of many small files.

Process hundreds of thousands of files in a single U-SQL job!

With the recent release, ADLA takes the capability to process large amounts of files of many different formats to the next level.

Figure 2: Improving the handling of many small files

ADLA increases the scale limit to schematize and process several hundred thousand files in a single U-SQL job using so-called file sets. In addition, we have improved the handling of many small files (see Figure 2) by grouping up to 200 files of at most 1 GB of data into a single vertex (in preview). Note that the extractor will still handle one file at a time, but now the vertex creation is amortized across more data and the system uses considerably fewer resources (Analytics Units), and costs less, during the extraction phase.

Figure 3 shows on the left how a job spent about 5 seconds to create a vertex and then processed the data in less than 1 second and used many more vertices. In the image on the right, with the improved files sets handling, it was able to spend much fewer resources (AUs) and made better use of the allocated resources.


Figure 3: Job Vertex usage when extracting from many small files: without file grouping and with file grouping

Customers who use these file set improvements have seen up to a 10-fold performance improvement in their jobs resulting in a much lower cost, and they can process many more files in a single job!

We still recommend though to maximize the size of the files for processing, because the system will continue to be even better with large files.

Plus today, ADLA now natively supports Parquet files

U-SQL offers both built-in native extractors to schematize files and outputters to write data back into files, as well as the ability for users to add their own extractors. In this latest release, ADLA adds a public preview of the native extractor and outputter for the popular Parquet file format and a “private” preview for ORC, making it easy to both consume and produce these popular data formats at large scale. This provides a more efficient way to exchange data with the open source Big Data Analytics solutions such as HDInsight’s Hive LLAP and Azure DataBricks than previously using CSV formats.

Head over to our Azure Data Lake Blog to see an end-to-end example of how we put this all together to cook a 3 TB file into 10,000 Parquet files and then process them both with the new file set scalability in U-SQL and query them with Azure Databricks’ Spark.

But wait, there’s more!

There are many addition new features such as  a preview of dynamically partitioned file generation (the user voice top ask!) a new AU modeler to optimize your jobs cost/performance trade-off, the ability to extract from files that use the Windows code pages, the ability to augment your script with job information through @JobInfo, and even a light-weight, self-contained script development with in-script C# named lambdas and script-scoped U-SQL objects. Go to the spring release notes to see all of the newly introduced capabilities.

Read More..

11 Comments on “Process more files than ever and use Parquet with Azure Data Lake Analytics”

  1. Hey there I am so excited I found your site, I really found you by accident,
    while I was browsing on Bing for something else, Anyways I am here now and would just like to say cheers for a marvelous post and a all round entertaining blog (I also love the theme/design), I don’t have time to browse it all at the minute but
    I have bookmarked it and also added in your RSS feeds,
    so when I have time I will be back to read a lot
    more, Please do keep up the superb work.

  2. Woah! I’m really digging the template/theme of this website.
    It’s simple, yet effective. A lot of times it’s challenging to
    get that “perfect balance” between usability and visual appearance.
    I must say that you’ve done a excellent job with this.
    Also, the blog loads super quick for me on Chrome.

    Outstanding Blog!

  3. Hello, i think that i saw you visited my website thus
    i came to “return the favor”.I’m trying to find things to improve my web site!I suppose its ok to use a few of your

  4. I like the valuable info you provide in your articles. I’ll bookmark your blog and check again here regularly.
    I’m quite certain I’ll learn many new stuff right here!
    Good luck for the next!

  5. Greetings! Quick question that’s totally off topic.
    Do you know how to make your site mobile friendly?

    My web site looks weird when viewing from my apple iphone.
    I’m trying to find a template or plugin that might
    be able to resolve this issue. If you have any suggestions, please share.
    With thanks!

  6. Wow, fantastic blog layout! How long have you been blogging for?

    you made blogging look easy. The overall look of your web site is
    wonderful, as well as the content!

  7. This is really interesting, You’re a very skilled blogger. I’ve joined your rss feed and look forward to seeking more of your wonderful post. Also, I’ve shared your web site in my social networks!

Leave a Reply

Your email address will not be published. Required fields are marked *