Shortly about the latest developments in Big Data, AI, machine learning, IoT, cloud, and more.
“The thing that’s going to make artificial intelligence so powerful is its ability to learn, and the way AI learns is to look at human culture.” – Dan Brown
Journey to AI is where many companies and organizations are now. Most of them have reached a data collection stage — they know how to build fast and robust data pipelines and have created huge data warehouses and data lakes. But now they’re trying to apply machine learning models and algorithms to this data — and solving this problem has proven painful for many big data players.
Many talks suggested moving from a “process-driven” to a “data-driven” organization. Listening to our customers, analyzing our internal metrics, and finding insight from all our data is required to be competitive nowadays — and that’s where machine learning (“ML”) is expected to help.
“Big data knows and can deduce more about you than Big Brother ever could.” – Toomas Hendrik Ilves
DataWorks Summit is a huge conference with about 70 track sessions, crash courses, and birds-of-a-feather sessions. However, all of them can be related to one of the groups below.
- Data Acquisition and Data Quality — this is the start for any big data company. You must know how to gather data, create ETL processes, and guarantee data quality. Apache NiFi was the most used tool for data transfer automation. Apache Spark was quite popular and different aspects of it were discussed, like new features in Spark 3.0, how to run it in Kubernetes, do machine learning, etc.
- Data Security — data privacy and security are central to any big data organization. Many sessions were related to secure data storage, secure traffic, roles and permissions management, and the like. Regulations like GDPR and HIPAA must be carefully and thoroughly enforced. Many talks predicted a huge need for cybersecurity specialists in a few years. The hottest tools in this space were Apache Ranger, Apache Metron, and Apache Knox.
- Enterprise Data Pipelines — big companies like IBM and Cloudera are trying to simplify the big data and ML journey for enterprises. They provide “everything you need” platforms, where you can establish a complex data acquisition, processing, and analysis pipeline almost with zero coding skills. Additionally, they provide secure data storage solutions.
- Machine Learning — Everyone is trying to solve the problem of applying ML to their data now. Huge companies have huge data warehouses and are looking for new ways to glean insight from them. They’re all building ML pipelines — and they all do it differently. A big technology zoo is present here.
- DataOps — a much-needed new profession is emerging today, to handle lots of infrastructure work related to data and ML pipelines. In most cases, data scientists don’t have enough knowledge for this work since it requires expertise in so many areas, like networking, cloud technologies, CI/CD tools, etc.
- Streaming — companies are moving towards real time. They are replacing traditional batch processing with streaming tools. Apache Kafka is the most used instrument in this area, complemented by Spark Streaming. Apache Druid, a high-performance real-time analytics database, was mentioned several times. We are expecting a huge technological boost related to streaming and time series processing, especially, with the constant growth of IoT companies. And this raises a question — how can we apply and improve ML models for streaming data?
- Data Pipeline Testing — this is the logical endpoint for any data pipeline’s evolution — how can we guarantee their quality, and how can we automate this process? Today, there’s no simple way to do this. You must control each step, from unit testing SQL queries and Spark job to performing high-load infrastructure tests. More and more tools will emerge in this area.
Inspired by the possibilities of ML and AI, we’d like to list a few points from the vision of our future by famous futurist Sophie Hackford.
- Platforms for intelligent avatars — everyone will have an avatar, which will help to simplify our human life. Their goal will be to solve all complex problems in finance, law, insurance, and other areas.
- Digital Immortality — there are so many questions about digital immortality now. Should we remove all digital resources after human death, like a Facebook page, twitter, etc?
- Human source code — the idea of representing human as source code, as we can for computer programs today. This would enable us to do so many unbelievable things, including teleportation.
- Infinity machines and quantum computers — they are coming and with them we’ll be able to solve many complex problems in genetics, physics, security, etc.
I hope you enjoyed this little summit report. #everythingwillbebigdata