![]() ![]() The architecture of the Apache Airflow is quite complex, every task is executed as a Celery worker with its own overhead.Įvery task also needs a connection to the Apache Airflow database, it also consumes resources. Unfortunately it isn’t only the remote call. Whaaaat – 220MiB allocated just for the REST call to the Dataproc cluster API? The memory usage might be also varying for different Apache Airflow operators. It’s important that you should measure the task memory usage by yourself, all further calculations heavily depend on it. We do have the first insight in our tuning journey: every operator allocates approximately (4.36GiB - 1.76GiB) / 12 =~ 220MiB of RAM. I would opt for the second, more practical option.įor 12 concurrent running operators the workers’ memory utilization increased from the steady state of 1.76GiB to 4.36GiB. Run a dozen of tasks and measure the real resource utilization. You could try to connect to the Kubernetes airflow-worker pod or … It’s not an easy task, Cloud Composer workers form a Kubernetes cluster. How to check how much memory is allocated by the operators? So for further capacity planning we should mainly count memory allocated by operators and add some safety margin for the sensors. It could take several minutes or even hours.įor most of the time, the operator doesn’t do much more than checking for the Spark job status, so it’s a memory bound process.Įven if there is a CPU time slots shortage on the Cloud Composer worker, the negative impact on the Spark job itself is negligible. On the contrary Spark operator allocates resources for the whole Spark job’s execution time. If data isn’t available yet, the sensor finishes as well, but it’s also rescheduled for the next execution after The sensor checks for the data and if data exists the sensor quickly finishes. In contrast, the sensors wait for BigQuery data, the payload for the Spark jobs.įrom the performance perspective, the operators are much more resource heavy than sensors.īigQuery sensors are short-lived tasks if configured in the rescheduling mode In my Cloud Composer installation operators are mainly responsible for creating the ephemeral Dataproc clusters and submit Apache Spark batch jobs to this clusters. There are two main kind of tasks: operators Let’s begin with the Apache Airflow basic unit of work - task. What are the most important Cloud Composer performance metrics to monitor?.How to choose the right virtual machine type and how to configure Apache Airflow to fully utilize the allocated resources?.What’s a total parallelism of the whole Cloud Composer cluster?. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |