Apache NiFi – an awesome integration tool

Apache NiFi is a tool which the US government created for ingesting large amounts of data on it’s citizens from a variety of sources. The requirement was to support realtime data ingregration and enrichment. NiFi was then released under open source, the website is at: https://nifi.apache.org/.

Having well over a decade of experience performing integration and ETL (Extract, Transform, Load) I was very pleasantly surprised at the range of adaptors and transforms (processors in NiFi terminology). The UI is consistent and very professionally created, something that is often not as polished on open source software.

In general when performing integration work it is preferable to prefer tooling over custom code for each ingest. Using standard tooling enables:

  • Quicker development (once past the initial learning which is straight forward).
  • More maintainable.
  • Greater efficiency for reduced development time.
  • Standardising work, allows for devs to move teams with reduced learning.
  • Reduces friction transitioning work between devs and dev-ops.

A common question is what is the difference between Apache NiFi and Apache Kafka, both are designed for efficient integrations (though Kafka can also be used as a message bus) and utilise flow based programming. Each approaching integration from a different end, NiFi assumes integrations will vary and has an array of adaptors to accommodate this, whilst Kafka requires all processes to fit it’s interfaces. It strikes me that NiFi is better suited to external integrations where you do not control the source (typically), for internal integrations where source and destination are owned then both perform similar roles (though I have not used Kafka, so am open to learning otherwise).

The real power of NiFi comes from being clearly built from the ground up by computer scientists; NiFi strengths come from:

  • Flow based programming, with flow control and recovery, giving high performance and ease of transformation.
  • Being immensely configurable with an array of adapters, protocols and transformations supported including those for alerting, see the left hand side of: https://nifi.apache.org/docs.html
  • It is extensible – this is at the heart of the tool, a single Java class will create a processor. In a future post I’ll show how simple this is.
  • Crucially works in-memory process for speed whilst being durable for recovery
  • Has statistics, logging and dev ops support
  • Multi tenancy aware ingests – the clusters are high performance and not just many servers doing little, so one can get high throughput on a few machines.
  • In Cluster configuration – a machine/JVM failure does not fail the job it will distribute that work to other active nodes.
  • Latency vs throughput can be easily manipulated to favour one or the other with a slider (representing time) to aggregate events for efficiency if desired.
  • It can join an ingest back to the data store (e.g HDFS) to validate results using federated queries across the cluster!
  • If there is too much work to do rather than an internal denial of service type attack, NiFi simply halts the preceeding processors until the queues start to empty.
  • Each subprocess can be paused and the data (flow files) in the queue inspected.
  • Passwords are stored securely outside of the template and must be re-entered on deployment.

If you want a great video quick overview of the broad aims and architecture of NiFi consider watching the video entitled OSCON 2015 at: https://nifi.apache.org/videos.html

The getting started guide from NiFi is comprehensive for its aims and nor too long, if this has piqued your interest I heartily recommend you read and try it https://nifi.apache.org/docs/nifi-docs/html/getting-started.html

If you do use NiFi things you will want to consider are (most are common to integration tools in general):

  • Create templates for integrations and reusable processes.
  • Back up the templates in source control, they are easy to export into XML.
  • When you create services which NiFi depends upon create a build process and/or document them so you can easily spin up another node.
  • Always use process groups so you can separate different integrations.
  • Test services on all the nodes to ensure job failovers can be successful on other nodes than that tested.
  • The documentation is detailed and worth checking, as on occasion attribute names can be misleading. What is great within the UI you can access the documentation of any processor.
  • Try to avoid using merges or locking mechanisms which go against the flow (pardon the pun) of flow based programming.
  • Separate the flows even if you wish to perform the same subsequent action, doing so makes debugging much easier as you can pause the downstream processor and examine the queue to understand what is happening.
  • Name processors, to aid understanding the process groups.

In summary NiFi greatly simplifies integration work, real-time data is ingested at speed with inbuilt status, processors to alert with, integration developers and dev-ops can understand, debug and fix the workflow. It is quick to create new ingests with, which can be scaled to big data sizes, it just works!