My Code Hurdles and Experiences: Processing Large Cumulative Files with Camel and Filtering the duplicates

     It is very uncommon in enterprise world where applications should consume data from files, which range to millions of records and often containing duplicates. The application requirements could be filter out the data , insert it in to a database or queue for further processing.Camel has a component called Idempotent Consumer for the exact purpose,. Combined with the File component, let us examine how efficient it can be.

     We are going to use a file with 1 million orders (111112 records are duplicate ) enclosed by orders tag as below , we will read this file , split the file and then insert the order records to the Order table and product records to the product table.

     Now obviously loading this file completely in to memory to process is not always a feasible approach , so camel provides us with the Splitter EIP along with its streaming capability , what this does is to read the file chunk by chunk based on the token provided as delimiter and stream these results ( advantage there is no xml models loaded , we can receive the chunk as a plain string).

     The code snippet reads the xml file and splits it at orders , marshals the tokenized xml to order object and aggregates 1000 orders before sending it to the next end point , Observe lines 14 and 16 , there are two completion strategies provided to the aggregator.The line number 16 ensures that the aggregate does not keep waiting infinitely for the 1000 configured records to arrive ( and there by making the program to hand forever if they do not come / or not present ) and time out the provided mill second lapse.

     So the above code reads a file splits the xml as tokens and aggregates them by 1000 before sending it to DB end point , what about filtering the records that we talked about. Would you believe adding 4 lines of code would now enable us do so , that is exactly how powerful camel can get.

     Just add the value ( could be the node in the xml , or a value of the node that needs to be unique ) that needs to be unique to the header , add a idempotent repository component and provide the header to the repository as key as the basis for duplicate filtering.And just like that we have a code that reads from large file filters duplicate orders and sends to a different endpoint for further processing.
     I was able to achieve about 10582 records inserted to db per second and I have not even started to look in to improving the code for memory reduction, hit me with your comments.

For the complete Code and the processing details refer to my Git Hub Link Here.

My Code Hurdles and Experiences

Friday, April 29, 2016

Processing Large Cumulative Files with Camel and Filtering the duplicates

No comments:

Post a Comment