Post Thumbnail

Process Large Data from Excel File with a Breeze

1. Overview

Doing bulk upload of data from excel files reliably can be a nut-job at times. This is how to tame the beast with a simple JavaScript library.

Instead of first uploading the entire file to the server and then processing it, this library will save you some bandwidth by processing the file on the local machine and uploading the extracted data (which is what you actually want) to the server in a batch.

Because it does the upload in batches, issues such as connection timeout are taken care of. In addition to this, it reports the data that cause an exception on the server and make them available for download as an Excel file.

2. The Problem

I need to do a bulk upload of user records from an excel file ranging from 3000 to 16000 rows. Each excel record (a row) will be used to create a user account in the system. This involves trips to more than one database tables, sending email, data validation and one or more loops.

The first technique that comes to mind is to look for an Excel library, upload the excel file to the server, process the excel file and finally return success or error message — all in one HTTP Request!

The shortcomings of this method are:

  • HTTP Request timeout, especially with programming languages that are not multithreaded
  • Unreliable error reporting

and so on.

3. The Solution

What I did is to extract the data with JavaScript on the client browser and push these data to the server in batches.

On the server-side of things, the data received is processed in a try ... catch block and data that raise an exception is stored in an error array. Once, the server is done processing the current batch it sends an OK response to the client, and alongside, the error data — if any.

One other technique that is key to this process is Database Transaction Management. All database interactions for a particular record is done in the confines of a DB transaction - on the server side. If an error occurred during database interaction, the transaction is rolled back otherwise it is committed.

In programming, there is usually more than one way to achieve a feat, each with its own shortcomings. However, this technique is reliable in terms of error reporting and HTTP connection management.

4. Using The excel_uploader.js JavaScript Library

I implemented the solution above in a simple JavaScript file called excel_uploader.js. To use the JavaScript file in a project: we need to get the library from its Github repo and include it in the project alongside jQuery, xlsx.min.js, FileSaver.js and optionally bootstrap 4 and sweetalert.js:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
<!-- required -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>

<!--- optional -->
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js"></script>
<script src="https://unpkg.com/sweetalert/dist/sweetalert.min.js"></script>

<!-- needed by older browsers -->
<script src="https://github.com/eligrey/Blob.js"></script>

<!-- Required -->
<script src="https://rawgit.com/eligrey/FileSaver.js/src/FileSaver.js"></script>
<script src="https://unpkg.com/xlsx/dist/xlsx.full.min.js"></script>
<script src="https://cdn.rawgit.com/SeunMatt/excel_uploader/4f4ebd93/src/excel_uploader.js"></script>

Note that Blob.js is required on older browsers for FileSaver.js to work.

Next, we’ll add this to the HTML page we want the excel upload to take place:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
<script>
    $(document).ready( function () {
        new ExcelUploader({
            maxInAGroup: 1000,
            serverColumnNames: ["Name", "Email", "Phone Number"],
            importTypeSelector: "#dataType",
            fileChooserSelector: "#fileUploader",
            outputSelector: "#tableOutput",
            extraData: {_token: "23333323323223323232"}
        });
    });
</script>

From the snippet above, we instantiate the ExcelUploader object with configuration parameters. Following is an explanation of each parameter and what it does.

maxInAGroup: This parameter controls how many records to be sent to the server per time. The default is 1000 so if there are 5000 records in the source excel file, the data will be uploaded to the server in 5 batches. It can be fine-tuned to our taste base on available resources.

serverColumnNames: This is an array of the columns the server is expecting. Every time excel_uploader pushes data to the server, it also sends a column_map.

The column map is generated by using the lowercase form of the elements in serverColumnNames as the key and the index of the corresponding excel column as value.

For example, if the server is expecting the uploaded excel data to contain 3 columns — Name, Email and Phone Number.

The user will first upload an excel file, then map which column represents what data in the Excel. Then excel_uploader will compose the server column names using the lowercase of the serverColumnNames, replacing white spaces with an underscore (_) character.

So the server at the end will receive a JSON payload like this:

1
2
3
4
5
6
7
8
9
{
  "column_map":{"name":0,"email":2,"phone_number":1}, 
  "data":[ 
        ["name1", "08010xxx", "[email protected]"],
        ["name2", "08020xxx", "[email protected]"],
        ["name3", "08030xxx", "[email protected]"],
        ...
     ]
}

Which can be interpreted as the column name is at index zero (0) of the data, email is at index two (2) and phone_number is at index 1.

importTypeSelector: This is the jQuery selector for the HTML select field. The value of the Select options is the actual URL to send the excel data to if that option is selected. This way it’s possible to use the same page for multiple uploads. Just add the destination URL as an option in the Import Type Selector.

fileChooserSelector: This is the jQuery selector for the HTML file input field. For actually choosing the excel file from local storage by the user.

outputSelector: This is the jQuery selector for the HTML div that is used for displaying progress, status and error messages.

extraData: This is the data to be sent to the server alongside the uploaded data and column mapping. For example, CSRF token. It is an optional parameter.

Now that we have looked at the options, we must have noticed we need to add some markups to our web page for this to work.

Hence the need for the following HTML:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<!--This is the import type selector-->
<div class="form-group">
    <select class="form-control" id="dataType">
        <option value="-1" disabled selected>Select Data to Import</option>
        <option selected value="http://localhost/excel_uploader/demo/php/demo.php">Import Posts Data</option>
    </select>
</div>

<!-- This is the file chooser input field-->
<div class="form-group">
    <input type="file" id="fileUploader" class="btn btn-fill btn-primary btn-large" />
</div>

<!-- This is the Blank output/progress div-->
<div id="tableOutput"></div>

The complete HTML can be found on the demo.html page.

NOTE: The respective Ids of the HTML elements were used to configure the ExcelUploader object in the JavaScript snippet above. That’s all we need to do on the front-end. Simple isn’t it?

5. A Little Bit of Server-Side Processing

excel_uploader.js library is server-side agnostic! It will work with almost any server-side technology. It only needs to know the URL to push data via the importTypeSelector and that’s all.

It will push data to the server in a POST request using jQuery ajax. The POST request payload will contain two compulsory entries: data and column_map.

data: This an array of an array - representing the data from the Excel file uploaded by the user. It is in JSON format and should be decoded to a key-value data structure such as an associative array in PHP or a Map in Java or its equivalent in other server-side programming languages.

The outer array represents the whole, while the inner array represents the data on each row of the Excel file. Hence the number of the length of data[0] will is the number of columns mapped during upload while the length of data array itself is the total number of rows/records available in the uploaded file.

This is an example data received in a post request:

1
2
3
4
5
6
7
8
9
{
  "column_map":{"name":0,"email":2,"phone_number":1}, 
  "data":[ 
        ["name1", "08010xxx", "[email protected]"],
        ["name2", "08020xxx", "[email protected]"],
        ["name3", "08030xxx", "[email protected]"],
        ...
     ]
}

5.1. Accessing the Data

We understood, from the previous section, that the column_map associate column names to their corresponding index in the arrays contained in data.

We can, therefore, access the data by parsing the JSON payload to a key-value data structure like an associative array (PHP) or a Map (Java).

In PHP CodeIgniter, we can do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$columnMap = json_decode($this->input->post("column_map"), true);

//to get the name of the first row
$name = $data[0][$columnMap["name"]];

//to get the email of the first row
$email = $data[0][$columnMap["email"]];

//to get the phone_number in first row
$phoneNumber = $data[0][$columnMap["phone_number"]];

//we can definitely use a loop to process all the data

If you noticed, from the snippet above, we did not have to memorize the corresponding index of each column in the data array. We simply use the column_map which is much easier.

The same thing is applicable if the server-side is using Java programming language in a framework like Spring Boot:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
//extract the index of the columns we're expecting from the map
int nameIndex = Integer.parseInt(columnMap.get("name").toString());
int emailIndex = Integer.parseInt(columnMap.get("email").toString());
int phoneIndex = Integer.parseInt(columnMap.get("phone_number").toString());

//save a user
User user = new User();
user.setName(data.get(nameIndex));
user.setEmail(data.get(emailIndex));
user.setPhone(data.get(phoneIndex));
user.setPassword("newHashedPassword");
User savedUser = userRepository.save(user);

The complete demo code for PHP and Java can be found here.

5.2. Processing the Data

One of the best ways to handle the incoming POST data is to have two methods (or functions).

One in which the POST payload will be decoded and the data array will be looped over let’s call it uploadUserData().

In this function, there will be an $errorArray variable which itself must be an array of array. The data array, decoded from the POST payload, will be looped over and each element of that data (corresponding to a row on the excel file) will be processed by another method doUserUpload().

The method doUserUpload will perform database interactions in a try...catch block wrapped in a database transaction. If any of the database interaction fails, it will rollback the database changes and return false otherwise it returns true.

If doUserUpload returns false, the particular record/row under processing will be added to $errorArray

Example implementation of uploadUserData() in PHP:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$errorArray = [];

//this is where the data will be handled and process
foreach ($data as $datum) {
   if(!empty($datum) && !$this->doUserUpload($datum,  $columnMap)) {
     //that means there was an error processing the file
     //so we will just add this $datum to the errorArray
     array_push($errorArray, $datum);
   }
}

//....

//after all the data has been processed, let's respond to the client
if(!empty($errorArray)) {
    //there was an error somewhere. Let's send the client the affected data
    $this->respond(json_encode(["data" => $errorArray]));
} else{
    $this->respond(json_encode(["success" => "OK"]));
}

The complete example server-side processing can be found in demo.php. The corresponding Java demo can also be found here.

5.3. Server Response

The key step to this flow is the server response. The server is expected to respond in JSON and the content-type should be application/json.

If any of the uploaded data result in an error while processing, and it’s added to the errorArray, the resulting error Array should be sent to the excel_uploader.js, otherwise, a success message should be sent:

1
2
3
4
5
6
if(!empty($errorArray)) {
    //there was an error somewhere. Let's send the user some data that are affected.
    $this->respond(json_encode(["data" => $errorArray]));
} else{
    $this->respond(json_encode(["success" => "OK"]));
}

Or in Java:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
if(errors.isEmpty()) {
    //all is well during the process
    responseMap.put("success", "OK");
    return ResponseEntity.ok(responseMap);
}

//there are errors and some data are not processed.
// Let's send those data back to the client
//note that we're sending the OK response 200
//it's just that we're appending some error data that might have been present
responseMap.put("data", errors);
return ResponseEntity.ok(mapper.writeValueAsString(responseMap));

If we noticed an invalid data is being sent and we’ll like to notify the user, we can send a JSON response with an error entry and a 200 status code - because we actually handled the error ourselves:

1
2
3
4
if(count($data[0]) < 3 || count($columnMap) < 3) {
    $this->respond(json_encode([
      "error" => "The Data Seems not to be complete! " . count($columnMap) . " columns were mapped instead of 3"]), 200);
}

Note: the JSON response should have a STATUS code of 200 for excel_uploader.js to be able to process it.

If there were any other server-generated errors, they will be handled in the fails() callback of the ajax method.

6. The Soup is Ready!

With just a little bit of JavaScript, a sprinkle of config and HTML, we are good to go. The good thing is that we can use the same page to upload data to many endpoints even with an additional payload. Just add the URL as another option in the importTypeSelector and we are good.

I want to hear about your experiences integrating this simple tool into your project. If you find any bug, kindly report it to Github.

Github: [https://github.com/SeunMatt/excel_uploader](Github: https://github.com/SeunMatt/excel_uploader)

Seun Matt

Results-driven Engineer, dedicated to building elite teams that consistently achieve business objectives and drive profitability. With over 9 years of experience, spannning different facets of the FinTech space; including digital lending, consumer payment, collections and payment gateway using Java/Spring Boot technologies, PHP and Ruby on Rails