Speeding Up Bulk Imports in Rails

Development

Reading Time: 6 minutes

This article is about speeding up bulk imports in Rails. Let’s imagine we have 50,000 products to import… not an absurd amount, but enough that we can start to notice some pain and enough that we wouldn’t want to keep the user sitting there while the system tries to import them all.

That is unless we can figure out a way to have these records imported quickly.

The Data We’re Working With

We’ll be working with a Product model which has the columns: sku, name, origin, and msrp_cents. To populate our database, we receive a nightly dump from the manufacturer which arrives to us as a CSV file. The data looks like this:

0abcd,0 Product,China,3000
1abcd,1 Product,Mexico,4000
2abcd,2 Product,Canada,2000
3abcd,3 Product,USA,5000
4abcd,4 Product,Canada,4000

Here is the schema:

create_table "products", force: :cascade do |t|
  t.string   "sku",        null: false
  t.string   "name",       null: false
  t.string   "origin",     null: false
  t.integer  "msrp_cents", null: false
  t.datetime "created_at", null: false
  t.datetime "updated_at", null: false
end

add_index "products", ["sku"], name: "index_products_on_sku", unique: true, using: :btree

The data is obviously fake. But that doesn’t matter, the principle is the same. The only validation our model has is that it requires the sku field to be unique, and the other fields to not be null.

Attempt 1 – The original (59 seconds)

My first attempt was the most straight-forward attempt… nothing fancy is going on. I’m taking the CSV file, looping over the records and saving them.

task import_products1: :environment do
  require 'csv'

  CSV.foreach("#{Rails.root}/tmp/products.csv") do |p|
    product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3])
    product.save!
  end
end

Running this with the time program produces the following output:

rake speedy:import_products1  59.45s user 4.44s system 77% cpu 1:22.59 total

Is 59 seconds fast? Slow? At this point we don’t really know. But let’s take a look at what’s going on for each of the records.

(0.1ms)  BEGIN
Product Exists (0.6ms)  SELECT  1 AS one FROM "products" WHERE "products"."sku" = '123' LIMIT 1
SQL (0.3ms)  INSERT INTO "products" ("sku", "name", "origin", "msrp_cents", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id"  [["sku", "123"], ["name", "n"], ["origin", "Russia"], ["msrp_cents", 1000], ["created_at", "2015-12-05 01:34:41.896705"], ["updated_at", "2015-12-05 01:34:41.896705"]]
(6.1ms)  COMMIT

Here is what’s happening step by step:

  1. BEGIN starts a database transaction. If there is a failure of any kind it will run the ROLLBACK command to avoid finalizing these changes in the database.
  2. Next because we have a uniqueness validation in the model, it is querying to check if this sku already exists.
  3. At this point it does the actual insert statement, inserting the data into the database.
  4. Because there were no failures, it runs COMMIT to finalize all changes made within this database transaction.

What we don’t see is that there is a lot going on behind the scenes as well. Rails is having to talk over TCP to the database four times for each record, plus it is having to build two different SQL statements, each time. Luckily, this last part is quick thanks to some work by Tenderlove on AdequateRecord Pro where the query construction is cached.

Attempt 2 – Rely on DB for unique validation (44 seconds)

The first thing we’ll do to speed up the import is to rely on our database to ensure the skus are unique rather than on Rails. This will allow us to avoid that extra SELECT statement from being run.

You should always have a unique index on any field you want to actually be unique. Rails will do its best, but it can’t guarantee uniqueness due to race conditions.

The migration to add this unique index looks like this:

add_index :products, [:sku], unique: true

If there is failure, it will raise an ActiveRecord::RecordNotUnique exception which you can rescue from. To avoid this validation run, we could send the save! method the option validate: false, but this would stop all validations from being run, not just the one we want to avoid.

Below is a technique for avoiding a single validation. It creates an attr_accessor called skip_uniqueness, which we use in conjunction with the unless option that we can pass to the validates method.

class Product < ActiveRecord::Base
  attr_accessor :skip_uniqueness
  validates :sku, uniqueness: true, unless: :skip_uniqueness

  def save_for_bulk_import!(options = {})
    self.skip_uniqueness = true
    save! options
  end
end

Using this technique, we created our own version of the save! method which sets the skip_uniqeness attribute to true, thereby skipping the uniqueness validation.

And finally, here’s the code that was run to produce this result:

task import_products2: :environment do
  require 'csv'

  CSV.foreach("#{Rails.root}/tmp/products.csv") do |p|
    product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3])
    product.save_for_bulk_import!
  end
end

Now after running the import, we get a time of 44 seconds… about 15 seconds faster than the original!

rake speedy:import_products2  44.10s user 3.56s system 78% cpu 1:00.57 total

Attempt 3 – Wrap it in a transaction (34 seconds)

The way we have our code now, each of the 50,000 products gets its own database transaction. This is how Rails handles saving a model by default… unless of course we wrap it in our own transaction, a single one that we can use for all 50,000 records.

This means that if there is any failure, none of them will be inserted.

task import_products3: :environment do
  require 'csv'

  Product.transaction do
    CSV.foreach("#{Rails.root}/tmp/products.csv") do |p|
      product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3])
      product.save_for_bulk_import!
    end
  end
end

Now when we run the code we see that it is down to 34 seconds… another 10 seconds faster!

rake speedy:import_products3  34.14s user 1.47s system 89% cpu 39.696 total

Attempt 4 – Bulk insert (4.6 seconds)

Our code is still having to connect to the database 50,000 times. I had a feeling this was the real reason for slowness, because if I instantiate 50,000 products without saving them, the script only takes about 5 seconds.

To solve this, let’s take advantage of something both MySQL and PostgreSQL support: Inserting multiple values with a single INSERT statement.

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

As with many things in Ruby, there is a gem that helps us do this called activerecord-import. Now we’ll read all of the products into an array and insert them into the database all at once:

task import_products4: :environment do
  require 'csv'

  Product.transaction do
    products = CSV.read("#{Rails.root}/tmp/products.csv")
    columns = [:sku, :name, :origin, :msrp_cents]
    Product.import columns, products, validate: false
  end
end

This lowers our code all the way down to 4.6 seconds!

Keep in mind that in this case I’ve turned off validations entirely. As a result, you should be sure that your incoming data is valid, maybe running it through a cleanse phase before getting to this point. That being said, even by running the validations, it ran in about 25 seconds, which is the second fastest time I was able to get.

rake speedy:import_products4  4.64s user 0.30s system 75% cpu 6.527 total

Conclusion

Take advantage of your database! It’s very efficient at what it does, and by relying on its functionality, you’ll be able to figure out different ways to make your code more efficient.

In this article, we relied upon our not-null database constraints along with a unique index, and then we used the ability to insert multiple records with a single SQL query. All in all, we were able to get 50,000 records into the database from 59 seconds all the way down to 4.6 seconds, around 12 times faster.

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

  • Adrian

    If you have CSV as source and you are on postgresql you can take a look at COPY FROM which would be the fastest solution to put data into database

    • Leigh Halliday

      For sure! I was playing with that and have used it before but I ended up just sticking with what I had. Postgres is great because you can query and produce the result as CSV too, which is waaaay faster than CSV Builder or anything you can do in Ruby.

  • Now after running the import, we get a time of 44 seconds… about 15 seconds slower than the original!

    Now when we run the code we see that it is down to 34 seconds… another 10 seconds slower!

    Both of these should say “faster” instead of “slower”, right?

    • Leigh Halliday

      Haha… you’re totally right! I think I was thinking about “less” and translated that to “slower”. I’ll ask Codeship to update the text :) Thanks!!

      • Glad I could help. Nice article BTW!

  • Janko Marohnić

    That’s why I really like Sequel instead of ActiveRecord, it has this built-in! You can either do a multi-INSERT statement, or – the fastest option – use Postgres’ COPY command.


    DB = Sequel.connect("postgres:///app-database")
    columns = [:sku, :name, :origin, :msrp_cents]

    # 1. Using the multi-INSERT statement
    products = CSV.read("tmp/products.csv")
    DB[:products].multi_insert(columns, products)

    # 2. Using Postgres' COPY command
    DB.copy_into(:products, format: :csv, columns: column, data: File.read("tmp/products.csv"))

    The multi-INSERT statement wasn’t enough at my work, because we were working with million of records, and Postgres’ COPY command really saved us, it’s the fastest possible way to get data into Postgres, since you’re using Postgres’ optimized C implementation.

    • Leigh Halliday

      That’s cool! I’ve never used Sequel before but I’ve heard some great things about it. And definitely, using the COPY command would be faster but I was pretty happy with 4.6 seconds so decided not to keep going. Do you know how you can fill in the created_at and updated_at records using the COPY command? The DB has not null constraints and I didn’t feel like re-generating the data with pre-filled created_at/updated_at values filled in.

      • Janko Marohnić

        I’m using database triggers for setting these columns, so they’re always set regardless on how I get data in the database. The author of Sequel wrote a really nice sequel_postgres_triggers gem, which gives you helper methods for various kinds of triggers, which you can run in migrations:

        Sequel.migration do
        up do
        pgt_created_at(:products, :created_at)
        pgt_updated_at(:products, :updated_at)
        end
        end

  • You could also delay CSV line read with a lazy enumerator and a batch strategy, to be able to “stream” the CSV to the importer for big files without 1TB of Ram

  • Alexander Maslov

    Nice article! Does you know any gems to bulk update records? I’m finished now with this solution:

    new_values = { 10 => “test1”, 20 => “test2” }

    value = new_values.map { |product_id, value| “(#{product_id}, #{value})” }.join(“,”)

    ActiveRecord::Base.connection.execute(%(
    UPDATE products AS p SET value = v.value FROM (values #{values}) AS v(id, value) WHERE p.id = v.id
    ))

    https://gist.github.com/drakmail/7b337827907ccfcec0af

    It speedup my update of 250_000 records to about 15x times faster. But I think it doesn’t clean enough and could be better.

    • Janko Marohnić

      Well, I usually write a CASE statement for this, I did write one for rubygems.org some time ago. https://github.com/rubygems/rubygems.org/blob/master/lib/tasks/gemcutter.rake#L64-L74

      It still requires building SQL strings though, there probably isn’t any way around that with ActiveRecord, since its query interface is relatively primitive. With Sequel you would be able to write this:

      DB = Sequel.connect("postgres:///my-database")
      products = {10 => "test1", 20 => "test2"}
      dataset = Product.where(id: products.keys)
      dataset.update(value: Sequel.case(products, nil, :id))

      • Alexander Maslov

        It will be interesting to benchmark `SET FROM values` and `CASE` approaches, but I think there’s no significant difference.

        Very nice example with Sequel – its DSL seems really awesome!

  • Rutger Geelen

    We did similar tests and activerecord-import always wins. Would be nice to make it part of rails core.

  • Lars Kanis

    Thanks for this comparison! Unfortunately this is not suitable if you have millions of rows to insert. That is was COPY is made for. It can stream the data to the server, is much faster and already built into the pg gem. It can be used to stream row-Arrays like this:


    pgconn = Product.connection.raw_connection
    enco = PG::TextEncoder::CopyRow.new
    times = [Time.now.to_s] * 2
    pgconn.copy_data("COPY products (sku, name, origin, msrp_cents, created_at, updated_at) FROM STDIN", enco) do
    CSV.foreach("products.csv") do |row|
    pgconn.put_copy_data(row + times)
    end
    end

    This is around 6 times faster than the bulk insert solution, at least on my netbook.

    If the file data is in a file format like CSV, it can be entirely processed by the server:


    pgconn = Product.connection.raw_connection
    pgconn.copy_data("COPY products (sku, name, origin, msrp_cents) FROM STDIN CSV") do
    File.open("products.csv") do |fd|
    while data=fd.read(100000)
    pgconn.put_copy_data data
    end
    end
    end

    This is 20 times faster than the bulk inserts, with almost no CPU time in the ruby process. To get the timestamps filled in this case, the columns can be set to the default value “now()”. This can done at the start of the transaction and reverted at the end.

  • KrzysztofObcy

    Last time I had 1 GB csv file (2 mln records) and huge performance issues to insert those to database. I checked all available solutions (maybe without postgres import directly to db), and came out with this: bulk insert splitted via processes/threads. 6 workers work at the same time on the same file (chunked) – result – bulk insert speed x6.

  • Phil Coggins

    I would actually argue against doing some of these things – specifically generating raw SQL for bulk imports.

    What happens when you start adding callbacks into your model? activerecord-import does not run callbacks on these records, so you have to ensure that your model has no callbacks and never will have callbacks – that is if you want to do your due diligence of maintaining the integrity of your data.

    Running large bulk inserts inside a transaction can also lock rows in your database and block reads / writes for your users. If I’m running Passenger in a production environment with a pool size of 6, and 6 people try to access the same resource that is blocked, then further visitors to the page will not have access until the import completes.

    Wrapping imports in transactions is good because you can fail the whole import if any one insert fails, but you should definitely understand what rows / tables will have blocked access before doing this in a busy production environment.

  • Connor McDonald

    Just as an endorsement to taking advantage of the capabilities of the database you happen to be using, this from an Oracle 12c database (no fancy kit – just running on a laptop)

    --

    -- reference to the flat file

    --

    SQL> CREATE TABLE product_ext (

    2 sku VARCHAR2(5),

    3 name VARCHAR2(20),

    4 origin VARCHAR2(20),

    5 mrsp_cents INT

    6 )

    7 ORGANIZATION EXTERNAL (

    8 TYPE ORACLE_LOADER

    9 DEFAULT DIRECTORY TEMP

    10 ACCESS PARAMETERS (

    11 RECORDS DELIMITED BY NEWLINE

    12 FIELDS TERMINATED BY ','

    13 MISSING FIELD VALUES ARE NULL

    14 (

    15 sku CHAR(5),

    16 name CHAR(20),

    17 origin CHAR(20),

    18 mrsp_cents INTEGER

    19 )

    20 )

    21 LOCATION ('product.dat')

    22 )

    23 REJECT LIMIT UNLIMITED;

    Table created.

    --

    -- the target table

    --

    SQL> CREATE TABLE product (

    2 sku VARCHAR2(5),

    3 name VARCHAR2(20),

    4 origin VARCHAR2(20),

    5 mrsp_cents INT,

    6 created_at date,

    7 updated_at date

    8 );

    Table created.

    SQL>

    SQL> create unique index product_ix on product ( sku );

    Index created.

    --

    -- and now insert approx 100,000 rows

    --

    SQL> set timing on

    SQL>

    SQL> insert into product (sku,name,origin,mrsp_cents)

    2 select sku,name,origin,mrsp_cents from product_ext;

    99999 rows created.

    Elapsed: 00:00:00.27

    SQL> commit;

    Commit complete.

    Elapsed: 00:00:00.00

    --

    -- or without the index and then build the index afterwards

    --

    SQL> drop index product_ix;

    Index dropped.

    SQL> truncate table product;

    Table truncated.

    SQL> insert /*+ APPEND */ into product (sku,name,origin,mrsp_cents)

    2 select sku,name,origin,mrsp_cents from product_ext;

    99999 rows created.

    Elapsed: 00:00:00.14

    SQL> commit;

    Commit complete.

    Elapsed: 00:00:00.00

    SQL> create unique index product_ix on product ( sku );

    Index created.

    Elapsed: 00:00:00.06

    SQL>

    SQL>

    so around 0.27 seconds with the index in place, and 0.14 seconds without. Database tech (no matter what your vendor) is generally pretty cool if you can take advantage of it.

  • Chien Tran

    Hey @leigh_halliday:disqus , thanks for this well-written article. Do you have any ideas on how we can bulk insert relational data? I have a data like this:
    Record 1
    + A
    + B
    + C

    Record 2
    + X
    + Y
    + Z

    There are about 1 million records and each record has several rows of data. I have two tables, 1 table for Record and another table for storing its children. If we bulk insert records, we do not have the reference id to build the relationship between this record and its data.

    Thanks

    • Chien Tran

      I have just realized that we can pass our ID to Postgres instead of letting it generate automatically. This solves my problem.

  • Pingback: CSV Imports with Rails – Sên's Blog()