The Sesame Native Store is reported to scale up to about 100-150 million triples (depending on hardware and dataset characteristics). However, getting that number of triples into the store is not always a trivial task, so I wanted to go over several possible strategies you can employ to get best performance when trying to upload large datasets into the Sesame Native Store.

In this recipe, we will look at simple uploading and its limitations, splitting your input data into several files (and how to deal with blank node identity), as well programmatically chunked uploads and several tweaks you can employ to improve performance.

[toc]

Simple upload

This method consists of simply opening a connection, opening your file, and supplying it to the connection:

[java]
// create a local Sesame Native Store
Repository nativeRep = new SailRepository(new NativeStore());
nativeRep.initialize();

String fileName = “/path/to/example.rdf”;
File dataFile = new File(fileName);
RepositoryConnection conn = nativeRep.getConnection();
try {
conn.add(dataFile, “file://” + fileName, RDFFormat.forFileName(fileName));
}
finally {
conn.close();
}
[/java]

As said, this is the easiest and quickest way to programmatically add a file to a native store. However, when the input file grows over a certain size (depending on available memory, CPU, and disk I/O performance, but generally speaking when larger than ~6,000,000 triples or 500MB), this method of uploading can become very slow, and we will need a slightly more involved way to get our data into our native store.

Splitting the input

The other strategies for loading large data sets all involve having the data offered piece by piece, or chunked, in some way. The first and easiest way to achieve chunking is to have your input data file split into several smaller files, and then upload each file separately.

If your input data file is in N-Triples format, contains no blank nodes, and you’re lucky enough to be on Linux, you can create your own split simply with the split shell command:

[bash]
split -l 100000 /path/to/large/rdf/file.nt
[/bash]

The above command will split your input data file into several smaller files, of 100,000 lines each. Vary chunk size to taste (between 100,000 and 500,000 seems to generally give best results, but a lot, again, depends on the hardware).

As said, the above method of splitting only works if the data is in N-Triples format and contains no blank nodes. N-Triples format is a line-by-line format with no headers at the start of the file, so when you cut away the first 100,000 lines, you are still left with a valid N-Triples file. In contrast, Turtle, N3 and (even worse) RDF/XML can be less easily split using such a simple method, since they contain header information (namespace declarations) and apart from that can be invalidated if you split on the wrong line.

Blank nodes are a concern because a blank node reference is scoped to the local file: if you split a file containing a blank node _:a which is referenced in several places, and these references end up in two separate split files, the fact that they are talking about the same blank node is lost. There is a trick in Sesame to cope with this, more about that below.

Once we have succeeded in splitting our files, we can upload each file separately, like so:

[java]
// create a local Sesame Native Store
Repository nativeRep = new SailRepository(new NativeStore());
nativeRep.initialize();

String path = “/path/to/split/files/”;
File directory = new File(path);
File[] files = directory.listFiles();
RepositoryConnection conn = nativeRep.getConnection();
try {
for (File file: files) {
String fileName = file.getAbsolutePath();
conn.add(file, “file://” + fileName, RDFFormat.forFileName(fileName));
}
}
finally {
conn.close();
}
[/java]

Preserving blank node identity

As said above, when the data set we want to split contains blank nodes, we face the problem that blank node identifiers are locally scoped: blank node _:a in file 1 is not the same blank node as blank node _:a in file 2. Normally, when data is added to a Sesame repository, blank node identifiers as found in the source are thrown away and Sesame instead assigns its own internal blank node identifiers (the reason Sesame does this is that normally this is actually the desired behavior: most of the time you don’t want blank nodes from different files to point to the same thing).

However, we can configure the parser that reads our input data to preserve blank node identifiers. Effectively, we tell Sesame not to assign its own internal blank node identifiers, but instead just use the blank node ids it finds in the file. In the case of split files, this is exactly what we need.

We can tune the parser configuration as follows (using explicit boolean vars here for clarity):

[java]
// create a parser config with preferred settings
boolean verifyData = true;
boolean stopAtFirstError = true;
boolean preserveBnodeIds = true;
ParserConfig config = new ParserConfig(verifyData, stopAtFirstError, preserveBnodeIds, DatatypeHandline.VERIFY);

// set the parser configuration for our connection
conn.setParserConfig(config);
[/java]

Once you’ve reconfigured the parser as above, you can use the connection to upload your split files as shown in the previous section, and even if the split files contain blank node references, these will be handled properly.

There is one caveat to this: if you are loading data into a repository that already contains data from a different source, and this existing data contains blank nodes as well, you could in theory run into problems when a blank node identifier in the newly uploaded data set is the same as one of the blank nodes identifiers in your existing data.

Converting to N-Triples

If your input data file is not in N-Triples format, it is more difficult to split into separate files. However, you can convert it to N-Triples format – even for very large files this is relatively quick. See the Cookbook recipe on Parsing and Writing RDF with RIO for details on how to cook up your own Java program to do this. Alternatively, you can use a command line tool, such as RDFConvert.

Programmatic Chunking

If you can not split the file into several chunks manually (or programmatically using a preprocessing step), you can do a chunked load of a single large file in another way still. The default add method in the Sesame RepositoryConnection, which we have used sofar to upload files, takes the entire input file, and adds its contents in a single transaction, only committing when the entire contents have been read. However, if we create our own parser and listener and count the triples processed by the parser, we can force intermediate commits of the data during upload, achieving more or less the same effect as having several input files.

In the recipe on Parsing and Writing RDF with RIO we have seen how we can create our own parser and parser listener. We are going to use that here. The code to create our own parser and use it to process the big RDF file is as follows:

[java]
Repository nativeRep = new SailRepository(new NativeStore(datadir));
nativeRep.initialize();

RepositoryConnection conn = nativeRep.getConnection();

// we set autocommit to false to make sure we can insert individual statements
// without immediately committing
conn.setAutoCommit(false);

String fileName = “/path/to/example.rdf”;
RDFParser parser = Rio.createParser(RDFFormat.forFileName(fileName));

// add our own custom RDFHandler to the parser. This handler takes care of adding
// triples to our repository and doing intermittent commits
parser.setRDFHandler(new ChunkCommitter(conn));

File file = new File(fileName);
FileInputStream is = new FileInputStream(file);
try {
parser.parse(is, “file://” + file.getCanonicalPath());
conn.commit();
}
finally {
conn.close();
}
[/java]

In the above code, we have supplied an object of type ChunkCommitter to our RDF parser. This is a custom object which will receive the parser output and add it to the repository, but which also keeps track of the number of triples added sofar, and does an intermittent commit every X triples. An example implementation could look like this:

[java]
class ChunkCommitter implements RDFHandler {

private RDFInserter inserter;
private RepositoryConnection conn;

private long count = 0L;

// do intermittent commit every 500,000 triples
private long chunksize = 500000L;

public ChunkCommitter(RepositoryConnection conn) {
inserter = new RDFInserter(conn);
this.conn = conn;
}

@Override
public void startRDF() throws RDFHandlerException {
inserter.startRDF();
}

@Override
public void endRDF() throws RDFHandlerException {
inserter.endRDF();
}

@Override
public void handleNamespace(String prefix, String uri)
throws RDFHandlerException {
inserter.handleNamespace(prefix, uri);
}

@Override
public void handleStatement(Statement st) throws RDFHandlerException {
inserter.handleStatement(st);
count++;
// do an intermittent commit whenever the number of triples
// has reached a multiple of the chunk size
if (count % chunksize == 0) {
try {
conn.commit();
} catch (RepositoryException e) {
throw new RDFHandlerException(e);
}
}
}

@Override
public void handleComment(String comment) throws RDFHandlerException {
inserter.handleComment(comment);
}
}
[/java]

As an aside: this method is also useful if you are interested in getting feedback on loading progress. You can add all the logging messages or println statements to the ChunkCommitter that you want.

Compressing the input using gzip

The Sesame Native Store is a disk-based store, and its performance is heavily dependent on disk i/o performance. Without going into too much detail here, performance can be improved by compressing our data file(s) beforehand. This reduces disk reading access operations and allows the OS’s page cache to be more focused on the contents of the native store, improving overall access speed.

By default, Sesame’s RepositoryConnection accepts input files in gzip compression format. Simply compress your input file(s) using gzip, and supply the compressed file(s) directly via the RepositoryConnection.add(File,...) method.

However, if we are using the programmatic chunking method described above, we need to take care of uncompressing our files manually, since the parser itself does not accept compressed data. Fortunately, this is easily achieved: all we have to do is wrap the FileInputStream in a java.util.zip.GZipInputStream before passing it to the parser:

[java]
File file = new File(fileName);
InputStream is = new GZipInputStream(new FileInputStream(file));
[/java]

Easy as pi.

Limits

The above methods should be able to get you decent performance when loading large datasets into a Sesame Native Store. Even on relatively modest hardware, these strategies should enable you to load data at a rate of roughly 6-10 million triples per hour (the average rate goes down as the total size of the store increases).

However, it should be stressed again that the Sesame Native Store is designed with medium-sized datasets in mind. Its limits in terms of usability are around 100-150 million triples (a lot depends on hardware as well as the characteristics of the dataset). If you are working with datasets that are significantly larger than that, you should either consider splitting your dataset and storing it in multiple native stores, or instead go for a different Sesame-based store altogether, one that is specifically designed to work with large datasets. Ontotext’s OWLIM is a good candidate, but there are several possible backends to choose from, by various different vendors.

Leave a Reply