The Sesame documentation is pretty extensive, but also perhaps a bit daunting, especially if you are relatively new to Sesame, or indeed to tools like Eclipse or Apache Maven.

To help people who are new to Sesame get started, I have written a step by step tutorial how to use these tools together and build a simple Semantic Web application.

The tutorial is available on the Sesame documentation website:

The Sesame Native Store is reported to scale up to about 100-150 million triples (depending on hardware and dataset characteristics). However, getting that number of triples into the store is not always a trivial task, so I wanted to go over several possible strategies you can employ to get best performance when trying to upload large datasets into the Sesame Native Store.

In this recipe, we will look at simple uploading and its limitations, splitting your input data into several files (and how to deal with blank node identity), as well programmatically chunked uploads and several tweaks you can emply to improve performance.

read more…

Since release 2.5.0, Sesame supports reading and writing a binary RDF serialization format. Its main features are reduced parsing overhead and minimal memory requirements (for handling really long literals, amongst other things).

Unfortunately, however, the format is currently not documented in the Sesame manuals. I have therefore done a bit of reverse-engineering and actually tried to reconstruct from the code how the format works. This post is a first draft of my reconstruction of the format.

Warning: dry technical stuff coming up. Read at your peril.

MIME Content Type

Sesame assigns the content type application/x-binary-rdf to its format.

Overall design

Results encoded in the Sesame Binary RDF format consist of a header followed by zero or more records, and closes with an END_OF_DATA marker (see below). Values are stored in network order (Big-Endian).

All string values use UTF-16 encoding. Reference ids are assigned to recurring values to avoid having to repeat long strings.

Header

The header is 8 bytes long:

  • Bytes 0-3 contain a magic number, namely the ASCII codes for the string
    “BRDF”, which stands for Binary RDF.
  • Bytes 4-7 specify the format version (a 4-byte signed integer).

For example, a header for a result in format version 1 will look like this:

  byte: 0  1  2  3 |  4  5  6  7 | 
-------------------+-------------+
 value: B  R  D  F |  0  0  0  1 |

Content records

Zero or more records follow after the header. Each record can be a namespace declaration, a comment, a value reference declaration, or a statement.

Each record starts with a record type marker (a single byte). The following record types are defined in the current format:

  • NAMESPACE_DECL (byte value: 0):
    This indicates a namespace declaration record.
  • STATEMENT (byte value: 1):
    This indicates an RDF statement record.
  • COMMENT (byte value: 2):
    This indicates a comment record.
  • VALUE_DECL (byte value: 3):
    This indicates a value declaration.
  • END_OF_DATA (byte value: 127):
    This indicates the end of the data stream has been reached.
Strings

All strings are encoded as UTF-16 encoded byte arrays. A String is preceeded by a 4-byte signed integer that encodes the length of the string (specifically, it records the number of Unicode code units). For example, the string ‘foo’ will be encoded as follows:

 byte: 0 1 2 3 | 4 6 8 |
---------------+-------+
value: 0 0 0 3 | f o o |
RDF Values

Each RDF value type has its own specific 1-byte record type marker:

  • NULL_VALUE (byte value: 0)
    marks an empty RDF value (this is used, for example, in encoding of context in statements)
  • URI_VALUE (byte value: 1)
    marks a URI value
  • BNODE_VALLUE (byte value: 2)
    marks a blank node value
  • PLAIN_LITERAL_VALUE (byte value: 3)
    marks a plain literal value
  • LANG_LITERAL_VALUE (byte value: 4)
    marks a language-tagged literal value
  • DATATYPE_LITERAL_VALUE (byte value: 5)
    marks a datatyped literal value
URIs

URIs are recorded by the URI_VALUE marker followed by the URI encoded as a string.

Blank nodes

Blank nodes are recorded by the BNODE_VALUE marker followed by the id of the blank node encoded as a string.

Literals

Depending on the specific literal type (plain, language-tagged, datatyped), a literal is recorded by one of the markers PLAIN_LITERAL_VALUE, LANG_LITERAL_VALUE or DATATYPE_LITERAL_VALUE. This is followed by the lexical label of the literal as a string, optionally followed by either a language tag encoded as a string value or a datatype encoded as a string.

Value reference declaration records

To enable further compression of the byte stream, the Binary RDF format enables encoding of reference-identifiers for often-repeated RDF values. A value reference declaration starts with a VALUE_DECL record marker (1 byte, value 3), followed by a 4-byte signed integer that encodes the reference id. This is followed by the actual value, encoded as an RDF value (see above).

For example, a declaration that assigns id 42 to the URI ‘http://example.org/HHGTTG’ will look like this:

  byte: 0 | 1 2 3 4 | 5 | 6 7 8 9 | 10 12 14 16 18 (etc) | 
----------+---------+---+---------+----------------------+
 value: 3 | 0 0 0 42| 1 | 0 0 0 25| h  t  t  p  :  (etc) |

Explanation: byte 0 marks the record as a VALUE_DECL, bytes 1-4 encode the reference id, byte 5 encodes the value type (URI_VALUE), bytes 6-9 encode the length of the string value, bytes 10 and further encode the actual string value as an UTF-16 encoded byte array.

Note that the format allows the same reference id to be assigned more than once. When a second value declaration occurs, it effectively overwrites a previous declaration, reassigning the id to a new value for all following statements.

Namespace records

A namespace declaration is recorded by the NAMESPACE_DECL marker. Next follows the namespace prefix, as a string, then followed by the namespace URI, as a string.

For example, a namespace declaration record for prefix ‘ex’ and namespace uri ‘http://example.org/’ will look like this:

  byte: 0 | 1 2 3 4 | 5 6 | 7 8 9 10 | 11 13 15 17 19 (etc) |
----------+---------+-----+----------+----------------------+
 value: 0 | 0 0 0 2 | e x | 0 0 0 19 | h  t  t  p  :  (ect) |
Comment records

A comment is recorded by the COMMENT marker, followed by the comment text encoded as a string.

For example, a record for the comment ‘example’ will look like this:

  byte: 0 | 1 2 3 4 | 5 7 9 11 13 15 17 |
----------+---------+-------------------+
 value: 2 | 0 0 0 7 | e x a m  p  l  e  |
Statement records

Each statement record starts with a STATEMENT marker (1 byte, value 1). For the encoding of the statement’s subject, predicate, object and context, either the RDF value is encoded directly, or a previously assigned value reference (see section 2.3) is reused. A Value references is recorded with the VALUE_REF marker (1 byte, value 6), followed by the reference id as a 4-byte signed integer.

An example statement

Consider the following RDF statement:


<http://example.org/George> <http://example.org/name> "George" .

Assume that the subject and predicate previously been assigned reference ids,
(42 and 43 respecively). The object value has not been assigned a reference id.

This statement would then be recorded as follows:

 byte: 0 | 1 | 2 3 4 5 | 6 | 7 8 9 10| 11 | 12 13 14 15 | 16 18 20 22 24 26 | 28 |
---------+---+---------+---+---------+----+-------------+-------------------+----+
value: 1 | 6 | 0 0 0 42| 6 | 0 0 0 43|  3 |  0  0  0  5 |  G  e  o  r  g  e |  0 |

Explanation: byte 0 marks the record as a STATEMENT. Byte 1 marks the subject of the statement as a VALUE_REF. Bytes 2-5 encode the reference id of the subject. Byte 6 marks the predicate of the statement as a VALUE_REF. Byte 7-10 encode the reference id of the predicate. Byte 11 marks the obect of the statement as a PLAIN_LITERAL value, bytes 12-15 encode the length of the lexical value of the literal, and bytes 16-26 encode the literal’s lexical value as a UTF-16 encoded byte array. Finally, byte 28 marks the context field of the statement as a NULL_VALUE.

Buffering and value reference handling

The binary RDF format enables declaration of value references for more compressed representation of often-repeated values.

A binary RDF producer may choose to introduce a reference for every RDF value. This is a simple approach, but it produces a suboptimal compression (because for values which occur only once, direct encoding of the value uses fewer bytes than introducing a reference for it).

Another approach is to introduce a buffered writing strategy: statements to be serialized are put on a queue with a certain capacity, and for each RDF value in these queued statements the number of occurrences in the queue is determined. As the queue is emptied and each statement is serialized, all values that occur more than once in the queue are assigned a reference id. This is, in fact, the strategy employed by the OpenRDF Rio implementation of this format.

It is also important to note that reference ids are not necessarily global over the entire document: ids are assigned on the basis of number of occurrences of a value in the current statement queue. If that number drops to zero, the reference id for that value can be ‘recycled’, that is, reassigned to another value. This ensures that we never run out of reference ids, even for very large datasets.

Conclusions

The record marking approach ensures that the format can be read and written in a completely streaming fashion, reducing memory overhead and making it very suitable for large datasets.

The next steps are to do actually do a bit of analysis on this format. For example, I’d be keen to compare this approach in more detail with the HDT format, currently a W3C submssion. Aduna’s motivation for implementing its own custom format had to do with optimizing performance for datasets with very long literal values, but I haven’t yet seen any comparitive figures on byte size and processing speeds.

I recently became aware that the implementation of arbitrary-length property paths I created for Sesame’s SPARQL engine suffered a serious flaw: it didn’t scale. At all. The chosen strategy (see my earlier blog post on this topic) of evaluating paths of ever-increasing length using multi-joins slowed down to a crawl (through tar) on any serious dataset at about length 7 or 8. Queries involving such paths took 15 minutes or more to complete. Clearly, this would need a little improvement.

Looking at the implementation, it turned out there were two main bottlenecks in the performance:

  1. complete re-evaluation of all paths of length (n – 1) whenever we expand the path length
  2. a computationally heavy cycle-detection check

After fiddling about with some ineffectual half-measures for one or both of these problems I decided to take a step back, and completely rethink the algorithm. It occurred to me that if we did a proper breadth-first search implementation we could eliminate the computationally expensive multi-join altogether (and even more importantly, the hugely inefficient repeat performances on known paths for every increase of the length). We could simply use a FIFO queue to pop intermediate values on and keep doing simple lookups of patterns of length 1.

Let me clarify with an example. Suppose you have the following path-expression:

  ex:alice foaf:knows+ ?x 

In the naive join-based algorithm, this would be evaluated by first looking up all paths of length 1:

  BGP(ex:alice, foaf:knows, ?x)

If the result of this is not empty, we continue by looking up paths of length 2, adding a filter-condition to guard against cycles:

Filter(
 And(?a1 != ex:alice, ?a1 != ?x),
 Join(
  BGP(ex:alice, foaf:knows, ?a1),
  BGP(?a1, foaf:knows, ?x)))

I won’t bore you with the details for paths of length 3 and further, but I trust you appreciate that this can quickly become quite big and complex. Also, notice the fact that in the expression for length 2, we have BGP(ex:alice, foaf:knows, ?a1), which is essentially the same expression we already evaluated for paths of length 1. We already know the answers to this bit, yet we still create a new join which will re-evaluate the damn thing, for every increase of the length, again and again.

Enter the new algorithm. I got rid of joins completely. Instead, I introduced a FIFO-queue of values. We iterate over results for length 1, reporting all resulting solutions, and putting all bindings found for ?x onto the queue. Then, when all results for length 1 are exhausted, we examine the queue. If it is not empty, we pop the first value, and create a new BGP where we replace the start with the popped value. We iterate over solutions to this BGP, and again, we report all solutions and add all bindings of ?x thus found to the stack. We continue in this fashion until finally the last iteration is exhausted and the queue is empty.

This is really classic breadth-first search, so no new wheels have been invented here, but it is amazing (although in retrospect really quite understandable) how much more performant this graph traversal mechanism is than the earlier join-based approach. One experiment I did (using the DBPedia category thesaurus) was the following query:

SELECT ?sub
WHERE {
   ?sub skos:broader+ <http://dbpedia.org/resource/Category:Business>
}

This query (which effectively retrieves all subcategories of Category:Business) took over 15 minutes to complete in the original approach. After switching to the new BFS-based approach, it completes in roughly 120ms.

One thing I haven’t touched on yet in this new implementation is how to detect cycles. In the original approach, we retrieved solutions that encompassed the complete path (joins with intermediate variables) and we used filters to check that all of the intermediate vars are pairwise distinct. Clearly, we can’t do that in this approach, because we no longer have all these intermediate vars available at once.

The solution is actually rather simple, if you stop and think of what a cycle actually is: a cycle is a path where at the end you have gotten back where you started (in other news: water is quite wet). Therefore, you simply have to mark where you’ve been before (that is, which solutions you’ve previously reported), and you’re home free.

A list internally records all reported combinations of the start value (ex:alice) and the end value (each binding of ?x) of our path. Whenever we find a new solution during iteration, we check if the reported combination already exists in that list, and if so, we discard it.

Why not just record the end values? Well… Think about it. Do all path expressions start with a fixed start point?

This approach to cycle detection has two consequences:

  • cycle detection has become a cheap lookup in a list.
    This is quick, but puts a burden on memory space. In practice, however, the result of property-paths stays small enough to be able to keep such a list without problems.
  • We now automatically filter out duplicate results

The second consequence is really useful in practice: the old approach reported back all possible paths of all possible lengths, so if for example Bob is a friend of Alice but also a friend of a friend of Alice, we would get back Bob twice – which is clearly redundant information. In the new algorithm, Bob is only reported once, saving bandwidth, computation time, and the user having to jump through a lot of hoops to get unique results.

The new and improved algorithm will be available in the next Sesame release (2.6.1).

This just in from OpenRDF.org: We are very pleased to announce the release of Sesame 2.6.0. This is a major new release with some very exciting features, most importantly:

  • SPARQL 1.1 Federation support

    Sesame 2.6.0 features support for the SPARQL 1.1 Query Language Federation extension as specified in the Last Call Working Draft. This enables the incredibly powerful possibility to directly query and integrate data from any public SPARQL endpoint.

  • SPARQL Repository

    The SPARQLRepository enables you to programmaticaly connect to any SPARQL endpoint via the Repository API. It also offers the option to send custom HTTP headers (for example API keys) to endpoints which require this.

  • User-configurable parser behavior in Repository API

    You can now configure the way the parser handles validation, errors, and datatypes verification when uploading a file directly via the Repository API.

Apart from these features the release includes a number of bugfixes and minor improvements, see the Changelog for full details.

The SPARQL 1.1 federation extensions in Sesame 2.6.0 have been contributed by fluidOps, in cooperation with Ontotext and Aduna.

As usual, the latest Sesame release can be downloaded at OpenRDF.org.

It’s not yet up on the openrdf.org news page (mainly due to New Zealand being so far ahead of Europe – ha!), but Sesame 2.5.1 is now available for download.

This is a maintenance release that fixes a number of issues in the newly developed SPARQL 1.1 Query/Update support. Additionally, the Rio parser has improved support for validation of XML Schema datatypes and language tags.

For a full overview of changes, see the ChangeLog.

From OpenRDF.org: We are very pleased to announce the release of Sesame 2.5.0. This is a major new release with some very exciting features:

  • SPARQL 1.1 Query Language support
    Sesame 2.5 features near-complete support for the SPARQL 1.1 Query Language Last Call Working Draft , including all new builtin functions and operators, improved aggregation behavior and more.
  • SPARQL 1.1 Update support
    Sesame 2.5 has full support for the new SPARQL 1.1 Update Working Draft. The Repository API has been extended to support creation of SPARQL Update operations, the SAIL API has been extended to allow Update operations to be passed directly to the underlying backend implementation for optimized execution. Also, the Sesame Workbench application has been extended to allow easy execution of SPARQL update operations on your repositories.
  • SPARQL 1.1 Protocol support
    Sesame 2.5 fully supports the SPARQL 1.1 Protocol for RDF Working Draft. The Sesame REST protocol has been extended to allow update operations via SPARQL on repositories. A Sesame server therefore now automatically publishes any repository as a fully compliant SPARQL endpoint.
  • Binary RDF support
    Sesame 2.5 includes a new binary RDF serialization format. This format has been derived from the existing binary tuple results format. It’s main features are reduced parsing overhead and minimal memory requirements (for handling really long literals, a.o.t.).

Sesame 2.5 SPARQL improvements have been made possible by Ontotext, in cooperation with Aduna. We hope you enjoy the new features and look forward to receiving your feedback.

See the release notes for a full overview of all changes.

The Sesame framework comes with a pre-packaged web service (often referred to as the Sesame server). This web service enables deployment of Sesame as an online RDF database server, with multiple SPARQL query endpoints and full update capabilities.

In the default setup of the Sesame server, however, there is no security at all: everybody can access all available Sesame repositories, can query them, add data, remove data, and even change the server configuration (e.g. creating new databases or removing existing ones). Clearly this is not a desirable setup for a server which is publicly accessible.

Fortunately, it is possible to restrict access to a Sesame server, using standard Java web application technology: the Deployment Descriptor.

In this recipe, we will look at setting up some basic security constraints for a Sesame server, using the web application’s deployment descriptor, in combination with basic HTTP authentication. For the purpose of this explanation, we assume you have a default Sesame server running on Apache Tomcat 6.

Sesame HTTP REST protocol

The Sesame server implements a RESTful protocol for HTTP communication (it’s a superset of SPARQL protocol). What that comes down to is that the protocol defines specific locations, reachable by a specific URL using a specific HTTP method (GET, POST, etc.), for each repository and each operation on a repository.

This is good news for our security, as it means we can easily reuse HTTP-based security restrictions: since each operation on a repository maps to a specific URL and a specific method, we can have fairly fine-grained access control by simply restricting access to particular URL patterns and/or HTTP methods. read more

The SPARQL query language is extensible by nature: it allows implementors to define their own custom operators if the standard set of operators is not sufficient for the needs of some application.

Sesame’s SPARQL engine has been designed with this extensibility in mind: it allows you to define your own custom functions and use them as part of your SPARQL queries, like any other function. In this new recipe, I’ll show how to create a simple custom function. Specifically, we are going to implement a boolean function that detects if some string literal is a palindrome. read more…

From openRDF.org: Sesame 2.4.1 is a bugfix release, fixing various reported issues in the new SPARQL 1.1 query functionality and a number of stability/scalability improvements. Some highlights:

  • SPARQL GROUP BY with complex expressions as arguments are now evaluated correctly (SES-774)
    SUM and AVG now silently ignore non-numeric values (SES-788)

  • Performance improvements in processing of aggregate operators (SES-789)
  • Fix for native store data corruption issue (SES-776)
  • Various other fixes and improvements in handling of property paths and aggregate operators

See the release notes for a full overview of all changes.