Making Indexes, Classification Rules and Searches Compatible

When creating a Lucene Index, the index compiler is presented with a number of choices about how the indexable content is to be analysed for subsequent searching and classification. So it is necessary to make a choice of analysis that is best going to support the type of search that is likely to occur, but more importantly that is amenable to the sorts of classification rules that will most effectively classify the content.

There are a number of ways that the analysis can be done, the choice being made having a profound affect on the way the search results will occur. The sorts of choices include, for example, whether exact words are indexed, which would support exact-match searches. For this the Standard Analyser would be selected for indexing. On the other hand, the Snowball Analyser might be selected. This is a Potter-stemming analyser which removes the distinction between words which share the same stem. For example, the words ‘innovation’, ‘innovative’ and ‘innovator’ would become stemmed such that the term ‘innovat’ only would be indexed. This approach is useful for generalizing searches. Nevertheless, in some circumstances preservation of the distinction between the un-stemmed words might better serve the purpose of the resource repository.

Thus, the choice of the analyser is best informed by the classification rules that will be written and the sorts of searches that will be appropriate to the end-user base. Once again, as alluded to in an earlier blog, expertise is the only really useful arbiter of both the development of the system and the validation of its performance.

Posted in Technical development | Leave a comment

Getting the Content for Resource Description

Such is the immaturity of the provision of open education resources that there is currently no standard for representing the description of a resource in terms of content or syntax. As a result of this, it is often the case that material suitable for, and made available for use as an OER, is sub-optimally described for easy – especially automatic – processing.

For Delores Extensions a Waypoint Proxy Document (WPD) is used to describe each resource. This document contains just enough information, in a prescribed XML format, to allow the resource to be identified uniquely together with its provenance and licensing status, and to be minimally classified.

Sometimes all of the data necessary to populate an instance of the WPD is available in the ‘top-level’ description of a resource. This, however, is fairly uncommon. Where the information is missing it is necessary to resort to some more complex interrogation of the resource content and the context it is embedded in. So, for example, the licensing status of a particular resource might be found only, say, on the title page of the resource itself. To find such information may mean not only searching the text, but of transforming the format from one to another to making automatic search possible. For example, in processing content for Delores Extensions it has been necessary to manually select files for conversion from PDF to text, using PDFBox to achieve the format migration.

Often, the missing descriptive data will be found remote from the resource itself. Such is often the case where licensing information which applies to a set of resources is quote once only, say, on the source home page.

When assembling the Delores Extensions collection of resources much ‘manual sleuthing’ has gone on. Much of this labour would be made redundant at a stroke were a standard for the description of OERs be developed by the community and embraced by the OER providers.

Posted in Uncategorized | Leave a comment

WordPress for hosting and describing learning resources

On 5 August I gave a presentation about Delores Selections with the above title to the CETIS Advances in Open Systems for Learning Resources workshop at the Edinburgh Repository Fringe meeting. Below is the powerpoint presentation I used and the (lightly editted) notes taken by Nicola Osborne’s in her live blog of the event.

Slide 1
Delores is: Delivering Open Educational Resources for Engineering Design.

Slide 2
We have static and dynamic collections of university level OERs and other openly available resources relevant to Engineering Design. A static collection may include dynamic resources but the collection itself is static, once set up it stays as it is. Dynamic collections can have new materials added or taken away or developed.

ICBL, School of mathematical and computer sciences, Heriot-Watt University and the University of Bath worked together on this project, funded by HEA and JISC under OER Phase 2.

Slide 3
We used WordPress to gather resources selected by experts in design engineering as being of high quality and usefulness for the collection. We aimed for about 100 objects in that collection of materials. The dynamic collection is everything underneath that. We use a tool called Sux0r which does Bayesian filtering of content – this is how Spam filtering works. We are using that idea the other way around – filtering to detect likely design engineering materials. Then we put material through a tool designed by Bath called Waypoint which enables faceted searching by automatic classification. Because Sux0r pulls RSS feeds from collections we know of, those feeds are continually updated and the collection presented by Waypoint continues to grow. I am going to focus on WordPress but I mention this context to point out that the technically difficult stuff, the effort, the hard thinking wasn’t really in the bit I am talking about.

Slide 4
So, starting off: what do we think we need in order to have this static collection? What are the needs for describing these OERs? First up you may not want to hold an actual copy of the resources. We decided we didn’t want to hold a copy of the resource, these are pre-existing resources hosted elsewhere. What metadata do we need? Title, description, authors, origin, date, subject, classification of some sort, licence, and probably something about the resource type. Users want to see that information, not necessarily locked up in an xml file. We want to embed a preview. We may or may not want to allow comments – but we don’t want to have to manage and spam filter those for the long term. We want something with a good web presence (and findable by Google) and something that has good participation (links in many direction, embedded material, widgets etc. We want it to take part in the web). We want RSS feeds – great for pushing metadata around, we want embedded metadata (thinking RDFa, microformats etc), we want flexibility, want something easy to use and maintain (perhaps familiar), and possibly the option to export metadata?

Slide 5
The idea that we had was to use WordPress. One blog post per resource – if required you can attach resources that are single files to the post. This gives you a basic description and good web presences. WP handles author, date, tite, and you have tags and topics for classification. Also extensions for metadata and additional functionality (a big developer community there).

Slide 6
We weren’t the first people with this idea…

Slide 7 & 8
Oxford’s Triton Project are running the Politics InSpires blog. They are creating OERs within WordPress – describe and comment on current affairs and other items. They have focused on add ons around that blog.

Slide 9 & 10
Edinburgh University have an initiative called OpenMed

Slide 11 & 12
CETIS has been exploring the use of WordPress to disseminate our publications. We see a sneak preview and should note that resources are attached to posts and it looks nothing like a blog

Slide 13
Scriblio (formally WPopac) – WordPress theme to create an OPAC using WordPress

slide 14 & 15
How were our goals met? Well most of what we wanted was possible.
All those question marks are where WordPress gives you information about the post not the resource resource described in the post, which matters for us because we are describing third party resources produced and hosted elsewhere. That is you get the date, author etc of the wordpress post you wrote to describe the resource, which isn’t really what you wanted. You get RSS feeds which link to and are about the descriptions in WordPress, not the resources.

But you do get a good website that is easy to use and maintain and familiar – though the more flexibility you use, the harder it is to maintain.

An Aside
One thing I like about WordPress is Trackbacks – you can see when you’ve been blogged or linked to – people can write about you and you can then aggregate those comments on your post.

Slide 16
So some customisation…

We used WordPress’s custom fields and we adapted a theme so that these are displayed. And we will have either a Plugin or theme extension written so that the right metadata goes into the RSS feed.

slide 17 and demo of Selections site
So lets have a look in the system for bridges

We can find a description and preview of the resource, links to it etc. Looking at the admin screen you can see we are using custom fields to include metadata about the object and we have set up categories that fit the curriculum. Lesa in the audience here wrote all of the resource description – she is a trained librarian and that has really been helpful here.

FINIT

Posted in Dissemination | 3 Comments

Writing Query Rules to Classify Resources in Delores Extensions

The power of Waypoint comes from the fact that a set of classification facets is used to pigeon-hole individual OERs. The user can then ‘filter’ the content by selecting qualifying criteria from any number of facets.

In Waypoint individual resources are classified against each of facets selected to describe and using rules which have been written for a particular domain. Lucene, which does the indexing, has a query language syntax which is based on Boolean logic, but has additional terms which may be used to increase the power of discrimination through such things as proximity and range measures and fuzzy search.

So, a standard Boolean rule might be written: ‘gear AND machining’, which, not surprisingly would find all content which contain ‘gear’ and ‘machining’ in the content.

Alternatively, a phrase: ‘ “gear machining” ’ with the two words in double-quotes will return only those descriptions which have these two words next to each other in this order.

This can be modified to a proximity query – so: ‘“gear machining” ~10’ will find any occurrence of these two words where they occur within 10 words of each other in either order.

The Lucene Syntax is very powerful. This, however, brings its own problems, since the performance of the classification is dependent on the rules that have been written, and the rule-writer is spoil for choice. The facts are that rule-writing is more an art than a science and that expertise both in the practise of rule writing and the domain is very necessary to a good domain classification.

For the novice, the best approach is to limit, in the first instance, rules to those using simple Boolean expressions. Then it is a matter of assessing the classification performance and making step-wise adjustments. Thus, the rules are tuned to eliminate invalid classifications and to incorporate new rules to deal with missed resources.  It is always necessary for the classification performance to be checked by domain experts. We are taking this step-wise and expert-check approach in developing the classification for Delores Extensions.

It should, however, be remembered, that the OERs in Delores Extensions are pre-filtered by sux0r into two streams, delivering only those engineering resources to Waypoint for further classification that apply specifically to engineering design.

Posted in Uncategorized | Leave a comment

Training sux0r to recognise design engineering

I spent some time last month training sux0r to recognise what is and isn’t relevant to design engineering, and also to recognise what is relevant to each of the top level topics in the SEED curriculum that we are using to categories resources. We should see the fruits of this training in filtered feeds coming out of sux0r as we add new feeds and as new resources are added to existing feeds. So where do we look for that? For simplicity I shall focus on the more important design relevant/not relevant categorisation here.

The sux0r API provides access to various feeds and other information relevant to the filtering, Santy has written in general terms about what is where in the Delores installation of sux0r. More specifically:

We can see from Return vectors call that the relevant dimension to the categorisation is DesignEngRelevant which has vectorID of 3.

Using this in the ReturnCategories per vector Call we get the categories categoryName=isDesignEng, categoryID=10; and categoryName=notDesignEng categoryID=11.

To get the feeds we use this information in the ReturnItems per Category call. So

Immediate feedback from Chris, who is the project team member who know about Design Engineering, is that the automatic categorisation is working well at the is/isn’t relevant level.

Posted in Resource selection, Technical development | Leave a comment

Poster at the HE Academy’s conference

Unfortunately none of us could attend the HE Academy’s annual conference, but we were able to send a poster. Here’s an image of the poster; click on it to get the (higher quality) pdf version.

Delores project poster from HE Academy conference

Posted in Dissemination | Leave a comment

OERs Where are You?

The expectation of the DelOREs Project was that there would be a large number of engineering design OERs available suitable for use in teaching and learning at undergraduate level. We have found, however, that whilst there is a very great deal of material that could be of use either to students or to teachers in this discipline, there is a very limited amount of material that is offered expressly for ‘open’ use and which conforms to the required licensing criteria.

To provide a useful resource, it seems to us that material must conform to the minimal ideal requirements, these being:

  1. Subject appropriateness
  2. A specified quality
  3. Alignment with target audience needs
  4. Explicitness of conditions of use which define an OER

Item 4 could be relaxed in some circumstances, provided that the legitimate use is clearly flagged by the licence/copyright statement and would be of benefit to the target user group.

Candidates for OERs cover, of course, a wide spectrum ranging from single stand-alone documents and web pages, to fully structured web sites, through webcasts to sets of lecture notes for a particular engineering design course taught module.

There are a number of OER collections and repositories whose brief is to provide material of this sort; however, that which has been provided specifically for engineering design students and teachers appears to be very restricted in both scope and quantity.

Frustratingly, there is a very great deal of material that is accessible on the World Wide Web which might be of value to both students and teachers of which much has clearly been placed in public sight with the intention of allowing its use for education purposes. However, the legitimate usage of this material remains unclear. This might be because no formal indication is given of the terms of re-use, the manner in which the copyright is expressed is more restrictive than intended, the expression of terms of re-use are informally expressed and ambiguous, and so on.

There is a certain difficulty, too, of catering in the same collection for teachers’ and students’ needs, since they are sometimes different. Students, for example, could quite legitimately read, mark and digest content contained within a resource the copyright of which was highly restrictive, but a teacher might run into trouble if they were to use it in any way that might be useful to them.

Material of this sort, notwithstanding its usefulness and the intentions of the provider, cannot be offered easily through a subject collection because of the danger of inadvertent encouragement to reuse material in a non-legitimate manner.

If nothing else this situation reinforces the belief that it is worthwhile pursuing the development of OER content – that is, material that is clearly marked as being ‘open’ –  and encouraging education, largely of the educators, in its usefulness to teachers and their students.

Posted in Resource selection | Leave a comment