Sunday 2 October 2011

Cooking and feeding

This morning I updated my pog website. After church I cooked dinner.
In the afternoon I was doing various stuff on my comp. One of them was making sure my VMs I use for testing websites on the different IE versions were all up to date. But strangely when it came to running Windows Update, I got a message that I needed to add various domains to the trusted sites options in IE:
See which updates failed to install To continue, you must first add this website to your trusted sites in Internet Explorer.
The site cannot determine which updates apply to your computer or display those updates unless you change your security settings to allow ActiveX controls and active scripting. The best way to do this without lowering your security settings is to make this site a trusted website. Your security settings will continue to block potentially harmful ActiveX controls and scripting from other sites but you will be able to get updates.
To make this site a trusted website:
  1. In Internet Explorer, click Tools, and then click Internet Options.
  2. On the Security tab, click the Trusted Sites icon.
  3. Click Sites and under Add this website to the zone, copy and paste these website addresses. You can only add one address at a time and you must click Add after each one. Note that you may need to uncheck "Require server verification (https:) for all sites in this zone."
    • http://update.microsoft.com
    • https://update.microsoft.com
    • http://*.update.microsoft.com
    • https://*.update.microsoft.com
    • http://download.windowsupdate.com
Note: The asterisks and different addresses allow your computer to work with the site, no matter how you try to access it from your computer or the Web.
However, even after adding these sites to the trusted sites list, and restarting the machine, I still got the same message. It seems a bit strange that microsoft have decided to make it difficult to check for updates - I'd have thought they would want people to keep windows up to date.
After this I looked at feeds on some different photo galleries, and they seem to take quite a different approach to that suggested in the IBM article An overview of the Atom 1.0 Syndication Format. Rather than using

<summary>A picture of my new car</summary>
<content src="/mypng2.png" type="image/png" />

for the content (which doesn't display the image in FF and IE), the sites I looked at include the the image and summary as HTML for the content. The sites I looked at were Flickr (example feed), Red Bubble (example feed), and Picasa (example feed).

I also noticed that they use category elements for tags, and use encoded HTML e.g.
<div> converted to &lt;div&gt; rather than wrapping the HTML in CDATA tags.

After that I started making some more chocolate horns, but this time using an English recipe that uses pastry instead of bread.

I looked at using <category> tags for my feed, and found this article: Representing tags in Atom, which while good, doesn't come to any actual conclusions about a standard recommendation. Studying the same feeds as earlier, Red Bubble uses:
<category term="photography"/>

Flickr uses:
<category term="road" scheme="http://www.flickr.com/photos/tags/" />

Picasa doesn't use the category for tags, just specifying that it is a photo.
Wordpress uses:
<category><![CDATA[Photography]]></category>


Personally I like suggestion #1 from the aforementioned article best (e.g.
<category scheme="http://edward.oconnor.cx/tags/"
          term="foo" label="Foo" />

But I don't currently have any way for people to view a list of tags or photos with a certain tag on my website. So although I could use the above representation, the scheme URL would either lead to a 404, or I could put up a blank page there, which would be just as useful.

So I think what I am going to do for the moment is use RedBubble's method. Then if or when I add the ability to browse photos by tags, I can add in the scheme url, ala Flickr and Edward O’Connor's suggestion 1.

Now I knew how I wanted the tags formatted, I just had to look at how to extract them from the database and format them appropriately. With MySQL you can only have scalar values, so you can't select each image record with the keywords as an array. Instead you can:

  • Select the keywords as an array by doing a separate query for each record you want keywords for
  • Select the keywords and the record ids in a single query, then use the record ids as array keys when pulling out the rows e.g.
    $result = query('SELECT img_keywords.img_id, keywords.Subject
    FROM keywords
    LEFT JOIN img_keywords ON keywords.id = img_keywords.keywords_id
    WHERE img_keywords.img_id IN (list of ids you want records for or a subquery to select records here)'
    while($row = $result->fetch_assoc()){
    $images[$img_id]['keywords'][] = $row['Subject'];
    }
  • Select the keywords for each record as a comma separated string (using GROUP_CONCAT), then explode this into an array in PHP.
  • Don't try to get the keywords in an array, and just get them formatted how you want in the query (using GROUP_CONCAT)

There may be other methods you could use as well, but I think in theory that the last option above should be the most efficient, though I don't really like it as it is mixing data with presentation/markup.

Anyway, when I added the code for that last method to my query, the query was taking about 1s!!! So I spent quite a while trying to find out what was wrong. I looked at suggested keyword table structures and indexes, but it was actually the way my query was structured that was slowing it down.

This was the original query (before adding in anything to do with keywords / tabs, this took 0.056s to run:

SELECT images.id, images.Headline, images.filename, images.Rating, images.url_name,
SUBSTRING_INDEX(imageData.ImageDescription, '\n', 1) AS summary, imageData.GPSLatitude, imageData.GPSLongitude, imageData.GPSAltitude, DATE_FORMAT(imageData.Last_updated,'%Y-%m-%dT%TZ') AS Last_updated
FROM images
LEFT JOIN imageData ON images.id = imageData.id
LEFT JOIN img_categories ON img_categories.img_id = images.id
LEFT JOIN categories AS node ON node.id = img_categories.categories_id,
categories AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND parent.id = 14
GROUP BY images.id
ORDER BY images.id DESC
LIMIT 0,25

By moving the part of the query that selects all matching image ids into a subquery, the time was dramatically cut to around 0.0025s.

SELECT images.id, images.Headline, images.filename, images.Rating, images.url_name,
SUBSTRING_INDEX(imageData.ImageDescription, '\n', 1) AS summary, imageData.GPSLatitude, imageData.GPSLongitude, imageData.GPSAltitude, DATE_FORMAT(imageData.Last_updated,'%Y-%m-%dT%TZ') AS Last_updated
FROM images
LEFT JOIN imageData ON images.id = imageData.id
WHERE images.id IN(SELECT DISTINCT img_categories.img_id
FROM img_categories
LEFT JOIN categories AS node ON node.id = img_categories.categories_id,
categories AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND parent.id = 14)
GROUP BY images.id
ORDER BY images.id DESC
LIMIT 0,25

Finally, I also tried using the subquery just to select the category ids we wanted to show images for, and then selecting the image ids that match these in the main query, but this didn't seem to have any benefit over the above, with the run time still around 0.0025s.

SELECT images.id, images.Headline, images.filename, images.Rating, images.url_name,
SUBSTRING_INDEX(imageData.ImageDescription, '\n', 1) AS summary, imageData.GPSLatitude, imageData.GPSLongitude, imageData.GPSAltitude, DATE_FORMAT(imageData.Last_updated,'%Y-%m-%dT%TZ') AS Last_updated
FROM images
LEFT JOIN imageData ON images.id = imageData.id
LEFT JOIN img_categories ON img_categories.img_id = images.id
WHERE img_categories.categories_id IN(SELECT DISTINCT node.id
FROM categories AS node,
categories AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND parent.id = 14)
GROUP BY images.id
ORDER BY images.id DESC
LIMIT 0,25

Now when I add in the code to get the keywords, the query is much faster than the 1s it was taking previously - 0.0035s, a massive difference from a simple change.

#0.0035
SELECT images.id, images.Headline, images.filename, images.Rating, images.url_name,
SUBSTRING_INDEX(imageData.ImageDescription, '\n', 1) AS summary, imageData.GPSLatitude, imageData.GPSLongitude, imageData.GPSAltitude, DATE_FORMAT(imageData.Last_updated,'%Y-%m-%dT%TZ') AS Last_updated,
CONCAT('\n') AS tags
FROM images
LEFT JOIN imageData ON images.id = imageData.id
LEFT JOIN img_keywords ON images.id = img_keywords.img_id
LEFT JOIN keywords ON keywords.id = img_keywords.keywords_id
WHERE images.id IN(SELECT DISTINCT img_categories.img_id
FROM img_categories
LEFT JOIN categories AS node ON node.id = img_categories.categories_id,
categories AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND parent.id = 14)
GROUP BY images.id
ORDER BY images.id DESC
LIMIT 0,25

If you're good with MySQL, then the above may come as no surprise, but I had been under the general impression that it's more efficient to keep everything in one query than to have to use separate queries or subqueries. I will probably see if there's anything else I can do to improve it a bit more tomorrow.

No comments: