Introducing Clippit, get your slides out of PPTX.

logoTL;TR

Clippit is .NETStandard 2.0 library that allows you to easily and efficiently extract all slides from PPTX presentation into one-slide presentations or compose slides back together into one presentation.

Why?

PowerPoint is still the most popular way to present information. Sales and marketing people regularly produce new presentations. But when they work on new presentation they often reuse slides from previous ones. Here is the gap: when you compose presentation you need slides that can be reused, but result of your work is the presentation and you are not generally interested in “slide management”.

One of my projects is enterprise search solution, that help people find Office documents across different enterprise storage systems. One of cool features is that we let our users find particular relevant slide from presentation rather than huge pptx with something relevant on slide 57.

How it was done before

Back in the day, Microsoft PowerPoint allowed us to “Publish slides” (save each individual slide into separate file). I am absolutely sure that this button was in PowerPoint 2013 and as far as I know was removed from PowerPoint 365 and 2019 versions.

d596b287-af31-4ecb-a8e0-092b4dbbed2b.jpeg

When this feature was in the box, you could use Microsoft.Office.Interop.PowerPoint.dll to start instance of PowerPoint and communicate with it using COM Interop.

But server-side Automation of Office has never been recommended by Microsoft. You were never able to reliably scale your automation, like start multiple instances to work with different document (because you need to think about active window and any on them may stop responding to your command). There is no guarantee that File->Open command will return control to you program, because for example if file is password-protected Office will show popup and ask for password and so on.

That was hard, but doable. PowerPoint guarantees that published slides will be valid PowerPoint documents that user will be able to open and preview. So it is worth playing the ceremony of single-thread automation with retries, timeouts and process kills (when you do not receive control back).

Over the time it became clear that it is the dead end. We need to keep the old version of PowerPoint on some VM and never touch it or find a better way to do it.

The History

Windows only solution that requires MS Office installed on the machine and COM interop is not something that you expect from modern .NET solution.

Ideally it should be .NETStandard library on NuGet that platform-agnostic and able to solve you task anywhere and as fast as possible. But there was nothing on Nuget few months ago.

If you ever work with Office documents from C# you know that there is an OpenXml library that opens office document, deserialize it internals to an object model, let you modify it and then save it back. But OpenXml API is low-level and you need to know a lot about OpenXml internals to be able to extract slides with images, embedding, layouts, masters and cross-references into new presentation correctly.

If you google more you will find that there is a project “Open-Xml-PowerTools” developed by Microsoft since 2015 that have never been officially released on NuGet. Currently this project is archived by OfficeDev team and most actively maintained fork belongs to EricWhiteDev (No NuGet.org feed at this time).

Open-Xml-PowerTools has a feature called PresentationBuilder that was created for similar purpose – compose slide ranges from multiple presentations into one presentation. After playing with this library, I realized that it does a great job but does not fully satisfy my requirements:

  • Resource usage are not efficient, same streams are opened multiple times and not always properly disposed.
  • Library is much slower than it could be with proper resource management and less GC pressure.
  • It generates slides with total size much larger than original presentation, because it copies all layouts when only one is needed.
  • It does not properly name image parts inside slide, corrupt file extensions and does not support SVG.

It was a great starting point, but I realized that I can improve it. Not only fix all mentioned issues and improve performance 6x times but also add support for new image types, properly extract slide titles and convert then into presentation titles, propagate modification date and erase metadata that does not belong to slide.

How it is done today

So today, I am ready to present the new library Clippit that is technically a fork of most recent version of EricWhiteDev/Open-Xml-PowerTools that is extended and improved for one particular use case: extracting slides from presentation and composing them back efficiently.

All classes were moved to Clippit namespace, so you can load it side-by-side with any version of Open-Xml-PowerTools if you already use it.

The library is already available on NuGet, it is written using C# 8.0 (nullable reference types), compiled for .NET Standard 2.0, tested with .NET Core 3.1. It works perfectly on macOS/Linux and battle-tested on hundreds of real world PowerPoint presentations.

New API is quite simple and easy to use

P.S. Stay tuned, there will be more OpenXml goodness.

SharePoint 2013: Content Enrichment and performance of 3rd party services

There are a lot of cases when people write Content Enrichment service for SharePoint 2013 to integrate it with some 3rd party tagging/enrichment API. Such integration very often leads to huge crawl time degradation, the same happened in my case too.

The first thought was to measure the time that CPES (Content Processing Enrichment Service) spends on calls. I wrapped all my calls to external system in timers and run Full Crawl again. The results were awful, calls were in average from 10 to 30 times slower from CPES than from POC console app. Code was the same and of course perfect, so, probably my 3rd party service was not able to handle such an incredible load that was generated by my super-fast SharePoint Farm.

If you are in the same situation with fast SharePoint, awesome CPES implementation and bad 3rd party service, you probably should try async integration approach described in “Advanced Content Enrichment in SharePoint 2013 Search” by Richard diZerega.

But what if the problem is in my source code… what if execution environment affects execution time of REST calls… Some time later, I found a nice post “Quick Ways to Boost Performance and Scalability of ASP.NET, WCF and Desktop Clients” from Omar Al Zabir. This article describes the notion of Connection Manager:

By default, system.net has two concurrent connections per IP allowed. This means on a webserver, if it’s calling WCF service on another server or making any outbound call to a particular server, it will only be able to make two concurrent calls. When you have a distributed application and your web servers need to make frequent service calls to another server, this becomes the greatest bottleneck

WHOA! That means that it does not matter how fast our SharePoint Search service is. Our CPES executes no more than two calls to 3rd party service at the same time and queues other. Let’s fix this unpleasant injustice in web.config:

 <system.net>
   <connectionManagement>
     <add address="*" maxconnection="128"/>
   </connectionManagement>
 </system.net>

Now performance degradation should be fixed and if your 3rd party service is good enough your crawl time will be reasonable, otherwise you need to go back to advice from Richard diZereg.

SharePoint 2013: Content Enrichment for Large Files

There are a couple of guides on how to write Content Enrichment services for SharePoint 2013. One of them is official MSDN article “How to: Use the Content Enrichment web service callout for SharePoint Server“.

This article advice you two configuration steps to adjust max size of document that will be processed by CPES (Content Processing Enrichment Service).

  1. Modify web.config to accept messages up to 8 MB, and configure readerQuotas to be a sufficiently large value.
    <bindings>
     <basicHttpBinding>
       <!-- The service will accept a maximum blob of 8 MB. -->
       <binding maxReceivedMessageSize = "8388608">
       <readerQuotas maxDepth="32"
         maxStringContentLength="2147483647"
         maxArrayLength="2147483647" 
         maxBytesPerRead="2147483647" 
         maxNameTableCharCount="2147483647" /> 
       <security mode="None" />
       </binding>
     </basicHttpBinding>
    </bindings>
    
  2. Modify SPEnterpriseSearchContentEnrichmentConfiguration.
    $ssa = Get-SPEnterpriseSearchServiceApplication
    $config = New-SPEnterpriseSearchContentEnrichmentConfiguration
    $config.Endpoint = http://Site_URL/ContentEnrichmentService.svc
    $config.InputProperties = "Author", "Filename"
    $config.OutputProperties = "Author"
    $config.SendRawData = $True
    $config.MaxRawDataSize = 8192
    Set-SPEnterpriseSearchContentEnrichmentConfiguration –SearchApplication
    $ssa –ContentEnrichmentConfiguration $config
    

The concept is generally good, but what if you need to process files larger than 8MB? Let’s try to increase this number up to 300Mb for example (I think that ideally this limit should be not less than max file size allowed for your web apps).

Let’s change both values and run full crawl of SharePoint site. After that, if you are lucky, you will see something like that in your “Error Breakdown”:

crawl_errors_001

WAT? Something went wrong, but what it was … Let’s investigate ULS logs on the machine with Search Service. After a couple of unforgettable minutes of reading ULS logs, I’ve found the following error message:

[Microsoft.CrawlerFlow-cb9134ec-91c6-4bac-89f9-a0cc9fe1e481] Microsoft.Ceres.Evaluation.Engine.ErrorHandling.HandleExceptionHelper : Evaluation failure detected: Operator : ContentEnrichment Operator type : ContentEnrichmentClient Error id : 3206 Correlation id : 60ef1afd-038a-4f64-8230-2b2493923f80 Partition id : 0c37852b-34d0-418e-91c6-2ac25af4be5b Message : Failed to send the item to the content processing enrichment service. 49691C90-7E17-101A-A91C-08002B2ECDA9:#9: https://mysite.com/MyDoc.pptx id : ssic://780174 System.ServiceModel.EndpointNotFoundException: There was no endpoint listening
at http://MyServer:8081/ContentProcessingEnrichmentService.svc that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details. —> System.Net.WebException: The remote server returned an error: (404) Not Found.
at System.Net.HttpWebRequest.GetResponse()
at System.ServiceModel.Channels.HttpChannelFactory`1.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout) –

WAT? “The remote server returned an error: (404) Not Found”. Does the service not exist sometimes? How could it be? Let’s go to IIS log (on the machine where your CPES is installed). Path to the IIS logs should look similar to this c:\inetpub\logs\LogFiles\W3SVC3\.

iislog

It is true – CPES sometimes returns 404.13 status. Let’s google what this status code means.

404.13 – Content length too large. The request contains a Content-Length header. The value of the Content-Length header is larger than the limit that is allowed for the server.

Seems that IIS is not ready yet to receive our 300Mb files. There is one more parameter in web config that should be tweaked to  handle really large files, this parameter is maxAllowedContentLength (default value is 30000000, that is ~30Mb). Let’s change it in web.config:

 <system.webServer>
   <security>
     <requestFiltering>
       <requestLimits maxAllowedContentLength="314572800" />
     </requestFiltering>
   </security>
 </system.webServer>

Recrawl your content once again, and Voila, strange errors gone! Enjoy your content enrichment!)