Yossi Dahan [BizTalk]

Google
 

Friday, December 11, 2009

What not to do – avoid reading entire message’s to memory unnecessarily

This one is fairly common, I suspect, and I can certainly see why – the temptation is simply to big – but too many pipeline components start by reading the message into memory, when, with a little bit more effort this could have been avoided.

One pipeline component I’ve seen, for example, receives a flat file, and needs to remove records already processed (duplication elimination) – quite a good thing to do in a pipeline, and I also liked the approach of doing so before the disassembler, to make the xml produced smaller.

V1 of the component used a memory stream – the incoming stream was read line by line, each line was assessed, and – if was not a duplicate – it would get written to the memory stream.

When the component had finished going through the entire incoming stream, the memory stream would be assigned to the message, replacing the original stream, and the message would get returned to the pipeline.

There are two downsides to this approach – the first is memory consumption – the component will always consume at least as much memory as the size of the (outgoing) message; done properly BizTalk would then clean this memory, but only after completing the processing of the message; the second downside is potentially unnecessary delay in further processing of the message – one of the huge benefits of the pipeline, in my view, is its streaming fashion, where subsequent components, if developed in the correct manner, can start working on parts of the message before preceding components completed their processing; basically each component passes back to the pipeline the portion of the message it already processed, whilst working on the next portion.

It appears that the customer in question encounter memory issues as the component’s code was changed to use virtual stream instead of memory stream; a virtual stream is effectively a stream that uses disk for storage instead of memory.

This solves the memory consumption issue, but merely replaces it with IO operations which may have an even bigger impact on the server’s overall performance (and does not address the processing delay point at all).

What would have been the correct way to implement this in my view?

The component should have create a custom stream, wrapping the original stream from the message; It would then replace the message’s stream with the custom stream immediately returning the message back to the pipeline. Note that so far the component hadn’t touched the message stream – zero bytes have been read.

As BizTalk (and not the component!) would read the message (for instance when persisting it to the message box), the custom stream’s read function would be called which would contain that reads the underlying stream (the original stream received by the component), probably buffering reads until the end of a line for simplicity (although in many cases this is not necessary) and assessing whether the record is a duplicate or not; if it is a duplicate the function will simply read the next line and so on until a non-duplicate record is found, at which point the line would be returned as the output byte stream from the read function.

This effectively means that the next component, or the message box, will receive the message line by line, duplicate records removed, without having to wait for the component to process the entire message, and with only a maximum of one line ever loaded into memory.

Labels: ,

5 Comments:

  • Hi Yossi,

    Any chance of posting some sample code to demonstrate this technique? Sounds very powerful but probably easier to explain via an example.

    Cheers
    Mark

    By Blogger Mark, at 11/12/2009 16:11  

  • +1 on the request for a sample.
    Cheers
    Benjy

    By Blogger Benjy, at 12/12/2009 18:26  

  • I'll look at producing a sample and posting it (a new post, as its easier to find) as soon as I can, might take a couple of weeks though, given work schedule.

    Have to say though that the FixMsg sample in the SDK demonstrates this pretty well.

    The stream implementation there is pretty simple, but the pipeline component side is all you need.

    The stream will be as simple or as complex as the specific requirements dictate, and that's 'standard' stream coding, nothing BizTalk specific; I suggest you guys take a look at that as well

    By Blogger Yossi Dahan, at 15/12/2009 20:37  

  • I don't quite get this. AFAIK there are only two types of streams: either memory-based or IO-based. correct me if i'm wrong, but this blog entry basically says you shouldn't use either, but instead use a "custom stream." what exactly is this "custom stream" you are talking about?

    By Anonymous Dexter Legaspi, at 15/01/2010 13:20  

  • Mark, Benjy, Dexter - 'watch this space' as there's more to come shortly.

    Dexter - in short - the main idea is that the component itself does not call a Read() a single time.

    Instead it creates an instance of some custom stream, which wraps the stream of the message received (doesn't care if this is a memory stream or some IO stream) and places that in the message.

    The 'magic' happens in this custom stream - when it is read (by something further down the line - ideally BizTalk itself) - at this point the custom stream is likely to call the undelying stream's Read method, but would then apply some behaviour on the read bytes before returning them to the reader.

    As I said - more on this to be published shortly (I hope)

    By Blogger Yossi Dahan, at 15/01/2010 14:30  

Post a Comment

<< Home