bafprp

You are currently browsing articles tagged bafprp.

Here is a problem for all the theoretical computer scientists out there.  Say you want to write a program, this program needs to parse a certain file structure and extract useful data.  A single file is setup as a bunch of independent records built on a handful of fields in said record.  There are almost 1000 different kinds of fields, and close to 200 different kinds of records.  The requirements for output is standard duplicate checking and removal, and programmable output, meaning you can change how the output looks or is handled without changing the program.

If you can think of a good solution feel free to post your ideas, today I will be writing about my solution and things that worked and did not work, and things that still do not work.

Read the rest of this entry »

Tags: , ,

I setup the release of version 1.0 for bafprp the other day. It has been a fun project that I hope is useful to others like myself who needed an improvement over bafview without going to the big software companies. For today’s post I want to list some of the helpful tips I picked up while writing the program.

DSN-Less ODBC Connection String

I found it really annoying that in order to connect through ODBC you had to setup a data source. In windows this process involves going to the control panel, administrative tools, and data sources. In Linux, this involves setting up the ODBC driver, setting up unixODBC to recognize the driver, and finally setting up the server information and data source in both your driver and unixODBC.

I spent some time looking online and found a nice and easy way to connect without an external data source in windows. Basically it involves supplying all the information in one string like so:

std::string dsn = “DRIVER=sql server;DATABASE=” + _database + “;SERVER=” + _server + “;Uid=” + _user + “;Pwd=” + _password + “;”;

You must also use SQLDriverConnect instead of the usual SQLConnect function since the former will accept a dsn string, instead of just a dsn name and login info.

In linux its still a little annoying. You must still define your driver in unixODBC for starters but you do not need to setup a dsn or give server type information. Also, the support the DSN-less connections depends on the driver you choose. Since my primary use for bafprp was to connect to a ms sql database, I used FreeTDS which as of a recent update does have dsn-less capability. If you are using something else, please consult the documentation about this as well.

In FreeTDS the basic connection string goes like this

std::string dsn = “DRIVER=FreeTDS;SERVER=” + _server + “;Uid=” + _user + “;Pwd=” + _password + “;DATABASE=” + _database + “;TDS_VERSION=8.0;Port=1433;”;

TDS_VERSION and Port are FreeTDS specific settings, but the basic idea remains the same. Also it should be noted that different version of ms sql require a different TDS_VERSION. If you are using FreeTDS its important to know the correct version.

As a final side note about connection strings, make sure to include the terminating semi-colon. If you find your string unable to connect no matter what you do, this is probably the problem.

File Output

Sometimes when your program terminates unexpectedly, text being written to a file can be lost. If you are using fprintf or writef or any of the stdio functions for logging you will probably come across this problem. The only solution I found to guarantee that the text gets written is to use std::fstream and call the flush() method when you are done writing. Flush will make sure the text gets written before returning so it will be a bit slower, but for something as important as logging this is important.

Duplicate Removal

I remember reading about a similar situation in Programming Perls. It involved making a hash of your data and comparing collisions I believe. The situation I was in was that I had thousands of records that could be byte for byte duplicates with any other record in the original binary file. Like Programming Perls states, comparing each record against every other record is a joke. Hashing is definitely a better solution, but you need not create such a complex hash table for something like this. I ended up pulling a crc32 method to calculate the crc for the originals bytes in the record. After the record parsing was completed I sorted the array of records by their crc value. It was then a very easy procedure to remove any duplicates since they would be sitting right next to one another with the same crc.

One thing to note however, std::unique in algorithm.h seems like a wonderful function, but I could not get it to work for the life of me. It is supposed to sort the array, and place any duplicates at the end of the array, returning an iterator pointing at the start of the duplicates. Theoretically you can then use std::remove to remove all elements after that point to erase the duplicates. I managed to get the list sorted and std::unique did identify the correct number of duplicates ( the number of elements after the returned iterator matched the number of duplicates I later removed by said method ), but it did not seem to place the real duplicates at the end of the array. I ended up removing valid records that were unlucky enough to have a high crc and thus were at the end of the array.

So in the end I went through the entire list and removed neighbors with the same crc, which worked quite nicely.

Static Factories

I do not believe I have covered this concept here before so I will do a brief summary. This subject requires a much more detailed post but here is the cut and dry. If you are familiar with abstract factories you might know its a bit of a pain to add a new object. If you are using some kind of enumeration you need to add the id to that list, and add the correct new object code in the create method of your factory. Eventually you end up with enumerations of 100+ elements and a very scary switch statement. Fear not however, there is a better way!

Imagine a system where all you need to do to add a new object in your factory is compile a cpp. Thanks to static factories this is not just a dream. The trick involves a very natural side effect of static objects. The basic idea is simple. You have a main ‘maker’ class with a static registry variable that stores the names and pointers to other maker classes. When you want an object to be built through this factory you need to create a simple maker class for your object with a method called make, which is defined in the parent maker as pure virtual. The child maker defines an instance of itself as static and thus when the program starts it is created.

When the child gets created it calls the parent constructor with the name, or some other form of identification, of the object it creates. The parent maker then adds the information to its static database. When the programmer needs an instance of that object it simply calls the parent’s make method which looks at the database, pulls up the correct child maker and has the child make the object.

This technique is quite powerful if used correctly. It is absolutely necessary for data driven applications in my opinion, and very handy when working with any kind of file data. Using this method you can seperate file structure from logic in a very effective and pretty design.

Tags: , , , , ,

Log Files

While working on the BAF file project I learned a very interesting life lesson. See like most new computer science graduates I have worked on few actual projects, and quite frankly the few I have finished could not be considered corporation ready. While writing a few console applications, file parsers, corba interfaces, etc I learned a valuable lesson in documentation and log files. Log files can save your life if done correctly, and if you are really up to speed they can save many long hours of debuging too. In my case I fell in love with trace log messages. Trace logs are the log files that trace a programs execution so you can get a general idea of where your software is failing or producing an error. Now, most trace logs I have had the pleasure to work with are not quite where I wanted to be when I approached trace logs in my program. Most of the time they are a little more helpful then debug messages when trying to pinpoint an application’s problem. So when I faced the decision to add trace messages I decided to go all out.

My first logging framework was log4cplus, which is a very nice and complete logging framework for C++. I used it for one linux application I wrote and it works perfectly despite being several years old. Unfortunetly I was uncomfortable with how much work it took to write a message each new function ( two lines instead of one ) and it would not compile in windows so for my next application, bafprp, I set out to write my own.

When I completed the class I was left with one macro to print a log message depending on the level you wanted, ie, LOG_TRACE( string ), LOG_DEBUG( string ), etc. I then made sure I was adding trace messages to each and every function as I made them, instead of later on in the design. For example, an empty function would look like


void BafRecord::getType()
{
LOG_TRACE( "BafRecord::getType" );
LOG_TRACE( "/BafRecord::getType" );
return;
}

Each function would have a start and stop trace message.

Now this might raise an eyebrow or two but I assure you it definitely helps when your program crashes and you do not have a nice debugger on hand. Like say, if a non-technical user is using it.

This is how my linux application was programmed and I went the extra step in bafprp to work these in while I was working. Enough about this though, a short while ago, after adding one of the major structure types to the program my application slowed down from parsing a 12 meg file in 2 seconds to 2 minutes and I was greatly concerned over the well being of my design.

I tried many things, first I greatly reduced the number of memory copies my program executed, then I changed my file input so that it would read the entire file at the start and reference a data bank instead of reading the file each cycle. However none of these things put a dent in the processing time. So then I went online to try and find a nice and easy code profiler. Code profilers will basically watch your program execution and tell you which function your program is spending the most amount of time in. I ended up finding LTProf which allowed me to profile my program without recompiling or changing my program at all. I am actually very surprised at how well it works with compiled binaries. After running a release version of my program it was still able to accuratly determine function names and operate like it had a window into the source code.

I found that my time stamp function, NowTime, which returns the current time as a string, was taking a noticably large amount of my program’s time. Thinking back to what code uses this function I discovered the flaw.

When I wrote the log program I wrote the log level exception into the log class. This way if the program tried to log a trace message it would get sent to the output class, which would then pass it on to the logs if the level was at or below the log level type. However this was not enough. as it turns out simply creating the log message twice in each function had a substantial effect on the processing speed of my program.

Needless to say as soon as I moved the log level check to the macro the process speed dropped from 2 minutes to 30 seconds. Now some might say that trace logs this detailed are excessive, however I believe they can help greatly when dealing with a malfunctioning program. So as a final statement, be careful when you design systems, and beaware of how to use your tools. And if your program runs 3 times longer then a similar program, there is probably something horribly wrong.

Tags: , , , , ,

I finally got around to setting up svn for my project. Thanks in part to some new tools I recently came across the transition was fairly painless and after about 2 hours of cleaning up my old project and trimming the fat I uploaded what I will be working on to http://code.google.com/p/anaa/

I also created a second project of a more work related function. This project deals with reading and parsing Bellcore BAF files created by various soft switches in use around the world. These records contain call records in a highly coded and formated to Telcordia specifications in their GR-1100 document. This document is not very cheap to come by so I have started work on improving one of the best free parser’s available today, bafview. You can find this project here http://code.google.com/p/bafprp/

Tags: , , , , ,

Charles Solar is Digg proof thanks to caching by WP Super Cache