User:Alan/Software Development
From New Orleans Wiki
Relay is working out well development wise. Scale is a huge concern. Scale is always a huge concern. Relay could become much faster, but it would require profiling tools that I do not have.
The database design will pivot around the tag table. I'll use tags for most associations. I am going to create a boolean field to indicate private tags, or tag type.
I'm realizing that tags require some wicked joins.
Each posting gets a directory that contains an XML added to that posting. These may be comments or form data. Lucene will step in and help with that mess.
Attached forms, say a classification of a historic photograph, or some sort of structured comment form. Does one create a MySQL table for that particular form? Or does one create a field and type table? How much information can become structured? It's simple enough to pull out hyperlinks from syndicated feeds, to determine where they are coming from. Keeping the XML around means creating ad hoc tables.
Perhaps tables for the obvious things. Person, image, hyperlink, address, geocode, organization, and these can be optimized. Some obvious relationships exist. Others can be tagged into existence. Two stage search? Find on structured table, then find associations? Agents?
Need a scripting language for such a search. Thinking about this anyway.
Tagging is cool, but data needs a home. Looking to create containers, based on community or relationships. People can join a tag group and have page constructed from their tags. Using the box column model, then can place resources in different columns and tag to fill the columns. Somethings are sticky.
Continue hacking mozile to create an easy feed generator, almost a blog, but not quite. Limited formatting. Keep it semantic.
|
[edit] Squandered Heritage
The end goal is to place the demolitions on a map of New Orleans.
What are the minimum steps necessary to get something over the wall?
This project can stop and start. The source material will be stored in WordPress. If there are problems with storage, the source material can be reclaimed if needed. One must not be too concerned about translating something incorrectly. The most important thing is to have a consistant id for the feed and the item, so that information that is assigned to an item can be reassigned if the feed needs to be reimported.
The feed needs to be converted from RSS 2.0 to Atom 1.0. There is an existing converstion script that needs to be revived. One must not concern oneself with the layout of the feeds. It does not have to a generic, simply an import of Squandered Heritage.
It is not important that the import of the feed header be flawless. If there are problems, I can fix those headers by hand. Comments feeds do not need to be recorded either. I can iterate over the document collection and persue the comments later.
The first step of an import is a difference to determine if the item has already been imported. We extract the same count of most recent items in the stored feed as there are in the imported feed. We then check for GUIDs that do not already exist.
We will have to find a way to iterate through the document collection, to obtain previous articles.
To begin the import, we do not need to do the difference. We can, in fact, wipe the document collection and reimport.
We will do the split and import of the articles using whatever we have on hand. There will be no, near term changes to the import mechanism.
- Create an RSS 2.0 to Atom 1.0 converstion stylesheet.
- Create a converstion pipeline for Squandered Heritiage.
- Convert content to XHTML.
- Review import scripts.
[edit] Detecting Feed Changes
Issues to resolve.
- Going beyond the first elements in the feed.
- Updated dates?
- Differencing content.
[edit] Upgrades and Issue Fixes
- For each document iterator.
[edit] Feed Import
One must first convert the RSS 2.0 feed of Squandered Heritage into syndibase. This requires first converting the RSS 2.0 feed to Atom 1.0. Then the Atom 1.0 feed is differenced against the existing collection of feeds in the database.
The difference can be performed by checking the index of the documents in the database for each item, or the most recent entries can be pulled from the database, a document materialized, and the difference is performed in XSLT. Those that are not in the database are saved into the database.
Concurrency concerns, so long as only one thread imports a feed, so long as two feeds are not imported at once, we are going to be okay.
[edit] RSS 2.0 to Atom 1.0
Rather than write out how I'm going to convert the two, I'm simply going to convert the two.
- http://www.atomenabled.org/
- http://blogs.law.harvard.edu/tech/rss
- http://www.dpawson.co.uk/xsl/sect2/pretty.html
[edit] Syndibase
- http://www-128.ibm.com/developerworks/xml/library/x-extatom2.html
- http://www-128.ibm.com/developerworks/xml/library/x-extatom1/
[edit] Caching
- Gradually reintroduce MRU caching of documents.
[edit] Feeds
- A photo feed, probably Flickr.
- Craig's List.
[edit] Importing
Regarding importing feeds. The first, most obvious thing to do, is to get everything into Atom 1.0, my storage format. Then I can hit it with the feed annotation tools. The calendars and whatnot.
When I am importing parsed forums, it has been necessary for me to perform a difference based on values that that are not otherwise available to
The end goal is to place the demolitions on a map of New Orleans.
What are the minimum steps necessary to get something over the wall?
This project can stop and start. The source material will be stored in WordPress. If there are problems with storage, the source material can be reclaimed if needed. One must not be too concerned about translating something incorrectly. The most important thing is to have a consistant id for the feed and the item, so that information that is assigned to an item can be reassigned if the feed needs to be reimported.
The feed needs to be converted from RSS 2.0 to Atom 1.0. There is an existing converstion script that needs to be revived. One must not concern oneself with the layout of the feeds. It does not have to a generic, simply an import of Squandered Heritage.
It is not important that the import of the feed header be flawless. If there are problems, I can fix those headers by hand. Comments feeds do not need to be recorded either. I can iterate over the document collection and persue the comments later.
The first step of an import is a difference to determine if the item has already been imported. We extract the same count of most recent items in the stored feed as there are in the imported feed. We then check for GUIDs that do not already exist.
We will have to find a way to iterate through the document collection, to obtain previous articles.
To begin the import, we do not need to do the difference. We can, in fact, wipe the document collection and reimport.
We will do the split and import of the articles using whatever we have on hand. There will be no, near term changes to the import mechanism.
- Create an RSS 2.0 to Atom 1.0 converstion stylesheet.
- Create a converstion pipeline for Squandered Heritiage.
[edit] Detecting Feed Changes
Issues to resolve.
- Going beyond the first elements in the feed.
- Updated dates?
- Differencing content.
[edit] Upgrades and Issue Fixes
- For each document iterator.
[edit] Feed Import
One must first convert the RSS 2.0 feed of Squandered Heritage into syndibase. This requires first converting the RSS 2.0 feed to Atom 1.0. Then the Atom 1.0 feed is differenced against the existing collection of feeds in the database.
The difference can be performed by checking the index of the documents in the database for each item, or the most recent entries can be pulled from the database, a document materialized, and the difference is performed in XSLT. Those that are not in the database are saved into the database.
Concurrency concerns, so long as only one thread imports a feed, so long as two feeds are not imported at once, we are going to be okay.
[edit] RSS 2.0 to Atom 1.0
Rather than write out how I'm going to convert the two, I'm simply going to convert the two.
- http://www.atomenabled.org/
- http://blogs.law.harvard.edu/tech/rss
- http://www.dpawson.co.uk/xsl/sect2/pretty.html
No detailed accounting, except perhaps in
[edit] Mix
Now made of Groovy.
Compard to Gant? Gant is a Groovy builder for Ant. I still do not like Ant and feel that it could easily be replaced by something that uses the AntBuilder to make use of historic tasks.
Here are the components of a Groovy based depenency engine.
- File globbing.
- File iteration interface, including filtering, and chaining of filters.
- Dependency graph.
- Caching of dependency, dependency database.
How hard can it really be?
[edit] XStrategy
Where does all this belong?
[edit] Enfilade
In order to create Enfilade, I need to create somethings that are neeto with XRegex. Like the ability to intercept and forward. I've decided against expressing XRegex in XML, of course, and would prefer to express it in Java or Groovy, which I've yet to embrace fully. (A full embrace of Groovy, the sooner the better, would be a great thing. Using something expressive halfway is a waste, or a lot of time is wasted fining lines.)
[edit] Relay
Goals for Relay are:
- Easy caching of generated documents.
- Easy to create a new runner.
- Easy to specify keys for cached documents and cached runners.
- Cached complex builders, like compiled scripts (XSLT or Groovy), caching for both thread safe and pooled copies.
- Composition of applications using XML, XSLT and pipelines XSLT By Convention.
Goals that might not be met.
- Expression of relays in XML for creation in XSLT.
[edit] XSLT By Convention
If what I wanted was a simple table view, simple crud of a MySQL table somewhere, where would I begin?
[edit] Scaffolding
Common to all of the object models that I see is concept of scaffolding. This means generated HTML displays of the object model. It is not entirely reflected, or conjured. A template is emitted, so that a person can begin to work with the template to create the view that they wish to see.
At first, I thought it would be nice to have a default grid, form, etc, view of the data, but in the end, I realized that this starter template helps a body work though the creation of the view, and give the programmer much more control, and is anyway, a faster path to development of the framework.
[edit] Model View Controller
[edit] Linking
[edit] Saxon
Two things to add to Saxon. Slurper support with the builder. Ugly? Perhaps. Also, build Relays in XSLT.
[edit] JavaScript Template Engine
- Velocity - Example syntax.
- FreeMarker - Another example.
[edit] Groovy
Create a Groovy source runner that will act like a Groovlet, but generate SAX. Rather than using the full HttpRequest and HttpResponse, simply pass in parameters. Those parameters can take the form of a parameters hash table, that is a simple hash of lists, like Saxon parameters or, Parameters parameters.
In order to support Groovy as a source in Relay, I'll have to upgrade Groovy, probably building my own Groovy from a checkout (already checked out). The resulting jar will have to be checked into the Mix repository prior to a roll out, or else, I'll have to add a special build step.
- Groovy Markup - The Groovy Markup start page.
- Getting Groovy with XML - A document on the different XML things you can do with Groovy. Nothing here is all that different from what I can do with XSLT 2.0.
- Interview with Guillaume - Where the project lead sees Goovy going.
- Groovy JSR To Do List - The tasks at hand for Groovy, markup generation is a big one.
- Mark it up with Groovy Builders - An article on Groovy Markup by Andrew Glover.
- Eclipse Plugin - This project looks stuck.
- Grails - Groovy rails.
[edit] Other Frameworks
- Rife - Rife.
- Trails - Java Rails.
- Naked Objects.
[edit] Foo
Coding by convention. I'm reading through Grails. The biggest difference between Grails and Relay is the controller concept, which is very simple, simpler than what I have in my mind. The orginal idea was some simple logic built into the relay description. Then I decided that the description would be created through XSLT. Having a controller class for each thing you might want to control is not a bad convention, however.
Maybe I start with the Grails controller and breakout in different direction? Or maybe I need to crib from it. Because what am I trying to control? The idea that I've had since, is the idea of using a URI to reference documents. The difference between my thoughts on this and Grails is that I don't see so many distinct objects. I'm also seeing things composed of many different little components, more pipelines, or controllers.
Though I've never been happy with validation in Relay. What is the next step? How do I introduce Groovy? I could start by using Groovy to query the database, using the StreamingSAXBuilder. That would pull Groovy into Syndibase at least.
Where can I use Grails? Can I create a parallel project that uses Grails to look at tables in MySQL? How many tables am I going to have? I'd like to move away, and toward an XML database. Then at least, I'd know what a controller is supposed to feel like.
Of course, a Relay builder could simply emit XML and that would not require me to rethink how to construct the key, or else how hard it to add the key when you are in a Groovy builder, not very, eh?
Groovy adoption is going to be slow.
Basic thing would be a configuraiton document, that document has a merge functionality, so you can override defaults by writing your down document, where a node at the same level replaces the node in the starting document.
Looking at the scaffolding for Grails, and I have to wonder, why not generate that directly from the description of the MySQL database? Also, where does Groovy fit in all this? Maybe I store Groovy in XML, maybe I add lines of Groovy through a web interface.
It really starts to occur to me that with coding by convention, objects almost get in the way. Can I express most of the conventions using scripted documents? What would that look like? What langauge do I use?
Do I use JavaScript for continuity with the client side? No. There is nothing to share, really, the server side should script in something Java like. Groovy is right for scripting.
A bit of glue, is what Groovy is. A compact expression of validation, is what Groovy is. That's what it is.
Logic still looks like it is expressed in XSLT, oddly enough. Who would have thought? (Yes, rambling, it's late, I'm tired.) Goals for Relay are easy caching, easy to create a new runner, easy to specify keys for documents and constructed runners, complex builders, like compiled scripts (XSLT or Groovy) get cached, caching for both thread safe and pooled copies, and all this comes form XSLT, the express of the things that need to get done as XML, and XSLT to turn that into the Relays that get it done.
Tired, yes.
Scaffolding? It seems that the objects get in the way, because an XML description of a table can be transformed into all sorts of tools, then rather than programming new interfaces, why not take the scaffolding a bit further, like have a grid scaffold, but choose the editors in the grid, simple enough. Or a password field, simply scaffold the conformation in there, and a Groovy snippit to MD5 it.
Relationships can be mapped easily, possibly inferred from the data dictionary.
Besides I'm going to big file XML, clustered, so why does it matter? Well, I guess RDBMS will always be there, but I don't know if I want a framework, just applications. The design goal now is reuse through declarations.
[edit] Won't Do
Relay cannot be used as Ant, since dependencies are resolved as you build the item, then the dependencies are kept, while with Ant, you must build a tree before hand.
[edit] Merge
With two basic types of merge in computer science, it seems that one could create a merging engine, one that merges queries that query flat structures, and merge those structures with annotations, or reduce the result set by those that are match a certian criteria.
This divides the problems of queries into selecting from a table or document collection first, then joining the table or document collection against another.
There is a query langauage build on URLs that can be used to express these queries.
[edit] Documents
Relay is going to assume an email or syndicated view of the world. There is an xml document, with meta data, such as gecoding, tagging, columns, and timestamps. Sometimes the meta data is stored remotely. Documents have relationships, such as heirarchical relationships, imagine a conversation thread.
[edit] Sheaf
Every once in a while, I think that the trick to Sheaf is memory mapped pages, then I have to research again and read that memory mapped pages are a performance penalty for small files.
[edit] Strata
Thus, the short-term objectives for Strata are as follows.
- Leaf nodes only to store references to file positions, not store the data itself.
- Branches also store only references to file positions, and hold a list of objects read from those file positions to use as keys.
- Thread-safe.
- File backed, paged.
- Supports efficent insertion and deletion, while maintaining the balance of the B-Tree.
- Equality comparison query.
- Compares byte buffer ranges, not objects.
Crazy thoughts included storing partial in the inner tiers on pages, but this presents a strange case where a partial key might not fit on a page, if the length of a key is unbounded.
[edit] Leaf Linking
The advantage of a B+Tree is that the linking of the leaf nodes makes for easier range queries and queries of values less or greater than a specific value.
This advantage comes with linking leaf nodes, so that one can move from one leaf node to the next leaf node containing value that are greater than or equal to the values than the given leaf node.
As leaf nodes are split, then are so linked.
[edit] Split Logic
The inner tree will only contain a particular key once. This is enforced at the split of the leaf. When choosing the pivot, we make sure that we do not choose the key that is also at the end of the full leaf tier.
When we split a leaf node, we use the last element of the left split as the key. (Keep in mind that this is a B+Tree so the key is not removed from the leaf.) As a result, it is impossible, during an insertion, to insert an object with a greater value than the pivot value at the end of that leaf.
When we choose the next pivot, we make sure that it is not the same as the object at the end of the leaf, so that we do not repeat keys in the inner tree.
The split will not find a candidate when the entire leaf is full of objects of the same value. In this case, no split occurs. Rather, a new leaf is created and linked to the end of the full leaf. This creates a linked list of identical values.
The only leaf node that is the exception is the right most leaf node. This leaf node contains only those objects who's values are greater than any value in the tree, thus any object is a candidate for promotion as a pivot into the inner tree.
Now the copacetic method must ensure that the inner tree contains each object only once.
[edit] Duplicate Values
Duplicate values are stored in leaf nodes. The presenence of duplicate values in leaves does not mean that the keys in the inner tree will contain duplicate values.
When splitting a leaf, we make sure that we do not choose a pivot object that is also the last object in the leaf, because according to our split logic, that last object in the leaf is present in the inner tree, except for the case of the right most leaf.
When a leaf is filled entirely with objects of equal value, it cannot be split. Given of leaf of this nature, the first time we attempt to add a addional equal value we create a new leaf linked to the tail. This new leaf will continue to accmulate objects of the specified value. Thus, when we encounter a leaf filled entirely with objects of equal values, we consider it the first leaf a sub list of leaves of identical values. The additional leaves are not linked from the inner tree, but are found by iterating the list of leaves. The last element in this sub list links to the next leaf in the list of leaves, one that is found through the inner tree.
Adding a value of lesser value will cause a split where a new leaf is created to the left containing only the additional value. Natually, the new left leaf links to this leaf
Adding a value of greater value indicates that we are at the right most leaf in the tree. We find the end of the sub list, and attach a new next leaf containing only the additional greater value. The split is between the head of the sub list and the newly created leaf. The pivot is the duplicated value in the sub list. The newly created leaf becomes right most leaf.
[edit] Choosing A Partition
Since each key only occurs once in the inner tree, choosing a parition means choosing the key in the center of an inner tier.
I am giong to create a Partition interface for the leaf tier to allow for different implementations.
Choosing the partition will require choosing from among duplicate keys, so that it might be better to create an array of the unique values, and choose the center most value. If the B+Tree is used to index unique values it would be faster to use the method for inner tiers.
[edit] Deleting a value
If the value exists only in the leaf and is not a key in the inner tree, then the value is removed from the list.
If the value exists as a key in the inner tree, and the value is duplicated in the B+Tree, then the key object is replaced with one of the duplicates. (This is necessary, since the object may be a proxy. Although it appears present in the tree, removal from the tree may preceed the deletion of the proxied object.)
If the value exists as a key in the inner tree, and the value is not duplicated in the B-Tree, then the key object is replaced by object before it the leaf in which the object is stored.
If there is no item before the object in which the leaf is stored, then if the key exists on the bottom most inner tier, we delete the branch that points to the tier. If the key exists on a tier above the bottom most inner tier, we use the key of the branch to the left of the branch that refernences the leaf teir and then delete the branch that references the leaf tier.
[edit] Merging after deletion
When merging two leaves, the only use of the a leaf object in a key is in the leaf above. In the case of the last leaf, so long as we merge the left node into the right, there is no need to concern ourselves with the presence of the key of the leave in the tree above the parent. We must assert that we have only seen the key once, and that the key only exists in the parent.
When merging two inner tiers, the tier that links to leaves is a special case, where we merge the left node into the right,
Merging takes place after deletion of leaf nodes.
We always merge the left tier into the right tier. Deleting the tier to the left.
When merging a leaf, the branch node in the parent tier for the left leaf tier is deleted.
When merging an inner tier, the branch to the left tier is deleted. It's object value is used as the key value of the terminal branch of the left tier. The branch to the right tier is set to reference the new merged teir.
Tiers are merged when two consecutive tiers contain half of their maxium number number of branches or objects.
The algorithm ensures that key values are deleted from the inner tree only at from the bottom most inner tier.
[edit] Concurrency
[edit] Reference
[edit] Memento 0.9
Thinking of getting rid of the alternate spelling, Momento. In any case, these are the tasks at hand.
- Documentation
- Write out the cluster linking logic.
- Organization
- Change Momento into an object, not an interface.
- Move first page interpretation to a static class.
[edit] Memento 1.0
A node contains three long values. For a common node these values are...
- Node type - One of element, local name, namespace uri, attribute, text, version, cluster, other internal.
- First child - The address of the first child of this node.
[edit] Swag
Tracing objects are on my mind. At one point, I imagined a thread-safe data structure that did not need to be locked to iterate. The head object is syncrhonized. Setting the head pointer is a synced operation where the head pointer points to a new head node where the next pointer of that node is set to point to the current head node.
Creating an interator for this list is synchronized operation that reads the current head node and assigns it to the iterator.
Otherwise, iteration is not syncrhonized. Reading the next pointer of each node is not a synchronized operation. The nodes themselves are immutable.
Currently, I flush the data structure by resetting the head node every 500th node.
This is an in memory log, so that ought to be okay.
This does mean that you might seek to see what happened in a recent operation, and find that the results are truncated. A data structure that always included some history would be prefereable.
The list would contain an array of these list structures. Appending would count the current size of the head of the list. When the 100th item is reached, the array is shifted so that the last list pops off the end. A new list is created at the head of the array. This is a syncrhonized operation.
A version count is kept for the array shifts. When an interator is constructed it receives the array and the version and a reference to the version in the list. When it reaches the end of a linked list, it uses difference of the version of the iterator against the version of the list as an offset to the next element in the array.
[edit] Getting Software Done
- Inner FUD - Working today, and seeing that Java has picked up JRuby. Okay, why am I not a Rails programmer? Because, this is round one. Round two is going to be working with the feeds.
Thus, I'm running around, reading the Groovy mailing list, reading up on JRuby. Frustrated, this morning about strong typing in Java, and how it makes you type out so much non-sense. Then back to where I started, nothing changed, not off the path, working on my B-Tree. No, I'm not going to adopt a scripting language, nor re-implement in a different language.
How do I allow myself to become so distracted? How do I prevent myself from flowing down another path? I'm readying Ruby blogs and wondering at how I can ape that breathless software velocity marketing for my open applications, or for anything really. When are people going to get sick of it and move on? Programmer mind. Back to my Java. Must not get off track. If I want to be a part of Groovy or JRuby, or want to make this an open project, with others involved, it only comes when I have some people using the software. How do I layout my game plan?
I want this to be the place where I go first, to keep from wandering off into the wilderness of doubt and evaluation.
Java has the best IDE I've ever used. It is typed strongly enough for name completion. Making wide ranging changes is made simple with strong typing, since it finds all the places where the signatures of the methods do not mach, where types have changed or gone missing. Eclipse holds my code for me, holds my decisions for me. Work with it.
Do not be afraid to cast. It is fair enough to determine the instance, then base your actions on that instance. It is a poor-man's generic.
Could you really implement some of the things you want to create in Ruby? Where is support for XSLT 2.0? Perhaps you need to look closer at Lucene, Saxon and ANTLR, the software built from Java that you truely respect.

