Installing Apache Solr on Linux as a Container

Have you ever had a need to do a recommendation engine (something akin to Amazon) for a website? Ranking search results? Search at scale is hard. Scale can be performance (how long do you really want to wait for the result back from a website), size (the Amazon search database is a huge database, I’m sure) or velocity (have you ever seen the rate at which a domain controller produces data!). Now, I’m not saying “Amazon uses Solr” here – I’m saying you can do it too, but you have to have the right tool.

Search is hard

I’m going to throw it out there. Search is hard. At least, it’s hard when you need to do it at scale. Many companies exist to ensure you don’t need to know how hard it is. Just think what happens when you type a search keyword into the Google search box. It’s just plain hard.

You can simplify the problem by understanding the data you want to search, how you want the results to be presented and how you intend to get there; and then utilizing the appropriate tool for the job.

Choose by Understanding the data

Are you searching structured data? Does it look like an excel table with headers at the top? Do you know what sort of data is in each column? If so, you’ve got a great case for a relational database running with SQL. If you need Enterprise features, I’d suggest SQL Server or Oracle. If not, then try MySQL or PostgreSQL.

Are you searching small messages, like events, in a time-series manner? Are you wanting to do searching that start “tell me what happened between these two points in time?” Then you want something like ELK or Splunk.

Smaller blobs of data? Well, if those blobs are key-value pairs, then try a NoSQL solution like MongoDB. If they are JSON, try Firebase.

How about bigger bits of data, like full documents? Then you want to go to a specific Document-based search system, like Apache Solr.

Sure, everyone in those environments will tell you that you can store all data in all of these databases. But sometimes, the database is tuned for a specific use – SQL databases for structured data, Splunk for time-series data, Apache Solr for document data.

Installing Apache Solr

My favorite method of trying out these technologies right now is to use Docker. I can spin up a new image quickly, try it out and then shut it down. That’s not to say that I would want to run a container in production. I’m using Docker as a solution to stamp out an example environment quickly.

To install and start running Apache Solr quickly, use the following:

docker run -d -p 8983:8983 -t makuk66/docker-solr

Apparently, there are companies out there that run clusters of Apache Solr in the hundreds of machines. If that is the case, I’m not worried at this point about scaling (although I do wonder what they could be searching!)

Before I can use Apache Solr, I need to create a collection. First of all, I need to know what the name of the container is. I use docker ps to find that out:

blog-0702-1

In my case, the name is jolly_lumier. Don’t like the name? Well, let’s stop that instance and re-run it with a new name:

docker stop f4dc01217dd3
docker rm f4dc01217dd3
docker run -d -p 8983:8983 --name solr1 -t makuk66/docker-solr

Now I can reference the container by my name. To create a collection:

blog-0702-2

Note I am referencing the container by name with the -t argument.

But what are collections and shards?

Great question. Terminology tends to bite me a lot. Every single technology has their own terminology and usually it isn’t defined well. There is an assumption that you already know what they are talking about Fortunately, Apache Solr has a Getting Started Document to define some things.

A collection is a set of documents that have been indexed together. It’s the raw data plus the index together. Collections implement a scaling technique called sharding in which the collection is split into multiple shards in order to scale up the number of documents in a collection beyond what could physically fit on a single server. Those shards can exist on one or more servers.

If you are familiar with map-reduce, then this will sound familiar. Incoming search queries are distributed to every shard in the collection. The shards all respond and then the results are merged. Check out the Apache Solr Wiki for more information on this. For right now, it’s enough to know that a collection is a set of data you want to search and shards are where that data is actually stored.

The Admin Interface

Apache Solr has a web-based admin interface. In my case, I’ve forwarded the local port 8983 to the port 8983 on the container. I can access the web-based admin interface at http://localhost:8983/solr. You will note that I could have created a collection (called a Core in the non-cloud version of Apache Solr) from the web interface.

Sending Documents to Solr

Solr provides a Linux script called bin/post to post data. It’s a wrapped Java function, which means you need Java on your system in order to use it. Want to index the entire Project Gutenberg – you can do it, but there are some extra steps – most notably installing Solr and Java on your client machine..

For my first test, I wanted to index some data I already had. I have a Dungeons and Dragons Spell List with all the statistics of the individual spells. This is in a single CSV file. To do this, I can do the following:

curl 'http://localhost:8983/solr/collection1/update?commit=true' --data-binary @Data/spells.csv -H 'Content-type:application/csv'

You will get something like this back:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">902</int></lst>
</response>

According to the manual, that means success (since the status is 0). Non-zero status means a failure of some description.

Searching Solr

Now, let’s take a look at the data. We can do that by using the web-UI at http://localhost:8983/solr/collection1/browse – it’s fairly basic and can search for all sorts of things. I can not only search for things like “Fire” (and my favorite Fireball spell), but also for things like “Druid=Yes” to find all the Druid spells.

My keen interest is in using this programatically, however. I don’t want my users to even be aware of the search capabilities of the Solr system. I’d prefer them not to know what I was running. After all, do you think “that’s a nice Solr implementation” when browsing your favorite web shop?

If I want to look for the Fireball spell, I do the following:

curl http://localhost:8983/solr/collection1/select?q=Fireball

The syntax and options for a query is extensive. You can read all about it on their wiki. The response is an XML document. If I want it in another format, I use the wt parameter:

curl 'http://localhost:8983/solr/collection1/select?q=Fireball&wt=json'

It’s a good practice to use quotes around your URL so that you don’t end up with the shell preempting your meaning with special characters (like the ampersand, which puts a process in the background).

What else can Solr do?

Turns out – lots of things. Here are a bunch of my favorite things:

  1. Faceting – when you search for something and you get a table that says “keywords (12)” – that’s faceting. It groups things together to allow for better drill-down navigation.
  2. Geo-spatial – location-based search (find something “near here”)
  3. Query Suggestions – that drop-down from google that suggests searches? Yep – you can do that too
  4. Clustering – automatically discover groups of related search hits

What’s Missing

There is a lot missing from a base Apache Solr deployment. I’ll try to put some of the more important ones down here, but there is a solution – check out LucidWorks. LucidWorks was founded by the guys who wrote Solr and it adds a lot of the enterprise features that you will want in their Fusion product.

  1. Authentication – talking of enterprise features, top of the list is authentication. That’s right – Solr has no authentication – not even an encrypted channel. That means anyone (out of the box) can just submit a document to your Solr instance if they have a route to the port that it’s running on. It relies on the web container (Jetty, Tomcat or JBoss for example) to do the authentication. This isn’t really a big problem as authentication is pretty well documented. Incidentally, the Docker image uses Jetty for the web container.
  2. Getting Data In – I was going to call this crawling. However, it is more than that. If you have a fairly static set of data, then maybe the API’s and command line tools are good enough. What if you want to index the data in your SharePoint application? How about all the emails flowing through your Exchange server? You will need to write (quite complex) code for this purpose.
  3. Monitoring – if you are running a large Solr deployment, then you will want to monitor those instances. Solr exposes this stuff via JMX – not exactly the friendliest approach.
  4. Orchestration – this is only important if you have gone into production with a nice resilient cluster. How do you bring up additional nodes when the load gets high and how do you run multi-node systems? The answer is zookeeper and it’s not pretty to set up and has several issues of its own.

What’s Solr not good at

Solr isn’t good at time-series data. Ok – it’s not that hard, but it’s still not the best thing for the job. Similarly, if you are doing per-row or per-field updates to records, then perhaps you should be using a relational database instead.

If you are indexing documents, however, then this is the tool to use. It’s easy to set up and get started. It has a REST interface for programmatic access and it likely does all the search and analytics related stuff you want.

Other References

Go ahead, explore!

Containers, Docker and Virtualization

I’ve recently had to do a bunch of research on containers, Docker specifically and virtualization in general. It started with someone who had obviously drunk the kool-aid – “I can use Docker for EVERYTHING!”

Wut?

No, seriously. Someone was actually advocating for using Docker for all their virtualization challenges. I knew next to nothing about Docker and I just couldn’t support that. I wasn’t coming from an informational position of strength though. I doubt I’m coming from that place now, but at least I’m closer. I know I’m going to get in trouble here, but here is what I noted.

So what is a Container anyway?

A Container encapsulates an application. This is as opposed to a Virtual Machine, which encapsulates an application and it’s operating system.

A hypervisor can run multiple virtual machines each with their own operating system. Docker runs on Linux and the containers all seem to run Linux because it is leveraging the underlying operating system.

Docker is one (and the most famous) of the container technologies. Heck – it feels like it’s the only one right now (although Microsoft is heading in that direction too). However, there are others – such as Virtuozzo. Docker has the corporate support (with such companies as Mesosphere) and the open community you need to work with it.

Containers are good for small repeatable units

Let’s say you have a web site – maybe based on the latest ASP.NET, or maybe based on Node.js. You’ve done an awesome job. Now you want to scale it and one of the first things you want to do is handle more connections. The natural thing to do is to run multiple copies of the server. But you don’t want to run the same number of copies all the time. You want to rapidly spin up copies when the load gets high and spin them down again when the load drops off.

You’ve got a good case for containers.

Creating a new copy of a container is a light-weight task. I timed my tests in the seconds and I suspect that was due to the speed and resources available to my underlying operating system. Spinning up a new copy of a virtual machine can take minutes by comparison. Replicating the operating system disk takes the bulk of that time.

Containers are not good for stateful applications

This is the bit that I’m probably going to get in trouble for. I think it’s a bad idea to run stateful applications, like a database, in a container. You can (and people do), but that doesn’t mean you should.

That’s because the whole idea of containers is that you can run multiple copies of the same thing. If the thing has state, you are losing a lot of the value from containerizing in the first place. It would be like running VMware and putting one virtual machine on the server. You can – it doesn’t mean you should.

Of course, you can mount an external disk onto a Docker container and use that for the data store. This gives you the ability to transition the container seamlessly to another machine by bringing down one and bringing up the other. But then the state is stored externally – not internal to the container.

You can build containers in a build process

As a sometimes developer, I love this part. I can create a task in my Gulpfile that creates a Docker image (at least, I can if I am developing on Linux). This makes for a great workflow. Developers can be assured of running a golden image – the same as everyone else. If you have the same source files, the same container will result. If you are a developer on a team and have QA, then QA can encapsulate a problem, freeze the container and pass it to the developer for diagnosis. The “works on my machine” problem of the support and QA process reduces significantly.

That’s as opposed to Virtualization. If you are doing the same process in hypervisor-land, you have to set up an operating system as well. This can introduce environment drift that has nothing to do with your application. Technologies like PowerShell DSC, Chef, Puppet, AutomatedLab, Packer, Vagrant and Skytap all try to alleviate this problem of drift – in different ways and with different results. Containers isolate the developer from this problem.

How does this relate to microservices?

One of the enterprise architectures currently being espoused is to split an application into a number of independent pieces that are tied together with simple, network-based, APIs such as REST and JSON. Assuming each independent piece does one thing and handles state appropriately, it can be scaled independently from the rest of the application. Each independent piece could be a REST-based API – a microservice. Then other applications can use several of these microservices to produce a bigger workflow.

Microservices are an ideal thing to use containers, but it’s really an orthogonal problem.

What about performance?

There are several reports on both sides of the fence here. My own tests indicate that – given the same hardware and same number of containers / servers, the performance is pretty much identical. You may find some subtle changes, but you are architecting your application for scale anyway, so a couple of percentage points isn’t really signficant. The actual differences I measured were less than a percentage point.

So are containers good?

It depends.

It depends on your application.

It depends on your expectation.

Let’s take two examples.

If you are writing a new web application and intend on using a PaaS database (such as Azure SQL as a Service or Amazon RDS) and a suite of microservices for authentication (like Auth0), email delivery (like SendGrid) and others (maybe mobile notifications, maybe IoT integration – who knows), then yes – you should definitely be investigating containers.

If your application depends on large databases, major integration work with other enterprise pieces or has state stored on disk – either plan a re-architect of your application or resign yourself to the fact that you will be using virtualization.

How do I get started with Containers?

Install an Ubuntu 14.04 virtual machine, install Docker and start using it. I’ll be demonstrating a build of my node.js application which includes a docker build soon.

PowerShell and Profiles

I showed off my PathUtils module in my previous article. Today I’m going to show off my profile. Every time you execute a PowerShell prompt or start up ISE, four profiles get run.

The four profiles are:

&gt; $Profile.AllUsersAllHosts
C:\Windows\System32\WindowsPowerShell\v1.0\profile.ps1

&gt; $Profile.AllUsersCurrentHost
C:\Windows\System32\WindowsPowerShell\v1.0\Microsoft.PowerShellISE_profile.ps1

&gt; $Profile.CurrentUserAllHosts
H:\Documents\WindowsPowerShell\profile.ps1

&gt; $Profile.CurrentUserCurrentHost
H:\Documents\WindowsPowerShell\Microsoft.PowerShellISE_profile.ps1

When you edit or run $profile you actually edit or run the last one. I put most of my profile in the third one – profile.ps1. This is common to both the PowerShell prompt and the ISE. Then I only need to put the differences in the ISE or PowerShell prompt one.

Let’s take a look at my Microsoft.PowerShellISE_profile.ps1 file first:

Set-Location H:

function edit {
    param($file, $force = $false);

    if ($force) {
        if (-not (Test-Path $file)) {
            New-Item -ItemType File -Path $file
        }
    }

    psedit $file
}

My “home” is on my H: drive – it’s a Synology Diskstation in my basement. I change the location of documents, etc. to it and then sync the contents so that they are available off-line if I am using a laptop. Then I define a function “edit” – this creates the file I want to edit if it doesn’t exist (and I use -force) and then opens it in the ISE editor.

My Microsoft_PowerShell_profile.ps1 is similar:

Set-Location H:

Set-Alias edit atom

Instead of the ISE Editor, I’m using atom. Aside from that, this also doesn’t do much. All the work of my profile is done in profile.ps1. Here is the top of it:

Import-Module PathUtils

Add-Path -Directory &quot;${env:ProfileFiles(x86)}\PuTTY&quot;
Add-Path -Directory &quot;${env:USERPROFILE}\AppData\Roaming\npm&quot;

This just sets up my Path. I’ve added PuTTY and the NPM area to my path. Next comes git setup:

Import-Module posh-git

Add-Path -Directory &quot;${env:ProgramFiles(x86)}\Git\bin&quot;

function global:prompt {
    $realLASTEXITCODE = $LASTEXITCODE
    $Host.UI.RawUI.ForegroundColor = $GitPromptSettings.DefaultForegroundColor
    Write-Host($pwd.ProviderPath) -nonewline
    Write-VcsStatus
    $global:LASTEXITCODE = $realLASTEXITCODE
    return &quot;&amp;gt; &quot;
}

Enable-GitColors
Start-SshAgent -Quiet

posh-git was introduced last time. It’s a module that provides a colorized prompt when you are in a git repository. This is very useful when it comes to development.

I then have a series of functions that have helped me as a developer:

function Edit-HostsFile {
    Start-Process -FilePath atom -ArgumentList &quot;${env:windir}\System32\drivers\etc\hosts&quot;
}
 
function rdp ($ip) {
    Start-Process -FilePath mstsc -ArgumentList &quot;/admin /w:1024 /h:768 /v:$ip&quot;
}
 
function tail ($file) {
    Get-Content $file -Wait
}
 
function whoami {
    [System.Security.Principal.WindowsIdentity]::GetCurrent().Name
}

function Get-ProcessorArchitecture {
    if ([System.IntPtr]::Size -eq 8) { return &quot;x64&quot; }
    else { return &quot;x86&quot; }
}

function Test-Port {
    [cmdletbinding()]
    param(
        [parameter(mandatory=$true)]
        [string]$Target,

        [parameter(mandatory=$true)]
        [int32]$Port,

        [int32]$Timeout=2000
    )

    $outputobj = New-Object -TypeName PSobject
    $outputobj | Add-Member -MemberType NoteProperty -Name TargetHostName -Value $Target
    if (Test-Connection -ComputerName $Target -Count 2) {
        $outputobj | Add-Member -MemberType NoteProperty -Name TargetHostStatus -Value &quot;ONLINE&quot;
    } else {
        $outputobj | Add-Member -MemberType NoteProperty -Name TargetHostStatus -Value &quot;OFFLINE&quot;
    }            
    $outputobj | Add-Member -MemberType NoteProperty -Name PortNumber -Value $Port
    
    $Socket=New-Object System.Net.Sockets.TCPClient
    $Connection=$Socket.BeginConnect($Target,$Port,$null,$null)
    $Connection.AsyncWaitHandle.WaitOne($timeout,$false) | Out-Null
    if($Socket.Connected -eq $true) {
        $outputobj | Add-Member -MemberType NoteProperty -Name ConnectionStatus -Value &quot;Success&quot;
    } else {
        $outputobj | Add-Member -MemberType NoteProperty -Name ConnectionStatus -Value &quot;Failed&quot;
    }            
    $Socket.Close | Out-Null
    $outputobj | 
        Select TargetHostName, TargetHostStatus, PortNumber, Connectionstatus | 
        Format-Table -AutoSize
}

All of these came from someone else. I recommend poshcode.org – it’s got lots of good scripts in there that you can include in your profile.

The other thing to note about PowerShell is that there are lots of modules available. Some of them are really obvious – for example, if you are using Azure then you will want the Azure cmdlets which are contained in – you guessed it – the Azure module. However, I have a couple of modules I have found that are particularly useful.

  1. posh-git improves your git experience
  2. PowerShell Community Extensions (or PSCX) is a collection of useful cmdlets
  3. Carbon is another collection, targeted at devops

Funnily enough, I don’t tend to put pscx and carbon on my dev machines. However, I have an active lab build environment and those two modules end up on every Windows production box as part of the build.

I’m sure I could make my life even easier in PowerShell as a developer. However, this profile and these modules provide an excellent base for my continuing development work.

PowerShell and PATH

If you’ve been using PowerShell for any length of time, you have definitely set up your profile. Mine is fairly straight forward. It’s stored in $profile, which in my case is H:\Documents\WindowsPowerShell\Microsoft.PowerShell_Profile.ps1 – an obnoxiously long name. I’d rather it be profile.ps1, but I digress.

Import-Module PathUtils
Import-Module posh-git

Add-Path -Directory "${env:ProgramFiles(x86)}\Git\bin"
Add-Path -Directory "${env:USERPROFILE}\AppData\Roaming\npm"

function global:prompt {
    $realLASTEXITCODE = $LASTEXITCODE
    $Host.UI.RawUI.ForegroundColor = $GitPromptSettings.DefaultForegroundColor
    Write-Host($pwd.ProviderPath) -nonewline
    Write-VcsStatus
    $global:LASTEXITCODE = $realLASTEXITCODE
    return "> "
}

Enable-GitColors
Start-SshAgent -Quiet

The posh-git module is for colorizing my prompt, and everything after the Add-Path cmdlets is associated with that – actually, it’s boilerplate from their example. The Add-Path is the bit of code that I wrote. It’s in a module called PathUtils and contains just one function:

function Add-Path {
  <#
    .SYNOPSIS
      Adds a Directory to the Current Path
    .DESCRIPTION
      Add a directory to the current path.  This is useful for
      temporary changes to the path or, when run from your
      profile, for adjusting the path within your powershell
      prompt.
    .EXAMPLE
      Add-Path -Directory "C:\Program Files\Notepad++"
    .PARAMETER Directory
      The name of the directory to add to the current path.
  #>

  [CmdletBinding()]
  param (
    [Parameter(
      Mandatory=$True,
      ValueFromPipeline=$True,
      ValueFromPipelineByPropertyName=$True,
      HelpMessage='What directory would you like to add?')]
    [Alias('dir')]
    [string[]]$Directory
  )

  PROCESS {
    $Path = $env:PATH.Split(';')

    foreach ($dir in $Directory) {
      if ($Path -contains $dir) {
        Write-Verbose "$dir is already present in PATH"
      } else {
        if (-not (Test-Path $dir)) {
          Write-Verbose "$dir does not exist in the filesystem"
        } else {
          $Path += $dir
        }
      }
    }

    $env:PATH = [String]::Join(';', $Path)
  }
}

Export-ModuleMember -Function Add-Path

This is fairly straight forward, but I’ve made it as all-encompassing as I could. You just use Add-Path <directory> and it will add that to the path. However, it’s not a permanent change, and that recently caused me a problem.

When Visual Studio (or any other program, for that matter) kicks off a command like bower or npm (or more appropriately, jspm, in my case), it does so in a cmd prompt – not a PowerShell prompt. The Add-Path cmdlet sets up a PowerShell only prompt. So I need to store my PowerShell prompt as the user path, permanently.

To do this, I use the following bit of code:

[Environment]::SetEnvironmentVariable("Path", $env:Path,[System.EnvironmentVariableTarget]::User)

This uses the .NET Framework to store the current path in the users environment. You can replace the ::User with ::Machine to store the path variable in the system environment, which is useful when you have just installed a new program.

Maybe I should write a Save-Path cmdlet as well?

Installing Windows 10 Server Technical Preview on Hyper-V

I’ve just downloaded the VHD version of the Windows 10 Server Technical Preview. All I have to do is create a VM using the Hyper-V Manager and attach the VHD and away I go, right?

No so fast. If you do that, you only have one copy of the server, which doesn’t get you anywhere. You probably want three or four VMs. Do you really want to copy a multi-GB file each time? Also, Hyper-V Manager is for the junior admins who haven’t figured out automation and PowerShell is where the action is. You want to do this properly, right?

Firstly, you will need to download the VHD version of the Windows 10 Technical Preview from Microsoft. Make sure you get the Windows 10 Server Technical Preview and select VHD during the registration. You will need a Microsoft account and registration is required. Also, the file is 7.5Gb, so I would set this off before you go to bed and wake up to it in the morning unless you have a blazing fast connection. Even then, it’s going to take some time.

Warning: There are a couple of sites on the first page of the Google results that purport to give you information on the Windows 10 Server but they lead to ransom-ware and viruses. You do have up to date virus protection, right? Just make sure your link is going to Microsoft and not somewhere else.

Once you have downloaded the VHD, put it in a place where you will find it again. I routinely clean out my Downloads folder so it’s not safe there. Then, make a directory to hold your virtual machine. I use my second drive – D:\Hyper-V\Windows10-TP.

First off, let’s create a differencing VHD based on the VHD that we downloaded:

blog-code-0421-1

Next we need to create the VM. I’m going to connect it automatically to the external network switch:

blog-code-0421-2

Let’s take a look at what has been created:

blog-code-0421-3

The memory is automatically defined as static with 512Mb and there is only one processor. I like dynamic memory on my lab machines – somewhere between 384Mb and 64Gb (the capacity of my machine). I also like to use 4 virtual cores.

blog-code-0421-4

Now you can use the following to start your new VM:

Start-VM win10-test

It will take a couple of minutes to start up the first time. When it gets to the product key, just click on Skip. A few minutes later you will be up and running, ready to explore the new features of Windows 10 Server.

Filtering Get-ChildItem Output with Regular Expressions

In my last article I produced a skeleton application that had a bit more than the Empty ASP.NET project but much less than the Starter ASP.NET application – a sort of in-between house. It had the scaffolding for Bower, Gulp and NPM, plus ECMAScript 6 transpiling, ESLint and less all wired up.

I’m currently writing a short PowerShell cmdlet to clone that project into a new project. Instead of using the Visual Studio New -> Project… workflow, I’m going to go into PowerShell and run Clone-VSProject (my new cmdlet), then import the project into Visual Studio. This will save me time because I don’t have to spend the first half hour of the project wiring up the bits I need.

The first problem I ran into was cloning the project. I had thought to do the following:

Copy-Item -Path $src -Destination $dest `
    -Recurse -Exclude "node_modules","bower_components"

However there is a bug that prevents -Exclude working with -Recurse in this manner. Instead, I have to do it the long way around. Firstly, let’s define what I want to copy:

$Source = ".\BaseAspNetApplication"
$Destination = ".\MyNewProject"
$Exclude = @( "node_modules", "bower_components" )

To get a list of files, use Get-ChildItem -Path $Source -Recurse. This has two issues. Firstly, it doesn’t exclude anything (more on that in a minute) and secondly, it doesn’t handle long filenames – printing an error instead. Let’s get rid of the errors first since they are going to be in the node_modules or bower_components areas anyway and thus we don’t want them.

Get-ChildItem -Path $Source -Recurse -ErrorAction SilentlyContinue

Now for the filtering. I first of all need to turn my exclude list into a regular expression. You can do this by hand if you like, but it’s easy enough to handle:

$regexp = "("+(($Exclude | %{ "\\$_\\" }) -join "|") + ")"

This will turn our exclude list into this:

$regexp = "(\\node_modules\\|\\bower_components\\)"

The way to read this is “contains either \node_modules\ or \bower_components\”.

Now, back to getting the child items:

$Source = Resolve-Path -Path $Source
Get-ChildItem -Path $Source.Path -Recurse -ErrorAction SilentlyContinue | Where-Object {
    $_.Fullname.Substring($Source.Path.Length) -notmatch $regexp }

Take a look at the script block a moment. $_ is filled in with each file that Get-ChildItem produces. $_.FullName is the full path to that file (so something like C:\Users\Adrian\Source\GitHub\blog-code\BaseAspNetApplication\node_modules\foo.js). $Source.Path is the full path to the BaseAspNetApplication directory. Using the substring in this way is an effective way of saying “strip off the source path”. What will be left is something like \node_modules\foo.js. I then match that against my regular expression and only pass the object on if it does not match.

This will print out just the files we need – nothing more. I now need to copy them to their new location. This brings me to my next problem – if you just pipe these to Copy-Item then they get placed in the destination and the hierarchy is destroyed. I need to construct the directory structure within the new directory:

$files | Foreach-Object {
    Copy-Item -Path $_.FullName -Destination (Join-Path $Destination $_.FullName.Substring($Source.Path.Length)) 
}

I use the same trick on the copying that I do on the filtering to get the relative path, but then I join it to the Destination – this gives me an absolute path. All the intervening paths are created by Copy-Item so this creates the directory structure as well.

This is only the first part of the Clone-VSProject cmdlet I am producing. In the next part, I have to alter all the references to BaseAspNetApplication to my new project name. But that’s the subject of another article.

Setting up a Test Lab in Azure

I’ve been on a bit of a Web Applications kick for the last few weeks – I thought it was about time to take a step back and get some infrastructure work done as well. One of the tasks on my plate was the configuration of a lab environment within Microsoft Azure. I have a requirement to be running a number of Windows Server 2012 R2 machines that are linked to an Active Directory domain. I can then quickly install whatever software I want on them and start using them. I don’t care where the servers are – they could be on a Hyper-V box (which is what I would normally do). Today I’m going to do them in Azure. This article will kick things off by creating my first machine in Windows Azure plus all the other bits that I need as well.

To start with you need to download and install the Azure PowerShell SDK. This is your basic Web Platform Installer, so read the license, agree to the terms and install as normal. You will also need an Azure Subscription – if you haven’t signed up for a free trial yet, why not?

One thing I did find was that I needed to totally log out and back in again in order for the Azure PowerShell cmdlets to be recognized. I’m not sure if that is normal or not but it’s a reasonable thing to expect.

To create a connection to the Azure environment, use:

Add-AzureAccount

The command will prompt you via a GUI pop-up to log in – use the same credentials that you use to access Azure. Everything else can be done in the context of this authenticated session. I don’t want to embed credentials into my setup script and it’s not something I do on a regular basis, so this is perfectly fine with me.

I also want my machines to communicate over a virtual network on the back-end. Since Virtual Networks are per-subscription, not per-service, it makes sense to set these up ahead of time. This can be done in the portal (and I recommend that way). However if you insist on doing it on the command line then you need an XML document describing your subscriptions network topology. You can then apply this configuration with PowerShell:

Set-AzureVNetConfig -ConfigurationFile $tempXmlFile

A good plan is to set up your network in the portal and then export it. You can do the export with the following PowerShell:

Get-AzureVNetConfig -ExportToFile "C:\temp\MyLab-Network.xml"

Creating a Virtual Machine

In order to create a virtual machine with PowerShell I need to go through the following steps:

  1. Create an Affinity Group (if it doesn’t exist)
  2. Create a Storage Account (if it doesn’t exist)
  3. Create a Cloud Service (if it doesn’t exist)
  4. Re-connect to the subscription with the storage account
  5. Create the Virtual Machine

Before I get started I need the following information:

  1. Location: Where am I going to run my environment? You can see the selections when you drop down the Location field creating a VM in the Portal.
  2. Affinity Group: I get to choose this but it has to be unique within my subscription.
  3. Virtual Network: This comes from my network setup described above.
  4. Storage Account Name: I generate this but it has to be unique within Azure.
  5. Cloud Service Name: I generate this but it has to be unique within Azure.

Since I don’t want to re-create everything every time I’m going to store the storage account name and cloud service name for later. I’m creating a Create-Lab.ps1 script to hold all this. This is how it starts:

$VerbosePreference = "Continue"
$Location = "West US"
$AffinityGroup = "Lab"
$VNet = "Lab-Network"
$Subnet = "Subnet-1"
$IPNetwork = "10.1.1"

$StorageAccountFile = "$(get-location)StorageAccount.txt"
$StorageAccount = "lab" + ([guid]::NewGuid()).ToString().Substring(24)
if (Test-Path $StorageAccountFile) {
    $StorageAccount = Get-Content $StorageAccountFile
}

$CloudServiceFile = "$(get-location)CloudService.txt"
$CloudService= "lab" + ([guid]::NewGuid()).ToString().Substring(24)
if (Test-Path $CloudServiceFile) {
    $CloudService= Get-Content $CloudServiceFile
}

An Affinity Group is a collection of resources (like storage, cloud services, virtual machines in our case) that you want to be close to one another. It allows the Azure Fabric Controller (which is the thing you interact with to create your environment) to more intelligently locate the resources you want. The Affinity Group Name is your choice, but you will get an error if you try to add it twice so I’ve done a little test:

###
### Create the Affinity Group
###
$AffinityGroupExists = Get-AzureAffinityGroup -Name $AffinityGroup -ErrorAction SilentlyContinue
if (!$AffinityGroupExists) {
    Write-Verbose "[Create-Lab]:: Affinity Group $AffinityGroup does not exist... Creating."
    New-AzureAffinityGroup -Name $AffinityGroup -Location $Location
} else {
    Write-Verbose "[Create-Lab]:: Affinity Group $AffinityGroup already exists"
}

Once I have the affinity group created I can move on to the storage account and cloud service. These need to be created once as well. Since they have DNS names within the Azure namespace they need to be unique. What I’ve done above is create a name based on a globally unique ID. There is still a chance of a collision, so I preprended a specific string – in this case “lab”. Choose your own string to make it more unique to you if you bump into problems. (Side note: I wish Azure would allow GUIDs here – it would make it much easier to create unique names).

###
### Create the Storage Account
###
$StorageAccountExists = Get-AzureStorageAccount -StorageAccountName $StorageAccount -ErrorAction SilentlyContinue
if (!$StorageAccountExists) {
    Write-Verbose "[Create-Lab]:: Storage Account $StorageAccount does not exist... Creating."
    New-AzureStorageAccount -StorageAccountName $StorageAccount -AffinityGroup $AffinityGroup
    $StorageAccount | Out-File "$CWDStorageAccount.txt"
} else {
    Write-Verbose "[Create-Lab]:: Storage Account $StorageAccount already exists"
}

###
### Create a Cloud Service
###
$CloudServiceExists = Get-AzureService -ServiceName $CloudService -ErrorAction SilentlyContinue
if (!$CloudServiceExists) {
    Write-Verbose "[Create-Lab]:: Cloud Service $CloudService does not exist... Creating."
    New-AzureService -ServiceName $CloudService -AffinityGroup $AffinityGroup
    $CloudService | Out-File "$CWDCloudService.txt"
} else {
    Write-Verbose "[Create-Lab]:: Cloud Service $CloudService already exists"
}

Note I am storing the names I have chosen into text files. This allows me to re-use the names later on but it also allows me to do other things in automation with other scripts.

Now that I have created all my base services it’s time to create a virtual machine. My first machine is going to be special – it will be my domain controller. I need some more information here:

  1. The Image Name that I’m going to base my VM on
  2. The Instance Size or how much resources I want to give it
  3. The Admin Username and password for accessing it afterwards

The Image Name is probably the most complex here. You can use Get-AzureVMImage to get a full list of the images available to you – both on the marketplace and the ones you have uploaded. I’m going to use the latest Windows Server 2012 R2 Datacenter build that they have provided. To do that I need to do some filtering, like this:

$Image = Get-AzureVMImage |
    ? ImageFamily -eq "Windows Server 2012 R2 Datacenter" |
    Sort PublishedDate | Select -Last 1

You can get a list of the instance sizes using the following:

Get-AzureRoleSize | 
    ? SupportedByVirtualMachines -eq $true | 
    Select InstanceSize,Cores,MemoryInMb

For my purposes (a domain controller), a Medium or Basic_A3 instance size seems perfect. I want this to have a static IP address so here is my eventual configuration:

###
### Create the Domain Controller VM
###
$Image = Get-AzureVMImage | ? ImageFamily -eq "Windows Server 2012 R2 Datacenter" | Sort PublishedDate | Select -Last 1
$AdminUsername = "itadmin"
$AdminPassword = "P@ssw0rd"

# Set the VM Size and Image
$DC = New-AzureVMConfig -Name LAB-DC `
    -ImageName $Image.ImageName `
    -InstanceSize Medium -HostCaching ReadWrite
# Add credentials for logging in
Add-AzureProvisioningConfig -VM $DC -Windows `
    -AdminUsername $AdminUsername `
    -Password $Adminpassword
# Set Networking Information
Set-AzureSubnet -VM $DC -SubnetNames $Subnet
Set-AzureStaticVNetIP -VM $DC -IPAddress "$IPNetwork.2"


# Create the VM
New-AzureVM -ServiceName $CloudService -VNetName $VNet -VM $DC

The first few IP addresses in the C-class range I constructed are needed for the platform. I started my addressing at 10.1.1.11 to be sure to give Azure enough room. Since a domain controller has to have a static IP address, this is really needed.

By default you’ll get a PowerShell endpoint and an RDP endpoint created. You can find out where these endpoints are using PowerShell too:

Get-AzureEndpoint -VM (Get-AzureVM | ? Name -eq "LAB-DC")

So now you have the cloud service name (which you created and it’s in CloudService.txt) – add .cloudapp.net to that to get the computername, and the port number (from the endpoint command above). You need to go and import the certificate before using Enter-PSSession to connect using the AdminUsername and password you created. You can find out all about this process from this blog post.

Note that you can also create additional machines on your lab network in Azure in the same way – they just need a different IP address and name.