Forecast Cloudy – Going Big Data With Google Big Query

google-cloud-platform

Although I am mainly Microsoft and Azure centric guy,after reading a bit on Google Big Query it got me interested. Therefore I decided to explore it a bit more as well here.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce. BigQuery (BQ) is reportedly based on Dremel,a scalable, interactive ad hoc query system for analysis of read-only nested data. To use the data in BigQuery, it first must be uploaded to Google Storage and in a second step imported using the BigQuery HTTP API. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

Lets start with what BigQuery is not. Its not either RDBMS nor MapReduce technology. As I stated BigQuery is based on internal Google technology called Dremel. Dremel is a query service that allows you to run SQL-like queries against very,very large data sets and get accurate results in mere seconds. You just need a basic knowledge of SQL to query extremely large datasets in an ad hoc manner. At Google, engineers and non-engineers alike, including analysts, tech support staff and technical account managers, use this technology many times a day. BigQuery provides the core set of features available in Dremel to third party developers. It does so via a REST API, a command line interface, a Web UI, access control and more, while
maintaining the unprecedented query performance of Dremel.

Why BigQuery is so fast? The answer can be found in two core technologies which gives BQ this unprecedented performance:

  • Columnar Storage. Data is stored in a columnar storage fashion which makes possible to achieve very high compression ratio and scan throughput
  • Tree Architecture is used for dispatching queries and aggregating results across thousands of machines in a few seconds.

BigQuery stores data in its columnar storage, which means it separates a record into column values and stores each value on different storage volume, whereas traditional RDBMS normally store the whole record on one volume

image

Actually this technology isnt that new for anyone who dealt with DW technologies for a while, it’s a fairly hot topic today with SQL Server Column-Store Indexes and SAP HANA In-Memory Column Store for example.  As you may know Column Storage has following advantages:

  • Traffic minimization. Only required column values on each query are scanned and transferred on query execution. For example, a query “SELECT top(title) FROM foo” would access the title column values only.
  • Higher compression ratio.  One study reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values, especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage.

Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of BigQuery, it simply doesn’t support any update operations. Thus the technique has been used mainly in read-only OLAP/BI type of usage. Although the technology has been popular as a data warehouse database design, Dremel\BigQuery is one of the first implementations of a columnar storage-based analytics system that harnesses the computing power of many thousands of servers and is delivered as a cloud service.

One of the challenges Google had in designing Dremel\BigQuery was how to dispatch queries and collect results across tens of thousands of machines in a matter of seconds. The challenge was resolved by using the Tree architecture. The architecture forms a massively parallel distributed tree for pushing down a query to the tree and then aggregating the results from the leaves at a blazingly fast speed.

image    

By leveraging this architecture, Google was able to implement the distributed design for Dremel\BigQuery and realize the vision of the massively parallel columnar-based database on the cloud platform.

image

     

BigQuery provides the core set of features available in Dremel to third party developers. It does so via a REST API, command line interface, Web UI,access control, data schema management and the integration with Google Cloud Storage. BigQuery and Dremel share the same underlying architecture and performance characteristics. Users can fully utilize the power of Dremel by using BigQuery to take advantage of Google’s massive computational infrastructure. This incorporates valuable benefits like multiple replication across regions and high data center scalability. Most importantly, this infrastructure requires no management by the developer.

So why BigQuery over MapReduce? The difference here is MapReduce is batch based programming framework for very large datasets,  whereas BIgQuery is an interactive data query tool for large datasets

image

Ok, how do I use it?  Assuming you already have Google Cloud account you will have to create a new project from the dropdown in your Google Cloud Console.

image 

Once that is done, you can navigate to project console and enable BigQuery APIs for use with your project:

image

Now in you left side menu you can pick BigQuery from Big Data offerings.

image

 

Before we can run any queries, we need some data! There are a couple of options here:

  • Load you own data
  • Use Google provided public data

For now I will settle on the second choice.  I will pick Shakespeare dataset here, This dataset contains the words in Shakespeare’s works, the word_count for each word, in which corpus the word appears, and the date the corpus was written. First I will issue a simple SELECT. SELECT is the most basic clause and specifies what it is that you want to be returned by the query. FROM specifies what dataset we are using.

SELECT corpus
FROM (publicdata:samples.shakespeare)
GROUP BY corpus

image

Let’s switch gears. Say we want to count something – say, the number of words in Shakespeare’s works. Luckily, we have word_count, which represents how many times a particular word appeared in a particular corpus. We can just sum all of these values, and we are left with the total number of words that he wrote.

SELECT SUM(word_count) AS count
FROM (publicdata:samples.shakespeare)

image

945,845 words! Pretty good – but there must surely be some duplicates. How would we query the number of unique words that he used?

SELECT COUNT(word) AS count, word
FROM publicdata:samples.shakespeare
GROUP BY word
ORDER BY count Desc

Here we use the COUNT function to count how many words there are, and group them by word so as not to show duplicates. 32,786 unique words. Moreover I order these by mostly used words in descending order.

image

Ok, lets finally add WHERE clause. For example I wonder how many times Shakespeare is using word “Sir”:

SELECT word, SUM(word_count) as count
FROM (publicdata:samples.shakespeare)
WHERE word = "Sir"
GROUP BY word
ORDER BY count DESC

From results I can state that Shakespeare was really polite guy:

image

This of course is pretty basic. However, BigQuery now does allow for joins. I could take my large table and join it to smaller lookup table using standard ANSI SQL join syntax

image

There are also some complex functions it supports like:

image

In the next part I am planning to upload a dataset and do some joins and more complex processing.

For more on BigQuery see – https://cloud.google.com/bigquery/web-ui-quickstart, http://martinfowler.com/articles/bigQueryPOC.html, https://support.google.com/analytics/answer/4419694?hl=en, http://googlecode.blogspot.com/2011/11/google-bigquery-service-big-data.html

Forecast Cloudy – Developing Simple Java Servlet Application for Google App Engine

appengine

I realize that this post is pretty basic, however I think it may be useful to someone starting to develop Web applications on Google App Engine. In my previous post I shown you  how you can use NetBeans to develop for App Engine, this time however I will use Eclipse as with real Google plug-in it’s a lot more slick, despite my preference for NetBeans as Java IDE. In general I am a server side vs. UI guy so my servlet will be very basic, but that I don’t think is the point here as servlets in general are quite easy to do and make more more functional.

So you decided to build Web App on App Engine?  First you need to install Google Plug-In for Eclipse for AppEngine. Installation instructions are available here. In Eclipse Go To Help Menu –>Install Software and add proper link as per table in Google doc: If you are using latest Eclipse (Luna) full instructions are here – https://developers.google.com/eclipse/docs/install-eclipse-4.4

After you install Plug-In you will see new icon on your toolbar in Eclipse:

appengine_eclipse2

Next I will create a Dynamic Web Project in Eclipse to house my Servlet. Go To File –> New –>Dynamic Web Project. Resulting dialog looks like this:

appengine_eclipse4

After you complete this step, the project-creation wizard will create a simple servlet-based application featuring a Hello World-type servlet.  After you set that up you should see your project structure:

image

Now on to Java Servlets.  Java servlet is a Java programming language program that extends the capabilities of a server. Although servlets can respond to any types of requests, they most commonly implement applications hosted on Web servers. Such Web servlets are the Java counterpart to other dynamic Web content technologies such as PHP and ASP.NET.

servlet is a Java class that implements the Servlet interface. This interface has three methods that define the servlet’s life cycle:

public void init(ServletConfig config) throws ServletException
This method is called once when the servlet is loaded into the servlet engine, before the servlet is asked to process its first request.

public void service(ServletRequest request, ServletResponse response) throws ServletException, IOException
This method is called to process a request. It can be called zero, one or many times until the servlet is unloaded. Multiple threads (one per request) can execute this method in parallel so it must be thread safe.

public void destroy()
This method is called once just before the servlet is unloaded and taken out of service.

The init method has a ServletConfig attribute. The servlet can read its initialization arguments through the ServletConfig object. How the initialization arguments are set is servlet engine dependent but they are usually defined in a configuration file.

The doGet method has two interesting parameters: HttpServletRequest and HttpServletResponse. These two objects give you full access to all information about the request and let you control the output sent to the client as the response to the request.

My Servlet as stated is pretty basic:

package com.example.myproject;

import java.io.IOException;
import javax.servlet.http.*;

@SuppressWarnings("serial")
public class HelloAppEngineServletServlet extends HttpServlet {
	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {
		resp.setContentType("text/plain");
		resp.getWriter().println("Hello, AppEngine from GennadyK");
	}
}

The servlet gets mapped under the URI /simpleservletapp in the web.xml, as you can see below:

image

The project-creation wizard also provides an index.html file that has a link to the new servlet.

Before I deploy this to App Engine I probably want to debug it locally. Lets find Debug button on toolbar

Hit dropdown arrow and you can setup local server for debugging. If your application uses App Engine but not GWT, the only indication that the App Engine development server is running will be output in the Console view. App Engine-only launches will not appear in the development mode view. The console output includes the URL of the server, which by default is http://localhost:8888/. You can change the port number via Eclipse’s launch configuration dialog by selecting your Web Application launch and editing the Port value on the Main tab.

image

Simple result here:

image

Now lets upload this “application” to App Engine. Since I already configured development server I used Google Engine WPT Deploy. Right click on the project and you can pick that option:

Deploy

I can now watch my status in the status bar:

image

And read logged messages in Eclipse Console:

May 29, 2015 7:34:53 PM java.util.prefs.WindowsPreferences 
WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
Reading application configuration data...
May 29, 2015 7:34:54 PM com.google.apphosting.utils.config.AppEngineWebXmlReader readAppEngineWebXml
INFO: Successfully processed C:/Users/gennadyk/workspace/.metadata/.plugins/org.eclipse.wst.server.core/tmp0/helloAppEngineServlet\WEB-INF/appengine-web.xml
May 29, 2015 7:34:54 PM com.google.apphosting.utils.config.AbstractConfigXmlReader readConfigXml
INFO: Successfully processed C:/Users/gennadyk/workspace/.metadata/.plugins/org.eclipse.wst.server.core/tmp0/helloAppEngineServlet\WEB-INF/web.xml


Beginning interaction for module default...
0% Created staging directory at: 'C:\Users\gennadyk\AppData\Local\Temp\appcfg8550091122421213739.tmp'
5% Scanning for jsp files.
20% Scanning files on local disk.
25% Initiating update.
28% Cloning 3 static files.
31% Cloning 13 application files.
40% Uploading 4 files.
52% Uploaded 1 files.
61% Uploaded 2 files.
68% Uploaded 3 files.
73% Uploaded 4 files.
77% Initializing precompilation...
80% Sending batch containing 4 file(s) totaling 5KB.
90% Deploying new version.
95% Closing update: new version is ready to start serving.
98% Uploading index definitions.

Update for module default completed successfully.
Success.
Cleaning up temporary files for module default

Now let me open my browser and go to App Engine Console at https://appengine.google.com/.

image

Something I learned along the way – At this time you cannot run Java 8 on GAE. My IT department is a stickler for latest Java installed on my work laptop. Therefore I am running Java 8 SDK and JRE. However when I first build this application in Java 8 and deployed to App Engine trying to run it I got http 500 result. Looking at the logs in dashboard I can see following error:

2015-05-29 16:49:22.442  
Uncaught exception from servlet
java.lang.UnsupportedClassVersionError: com/example/myproject/HelloAppEngineServletServlet : Unsupported major.minor version 52.0
	at com.google.appengine.runtime.Request.process-ee32bdc226f79da9(Request.java)
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:817)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:375)
	at org.mortbay.util.Loader.loadClass(Loader.java:91)
	at org.mortbay.util.Loader.loadClass(Loader.java:71)
	at org.mortbay.jetty.servlet.Holder.doStart(Holder.java:73)
	at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:242)
	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
	at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:685)
	at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
	at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1250)
	at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
	at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:467)
	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
	at com.google.tracing.TraceContext$TraceContextRunnable.runInContext(TraceContext.java:437)
	at com.google.tracing.TraceContext$TraceContextRunnable$1.run(TraceContext.java:444)
	at com.google.tracing.CurrentContext.runInContext(CurrentContext.java:230)
	at com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContextNoUnref(TraceContext.java:308)
	at com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContext(TraceContext.java:300)
	at com.google.tracing.TraceContext$TraceContextRunnable.run(TraceContext.java:441)
	at java.lang.Thread.run(Thread.java:745)

Did a bit of digging and here is what I found out.

I believe these are the current version numbers:

J2SE 7 = 51,  //Note this one

J2SE 6.0 = 50,

J2SE 5.0 = 49,

JDK 1.4 = 48,

JDK 1.3 = 47,

JDK 1.2 = 46,

JDK 1.1 = 45

51.0 appears to be Java 7, which would mean in my case I am using 8 as its seeing 52. So issue here is that I compiled on higher version JDK (52 = Java 8) and then execute it on lower JRE version (GAE uses java 7). Repointing PATH to Java 7 did the trick, moreover as it was suggested elsewhere on the web I just created a batch file for this purpose as

SET JAVA_HOME="C:\Program Files\Java\jdk1.7.0_55"
SET PATH="%JAVA_HOME%\bin"
START eclipse.exe

For more information see – https://cloud.google.com/appengine/docs/java/tools/eclipse, https://developers.google.com/eclipse/docs/appengine_deploy, https://cloud.google.com/appengine/docs/java/tools/maven, https://cloud.google.com/appengine/docs/java/webtoolsplatform

Forecast Cloudy -Developing for Google App Engine in NetBeans IDE

appengine

Although I am fairly Microsoft centric and Azure centric in my interests I have decided to give a quick try to a competing alternative using language that I am familiar with – Java and IDE that I know fairly well for it – NetBeans. I wanted to try something quick and easy, but my first task was making sure that NetBeans actually can deploy code to AppEngine.

Good news there is a plug-in for Google AppEngine available from this Kenai Project – https://kenai.com/projects/nbappengine/pages/Home.

First things first. Navigate to developers.google.com  and sign up to enable your Google Account for use with Google App Engine.  Create a project, make sure you write down Project ID and name. Download and unzip the Google App Engine SDK to a folder on your hard drive. Obviously here I am using Java SDK.

Now, I will assume that you have NetBeans already installed. If not it can be installed from here – https://netbeans.org/ . It is my favorite IDE when it comes to non-Microsoft or Java IDEs at this time.

Next, you will need to install plug-in. Go to https://kenai.com/projects/nbappengine/pages/Home and follow these instructions:

  • In NetBeans click Tools –> Plugins
  • Go To Settings Tab
  • Click Add Button
  • Type “App Engine” (without the quotes) into the Name field
  • If using NetBeans 6.9 and above paste http://kenai.com/downloads/nbappengine/NetBeans69/updates.xml into the URL field
  • Click the OK button
  • Click on Available Plugins
  • Select all Google App Engine plugins
  • If you’re using NetBeans 6.9 you must also install the  Java Web Applications plugin for NetBeans.

After you are done you should see plugins in your installed tab like this:

image

To install the Google App Engine service in NetBeans, follow these instructions:

  • Start NetBeans
  • Click on the Services tab next to Projects and Files
  • Right-click on Servers and click Add
  • Select Google App Engine and Click Next
  • Select the location you unzipped the Google App Engine SDK and Click Next
  • Unless you have another service running on port 8080 and port 8765 leave the default port values
  • Finally Click Finish

After you are done you should see Google AppEngine in your Servers”

image

Next I created really easy small JSP file that I want to deploy. First thing lets take a look at appengine-web.xml settings file. As per Google docs –“ An App Engine Java app must have a file named appengine-web.xml in its WAR, in the directory WEB-INF/. This is an XML file whose root element is .” My config looks like this:

image

image

Once you are ready to deploy just go to project and right click and pick Deploy to Google App Engine option. It will prompt you for user name and password. That’s where I ran into first issue with error that my password and email don’t match detailed here – https://groups.google.com/forum/#!topic/opendatakit/5hpDkvrd_WQ. After working through this and another issue noted here with JSPs and AppEngine deploy – https://code.google.com/p/googleappengine/issues/detail?id=1226#makechanges I finally deployed and ran my little application generating traffic I can see on console:

image

Frankly the experience for me wasn’t as seamless as I would like (perhaps I am spoiled a bit by Visual Studio Azure SDK integration for Azure PaaS), I may have had an easier time with official Google plugin and Eclipse, but as stated I like NetBeans more.

For more see – http://rocky.developerblogs.com/tutorials/getting-started-google-app-engine-netbeans/, https://techtavern.wordpress.com/2014/05/13/google-app-engine-on-netbeans-8/, https://cloud.google.com/appengine/docs/java/gettingstarted/introduction

Well this was different and going “back to regularly scheduled programming”