Last week, I went to a meetup about streaming platform and there was a great guy who presents Summingbird : library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.
You perhaps know RStudio IDE which is really nice. But if you want to use the RAM and the CPU of another server you can also install RStudio Server and access your R environment using a browser based interface, and it rocks !
I mean we have done a lot of amazing studies and innovations in the field of Science, Technology and Health. But the fact that we have created this garbage continent, the way we run most of our business by enriching shareholders or all the political corruption you can discover in daily life... It deserves a very big WTF !
PRPD is a new feature in Teradata since 14.10 and it improves joining with skew tables (so it depends on statictics to identify skewed values). This is a smart way to avoid DBA to create Surrogate key !
I would like to meet people who are working as CDO : Chief Data Officer. It's look like it is a very interesting job (data quality, data management, data wwwww) and it should be very helpful for data preparation I need before running analytics workflow / discovery process.
Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.
Teradata’s Seamless Network Analytic Processing Framework is one of the great ideas inside Aster 6 database. It allows user to query different analytical engines and multiple type of storage using a SQL-like programming interface. It is composed by a query optimizer, a layer that integrates and manages resources, an execution engine and the unified SQL interface. These are the main components and their goals :
SQL-GR & Graph Engine : provide functions to work with edge, vertex, [un|bi|]directed or cyclic graph
SQL-MR : library (Machine Learning, Statistics, Search behaviour, Pattern matching, Time series, Text analysis, Geo-spatial, Parsing) to process data using MapReduce framework
SQL-H : easy to use connection to HDFS for loading data from Hadoop
I rent a car using locationdevoiture.fr, they called pretending there is a problem during website registration and they changed the date so when I arrived to take the car, no voucher, no reservation... #becareful #scam
I just took the taxi this morning and because I am a stranger, the guy took the wrong direction... I like to travel but not before going to work ;-). But thanks to Google Map and because I remember the price I payed the first day everything went well. These are advices I would like to share
Take your phone, use GMap and show the driver where you want to go to
Don't take taxi near the tourism spot or your hotel taxi
In 2013, I tried some softwares to improve my organisation, I found one quite smart & useful : Google Keep. You can create task or task list, add color, picture, and reminder (date or location) and it is synchronise with your androïd device !
Thanks to UD[AT]F or MapReduce you can work directly with Java and use your Hadoop ressources. Because of the huge number of Java library, you can imagine extract directly data from HTML / XML files, mix it with reference / parameter data (JDBC loading), and transform it in Excel files in one Job !
To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !
Serving layer : HBase, JEE & JS web application
JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.
Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !
Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !
A quite powerful way to handle hierarchical model data : recursive SQL !
WITH tmp_table AS SELECT column1, column2, ... FROM src_table WHERE src_table.hierarch_column_id is NULL UNION ALL SELECT column1, column2, ... FROM src_table INNER JOIN tmp_table ON src_table.hierarch_column_id = tmp_table.column_id SELECT * FROM tmp_table
You can also add meta-information like 1 as n in the first select using src_table and n + 1 in the second join which lets you filter the level.
>add jar /home/hive/developpement/loadfromjdbc.jar; >add jar /home/hive/developpement/tdgssconfig.jar; >add jar /home/hive/developpement/terajdbc4.jar; >create temporary function loadfromjdbc as 'mlanciau.dev.loadfromjdbc'; >SELECT result['column_name1'], result['column_name2'] FROM ( SELECT loadfromjdbc('com.teradata.jdbc.TeraDriver', 'jdbc:teradata://ip/CHARSET=UTF16', 'db_user', 'db_password', 'SELECT * FROM database_name.table_name') AS (result) FROM dual ) ssreq
We can now do join with Hadoop and database data directly from Hive.
With perl, python or java : ADD file reducer.pl; FROM ( FROM t_table SELECT c_1, c_2 WHERE c_2 = '2012-12-07' DISTRIBUTE BY c_1 SORT BY c_2 ) ssreq REDUCE ssreq.c_1, ssreq.c_2 USING 'reducer.pl'
You can find more on Hive tutorial !
Because of the WORA aspect, I choosed to learn Java. It's a powerful programming language for building software, website (JEE + HTML + CSS), phone application (Android) and so many others things (Talend, Hadoop) !