Friday, August 22, 2014

Jazz !

Let's relax with a little Jazz !

Tuesday, August 19, 2014

Unlockyourbrain !

This application is quite nice if you want to improve the way to unlock your phone !

Wednesday, August 6, 2014

How to be more productive !

Take 10 minutes and read the Aaron Swartz post !

Monday, August 4, 2014

RStudio Server

You perhaps know RStudio IDE which is really nice. But if you want to use the RAM and the CPU of another server you can also install RStudio Server and access your R environment using a browser based interface, and it rocks !

Saturday, August 2, 2014

Mankind are quite stupid...

I mean we have done a lot of amazing studies and innovations in the field of Science, Technology and Health. But the fact that we have created this garbage continent, the way we run most of our business by enriching shareholders or all the political corruption you can discover in daily life... It deserves a very big WTF !

Have a good week-end ;-)

Thursday, July 31, 2014

Partial Redistribution Partial Duplication

PRPD is a new feature in Teradata since 14.10 and it improves joining with skew tables (so it depends on statictics to identify skewed values). This is a smart way to avoid DBA to create Surrogate key !

Wednesday, July 30, 2014

NVD3 : Re-usable charts for d3.js

If you don't want to start from scratch with D3.js, have a look at NVD3.js ;-)

Tuesday, July 29, 2014

Dataiku !

Dataiku is a French startup which is providing a great web-based plateform to accelerate data-science projects and there is an open-source version !

Thursday, July 24, 2014

Wednesday, July 23, 2014

My Hadoop is not working, what can I do ?

Keep calm and ;-)
  • First check your logs
  • Is the service is running ? (netstat -nat | grep ...)
  • Is it possible to access it ? (telnet ip port)
  • Is there a problem linked with path, java libraries, environment variable or exec ?
  • Am I using the correct user ? 
  • What is the security system in place ?
  • Are nodes well synchronized ?
  • What about memory issue ? (swap should be desactivated also)

Monday, July 14, 2014

Virtual Desktop on Windows !

For those who come from Linux or MacOs and would like virtual desktop on Windows :-)

Wednesday, May 14, 2014

Chief Data Officer

I would like to meet people who are working as CDO : Chief Data Officer. It's look like it is a very interesting job (data quality, data management, data wwwww) and it should be very helpful for data preparation I need before running analytics workflow / discovery process.

Monday, May 5, 2014

Hive development !

A lot of improvment for this new release of Hive !
  • [NOT] EXIST and [NOT] IN are available
  • WITH t_table_name AS ... well know as Common Table Expressions too
  • SELECT ... WHERE (SELECT c_column1 FROM ...) as Correlated Subqueries
  • SQL authorization system (GRANT, REVOKE) is now working
  • The Tez engine which can be enable thanks to set hive.execution.engine=tez;ee

Thursday, April 24, 2014

Python !

Python is already almost everywhere and used in production in Google. It is a very powerful programming langage to map your wish (from Web to GUI) in a script !

Wednesday, April 23, 2014

HBase coprocessor !

If you need to execute some custom code in your HBase cluster, you can use HBase coprocessor :
  • Observers : like triggers in RDBMS
    • RegionObserver : To pick up every DML statement : get, scan, put, delete, flush, split
    • WALObserver : To intercept WAL writing and reconstruction events
    • MasterObserver : To detect every DDL operation
  • EndPoints : kind of stored procedure

Wednesday, April 2, 2014

Scikit-learn !

Scikit-learn is an open-source machine-learning library written in Python. It is fast and handles memory well and thanks to Python is very flexible !

Monday, March 31, 2014

Teradata & Hadoop !

Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.

If you want to transfert data, you can consider :

Friday, March 28, 2014

Machine Learning with Aster !

I am now working with Aster to do Machine Learning and statistics. Here are the functions you can use :
  • Approximate Distinct Count : to quickly estimates the number of distinct values
  • Approximate Percentile :  to computes approximate percentiles
  • Correlation : to determine if one variable is useful for predicting an other
  • Generalized Linear Regression & Prediction : to perform linear regression analysis
  • Principal Component Analysis : for dimensionality reduction 
  • Simple | Weighted | Exponential Moving Average : compute average with special algortihm
  • K-Nearest Neighbor : classification algorithm based on proximity
  • Support Vector Machines : build a SVM model and do prediction 
  • Confusion Matrix [Plot] : visualize ML algorithm performance
  • Kmeans : famous clustering algorithm
  • Minhash : Another clustering technic which depends on the set of products bought by users
  • Naïve Bayes : useful classification method especially for documents
  • Random Forest Functions : predictive modelling approaches broadly used for supervised classification learning

Tuesday, March 11, 2014

Teradata’s SNAP Framework !

Teradata’s Seamless Network Analytic Processing Framework is one of the great ideas inside Aster 6 database. It allows user to query different analytical engines and multiple type of storage using a SQL-like programming interface. It is composed by a query optimizer, a layer that integrates and manages resources, an execution engine and the unified SQL interface. These are the main components and their goals :
  • SQL-GR & Graph Engine : provide functions to work with edge, vertex, [un|bi|]directed or cyclic graph
  • SQL-MR : library (Machine Learning, Statistics, Search behaviour, Pattern matching, Time series, Text analysis, Geo-spatial, Parsing) to process data using MapReduce framework
  • SQL-H : easy to use connection to HDFS for loading data from Hadoop
  • SQL : join, filter, aggregation, OLAP, insert, update, delete, CASE WHEN, table
  • AFS connector : SQL-MR function to map AFS file to table
  • Teradata connector : SQL-MR function to load data from / to Teradata RDBMS
  • Stream API : plug your Python, Ruby, Perl, C[|++|#] scripts and use Aster CPU workers node to process it

Tuesday, February 25, 2014

Sunday, February 23, 2014

Scam #1

I rent a car using locationdevoiture.fr, they called pretending there is a problem during website registration and they changed the date so when I arrived to take the car, no voucher, no reservation... #becareful #scam

Friday, February 21, 2014

Taxi !

I just took the taxi this morning and because I am a stranger, the guy took the wrong direction... I like to travel but not before going to work ;-). But thanks to Google Map and because I remember the price I payed the first day everything went well. These are advices I would like to share

  • Take your phone, use GMap and show the driver where you want to go to
  • Don't take taxi near the tourism spot or your hotel taxi
  • Ask for the (approximation) price before leaving

Saturday, February 15, 2014

My Hapiness receipe !

  • Travelling every 4 months
  • Save money by using smart website
  • Drink prêt a manger soup, hot hazelnut chocolate in Starbucks or tea (especially green one)
  • Drive motorbike when it is sunny
  • Enjoy your family
  • Share knowledge & keep discovering (not only IT)
  • Try new food or new restaurants
  • Gather with your friends and have some drink ;-)
  • Eat healthy (yes you are what you do but also what you eat !)
  • Take photo using Instagram and share our friends !
  • Write down what it is important for you
  • Read, read, read especially before going to sleep
  • Wake up at 07h07, go to sleep at 22h22 (at least try)
  • Close your computer now and do some sports ;)

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :
  • MapReduce (API v1 & v2) : software framework for processing vast amounts of data
  • Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
  • HOYA, HBase on YARN : distributed, column oriented database
  • Accumulo : (Linux only) sorted, distributed key / value store
  • Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
  • HDFS : hadoop distributed file system
  • WebHDFS : interact to HDFS using HTTP (no need for library)
  • WebHCat : interact to HCatalog using HTTP (no need for library)
  • YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
  • Oozie : workflow / coordination system
  • Mahout : Machine-Learning libraries which use MapReduce for computing
  • Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
  • Flume : data ingestion and streaming tool
  • Sqoop : extract and push down data to databases
  • Pig : scripting platform for analyzing large data sets
  • Hive : tool to query the data using a SQL-like language
  • SolR : plateform for indexing and search
  • HCatalog : meta-data management service
  • Ambari : set up, monitor and configure your Hadoop cluster
  • Phoenix : sql layer over HBase
Components being developed / integrated :
  • Spark : in memory engine for large-scale data processing
  • Falcon : data management framework
  • Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
  • Storm : distributed realtime computation system
  • Kafka : publish-subscribe messaging system
  • Giraph : iterative graph processing system
  • OpenMPI : high performance message passing library
  • S4 : stream computing platform
  • Samza : distributed stream processing framework
  • R : software programming language for statistical computing and graphics
What else ;-) ?

Tuesday, February 4, 2014

Basic statistics with R !

I am quite sure you already know but it is really useful (especially with na.rm=TRUE) :
And don't forget t.test and prop.test !

Saturday, February 1, 2014

Lego & Chrome !

For now, there is not a lot of piece but it can let you have greats moments with your child / nephew ;-)

Monday, January 27, 2014

Main Hadoop 2.0 daemons !

  • NameNode : one per Namespace, stores & handles HDFS metadata
  • Secondary NameNode : (for now) still in use if no HA
  • Checkpoint node : (later) multiple checkpoint node is possible, performs periodic checkpoints
  • Backup node / Standby node : allows high availability, keep updated copy of namespace in its memory (if using no checkpoint allowed)
  • DataNode : stores HDFS data

  • ResourceManager : a global pure scheduler
  • ApplicationMaster : one per application, negociates ressource with RM, monitors and asks task execution to NodeManager 
  • NodeManager : one per slave server, a task application container launcher and reporting agent
  • Application Container : a separate processing unit, it can be a Map, a Reduce or a Storm Bolt, etc

Thursday, January 16, 2014

S.A.R.A.H

I will set up S.A.R.A.H soon, let's have an intelligent house and enjoy IoT ;-)

Monday, January 13, 2014

Google Keep !

In 2013, I tried some softwares to improve my organisation, I found one quite smart & useful : Google Keep. You can create task or task list, add color, picture, and reminder (date or location) and it is synchronise with your androïd device !

Tuesday, January 7, 2014

Hadoop & Java !

Thanks to UD[AT]F or MapReduce you can work directly with Java and use your Hadoop ressources. Because of the huge number of Java library, you can imagine extract directly data from HTML / XML files, mix it with reference / parameter data (JDBC loading), and transform it in Excel files in one Job !

Friday, January 3, 2014

Can we eat to starve cancer ?

William Li: http://on.ted.com/tbWi

Never give up !

Diana Nyad : http://on.ted.com/hyPR

Sunday, December 29, 2013

What is your passions ?

I always like to hear about others passions.

My passions/interests are :
  • IT
  • Solving problem
  • Discovering & Travelling
  • Human & Sharing & Realisation
  • Health
  • Build things to make this worl smarter and safer
  • Running (marathon for 2014) & sports in general !
  • Aviation & aeromodelling & flight simulation
  • English
  • Motorbiking
  • Eating & discovering food
  • Movies
  • Automatic watches
  • Having fun ;-)
So if we meet, tell me yours !

Saturday, December 28, 2013

Thursday, December 26, 2013

Thursday, December 5, 2013

Decision trees & R | Mahout !

Yesterday, I was asked "how can we visualise what leads to problems" ? To me, one of the best way is using decision tree with R or Mahout !

And you can do prediction & draw nicely !

Sunday, December 1, 2013

Orion

Sometimes I use Eclipse Orion it's really useful to have a cloud-based IDE !

AngularJS

Try AngularJS a pratical javascript framework !

Saturday, November 30, 2013

Lambda architecture

On this post I would like to present one of the possible software lambda-architecture :

Speed layer :  Storm, HBase

Storm is the real-time ETL and HBase because of random, realtime read/write capability is the storage !

Batch layer : Hadoop, HBase, Hive / Pig / [your datawarehouse]

To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !

Serving layer : HBase, JEE & JS web application

JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.

Stay KISS ;-)

Photo #11


Nagios script & Hadoop !

useful link to help you monitor your Hadoop cluster !

LAMP became MEAN !

MEAN is the new javascript-powerful way to develop web application !

Sunday, November 3, 2013

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Big-LambData Architecture !

Nathan Marz proposed to apply the lambda philosophy to big data architecture and so it can help when you have to solve use cases using batch and real time processing systems.

Lambda architecture is based on three main design principle :
  • human fault-tolerance
  • data immutability
  • recomputation

Quality Function Development !

I like to use Japanese methods, QFD is one of my favourite for improving / solving complex IT issues !

Sunday, October 13, 2013

Kafka !

Kafka is a good solution for high throughput large scale message processing applications !

Hive development !

Lateral view simplifies the use of UDTF :

SELECT column1, column_udtf1
FROM t_table

LATERAL VIEW explode(array_column) ssreq_lv1 AS column_udtf1 
;

And with Hive 0.11, you now have ORC Files and windowing functions :
  • LEAD / LAG
  • FIRST_VALUE / LAST_VALUE
  • RANK / RANK / ROW_NUMBER / DENSE_RANK
  • CUME_DIST / PERCENT_RANK / NTILE
which is convenient for our BI needs !

Sunday, September 29, 2013

Friday, September 27, 2013

Business Intelligence & Hadoop

Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !

Music #2

Stromae - Papaoutai | Rammstein - Ohne Dich | Bruno Mars - Locked Out Of Heaven | Selah Sue - Raggamuffin | c2c - Down The Road | Martin Garrix - Animals | Macklemore & Ryan Lewis - Can't Hold Us | Avicii - Wake Me Up | Kavinsky - Roadgame | Imelda May - Tainted Love

Wednesday, September 25, 2013

Hadoop & compression !

Compression with Hadoop is great ! You can reduce IO, network exchange and store more data, and most of the time your Hive/Pig/MapReduce jobs will be a little faster.

Depending on what your needs are, you should think about Snappy, lzo, lz4, bzip or gzip.
 
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Flume daemons !

  • Source (consumes events delivered to it by an external source)
  • Channel (stores temporarily event's data and help to provide end-to-end reliability of the flow)
  • Sink (removes the event from the channel and transfer/write it)
Both of them run asynchronously with the events staged in the channel.

Thursday, September 12, 2013

Node.js

Node.js is a great way to write web application in javascript ! It very useful especially with express and socket.io !

Wednesday, September 11, 2013

CouchSurfing !

CouchSurfing is a super cool & free way to travel and meet new friends around the world. I like it and I joined the community !

Monday, September 9, 2013

VirtualBox !

VirtualBox is a virtualization software package, it is very convenient when you want to test OS, to do web development or build your first Hadoop cluster :-) !

Sunday, September 8, 2013

Rattle !

I was wondering if there is a GUI for doing datamining / machine learning tasks and I found Rattle.

If you want to install and try :
install.packages("rattle", dependencies=TRUE)
library(rattle)
rattle()

And you can take a cofe during the first step ^^ !

How to make stress your friend !


Friday, August 30, 2013

MRUnit !

MRUnit (Blog) is the java library for MapReduce jobs testing !

Thursday, August 29, 2013

Speculative execution && Hadoop !

I usually disable speculative execution for MapReduce task when I write to RDBMS in Hive user defined table function.

set mapred.map.tasks.speculative=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.reduce.tasks.speculative.execution=false;


And if you tune the mapred.reduce.tasks, you can control RDBMS session-running number.

It is good also to use Batch mode and control the commit !

Photo #9


Monday, August 26, 2013

Recursive SQL !

A quite powerful way to handle hierarchical model data : recursive SQL !

WITH tmp_table AS
 SELECT column1, column2, ...
 FROM src_table
 WHERE src_table.hierarch_column_id is NULL
 UNION ALL
 SELECT column1, column2, ...
 FROM src_table
 INNER JOIN tmp_table
 ON src_table.hierarch_column_id = tmp_table.column_id
SELECT *
FROM tmp_table

You can also add meta-information like 1 as n in the first select using src_table and n + 1 in the second join which lets you filter the level.

R !

R is a free software for doing statistics, analytics, machine learning and data visualization.

If you want to start learning R, watch Google Developers videos, read machine learning or statistical models. You can find an IDE and an easy-way to create web-reporting using R and Shiny.

And don't forget library(rmr2) & library(rhdfs) to plug it with Hadoop !

Sunday, August 25, 2013

Guava !

Guava is a open-source java multi-library, very useful and time-saving especially when you want to work with collection !

Wednesday, August 21, 2013

Principal components analysis with R !

If you want to reduce the dimensial aspect of your n-variable problem and get the main uncorellated axis, try PCA and start with the generic function princomp !

Sunday, August 4, 2013

Trello !

A great, free, web-based way to organize your project with your collegue/friends : Trello !

Thursday, July 18, 2013

Photo #8


Monday, July 15, 2013

Sunday, May 26, 2013

RunKeeper !

I like to do running ! To record and share with my friends I use Runkeeper !

Saturday, May 25, 2013

Photo #7


Wednesday, May 8, 2013

SolR & ElasticSearch !

SolR and ElasticSearch are both great way to add search capability (and more) to your projects. And behind that, there is Lucene !

Thursday, May 2, 2013

Coursera !

A very cool and free way to learn : Coursera !

Monday, April 29, 2013

What can I do with Mahout ?

  • Clustering
    • Canopy
    • K-Means
    • Fuzzy K-Means
    • Dirichlet Process
    • Latent Dirichlet Allocation
    • Mean-shift
    • Expectation Maximization
    • Spectral
    • Minhash
    • Top Down
  • Classification
    • Logistic Regression
    • Bayesian
    • Support Vector Machines  
    • Random Forests
  • Decision forest
  • Machine learning
  • Recommendation
  • Dimension reduction
  • Your own business ! (If you understand how MapReduce and Mahout class work together, you can code your own logic)

Friday, April 26, 2013

Mahout !

Mahout is an incredible library to do machine learning, clustering, classification, recommendation. It works directly on top of Hadoop and MapReduce !

This is how to launch a recommender Job :

hadoop jar /usr/lib/mahout/mahout-core-0.7.0.21-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /apps/hive/warehouse/profile_activity_text_file --output /apps/hive/warehouse/recommenderJob --similarityClassname SIMILARITY_COOCCURRENCE --booleanData true

Wednesday, April 10, 2013

Storm & Hadoop/Hive partitioning load !

I am currently working on how Storm can load data into a partitioned Hadoop/Hive table.

This is how I do :
  • put hadoop libs into the Storm lib directory
  • add the hadoop xml conf and parse them using conf.addRessource();
  • create a HDFSBolt (implements IRichBolt)
  • add some private HashMap<String partition, FSDataOutputStream fsDataOutputStream >
  • override execute function (if the partition already exists use current buffer else create a new one)
You can also choose to do partitioning using Storm grouping and so limit the number of partition per worker !

Tuesday, April 9, 2013

Hadoop Operations at LinkedIn !

Watch this video ! It is very interesting !

Saturday, April 6, 2013

Photo #6 !


Scala !

I am learning Scala, very powerful and used inside Shark !

Monday, March 25, 2013

Memrise !

Do you want to learn or improve your language skills ! Try Memrise ! It's free, it's cool !

Google Code Jam !

You can register here !

Sunday, March 24, 2013

The R-project for Statistical Computing !

R is a famous language for statistical computing and graphics ! You can make it work with Hadoop too with this library !

Friday, March 22, 2013

Chart.js !

A cool open-source javascript library ! Try Chart.js !

Storm & real-time ETL !

Storm is a amazing scalable, fault-tolerant, open-source, real-time ETL. Let's storm !

Saturday, March 16, 2013

Main Storm daemons !

  • Nimbus (The Storm JobTracker)
  • Supervisor (The supervisor daemon is responsible for starting and stopping worker processes)
  • UI (administration website)

Thursday, March 7, 2013

HIVE-3963 : Hive & RDBMS example !

>add jar /home/hive/developpement/loadfromjdbc.jar;
>add jar /home/hive/developpement/tdgssconfig.jar;
>add jar /home/hive/developpement/terajdbc4.jar;
>create temporary function loadfromjdbc as 'mlanciau.dev.loadfromjdbc';
>SELECT result['column_name1'], result['column_name2']
FROM (
 SELECT loadfromjdbc('com.teradata.jdbc.TeraDriver', 
 'jdbc:teradata://ip/CHARSET=UTF16',
 'db_user', 'db_password',
 'SELECT * FROM database_name.table_name') AS (result) FROM dual
) ssreq 

We can now do join with Hadoop and database data directly from Hive.

SQL inside SQL ! Browse HIVE-3963.

Monday, February 25, 2013

D3.js

D3.js is a JavaScript library for data visualization !

Stinger : Apache Hive 100 Times Faster !

It is awesome ! Read more here !

Wednesday, February 13, 2013

Photo #5


Tuesday, February 5, 2013

HIVE-3963 : Hive & RDBMS !

I will be working soon on HIVE-3963. My goal is to allow Hive to read/write to Database thanks to JDBC. I have already tried and it works well !

Wednesday, January 16, 2013

Monday, January 14, 2013

Photo #4


Thursday, January 3, 2013

Tuesday, January 1, 2013

Thursday, December 20, 2012

Wednesday, December 12, 2012

Photo #3


Friday, December 7, 2012

Hive and custom Map Reduce !

With perl, python or java :
ADD file reducer.pl;
FROM (
 FROM t_table
 SELECT c_1, c_2
 WHERE c_2 = '2012-12-07'
 DISTRIBUTE BY c_1
 SORT BY c_2
) ssreq
REDUCE ssreq.c_1, ssreq.c_2
USING 'reducer.pl'

You can find more on Hive tutorial !

Sunday, December 2, 2012

Codecademy !

Have you heard about Codecademy ?

Monday, October 29, 2012

Impala !

Cloudera has created Impala for real time query on Hadoop (without MapReduce). Be ready !

Monday, October 22, 2012

Saturday, October 20, 2012

Real time Hadoop !

Want to use Hadoop for real time processing ? Then use Flume for collecting, Storm for calculation and HBase for handling client IO !

Monday, October 15, 2012

Hive SerDe !

SerDe means Serialisation/Deserialisation, it is one of the amazing power of Hive/Hadoop.
CREATE EXTERNAL TABLE t_access_log_part (
 c_proxy STRING, c_ip STRING, c_timetaken STRING,
 c_jour STRING, c_mois STRING,
 c_annee STRING, c_hour STRING,
 c_reste_timestamp STRING, c_commande STRING,
 c_fichier STRING, c_protocole STRING,
 c_code_retour STRING, c_size STRING,
 c_reste STRING, c_identifiant STRING
)
PARTITIONED BY (c_date string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "([a-zA-Z-0-9]*)[^\\t]*\\t(\\d{1,3}[.]\\d{1,3}[.]\\d{1,3}[.]\\d{1,3})[ ][^ ]+[ ](\\d+)[ ]+\\[(\\d+)/([0-9a-zA-Z]+)/([0-9a-zA-Z]+):(\\d+):(.*)\\][ ]\"(\\w+)[ ](.*)[ ]+([A-Za-z0-9/.]+)\"[ ]+([0-9A-Za-z]+)[ ]+([0-9A-Za-z-]+)[ ]+\"(.*)\"[ ]+\"(.*)\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s"
)
STORED AS TEXTFILE;


No ETL !

Friday, October 12, 2012

Projet CARS !

Wooooooo look !

Monday, October 1, 2012

Java HBase API !

Want to interact directly with HBase (web, bulk load) ? Java !

Monday, September 17, 2012

Hive development !

With Hive you can
  • do ETL and analysis using QL (a SQL like query language)
  • create table with partition or bucketed table !
  • compress your data !
  • use java reflection to instantiate and call methods of objects
  • build custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's)

Sunday, September 16, 2012

Hive join tips !

You can specified the [biggest] table to be streamed during a join operation :
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1);

And you can specified the [smallest] table to be join in a map join if you want to avoid the reducer
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key;

And if you want to do ... WHERE a.key IN (SELECT ...) use :
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key);

Saturday, September 15, 2012

Firebug !

a cool way to [do/debug/monitor] your web development !

Friday, September 14, 2012

Creating Custom Hive UDFs !

I really enjoy using Hive because ETL or analysis become easy, but sometimes you need to create your own function ! Follow the link !

Google Web Fonts !

Want to use open-source fonts on your web site ? Then try Google Web Fonts !

Sunday, September 9, 2012

Google Chart Tools !

Want to add some graphic on your website ? Try Google Chart Tools !

Wednesday, August 15, 2012

Photo #2


Saturday, July 28, 2012

Wednesday, July 4, 2012

Muse joke !

Just watch this video !

Hive development !

If you read this website, you can realise how hive is an incredible tool specially if you configure your Metastore store to work with a RDBMS.

You can :
  • CREATE TABLE (on HDFS or on HBase)
  • SHOW, DESCRIBE, ALTER, DROP TABLE !
  • LOAD DATA (from HDFS or locally)
  • SELECT ... FROM ... WHERE or BETWEEN UNION ALL ... (like a RDBMS !)
  • INSERT ... SELECT ...
  • use well know functions like DISTINCT, COUNT, SUM, AVG and of course GROUP BY operation !
  • you can [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN multiple table !
  • use built-in functions
  • manipulate INT, BIGINT, FLOAT, BOOLEAN, DOUBLE, TIMESTAMP, STRING and complex types like structs, maps and arrays !
And look at multitable insert, streaming and stay tuned !

Monday, July 2, 2012

Eclipse 4 !

Eclipse Juno is now ready for download !

Saturday, June 30, 2012

Hadoop best practice !

  • Check if your dfs.name.dir has one local value and another directory like NFS.
  • No swap allowed !
  • Enough memory !!
  • The Secondary Name Node is not on the same machine as the NameNode.
  • Time is sync !
  • Fast network !
  • ulimit is set

Main Hadoop daemons

  • NameNode (stores HDFS metadata)
  • DataNode (stores HDFS data)
  • Secondary NameNode (takes snapshots of the HDFS metadata)

  • JobTracker  (determines the execution plan)
  • TaskTracker (executes MapReduce job)

Antoine de Maximy !

I like watching his travel, so amazing !

Monday, June 25, 2012

Hadoop Summit 2012

If you want to see Hadoop Summit 2012 presentations, click here !

Monday, June 18, 2012

Pig !

When you want to takle your hadoop data, you can use Pig too ! Start here !

Sunday, June 17, 2012

Clusterssh !

For cluster administration, use clusterssh !

Saturday, June 16, 2012

JDJV !

If you understand French and like video games, watch JDJV. For now I am waiting for Watch Dog and the new God Of War !

Hadoop platforms !

You can dowload and set up the component you want or choose to install a platform. These are the most famous for now :

Friday, June 15, 2012

Thursday, June 7, 2012

Hadoop script !

#!/bin/bash
for serveur `cat $HBASE_HOME/conf/regionservers`;
do
   rsync -avz --delete --exclude='logs/*' $HBASE_HOME $serveur:$HBASE_HOME
done

Wednesday, June 6, 2012

Monday, May 28, 2012

Sync with firefox !

If you want to keep your bookmarks, history, passwords, add-ons and open tabs synchronize on all your firefox, try sync !

Tuesday, May 22, 2012

Java + SVN = SVNKit !

need to access SVN through java ? try SVNKit !

Sunday, May 20, 2012

The toolbox !

If you are a web developer, visit the toolbox !

Friday, May 18, 2012

Hive !

Hive is simple SQL like query language, it allows you to perform some ETL work and to access file from HDFS or HBase without writing any MapReduce. Start here !

So HBase !

Why : large scale (> 100 Go)
Specificity : versioned cell, column oriented, on top of Hadoop, sparse, open-source
Blog : http://hadoop-hbase.blogspot.fr/

Thursday, May 17, 2012

How to become a parisian in 1 hour !

You must go to see this one man show ! Really funny !

Wednesday, May 16, 2012

Monday, May 14, 2012

CSS3 generator !

a great tool to add CSS3 to your website : CSS3 generator !

Informatica repository tips !

Do you want to check if there is some useless workflow ? Connect to your repository and begin your analysis with REP_SESS_TBL_LOG or OPB_SWIDGINST_LOG !

Sunday, May 13, 2012

Modernizr !

want to check if the browser supports a feature, try Modernizr !

jQuery !

want to build beautiful and interactive website ? Try jQuery and jQuery UI !

Saturday, May 12, 2012

Portal !

if (youLikeToSolve("puzzle")) {
    try {
        portal();
    } catch (TimeToSleepException e) {
        computer.stop();
    }
}

Thursday, May 10, 2012

Bref !

An amazing French series : Bref !

Sqoop || Talend !

Sqoop is a great tool to communicate between RDBMS and Hadoop. But you can use Talend platform for Big Data too (or an other ETL) ! Click here !

HCatalog !

HCatalog is a good "meta-way" to tackle your hadoop data !

Informatica && Performance !

Informatica is one of the greatest ETL tool ! Mainly for performance issue. You need to improve your sessions ? Look for bottlenecks, and start with your thread statistics !

Tuesday, May 8, 2012

Dropbox !

Dropbox is a useful software to store, share and synchronize your files ! Download it here !

GT Academy 2012 !

If you want to become a real racing driver, turn on your PS3 and drive !

HBase !

HBase is the Hadoop column-oriented database ! Start here !

Sunday, April 29, 2012

Ubuntu 12.04

Ubuntu, the famous linux operating system, has now a new long term support version ! You can download it here !

I'm alive !!

If you to feel like Will Smith in I am legend, try I'm alive !

Saturday, April 28, 2012

BBC Radio !

If you want to listen BBC radio, follow the link !

Saturday, April 21, 2012

Java !

Because of the WORA aspect, I choosed to learn Java. It's a powerful programming language for building software, website (JEE + HTML + CSS), phone application (Android) and so many others things (Talend, Hadoop) !

Wednesday, April 18, 2012

ETL !

When you want to deal with your [unstructured|semi-structured|structured] data, you need a lot of tools and specially one ! An ETL !

Tuesday, April 17, 2012

BI ?

What is business intelligence  ? A "data-better" way to manage !

Me ?

You can find me on twitter, linkedIn, facebook, instagram or viadeo.

Monday, April 16, 2012

Hello !

I am here to share news and advice about Business Intelligence and other stuff !
Hope it will help someone ^^ !