Data Mining Engine Installation

The SpagoBIDataMiningEngine is the SpagoBI Data Mining engine that replaces previous Weka Engine, integrating R scripting capabilities.
Within this engine it's possible to execute multiple R scripts in an interactive way and visualise several outputs, including the powerful R graphics. Another important thing to notice is that it allows users to perform statistical or data mining analysis on different files or SpagoBI datasets.

Installation

In order to run the SpagoBI Data Mining engine, you have to install R properly (depending on the OS) on the same machine where SpagoBI is installed.

R_HOME must be set according to the R installation guide, depending on the OS (i.e. "C:\Program Files\R\R-3.0.3").
If you are using Windows, add the path to /R/R-3.0.3/bin/i386_or_x64 (i.e. "C:\Program Files\R\R-3.0.3\bin\x64") to the environment variable Path

Data mining engine needs also the JRI installation to execute scripts from Java onto R.
You can install it together with the rJava package by the R command install.packages("rJava") or you can download the binary zip (i.e. http://rforge.net/bin/windows/contrib/3.0/rJava_0.9-5.zip for Windows). Remember to check the correct version depending on R version .
As an alternative you could install rJava package from R GUI: install.packages('rJava')
If you are using Windows, add the path to /rJava/jri/i386_or_x64 (i.e. "D:\progetti\RIntegration\rJava\jri\x64;") to the environment variable Path.
As written in the JRI documentation, you have to set the java.library path to JRI setting it in SpagoBI application server:

For example for windows 7 and Tomcat:
set JAVA_OPTS="-Djava.library.path=C:\Program Files\R\R-3.0.3\library\rJava\jri\x64"
set R_HOME ="C:\Program Files\R\R-3.0.3"

in catalina.bat

Check whether the folder <TOMCAT_HOME>/resources/datamining is available (with read/write permission). If not so, provide to create it.
On Linux server, simply add similar instructions to the bash starting your application server (catalina.sh for Apache Tomcat):
export R_HOME='/usr/lib64/R'
export LD_LIBRARY_PATH='/usr/lib64/R/library/rJava/jri'

Clearly the paths depend on the folder where R and JRI are installed .

Document Template

The template is a simple XML file that enables the developer to configure properly the document behaviour. Here an example of the template:

<?xml version="1.0" encoding="ISO-8859-15"?>
<DATA_MINING>
   <DATASETS>
       <DATASET name="fileDS" readType="table" type="file" label="label Data set 1" canUpload="true">  
           <![CDATA[ ...read_options...]]>
       </DATASET>
       <DATASET name="spagobiDS" spagobiLabel="datasetQQQ" type="spagobi_ds"  label="label Data set 2"/>
   </DATASETS>  
   <SCRIPTS>  
       <SCRIPT name="scriptAAA" datasets="fileDs,spagobiDS"  label="label Script1" libraries="a,b,c">
           <![CDATA[....x,y...
    action_to_call<-function(x){
    ...
    }  
   ]]>

       </SCRIPT>
       <SCRIPT name="scriptBBB" datasets="fileDs" label="label Script2">
           <![CDATA[...z,y...
    z1<-'$P{var1}'
    ...
   ]]>

       </SCRIPT>
       <SCRIPT name="scriptCCC" label="label Script3">
           <![CDATA[...z...
    z2<-$P{var2}
   ]]>

       </SCRIPT>
   </SCRIPTS>
   <COMMANDS>
       <COMMAND name="command1" scriptName="scriptAAA" label="label Command 1" mode="auto">
           <OUTPUTS>
               <OUTPUT type="image" name="a" value="x"  function="plot" mode="auto" label="label Output 1"/>
               <OUTPUT type="image" name="c" value="z,k"  function="biplot" mode="manual" label="label Output 2"/>
               <OUTPUT type="text" name="d" value="y"  mode="manual" label="label Output 3"/>
           </OUTPUTS>
       </COMMAND>
       <COMMAND name="command2" scriptName="scriptBBB" label="label Command 2" mode="manual" action="function1(x)">
    <VARIABLES>
        <VARIABLE name="var1" default="valuevar1"/>
           </VARIABLES>
           <OUTPUTS>
               <OUTPUT type="text" name="c" value="y,z" function="function2" mode="manual" label="label Output 1"/>
           </OUTPUTS>
       </COMMAND>
       <COMMAND name="command3" scriptName="scriptCCC" label="label Command 3" mode="manual" action="action_to_call">
           <OUTPUTS>
               <OUTPUT type="text" name="e" value="z2"  mode="manual" label="label Output 1">
      <VARIABLES>
         <VARIABLE name="var2" default="valuevar2"/>
      </VARIABLES>              
               </OUTPUT>
               <OUTPUT type="image" name="f" value="" function="rectf(z)" mode="auto" label="label 2"/>
           </OUTPUTS>
       </COMMAND>
   </COMMANDS>
</DATA_MINING>

Tags:

  • <DATASETS> This tag contains all the available datasets (of different types) that will be used by the scripts. Datasets are optional. If User doesn't need to evaluate a dataset in its data mining script, then this tag is empty.
  • <DATASET> Each dataset is configured through this tag and is loaded into the user's R workspace at the beginning of the execution. There are 2 types of datasets: SpagoBI dataset anf file, identified by the "type" attribute as explained below. In case of "file" dataset, the CDATA section can store the string for the read option of the file (ex: header = TRUE, sep = ",", quote = "\"") . If not specified, the default is header = TRUE, sep = ",".

The attributes of the DATASET tag are:


    • type= This is the leading attribute from which depend some of the others. It indicates whether the dataset is loaded from a file (type="file") or from a SpagoBI dataset resultset (type="spagobi_ds"). In case it's a file dataset, data are read from a file manually loaded by the document's end user at runtime. If the file was loaded in a previous execution, then the user can use it (by default) or upload it again and re-execute the document. 
    • name= The name of the dataset used as an ID of the dataset and it's important to notice that it's used as the name of the variable to which the data.frame is associated, so the dataset content can be referred through the dataset name inside the script itself.
    • readType= This attribute must be specified only if the type is equal to "file". It represents the suffix for the R function read (ex. read.table then readType="table" etc..)
    • label= Represents the label of the dataset displayed by the GUI.
    • spagobiLabel= This attribute must be specified only if the type is equal to "spagobi_ds". It' needed to get the correct SpagoBI dataset and load its resultset assigned to the dataset name.
    • canUpload= needed (set to true) whenever you want to upload the file from GUI
  • <SCRIPTS> This tag contains all the available scripts. 
  • <SCRIPT> Each script is configured through this tag and contains the R script (that can be idented too) in the CDATA tag content. 

Attributes:


    • name= The name of the script (used as internal reference in the template)
    • datasets= A comma separated list of datasets names (not trimmed!)  to be used by the script itself and referenced by their dataset names.
    • label= Label for the GUI (not implemented)
  • <COMMANDS> This tag contains all the available commands. 
  • <COMMAND> Commands lead the Data Mining GUI. Eache command can perform multiple operations and produce several outputs.  

Attributes:


    • name= The name of the command (used for template intenal reference).
    • scriptName= The name of the script executed by this command. 
    • label= Label of the command displayed in document's GUI.
    • mode= This attribute leads the whole document behaviour. Possible values are:
      • "auto": means that this is the first commad to be executed. Tipically it is assigned tho the command that holds the script for preprocessing/funcions or objects definitions. 
      • "manual": all the other commands that will be activated manually by the user interacting with the document.
    • action= Call to the function (belonging to the scrpt that the command defines in its scriptName attribute) that must be executed to produce the result . If the scrip already embeds such a call, than it's optional. The arguments of the function, called by action attribute, can be constants or variables already defined in the script of the command iteself.
  • <OUTPUTS> List of Outputs for a specific command.
  • <OUTPUT> Each output that can be displayed executing the parent command, is defined as follows:

    • type= Sets whether the output is "image","text" or "html". In order to use html output type, R2HTML and RCurl packages must be installed on R environment.This output type allows the user to represent data frames too, simply defining html output type to such R value or to a function that returns a data frame (i.e. head(iris)).
    • name= The name of the output (template internal reference).
    • mode= Possible values are:
      • "auto": means that this is the first output of the parent command to be displayed. 
      • "manual": all the other outputs that will be activated manually by the user interacting with the document.
    • function= Optional attribute used to call the specific function (embedded in the code of the script of the parent command) that performs the rendering of the result (ex: complex barplot, biblot etc...). The arguments of the function can be constants or variables already defined in the script of the parent command.
      If you use attribute 'function' without attribute 'value', you have to add your name function with parenthesis and its parameters inside (e.g function="myFuncWithParam(a,b)" or function="myFuncNoParam()"). Instead, if you use both 'value' and 'function' and you have parameters, put the parameters in 'value' and your function name withouth parenthesis in 'function' (e.g. value="a,b" function="myFuncWithParam")
    • value= Optional. Used to indicate what variable or constant must be displayed. (Its usage could be overridden by function attribute)
    • label= The label used by the fron end.
  • <VARIABLES> List of all variables that can belong to either a COMMAND or a OUTPUT tag. These are required for changing factors or more generally parameters (strings or numbers) inside the script (referenced by a COMMAND) or the OUTPUT functions.
  • <VARIABLE> Each variable can be referenced inside :
    • the function attribute of OUTPUT tag 
    • the value  attribute of OUTPUT tag 
    • the SCRIPT tag itself (in the CDATA content)
      whit the $P{variable_name}
      The attributes of VARIABLE tag are:
      • name= the variable_name to be referenced by in the $P{variable_name}
      • default= the default value used to run the script with. Once the executed this value can be changed through the GUI.
        If the VARIABLE is belonging to the OUTPUT, an input type will be displayed for each variable in the respective output panel, to change its value.
        Whilst, if belonging to the command it will be invisible, 'till user double-clicks on the respective COMMAND horizontal tab name, thus a pop-up will
        appear in order to change the value. 

Document Detail

To define a Data Mining document, as usual developers should configure the documents detail as follows:

detail.PNG

where the only peculiar settings are:

  • type: "Data Mining"
  • engine: "Data-Mining Engine"
  • template: upload the template previously defined.

Document GUI

Data Mining document execution is very simple. Once configured the document, execution starts clicking on the document's icon from SpagoBI Document Browser interface.
The document displays an horizontal tab panel containing the commands, and per each command a vertical panel containing the possible outputs.
The output is displayed if there is a command with mode set to "auto" within an output with mode="auto" or if there is just one output for the auto mode command.
Here an example:

pca.PNG

If the command refers to a script that needs one or more datasets, than these datasets are displayed at the top of document as buttons (for file datasets) orl labels (for SpagoBI datasets, that cannot be changed from GUI). Clicking on the buttons it is possible to upload the file to replace the existing. Then a "Run script" button will appear to re-execute the document.

pca2.PNG

Tags:
Created by Monica Franceschini on 2014/09/11 15:09
Last modified by Alessio Conese on 2016/05/20 14:40

This wiki is licensed under a Creative Commons 2.0 license
XWiki Enterprise 2.7.33694 - Documentation