Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-27 Thread Dean Wampler
While case classes no longer have the 22-element limitation as of Scala
2.11, tuples are still limited to 22 elements. For various technical
reasons, this limitation probably won't be removed any time soon.

However, you can nest tuples, like case classes, in most contexts. So, the
last bit of your example,

(r: ResultSet) => (r.getInt("col1"),r.getInt("col2")...r.getInt("col37")
)

could add nested () to group elements and keep the outer number of elements
<= 22.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Sep 24, 2015 at 6:01 AM, satish chandra j 
wrote:

> HI All,
>
> In addition to Case Class limitation in Scala, I finding Tuple limitation
> too please find the explanation below
>
> //Query to pull data from Source Table
>
> var SQL_RDD= new JdbcRDD( sc, ()=>
> DriverManager.getConnection(url,user,pass),"select col1, col2,
> col3..col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r:
> ResultSet) => (r.getInt("col1"),r.getInt("col2")...r.getInt("col37")))
>
>
> //Define Case Class
>
> case class sqlrow(col1:Int,col2:Int..col37)
>
>
> var SchSQL= SQL_RDD.map(p => new sqlrow(p._1,p._2.p._37))
>
>
> followed by apply CreateSchema to RDD and than apply registerTempTable for
> defining a table to make use in SQL Context in Spark
>
> As per the above SQL query I need to fetch 37 columns from the source
> table, but it seems Scala has tuple restriction which I am defining by r
> ResultSet variable in the above SQL, please let me know if any work around
> for the same
>
> Regards,
> Satish Chandra
>
> On Thu, Sep 24, 2015 at 3:18 PM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> HI All,
>> As it is for SQL purpose I understand, need to go ahead with Custom Case
>> Class approach
>> Could anybody have a sample code for creating Custom Case Class to refer
>> which would be really helpful
>>
>> Regards,
>> Satish Chandra
>>
>> On Thu, Sep 24, 2015 at 2:51 PM, Adrian Tanase  wrote:
>>
>>> +1 on grouping the case classes and creating a hierarchy – as long as
>>> you use the data programatically. For DataFrames / SQL the other ideas
>>> probably scale better…
>>>
>>> From: Ted Yu
>>> Date: Wednesday, September 23, 2015 at 7:07 AM
>>> To: satish chandra j
>>> Cc: user
>>> Subject: Re: Scala Limitation - Case Class definition with more than 22
>>> arguments
>>>
>>> Can you switch to 2.11 ?
>>>
>>> The following has been fixed in 2.11:
>>> https://issues.scala-lang.org/browse/SI-7296
>>>
>>> Otherwise consider packaging related values into a case class of their
>>> own.
>>>
>>> On Tue, Sep 22, 2015 at 8:48 PM, satish chandra j <
>>> jsatishchan...@gmail.com> wrote:
>>>
 HI All,
 Do we have any alternative solutions in Scala to avoid limitation in
 defining a Case Class having more than 22 arguments

 We are using Scala version 2.10.2, currently I need to define a case
 class with 37 arguments but getting an error as "*error:Implementation
 restriction:caseclasses cannot have more than 22parameters.*"

 It would be a great help if any inputs on the same

 Regards,
 Satish Chandra



>>>
>>
>


Re: HDFS small file generation problem

2015-09-27 Thread ayan guha
I would suggest not to write small files to hdfs. rather you can hold them
in memory, maybe off heap. and then you may flush it to hdfs using another
job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark
already has something like it)

On Sun, Sep 27, 2015 at 11:36 PM,  wrote:

> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


HDFS small file generation problem

2015-09-27 Thread nibiau
Hello,
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?

Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?

Tks a lot
Nicolas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-27 Thread Michael Armbrust
No, you would just have to do another select to pull out the fields you are
interested in.

On Sat, Sep 26, 2015 at 11:11 AM, Jerry Lam  wrote:

> Hi Michael,
>
> Thanks for the tip. With dataframe, is it possible to explode some
> selected fields in each purchase_items?
> Since purchase_items is an array of item and each item has a number of
> fields (for example product_id and price), is it possible to just explode
> these two fields directly using dataframe?
>
> Best Regards,
>
>
> Jerry
>
> On Fri, Sep 25, 2015 at 7:53 PM, Michael Armbrust 
> wrote:
>
>> The SQL parser without HiveContext is really simple, which is why I
>> generally recommend users use HiveContext.  However, you can do it with
>> dataframes:
>>
>> import org.apache.spark.sql.functions._
>> table("purchases").select(explode(df("purchase_items")).as("item"))
>>
>>
>>
>> On Fri, Sep 25, 2015 at 4:21 PM, Jerry Lam  wrote:
>>
>>> Hi sparkers,
>>>
>>> Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext?
>>> I don't want to start up a metastore and derby just because I need
>>> LATERAL VIEW EXPLODE.
>>>
>>> I have been trying but I always get the exception like this:
>>>
>>> Name: java.lang.RuntimeException
>>> Message: [1.68] failure: ``union'' expected but identifier view found
>>>
>>> with the query look like:
>>>
>>> "select items from purhcases lateral view explode(purchase_items) tbl as
>>> items"
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>
>


textFile() and includePackage() not found

2015-09-27 Thread Eugene Cao
Error: no methods for 'textFile'
when I run the following 2nd command after SparkR initialized

sc <- sparkR.init(appName = "RwordCount")
lines <- textFile(sc, args[[1]])

But the following command works:
lines2 <- SparkR:::textFile(sc, "C:\\SelfStudy\\SPARK\\sentences2.txt") 

In addition, it says in official web "The includePackage command can be used
to indicate packages...", but
includePackage(sc, Matrix) with error: not find?

Thanks a lot in advance!

Eugene Cao
Xi'an Jiaotong University




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/textFile-and-includePackage-not-found-tp24834.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: textFile() and includePackage() not found

2015-09-27 Thread Sun, Rui
Eugene,

SparkR RDD API is private for now 
(https://issues.apache.org/jira/browse/SPARK-7230)

You can use SparkR::: prefix to access those private functions.

-Original Message-
From: Eugene Cao [mailto:eugene...@163.com] 
Sent: Monday, September 28, 2015 8:02 AM
To: user@spark.apache.org
Subject: textFile() and includePackage() not found

Error: no methods for 'textFile'
when I run the following 2nd command after SparkR initialized

sc <- sparkR.init(appName = "RwordCount") lines <- textFile(sc, args[[1]])

But the following command works:
lines2 <- SparkR:::textFile(sc, "C:\\SelfStudy\\SPARK\\sentences2.txt") 

In addition, it says in official web "The includePackage command can be used to 
indicate packages...", but includePackage(sc, Matrix) with error: not find?

Thanks a lot in advance!

Eugene Cao
Xi'an Jiaotong University




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/textFile-and-includePackage-not-found-tp24834.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-27 Thread Zhiliang Zhu
Hi All,
Would some expert help me some about the issue...
I shall appreciate you kind help very much!
Thank you!   
Zhiliang  

 
 


 On Sunday, September 27, 2015 7:40 PM, Zhiliang Zhu 
 wrote:
   

 Hi Alexis, Gavin,
Thanks very much for your kind comment.My spark command is : 
spark-submit --class com.zyyx.spark.example.LinearRegression --master 
yarn-client LinearRegression.jar 

Both spark-shell and spark-submit will not run, all is hanging during the stage,
15/09/27 19:18:06 INFO yarn.Client: Application report for 
application_1440676456544_0727 (state: ACCEPTED)...
The more deeper error log under /hdfs/yarn/logs/:
15/09/27 19:10:37 INFO util.Utils: Successfully started service 'sparkYarnAM' 
on port 53882.
15/09/27 19:10:37 INFO yarn.ApplicationMaster: Waiting for Spark driver to be 
reachable.
15/09/27 19:10:37 ERROR yarn.ApplicationMaster: Failed to connect to driver at 
127.0.0.1:39581, retrying ...
15/09/27 19:10:37 ERROR yarn.ApplicationMaster: Failed to connect to driver at 
127.0.0.1:39581, retrying ... 

For the all machine nodes, I just installed hadoop and spark, with same path & 
file & configuration, and 
copied one of the hadoop & spark directory to the remote gateway machine, the 
all would be with same 
path & file name & configuration under different nodes.
In the link Running Spark on YARN - Spark 1.5.0 Documentation, there is some 
words as:Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory 
which contains the (client side) configuration files for the Hadoop 
cluster.These configs are used to write to HDFS and connect to the YARN 
ResourceManager. 

I do not exactly catch the first sentence.
hadoop version is 2.5.2, spark version is 1.4.1
The spark-env.sh setting,
export SCALA_HOME=/usr/lib/scala
export JAVA_HOME=/usr/java/jdk1.7.0_45
export R_HOME=/usr/lib/r
export HADOOP_HOME=/usr/lib/hadoop
export YARN_CONF_DIR=/usr/lib/hadoop/etc/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_MASTER_IP=master01
#export SPARK_LOCAL_IP=master02
export SPARK_LOCAL_IP=localhost
export SPARK_LOCAL_DIRS=/data/spark_local_dir

Would you help point out what is wrong place I made...I must show sincere 
appreciation towards your help.
Best Regards,Zhiliang

On Saturday, September 26, 2015 2:27 PM, Gavin Yue  
wrote:
  

 

 It is working, We are doing the same thing everyday.  But the remote server 
needs to able to talk with ResourceManager. 

If you are using Spark-submit,  your will also specify the hadoop conf 
directory in your Env variable. Spark would rely on that to locate where the 
cluster's resource manager is. 

I think this tutorial is pretty clear: 
http://spark.apache.org/docs/latest/running-on-yarn.html



On Fri, Sep 25, 2015 at 7:11 PM, Zhiliang Zhu  wrote:

Hi Yue,
Thanks very much for your kind reply.
I would like to submit spark job remotely on another machine outside the 
cluster,and the job will run on yarn, similar as hadoop job is already done, 
could youconfirm it could exactly work for spark...
Do you mean that I would print those variables on linux command side?
Best Regards,Zhiliang

 


 On Saturday, September 26, 2015 10:07 AM, Gavin Yue 
 wrote:
   

 Print out your env variables and check first 

Sent from my iPhone
On Sep 25, 2015, at 18:43, Zhiliang Zhu  wrote:


Hi All,
I would like to submit spark job on some another remote machine outside the 
cluster,I also copied hadoop/spark conf files under the remote machine, then 
hadoopjob would be submitted, but spark job would not.
In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or 
for some other reasons...
This issue is urgent for me, would some expert provide some help about this 
problem...
I will show sincere appreciation towards your help.
Thank you!Best Regards,Zhiliang



 On Friday, September 25, 2015 7:53 PM, Zhiliang Zhu 
 wrote:
   

 Hi all,
The spark job will run on yarn. While I do not set SPARK_LOCAL_IP any, or just 
set asexport  SPARK_LOCAL_IP=localhost    #or set as the specific node ip on 
the specific spark install directory 

It will work well to submit spark job on master node of cluster, however, it 
will fail by way of some gateway machine remotely.
The gateway machine is already configed, it works well to submit hadoop job.It 
is set as:
export SCALA_HOME=/usr/lib/scala
export JAVA_HOME=/usr/java/jdk1.7.0_45
export R_HOME=/usr/lib/r
export HADOOP_HOME=/usr/lib/hadoop
export YARN_CONF_DIR=/usr/lib/hadoop/etc/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_MASTER_IP=master01
#export SPARK_LOCAL_IP=master01  #if no SPARK_LOCAL_IP is set, SparkContext 
will not start
export SPARK_LOCAL_IP=localhost #if localhost is set, SparkContext is 
started, but failed later
export SPARK_LOCAL_DIRS=/data/spark_local_dir
...

The error messages:
15/09/25 

Re: HDFS small file generation problem

2015-09-27 Thread Deenar Toraskar
You could try a couple of things

a) use Kafka for stream processing, store current incoming events and spark
streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too
(in a micro batched mode), so every x minutes. Kafka is more suited to
processing lots of small events/
b) Coalesce small files on HDFS into a big hourly, daily file. Use HDFS
partitioning to ensure that your pig job reads the least amount of
partitions.

Deenar

On 27 September 2015 at 14:47, ayan guha  wrote:

> I would suggest not to write small files to hdfs. rather you can hold them
> in memory, maybe off heap. and then you may flush it to hdfs using another
> job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark
> already has something like it)
>
> On Sun, Sep 27, 2015 at 11:36 PM,  wrote:
>
>> Hello,
>> I'm still investigating my small file generation problem generated by my
>> Spark Streaming jobs.
>> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
>> 10kb), and I have to store them inside HDFS in order to treat them by PIG
>> jobs on-demand.
>> The problem is the fact that I generate a lot of small files in HDFS
>> (several millions) and it can be problematic.
>> I investigated to use Hbase or Archive file but I don't want to do it
>> finally.
>> So, what about this solution :
>> - Spark streaming generate on the fly several millions of small files in
>> HDFS
>> - Each night I merge them inside a big daily file
>> - I launch my PIG jobs on this big file ?
>>
>> Other question I have :
>> - Is it possible to append a big file (daily) by adding on the fly my
>> event ?
>>
>> Tks a lot
>> Nicolas
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


FP-growth on stream data

2015-09-27 Thread masoom alam
Is it possible to run FP-growth on stream data in its current versionor
a way around?

I mean is it possible to use/augment the old tree with the new incoming
data and find the new set of frequent patterns?

Thanks