Spark application fail wit numRecords error

2017-10-31 Thread Serkan TAS
Hi,

I searched the error in kafka but i think at last, it is related with spark not 
kafka.

Has anyone faced to an exception that is terminating program with error 
"numRecords must not be negative" while streaming  ?

Thanx in advance.

Regards.



Bu ileti hukuken korunmuş, gizli veya ifşa edilmemesi gereken bilgiler 
içerebilir. Şayet mesajın gönderildiği kişi değilseniz, bu iletiyi çoğaltmak ve 
dağıtmak yasaktır. Bu mesajı yanlışlıkla alan kişi, bu durumu derhal gönderene 
telefonla ya da e-posta ile bildirmeli ve bilgisayarından silmelidir. Bu 
iletinin içeriğinden yalnızca iletiyi gönderen kişi sorumludur.

This communication may contain information that is legally privileged, 
confidential or exempt from disclosure. If you are not the intended recipient, 
please note that any dissemination, distribution, or copying of this 
communication is strictly prohibited. Anyone who receives this message in error 
should notify the sender immediately by telephone or by return communication 
and delete it from his or her computer. Only the person who has sent this 
message is responsible for its content.


Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Joseph Pride
Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It 
marries Spark with a co-located in-memory GemFire (or something gem-related) 
database. So you can access the data with SQL, JDBC, ODBC (if you wanna go 
Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM 
on each machine, so you get data locality instead of a bottleneck between load, 
save, and compute. The data is supposed to persist between applications, 
cluster startups, or multiple applications doing stuff to the data at the same 
time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty 
good.

—Joe Pride

> On Oct 31, 2017, at 11:14 AM, Gene Pang  wrote:
> 
> Hi,
> 
> Alluxio enables sharing dataframes across different applications. This blog 
> post talks about dataframes and Alluxio, and this Spark Summit presentation 
> has additional information.
> 
> Thanks,
> Gene
> 
>> On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil  wrote:
>> Any info on the below will be really appreciated.
>> 
>>  
>> 
>> I read about Alluxio and Ignite. Has anybody used any of them? Do they work 
>> well with multiple Apps doing lookups simultaneously? Are there better 
>> options? Thank you.  
>> 
>>  
>> 
>> From: roshan joe 
>> Date: Monday, October 30, 2017 at 7:53 PM
>> To: "user@spark.apache.org" 
>> Subject: share datasets across multiple spark-streaming applications for 
>> lookup
>> 
>>  
>> 
>> Hi, 
>> 
>>  
>> 
>> What is the recommended way to share datasets across multiple 
>> spark-streaming applications, so that the incoming data can be looked up 
>> against this shared dataset? 
>> 
>>  
>> 
>> The shared dataset is also incrementally refreshed and stored on S3. Below 
>> is the scenario. 
>> 
>>  
>> 
>> Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 
>> 
>> Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 
>> 
>>  
>> 
>> 
>> Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 
>> and DS-2 and write to DS-3 in S3. 
>> 
>> Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 
>> and DS-2 and write to DS-3 in S3. 
>> 
>> Streaming App-n consumes data from Source-n, needs to lookup against DS-1 
>> and DS-2 and write to DS-n in S3.
>> 
>>  
>> 
>> So DS-1 and DS-2 ideally should be shared for lookup across multiple 
>> streaming apps. Any input is appreciated. Thank you!
>> 
> 


Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Gene Pang
Hi,

Alluxio enables sharing dataframes across different applications. This blog
post 
talks
about dataframes and Alluxio, and this Spark Summit presentation

has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil  wrote:

> Any info on the below will be really appreciated.
>
>
>
> I read about Alluxio and Ignite. Has anybody used any of them? Do they
> work well with multiple Apps doing lookups simultaneously? Are there better
> options? Thank you.
>
>
>
> *From: *roshan joe 
> *Date: *Monday, October 30, 2017 at 7:53 PM
> *To: *"user@spark.apache.org" 
> *Subject: *share datasets across multiple spark-streaming applications
> for lookup
>
>
>
> Hi,
>
>
>
> What is the recommended way to share datasets across multiple
> spark-streaming applications, so that the incoming data can be looked up
> against this shared dataset?
>
>
>
> The shared dataset is also incrementally refreshed and stored on S3. Below
> is the scenario.
>
>
>
> Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
>
> Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.
>
>
>
>
> Streaming App-3 consumes data from Source-3, *needs to lookup against
> DS-1 and DS-2* and write to DS-3 in S3.
>
> Streaming App-4 consumes data from Source-4, *needs to lookup against
> DS-1 and DS-2 *and write to DS-3 in S3.
>
> Streaming App-n consumes data from Source-n, *needs to lookup against
> DS-1 and DS-2 *and write to DS-n in S3.
>
>
>
> So DS-1 and DS-2 ideally should be shared for lookup across multiple
> streaming apps. Any input is appreciated. Thank you!
>


Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Revin Chalil
Any info on the below will be really appreciated.

I read about Alluxio and Ignite. Has anybody used any of them? Do they work 
well with multiple Apps doing lookups simultaneously? Are there better options? 
Thank you.

From: roshan joe 
Date: Monday, October 30, 2017 at 7:53 PM
To: "user@spark.apache.org" 
Subject: share datasets across multiple spark-streaming applications for lookup

Hi,

What is the recommended way to share datasets across multiple spark-streaming 
applications, so that the incoming data can be looked up against this shared 
dataset?

The shared dataset is also incrementally refreshed and stored on S3. Below is 
the scenario.

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and 
DS-2 and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming 
apps. Any input is appreciated. Thank you!


Fwd: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hey all,

Any help in the below please?

Thanks,
Aakash.


-- Forwarded message --
From: Aakash Basu 
Date: Tue, Oct 31, 2017 at 9:17 PM
Subject: Regarding column partitioning IDs and names as per hierarchical
level SparkSQL
To: user 


Hi all,

I have to generate a table with Spark-SQL with the following columns -


Level One Id: VARCHAR(20) NULL
Level One Name: VARCHAR( 50) NOT NULL
Level Two Id: VARCHAR( 20) NULL
Level Two Name: VARCHAR(50) NULL
Level Thr ee Id: VARCHAR(20) NULL
Level Thr ee Name: VARCHAR(50) NULL
Level Four Id: VARCHAR(20) NULL
Level Four Name: VARCHAR( 50) NULL
Level Five Id: VARCHAR(20) NULL
Level Five Name: VARCHAR(50) NULL
Level Six Id: VARCHAR(20) NULL
Level Six Name: VARCHAR(50) NULL
Level Seven Id: VARCHAR( 20) NULL
Level Seven Name: VARCHAR(50) NULL
Level Eight Id: VARCHAR( 20) NULL
Level Eight Name: VARCHAR(50) NULL
Level Nine Id: VARCHAR(20) NULL
Level Nine Name: VARCHAR( 50) NULL
Level Ten Id: VARCHAR(20) NULL
Level Ten Name: VARCHAR(50) NULL

My input source has these columns -


ID Description ParentID
10 Great-Grandfather
1010 Grandfather 10
101010 1. Father A 1010
101011 2. Father B 1010
101012 4. Father C 1010
101013 5. Father D 1010
101015 3. Father E 1010
101018 Father F 1010
101019 6. Father G 1010
101020 Father H 1010
101021 Father I 1010
101022 2A. Father J 1010
10101010 2. Father K 101010
Like the above, I have ID till 20 digits, which means, I have 10 levels.

I want to populate the ID and name itself along with all the parents till
the root for any particular level, which I am unable to create a concrete
logic for.

Am using this way to fetch respecting levels and populate them in the
respective columns but not their parents -

Present Logic ->

FinalJoin_DF = spark.sql("select "
  + "case when length(a.id)/2 = '1' then a.id else
' ' end as level_one_id, "
  + "case when length(a.id)/2 = '1' then a.desc else ' ' end as
level_one_name, "
  + "case when length(a.id)/2 = '2' then a.id else ' ' end as level_two_id,
"
  + "case when length(a.id)/2 = '2' then a.desc else ' ' end as
level_two_name, "
  + "case when length(a.id)/2 = '3' then a.id else
' ' end as level_three_id, "
  + "case when length(a.id)/2 = '3' then a.desc
else ' ' end as level_three_name, "
  + "case when length(a.id)/2 = '4' then a.id else
' ' end as level_four_id, "
  + "case when length(a.id)/2 = '4' then a.desc
else ' ' end as level_four_name, "
  + "case when length(a.id)/2 = '5' then a.id else
' ' end as level_five_id, "
  + "case when length(a.id)/2 = '5' then a.desc
else ' ' end as level_five_name, "
  + "case when length(a.id)/2 = '6' then a.id else
' ' end as level_six_id, "
  + "case when length(a.id)/2 = '6' then a.desc else ' ' end as
level_six_name, "
  + "case when length(a.id)/2 = '7' then a.id else ' ' end as
level_seven_id, "
  + "case when length(a.id)/2 = '7' then a.desc
else ' ' end as level_seven_name, "
  + "case when length(a.id)/2 = '8' then a.id else
' ' end as level_eight_id, "
  + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
level_eight_name, "
  + "case when length(a.id)/2 = '9' then a.id else
' ' end as level_nine_id, "
  + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
level_nine_name, "
  + "case when length(a.id)/2 = '10' then a.id else ' ' end as
level_ten_id, "
  + "case when length(a.id)/2 = '10' then a.desc
else ' ' end as level_ten_name "
  + "from CategoryTempTable a")


Can someone help me in also populating all the parents levels in the
respective level ID and level name, please?


Thanks,
Aakash.


Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hi all,

I have to generate a table with Spark-SQL with the following columns -


Level One Id: VARCHAR(20) NULL
Level One Name: VARCHAR( 50) NOT NULL
Level Two Id: VARCHAR( 20) NULL
Level Two Name: VARCHAR(50) NULL
Level Thr ee Id: VARCHAR(20) NULL
Level Thr ee Name: VARCHAR(50) NULL
Level Four Id: VARCHAR(20) NULL
Level Four Name: VARCHAR( 50) NULL
Level Five Id: VARCHAR(20) NULL
Level Five Name: VARCHAR(50) NULL
Level Six Id: VARCHAR(20) NULL
Level Six Name: VARCHAR(50) NULL
Level Seven Id: VARCHAR( 20) NULL
Level Seven Name: VARCHAR(50) NULL
Level Eight Id: VARCHAR( 20) NULL
Level Eight Name: VARCHAR(50) NULL
Level Nine Id: VARCHAR(20) NULL
Level Nine Name: VARCHAR( 50) NULL
Level Ten Id: VARCHAR(20) NULL
Level Ten Name: VARCHAR(50) NULL

My input source has these columns -


ID Description ParentID
10 Great-Grandfather
1010 Grandfather 10
101010 1. Father A 1010
101011 2. Father B 1010
101012 4. Father C 1010
101013 5. Father D 1010
101015 3. Father E 1010
101018 Father F 1010
101019 6. Father G 1010
101020 Father H 1010
101021 Father I 1010
101022 2A. Father J 1010
10101010 2. Father K 101010
Like the above, I have ID till 20 digits, which means, I have 10 levels.

I want to populate the ID and name itself along with all the parents till
the root for any particular level, which I am unable to create a concrete
logic for.

Am using this way to fetch respecting levels and populate them in the
respective columns but not their parents -

Present Logic ->

FinalJoin_DF = spark.sql("select "
  + "case when length(a.id)/2 = '1' then a.id else
' ' end as level_one_id, "
  + "case when length(a.id)/2 = '1' then a.desc else ' ' end as
level_one_name, "
  + "case when length(a.id)/2 = '2' then a.id else ' ' end as level_two_id,
"
  + "case when length(a.id)/2 = '2' then a.desc else ' ' end as
level_two_name, "
  + "case when length(a.id)/2 = '3' then a.id else
' ' end as level_three_id, "
  + "case when length(a.id)/2 = '3' then a.desc
else ' ' end as level_three_name, "
  + "case when length(a.id)/2 = '4' then a.id else
' ' end as level_four_id, "
  + "case when length(a.id)/2 = '4' then a.desc
else ' ' end as level_four_name, "
  + "case when length(a.id)/2 = '5' then a.id else
' ' end as level_five_id, "
  + "case when length(a.id)/2 = '5' then a.desc
else ' ' end as level_five_name, "
  + "case when length(a.id)/2 = '6' then a.id else
' ' end as level_six_id, "
  + "case when length(a.id)/2 = '6' then a.desc else ' ' end as
level_six_name, "
  + "case when length(a.id)/2 = '7' then a.id else ' ' end as
level_seven_id, "
  + "case when length(a.id)/2 = '7' then a.desc
else ' ' end as level_seven_name, "
  + "case when length(a.id)/2 = '8' then a.id else
' ' end as level_eight_id, "
  + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
level_eight_name, "
  + "case when length(a.id)/2 = '9' then a.id else
' ' end as level_nine_id, "
  + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
level_nine_name, "
  + "case when length(a.id)/2 = '10' then a.id else ' ' end as
level_ten_id, "
  + "case when length(a.id)/2 = '10' then a.desc
else ' ' end as level_ten_name "
  + "from CategoryTempTable a")


Can someone help me in also populating all the parents levels in the
respective level ID and level name, please?


Thanks,
Aakash.


Re: Spark job's application tracking URL not accessible from docker container

2017-10-31 Thread Harsh
Hi 

I am facing the same issue while launching the application inside docker
container.


Kind Regards
Harsh



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Bucket vs repartition

2017-10-31 Thread אורן שמון
Hi all,
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want  to avoid shuffle like groupBy so I think about to save the result
of the pre-process as bucket by user in Parquet or to re-partition by user
and save the result .

What is prefer ? and why
Thanks in advance,
Oren


Read parquet files as buckets

2017-10-31 Thread אורן שמון
Hi all,
I have Parquet files as result from some job , the job saved them in bucket
mode by userId . How can I read the files in bucket mode in another job ? I
tried to read it but it didnt bucket the data (same user in same partition)


Hi all,

2017-10-31 Thread אורן שמון
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want  to avoid shuffle like groupBy so I think about to save the result
of the pre-process as bucket by user in Parquet or to re-partition by user
and save the result .

What is prefer ? and why
Thanks in advance,
Oren


Spark job's application tracking URL not accessible from docker container

2017-10-31 Thread Divya Narayan
We have streaming jobs and batch jobs running inside the docker containers
with spark driver launched within the container

Now when we open the Resource manager UI http://:8080, and try to
access the application tracking URL of any running job, the page times out
with error:

HTTP ERROR 500

Problem accessing /proxy/redirect/application_1509358290085_0011/jobs/.
Reason:

Connection to http://: refused


It works only when container is launched on the node that is currently the
yarn RM master.On any other node it doesn't work