question on spark streaming based on event time

2017-01-28 Thread kant kodali
Hi All,

I read through the documentation on Spark Streaming based on event time and
how spark handles lags w.r.t processing time and so on.. but what if the
lag is too long between the event time and processing time? other words
what should I do if I am receiving yesterday's data (the timestamp on
message shows yesterday date and time but the processing time is today's
time) ? And say I also have a dashboard I want to update in realtime ( as
in whenever I get the data) which shows past 5 days worth of data and
dashboard just keeps updating.

Thanks,
kant


Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Mark Hamstra
Try selecting a particular Job instead of looking at the summary page for
all Jobs.

On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi Jacek,
>
> I tried accessing Spark web UI on both Firefox and Google Chrome browsers
> with ad blocker enabled. I do see other options like* User, Total Uptime,
> Scheduling Mode, **Active Jobs, Completed Jobs and* Event Timeline.
> However, I don't see an option for DAG visualization.
>
> Please note that I am experiencing the same issue with Spark 2.x (i.e.
> 2.0.0, 2.0.1, 2.0.2 and 2.1.0). Refer the attached screenshot of the UI
> that I am seeing on my machine:
>
> [image: Inline images 1]
>
>
> Please suggest.
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 28 January 2017 at 18:51, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Wonder if you have any adblocker enabled in your browser? Is this the
>> only version giving you this behavior? All Spark jobs have no
>> visualization?
>>
>> Jacek
>>
>> On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>> Hi All,
>>
>> I am running a Spark job on my local machine written in Scala with Spark
>> 2.1.0. However, I am not seeing any option of "*DAG Visualization*" at 
>> http://localhost:4040/jobs/
>>
>>
>> Suggestion, please.
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> 
>>
>>
>>
>


Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Md. Rezaul Karim
Hi Jacek,

I tried accessing Spark web UI on both Firefox and Google Chrome browsers
with ad blocker enabled. I do see other options like* User, Total Uptime,
Scheduling Mode, **Active Jobs, Completed Jobs and* Event Timeline.
However, I don't see an option for DAG visualization.

Please note that I am experiencing the same issue with Spark 2.x (i.e.
2.0.0, 2.0.1, 2.0.2 and 2.1.0). Refer the attached screenshot of the UI
that I am seeing on my machine:

[image: Inline images 1]


Please suggest.




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 28 January 2017 at 18:51, Jacek Laskowski  wrote:

> Hi,
>
> Wonder if you have any adblocker enabled in your browser? Is this the only
> version giving you this behavior? All Spark jobs have no visualization?
>
> Jacek
>
> On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim"  org> wrote:
>
> Hi All,
>
> I am running a Spark job on my local machine written in Scala with Spark
> 2.1.0. However, I am not seeing any option of "*DAG Visualization*" at 
> http://localhost:4040/jobs/
>
>
> Suggestion, please.
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
>
>


Re: Dynamic resource allocation to Spark on Mesos

2017-01-28 Thread Michael Gummelt
We've talked about that, but it hasn't become a priority because we haven't
had a driving use case.  If anyone has a good argument for "variable"
resource allocation like this, please let me know.

On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin  wrote:

> An alternative behavior is to launch the job with the best resource offer
>> Mesos is able to give
>
>
> Michael has just made an excellent explanation about dynamic allocation
> support in mesos. But IIUC, what you want to achieve is something like
> (using RAM as an example) : "Launch each executor with at least 1GB RAM,
> but if mesos offers 2GB at some moment, then launch an executor with 2GB
> RAM".
>
> I wonder what's benefit of that? To reduce the "resource fragmentation"?
>
> Anyway, that is not supported at this moment. In all the supported cluster
> managers of spark (mesos, yarn, standalone, and the up-to-coming spark on
> kubernetes), you have to specify the cores and memory of each executor.
>
> It may not be supported in the future, because only mesos has the concepts
> of offers because of its two-level scheduling model.
>
>
> On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan  wrote:
>
>> Dear Spark Users,
>>
>> Currently is there a way to dynamically allocate resources to Spark on
>> Mesos? Within Spark we can specify the CPU cores, memory before running
>> job. The way I understand is that the Spark job will not run if the CPU/Mem
>> requirement is not met. This may lead to decrease in overall utilization of
>> the cluster. An alternative behavior is to launch the job with the best
>> resource offer Mesos is able to give. Is this possible with the current
>> implementation?
>>
>> Thanks
>> Ji
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Complex types handling with spark SQL and parquet

2017-01-28 Thread Antoine HOM
Hello everybody,

I have been trying to use complex types (stored in parquet) with spark
SQL and ended up having an issue that I can't seem to be able to solve
cleanly.
I was hoping, through this mail, to get some insights from the
community, maybe I'm just missing something obvious in the way I'm
using spark :)

It seems that spark only push down projections for columns at the root
level of the records.
This is a big I/O issue depending on how much you use complex types,
in my samples I ended up reading 100GB of data when using only a
single field out of a struct (It should most likely have read only
1GB).

I already saw this PR which sounds promising:
https://github.com/apache/spark/pull/16578

However it seems that it won't be applicable if you have multiple
array nesting level, the main reason is that I can't seem to find how
to reference to fields deeply nested in arrays in a Column expression.
I can do everything within lambdas but I think the optimizer won't
drill into it to understand that I'm only accessing a few fields.

If I take the following (simplified) example:

{
trip_legs:[{
from: "LHR",
to: "NYC",
taxes: [{
type: "gov",
amount: 12
currency: "USD"
}]
}]
}

col(trip_legs.from) will return an Array of all the from fields for
each trip_leg object.
col(trip_legs.taxes.type) will throw an exception.

So my questions are:
  * Is there a way to reference to these deeply nested fields without
having to specify an array index with a Column expression?
  * If not, is there an API to force the projection of a given set of
fields so that parquet only read this specific set of columns?

In addition, regarding the handling of arrays of struct within spark sql:
  * Has it been discussed to have a way to "reshape" an array of
structs without using lambdas? (Like the $map/$filter/etc.. operators
of mongodb for example)
  * If not and I'm willing to dedicate time to code for these
features, does someone familiar with the code base could tell me how
disruptive this would be? And if this would be a welcome change or
not? (most likely more appropriate for the dev mailing list though)

Regards,
Antoine

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Jacek Laskowski
Hi,

Wonder if you have any adblocker enabled in your browser? Is this the only
version giving you this behavior? All Spark jobs have no visualization?

Jacek

On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" <
rezaul.ka...@insight-centre.org> wrote:

Hi All,

I am running a Spark job on my local machine written in Scala with Spark
2.1.0. However, I am not seeing any option of "*DAG Visualization*" at
http://localhost:4040/jobs/


Suggestion, please.




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: kafka structured streaming source refuses to read

2017-01-28 Thread Koert Kuipers
there was also already an existing spark ticket for this:
SPARK-18779 

On Sat, Jan 28, 2017 at 1:13 PM, Koert Kuipers  wrote:

> it seems the bug is:
> https://issues.apache.org/jira/browse/KAFKA-4547
>
> i would advise everyone not to use kafka-clients 0.10.0.2, 0.10.1.0 or
> 0.10.1.1
>
> On Fri, Jan 27, 2017 at 3:56 PM, Koert Kuipers  wrote:
>
>> in case anyone else runs into this:
>>
>> the issue is that i was using kafka-clients 0.10.1.1
>>
>> it works when i use kafka-clients 0.10.0.1 with spark structured streaming
>>
>> my kafka server is 0.10.1.1
>>
>> On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers  wrote:
>>
>>> i checked my topic. it has 5 partitions but all the data is written to a
>>> single partition: wikipedia-2
>>> i turned on debug logging and i see this:
>>>
>>> 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to
>>> consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2,
>>> wikipedia-1]. Seeking to the end.
>>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>>> partition wikipedia-0
>>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>>> partition wikipedia-4
>>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>>> partition wikipedia-3
>>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>>> partition wikipedia-2
>>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>>> partition wikipedia-1
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-0 to latest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-0
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-0 to earliest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-0
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-4 to latest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-4
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-4 to earliest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-4
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-3 to latest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.AbstractCoordinator: Received
>>> successful heartbeat response for group spark-kafka-source-fac4f749-fd
>>> 56-4a32-82c7-e687aadf520b-1923704552-driver-0
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-3
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-3 to earliest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-3
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-2 to latest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=152908} for partition wikipedia-2
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-2 to earliest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-2
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>>> partition wikipedia-1 to latest offset.
>>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>>> offset=0} for partition wikipedia-1
>>> 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Got latest offsets for
>>> partition : Map(wikipedia-1 -> 0, wikipedia-4 -> 0, wikipedia-2 -> 0,
>>> wikipedia-3 -> 0, wikipedia-0 -> 0)
>>>
>>> what is confusing to me is this:
>>> Resetting offset for partition wikipedia-2 to latest offset.
>>> Fetched {timestamp=-1, offset=152908} for partition wikipedia-2
>>> Got latest offsets for partition : Map(wikipedia-1 -> 0, wikipedia-4 ->
>>> 0, wikipedia-2 -> 0, wikipedia-3 -> 0, wikipedia-0 -> 0)
>>>
>>> why does it find latest offset 152908 for wikipedia-2 but then sets
>>> latest offset to 0 for that partition? or am i misunderstanding?
>>>
>>> On Fri, Jan 27, 2017 at 1:19 PM, Koert Kuipers 
>>> wrote:
>>>
 code:
   val query = spark.readStream
 .format("kafka")
 .option("kafka.bootstrap.servers", "somenode:9092")
 .option("subscribe", "wikipedia")
 .load
 .select(col("value") cast StringType)
 .writeStream
 .format("console")
 .outputMode(OutputMode.Append)
 .start()

   while (true) {
 Thread.sleep(1)
  

Re: kafka structured streaming source refuses to read

2017-01-28 Thread Koert Kuipers
it seems the bug is:
https://issues.apache.org/jira/browse/KAFKA-4547

i would advise everyone not to use kafka-clients 0.10.0.2, 0.10.1.0 or
0.10.1.1

On Fri, Jan 27, 2017 at 3:56 PM, Koert Kuipers  wrote:

> in case anyone else runs into this:
>
> the issue is that i was using kafka-clients 0.10.1.1
>
> it works when i use kafka-clients 0.10.0.1 with spark structured streaming
>
> my kafka server is 0.10.1.1
>
> On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers  wrote:
>
>> i checked my topic. it has 5 partitions but all the data is written to a
>> single partition: wikipedia-2
>> i turned on debug logging and i see this:
>>
>> 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to
>> consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2,
>> wikipedia-1]. Seeking to the end.
>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>> partition wikipedia-0
>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>> partition wikipedia-4
>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>> partition wikipedia-3
>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>> partition wikipedia-2
>> 2017-01-27 13:02:50 DEBUG consumer.KafkaConsumer: Seeking to end of
>> partition wikipedia-1
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-0 to latest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-0
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-0 to earliest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-0
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-4 to latest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-4
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-4 to earliest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-4
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-3 to latest offset.
>> 2017-01-27 13:02:50 DEBUG internals.AbstractCoordinator: Received
>> successful heartbeat response for group spark-kafka-source-fac4f749-fd
>> 56-4a32-82c7-e687aadf520b-1923704552-driver-0
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-3
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-3 to earliest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-3
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-2 to latest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=152908} for partition wikipedia-2
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-2 to earliest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-2
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Resetting offset for
>> partition wikipedia-1 to latest offset.
>> 2017-01-27 13:02:50 DEBUG internals.Fetcher: Fetched {timestamp=-1,
>> offset=0} for partition wikipedia-1
>> 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Got latest offsets for
>> partition : Map(wikipedia-1 -> 0, wikipedia-4 -> 0, wikipedia-2 -> 0,
>> wikipedia-3 -> 0, wikipedia-0 -> 0)
>>
>> what is confusing to me is this:
>> Resetting offset for partition wikipedia-2 to latest offset.
>> Fetched {timestamp=-1, offset=152908} for partition wikipedia-2
>> Got latest offsets for partition : Map(wikipedia-1 -> 0, wikipedia-4 ->
>> 0, wikipedia-2 -> 0, wikipedia-3 -> 0, wikipedia-0 -> 0)
>>
>> why does it find latest offset 152908 for wikipedia-2 but then sets
>> latest offset to 0 for that partition? or am i misunderstanding?
>>
>> On Fri, Jan 27, 2017 at 1:19 PM, Koert Kuipers  wrote:
>>
>>> code:
>>>   val query = spark.readStream
>>> .format("kafka")
>>> .option("kafka.bootstrap.servers", "somenode:9092")
>>> .option("subscribe", "wikipedia")
>>> .load
>>> .select(col("value") cast StringType)
>>> .writeStream
>>> .format("console")
>>> .outputMode(OutputMode.Append)
>>> .start()
>>>
>>>   while (true) {
>>> Thread.sleep(1)
>>> println(query.lastProgress)
>>>   }
>>> }
>>>
>>> On Fri, Jan 27, 2017 at 5:34 AM, Alonso Isidoro Roman <
>>> alons...@gmail.com> wrote:
>>>
 lets see the code...

 Alonso Isidoro Roman
 [image: https://]about.me/alonso.isidoro.roman

 

DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Md. Rezaul Karim
Hi All,

I am running a Spark job on my local machine written in Scala with Spark
2.1.0. However, I am not seeing any option of "*DAG Visualization*" at
http://localhost:4040/jobs/


Suggestion, please.




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: Cached table details

2017-01-28 Thread Shuai Lin
+1 for Jacek's suggestion

FWIW: another possible *hacky* way is to write a package
in org.apache.spark.sql namespace so it can access the
sparkSession.sharedState.cacheManager. Then use scala reflection to read
the cache manager's `cachedData` field, which can provide the list of
cached relations.

https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L47

But this makes use of spark internals so would be subject to changes of it.

On Fri, Jan 27, 2017 at 7:00 AM, Jacek Laskowski  wrote:

> Hi,
>
> I think that the only way to get the information about a cached RDD is to
> use SparkListener and intercept respective events about cached blocks on
> BlockManagers.
>
> Jacek
>
> On 25 Jan 2017 5:54 a.m., "kumar r"  wrote:
>
> Hi,
>
> I have cached some table in Spark Thrift Server. I want to get all cached
> table information. I can see it in 4040 web ui port.
>
> Is there any command or other way to get the cached table details
> programmatically?
>
> Thanks,
> Kumar
>
>
>


Re: Dynamic resource allocation to Spark on Mesos

2017-01-28 Thread Shuai Lin
>
> An alternative behavior is to launch the job with the best resource offer
> Mesos is able to give


Michael has just made an excellent explanation about dynamic allocation
support in mesos. But IIUC, what you want to achieve is something like
(using RAM as an example) : "Launch each executor with at least 1GB RAM,
but if mesos offers 2GB at some moment, then launch an executor with 2GB
RAM".

I wonder what's benefit of that? To reduce the "resource fragmentation"?

Anyway, that is not supported at this moment. In all the supported cluster
managers of spark (mesos, yarn, standalone, and the up-to-coming spark on
kubernetes), you have to specify the cores and memory of each executor.

It may not be supported in the future, because only mesos has the concepts
of offers because of its two-level scheduling model.


On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan  wrote:

> Dear Spark Users,
>
> Currently is there a way to dynamically allocate resources to Spark on
> Mesos? Within Spark we can specify the CPU cores, memory before running
> job. The way I understand is that the Spark job will not run if the CPU/Mem
> requirement is not met. This may lead to decrease in overall utilization of
> the cluster. An alternative behavior is to launch the job with the best
> resource offer Mesos is able to give. Is this possible with the current
> implementation?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: mapWithState question

2017-01-28 Thread shyla deshpande
Thats a great idea. I will try that. Thanks.

On Sat, Jan 28, 2017 at 2:35 AM, Tathagata Das 
wrote:

> 1 state object for each user.
> union both streams into a single DStream, and apply mapWithState on it to
> update the user state.
>
> On Sat, Jan 28, 2017 at 12:30 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> Can multiple DStreams manipulate a state? I have a stream that gives me
>> total minutes the user spent on a course material. I have another stream
>> that gives me chapters completed and lessons completed by the user. I
>> want to keep track for each user total_minutes, chapters_completed and
>> lessons_completed. I am not sure if I should have 1 state or 2 states. Can
>> I lookup the state for a given key just like a map outside the mapfunction?
>>
>> Appreciate your input. Thanks
>>
>
>


Re: spark architecture question -- Pleas Read

2017-01-28 Thread Sachin Naik
I strongly agree with Jorn and Russell. There are different solutions for data 
movement depending upon your needs frequency, bi-directional drivers. workflow, 
handling duplicate records. This is a space is known as " Change Data Capture - 
CDC" for short. If you need more information, I would be happy to chat with 
you.  I built some products in this space that extensively used connection 
pooling over ODBC/JDBC. 

Happy to chat if you need more information. 

-Sachin Naik

>>Hard to tell. Can you give more insights >>on what you try to achieve and 
>>what the data is about?
>>For example, depending on your use case sqoop can make sense or not.
Sent from my iPhone

> On Jan 27, 2017, at 11:22 PM, Russell Spitzer  
> wrote:
> 
> You can treat Oracle as a JDBC source 
> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
> hive on the way back out (see the same link) and write directly to Oracle. 
> I'll leave the performance questions for someone else. 
> 
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu  wrote:
>> 
>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu  wrote:
>> Hi Team,
>> 
>> RIght now our existing flow is
>> 
>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
>> Context)-->Destination Hive table -->sqoop export to Oracle
>> 
>> Half of the Hive UDFS required is developed in Java UDF..
>> 
>> SO Now I want to know if I run the native scala UDF's than runninng hive 
>> java udfs in spark-sql will there be any performance difference
>> 
>> 
>> Can we skip the Sqoop Import and export part and 
>> 
>> Instead directly load data from oracle to spark and code Scala UDF's for 
>> transformations and export output data back to oracle?
>> 
>> RIght now the architecture we are using is
>> 
>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> Hive 
>> --> Oracle 
>> what would be optimal architecture to process data from oracle using spark 
>> ?? can i anyway better this process ?
>> 
>> 
>> 
>> 
>> Regards,
>> Sirisha 
>> 


Re: issue with running Spark streaming with spark-shell

2017-01-28 Thread Chetan Khatri
if you are using any other package give it as argument --packages


On Sat, Jan 28, 2017 at 8:14 PM, Jacek Laskowski  wrote:

> Hi,
>
> How did you start spark-shell?
>
> Jacek
>
> On 28 Jan 2017 11:20 a.m., "Mich Talebzadeh" 
> wrote:
>
>>
>> Hi,
>>
>> My spark-streaming application works fine when compiled with Maven with
>> uber jar file.
>>
>> With spark-shell this program throws an error as follows:
>>
>> scala> val dstream = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
>> java.lang.NoClassDefFoundError: Could not initialize class
>> kafka.consumer.FetchRequestAndResponseStatsRegistry$
>>
>> FYI  KafkaUtils itself exists
>>
>> scala>  KafkaUtils
>> res22: org.apache.spark.streaming.kafka.KafkaUtils.type =
>> org.apache.spark.streaming.kafka.KafkaUtils$@3ae931b6
>>
>>
>> I guess I am missing a jar file or something?
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: issue with running Spark streaming with spark-shell

2017-01-28 Thread Jacek Laskowski
Hi,

How did you start spark-shell?

Jacek

On 28 Jan 2017 11:20 a.m., "Mich Talebzadeh" 
wrote:

>
> Hi,
>
> My spark-streaming application works fine when compiled with Maven with
> uber jar file.
>
> With spark-shell this program throws an error as follows:
>
> scala> val dstream = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
> java.lang.NoClassDefFoundError: Could not initialize class kafka.consumer.
> FetchRequestAndResponseStatsRegistry$
>
> FYI  KafkaUtils itself exists
>
> scala>  KafkaUtils
> res22: org.apache.spark.streaming.kafka.KafkaUtils.type =
> org.apache.spark.streaming.kafka.KafkaUtils$@3ae931b6
>
>
> I guess I am missing a jar file or something?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


[ANNOUNCE] Apache Bahir 2.0.2

2017-01-28 Thread Christian Kadner
The Apache Bahir PMC approved the release of Apache Bahir 2.0.2
which provides the following extensions for Apache Spark 2.0.2:

   - Akka Streaming
   - MQTT Streaming
   - MQTT Structured Streaming
   - Twitter Streaming
   - ZeroMQ Streaming

For more information about Apache Bahir and to download the
latest release go to:

http://bahir.apache.org

The Apache Bahir streaming connectors are also available at:

https://spark-packages.org/?q=bahir

---
Best regards,
Christian Kadner

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 2.02 error when writing to s3

2017-01-28 Thread Steve Loughran

On 27 Jan 2017, at 23:17, VND Tremblay, Paul 
> wrote:

Not sure what you mean by "a consistency layer on top." Any explanation would 
be greatly appreciated!

Paul



netflix's s3mper: https://github.com/Netflix/s3mper

EMR consistency: 
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html

AWS S3: s3guard (Wip) : https://issues.apache.org/jira/browse/HADOOP-13345

All of these do the same thing: use amazon DynamoDB for storing all the 
metadata, guaranteeing that every client gets a consistent view of deletes, 
adds, and the listings returned match the state of the system. otherwise list 
commands tend to lag changes, meaning deleted files are still mistakenly 
considered as being there, and lists of paths can miss out newly created files. 
That means that there's no guarantee that the commit-by-rename protocol used in 
Hadoop FileOutputFormat may miss out files to rename, so lose results.

S3guard will guarantee that listing is consistent, and will be a precursor to 
the 0-rename committer I'm working on, which needs that consistent list to find 
the .pending files listing outstanding operations to commit.




Re: mapWithState question

2017-01-28 Thread Tathagata Das
1 state object for each user.
union both streams into a single DStream, and apply mapWithState on it to
update the user state.

On Sat, Jan 28, 2017 at 12:30 AM, shyla deshpande 
wrote:

> Can multiple DStreams manipulate a state? I have a stream that gives me
> total minutes the user spent on a course material. I have another stream
> that gives me chapters completed and lessons completed by the user. I
> want to keep track for each user total_minutes, chapters_completed and
> lessons_completed. I am not sure if I should have 1 state or 2 states. Can
> I lookup the state for a given key just like a map outside the mapfunction?
>
> Appreciate your input. Thanks
>


issue with running Spark streaming with spark-shell

2017-01-28 Thread Mich Talebzadeh
Hi,

My spark-streaming application works fine when compiled with Maven with
uber jar file.

With spark-shell this program throws an error as follows:

scala> val dstream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
java.lang.NoClassDefFoundError: Could not initialize class
kafka.consumer.FetchRequestAndResponseStatsRegistry$

FYI  KafkaUtils itself exists

scala>  KafkaUtils
res22: org.apache.spark.streaming.kafka.KafkaUtils.type =
org.apache.spark.streaming.kafka.KafkaUtils$@3ae931b6


I guess I am missing a jar file or something?

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


mapWithState question

2017-01-28 Thread shyla deshpande
Can multiple DStreams manipulate a state? I have a stream that gives me
total minutes the user spent on a course material. I have another stream
that gives me chapters completed and lessons completed by the user. I want
to keep track for each user total_minutes, chapters_completed and
lessons_completed. I am not sure if I should have 1 state or 2 states. Can
I lookup the state for a given key just like a map outside the mapfunction?

Appreciate your input. Thanks


Re: spark architecture question -- Pleas Read

2017-01-28 Thread Jörn Franke
Hard to tell. Can you give more insights on what you try to achieve and what 
the data is about?
For example, depending on your use case sqoop can make sense or not.

> On 28 Jan 2017, at 02:14, Sirisha Cheruvu  wrote:
> 
> Hi Team,
> 
> RIght now our existing flow is
> 
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
> Context)-->Destination Hive table -->sqoop export to Oracle
> 
> Half of the Hive UDFS required is developed in Java UDF..
> 
> SO Now I want to know if I run the native scala UDF's than runninng hive java 
> udfs in spark-sql will there be any performance difference
> 
> 
> Can we skip the Sqoop Import and export part and 
> 
> Instead directly load data from oracle to spark and code Scala UDF's for 
> transformations and export output data back to oracle?
> 
> RIght now the architecture we are using is
> 
> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> Hive 
> --> Oracle 
> what would be optimal architecture to process data from oracle using spark ?? 
> can i anyway better this process ?
> 
> 
> 
> 
> Regards,
> Sirisha 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org