Re: parsing embedded json in spark

2016-12-23 Thread Tal Grynbaum
Hi Shaw,

Thanks, that works!



On Thu, Dec 22, 2016 at 6:45 PM, Shaw Liu <ipf...@gmail.com> wrote:

> Hi,I guess you can use 'get_json_object' function
>
> Get Outlook for iOS <https://aka.ms/o0ukef>
>
>
>
>
> On Thu, Dec 22, 2016 at 9:52 PM +0800, "Irving Duran" <
> irving.du...@gmail.com> wrote:
>
> Is it an option to parse that field prior of creating the dataframe? If
>> so, that's what I would do.
>>
>> In terms of your master node only working, you have to share more about
>> your structure, are you using spark standalone, yarn, or mesos?
>>
>>
>> Thank You,
>>
>> Irving Duran
>>
>> On Thu, Dec 22, 2016 at 1:42 AM, Tal Grynbaum <tal.grynb...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a dataframe that contain an embedded json string in one of the
>>> fields
>>> I'd tried to write a UDF function that will parse it using lift-json,
>>> but it seems to take a very long time to process, and it seems that only
>>> the master node is working.
>>>
>>> Has anyone dealt with such a scenario before and can give me some hints?
>>>
>>> Thanks
>>> Tal
>>>
>>
>>


-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


parsing embedded json in spark

2016-12-21 Thread Tal Grynbaum
Hi,

I have a dataframe that contain an embedded json string in one of the fields
I'd tried to write a UDF function that will parse it using lift-json, but
it seems to take a very long time to process, and it seems that only the
master node is working.

Has anyone dealt with such a scenario before and can give me some hints?

Thanks
Tal


Re: difference between package and jar Option in Spark

2016-09-04 Thread Tal Grynbaum
You need to download all the dependencies of that jar as well

On Mon, Sep 5, 2016, 06:59 Divya Gehlot  wrote:

> Hi,
> I am using spark-csv to parse my input files .
> If I use --package option it works fine but if I download
> 
> the jar and use --jars option
> Its throwing Class not found exception.
>
>
> Thanks,
> Divya
>
> On 1 September 2016 at 17:26, Sean Owen  wrote:
>
>> --jars includes a local JAR file in the application's classpath.
>> --package references Maven coordinates of a dependency and retrieves
>> and includes all of those JAR files, and includes them in the app
>> classpath.
>>
>> On Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot 
>> wrote:
>> > Hi,
>> >
>> > Would like to know the difference between the --package and --jars
>> option in
>> > Spark .
>> >
>> >
>> >
>> > Thanks,
>> > Divya
>>
>
>


Re: any idea what this error could be?

2016-09-03 Thread Tal Grynbaum
My guess is that you're running out of memory somewhere.  Try to increase
the driver memory and/or executor memory.

On Sat, Sep 3, 2016, 11:42 kant kodali  wrote:

> I am running this on aws.
>
>
>
> On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com wrote:
>
>> I am running spark in stand alone mode. I guess this error when I run my
>> driver program..I am using spark 2.0.0. any idea what this error could be?
>>
>>
>>
>>   Using Spark's default log4j profile: 
>> org/apache/spark/log4j-defaults.properties
>> 16/09/02 23:44:44 INFO SparkContext: Running Spark version 2.0.0
>> 16/09/02 23:44:44 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/09/02 23:44:45 INFO SecurityManager: Changing view acls to: kantkodali
>> 16/09/02 23:44:45 INFO SecurityManager: Changing modify acls to: kantkodali
>> 16/09/02 23:44:45 INFO SecurityManager: Changing view acls groups to:
>> 16/09/02 23:44:45 INFO SecurityManager: Changing modify acls groups to:
>> 16/09/02 23:44:45 INFO SecurityManager: SecurityManager: authentication 
>> disabled; ui acls disabled; users  with view permissions: Set(kantkodali); 
>> groups with view permissions: Set(); users  with modify permissions: 
>> Set(kantkodali); groups with modify permissions: Set()
>> 16/09/02 23:44:45 INFO Utils: Successfully started service 'sparkDriver' on 
>> port 62256.
>> 16/09/02 23:44:45 INFO SparkEnv: Registering MapOutputTracker
>> 16/09/02 23:44:45 INFO SparkEnv: Registering BlockManagerMaster
>> 16/09/02 23:44:45 INFO DiskBlockManager: Created local directory at 
>> /private/var/folders/_6/lfxt933j3bd_xhq0m7dwm8s0gn/T/blockmgr-b56eea49-0102-4570-865a-1d3d230f0ffc
>> 16/09/02 23:44:45 INFO MemoryStore: MemoryStore started with capacity 2004.6 
>> MB
>> 16/09/02 23:44:45 INFO SparkEnv: Registering OutputCommitCoordinator
>> 16/09/02 23:44:45 INFO Utils: Successfully started service 'SparkUI' on port 
>> 4040.
>> 16/09/02 23:44:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
>> http://192.168.0.191:4040
>> 16/09/02 23:44:45 INFO StandaloneAppClient$ClientEndpoint: Connecting to 
>> master spark://52.43.37.223:7077...
>> 16/09/02 23:44:46 INFO TransportClientFactory: Successfully created 
>> connection to /52.43.37.223:7077 after 70 ms (0 ms spent in bootstraps)
>> 16/09/02 23:44:46 WARN StandaloneAppClient$ClientEndpoint: Failed to connect 
>> to master 52.43.37.223:7077
>> org.apache.spark.SparkException: Exception thrown in awaitResult
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>> at 
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>> at 
>> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
>> at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.RuntimeException: java.io.InvalidClassException: 
>> org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
>> classdesc serialVersionUID = -2221986757032131007, local class 
>> serialVersionUID = -5447855329526097695
>> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:
>>
>>


Re: Scala Vs Python

2016-09-02 Thread Tal Grynbaum
On Fri, Sep 2, 2016 at 1:15 AM, darren <dar...@ontrenet.com> wrote:

> This topic is a concern for us as well. In the data science world no one
> uses native scala or java by choice. It's R and Python. And python is
> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>
> This is why we have decoupled from spark in our project. It's really
> unfortunate spark team have invested so heavily in scale.
>
> As for speed it comes from horizontal scaling and throughout. When you can
> scale outward, individual VM performance is less an issue. Basic HPC
> principles.
>

Darren,

My guess is that data scientist who will decouple themselves from spark,
will eventually left with more or less nothing. (single process
capabilities, or purely performing HPC's) (unless, unlikely, some good
spark competitor will emerge.  unlikely, simply because there is no need
for such).
But putting guessing aside - the reason python is 3rd in line for feature
support, is not because the spark developers were busy with scala, it's
because the features that are missing are those that support strong typing.
which is not relevant to python.  in other words, even if spark was
rewritten in python, and was to focus on python only, you would still not
get those features.



-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Suggestions for calculating MAU/WAU/DAU

2016-08-28 Thread Tal Grynbaum
Hi

I'm struggling with the following issue.
I need to build a cube with 6 dimensions for app usage
for example:
---+---+--+-+--+--
user  |  app |   d3  | d4  | d5   |  d6
---+---+--+-+--+--
  u1   |  a1   |   x|   y   |   z   |   5
---+---+--+-+--+--
  u2   |  a1   |   a|   b  |   c|   6
---+---+--+-+--+--

the dimensions combinations generate ~100M rows daily.
for each row, I need to calculate the unique monthly active users, weekly
active users and daily active users, along with some other data (that can
be simply added up)

I can load the data of the last 30 days, each day, and calculate a cube
with countDistinct('userId)
but this requires a huge cluster, and is quite expensive.

I tried to use Hyper Log Log, and store the byte array of the HLL of the
previous day, de-serialize it, add the users of the current day, calc the
new distinct, and serialize the byte array for the next day.
however, to get 5% error accuracy with HLL, the byte array has to be 4K
long, which makes the 100M rows, be ~ 4000 times bigger.  and i ended up
requiring a lot more resources.

I wonder if one of you can think of a better solution.

Thanks
Tal





-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Tal Grynbaum
Did you specify -Dscala-2.10
As in
./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4
-Dscala-2.10 -DskipTests clean package
If you're building with scala 2.10

On Sat, Aug 27, 2016, 00:18 Marco Mistroni  wrote:

> Hello Michael
> uhm i celebrated too soon
> Compilation of spark on docker image went near the end and then it errored
> out with this message
>
> INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 01:01 h
> [INFO] Finished at: 2016-08-26T21:12:25+00:00
> [INFO] Final Memory: 69M/324M
> [INFO]
> 
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-mllib_2.11: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-mllib_2.11
> The command '/bin/sh -c ./build/mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero code:
> 1
>
> what am i forgetting?
> once again, last command i launched on the docker file is
>
>
> RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
> clean package
>
> kr
>
>
>
> On Fri, Aug 26, 2016 at 6:18 PM, Michael Gummelt 
> wrote:
>
>> :)
>>
>> On Thu, Aug 25, 2016 at 2:29 PM, Marco Mistroni 
>> wrote:
>>
>>> No i wont accept that :)
>>> I can't believe i have wasted 3 hrs for a space!
>>>
>>> Many thanks MIchael!
>>>
>>> kr
>>>
>>> On Thu, Aug 25, 2016 at 10:01 PM, Michael Gummelt <
>>> mgumm...@mesosphere.io> wrote:
>>>
 You have a space between "build" and "mvn"

 On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni 
 wrote:

> HI all
>  sorry for the partially off-topic, i hope there's someone on the list
> who has tried the same and encountered similar issuse
>
> Ok so i have created a Docker file to build an ubuntu container which
> inlcudes spark 2.0, but somehow when it gets to the point where it has to
> kick off  ./build/mvn command, it errors out with the following
>
> ---> Running in 8c2aa6d59842
> /bin/sh: 1: ./build: Permission denied
> The command '/bin/sh -c ./build mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero 
> code:
> 126
>
> I am puzzled as i am root when i build the container, so i should not
> encounter this issue (btw, if instead of running mvn from the build
> directory  i use the mvn which i installed on the container, it works fine
> but it's  painfully slow)
>
> here are the details of my Spark command( scala 2.10, java 1.7 , mvn
> 3.3.9 and git have already been installed)
>
> # Spark
> RUN echo "Installing Apache spark 2.0"
> RUN git clone git://github.com/apache/spark.git
> WORKDIR /spark
> RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
> clean package
>
>
> Could anyone assist pls?
>
> kindest regarsd
>  Marco
>
>


 --
 Michael Gummelt
 Software Engineer
 Mesosphere

>>>
>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Tal Grynbaum
Is/was there an option similar to DirectParquetOutputCommitter to write
json files to S3 ?

On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> Seems this just prevents writers from leaving partial data in a
> destination dir when jobs fail.
> In the previous versions of Spark, there was a way to directly write data
> in a destination though,
> Spark v2.0+ has no way to do that because of the critial issue on S3 (See:
> SPARK-10063).
>
> // maropu
>
>
> On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum <tal.grynb...@gmail.com>
> wrote:
>
>> I read somewhere that its because s3 has to know the size of the file
>> upfront
>> I dont really understand this,  as to why is it ok  not to know it for
>> the temp files and not ok for the final files.
>> The delete permission is the minor disadvantage from my side,  the worst
>> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
>> waiting for copy to end.
>>
>> Any suggestions how to avoid that?
>>
>> On Thu, Aug 25, 2016, 08:21 Aseem Bansal <asmbans...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> When Spark saves anything to S3 it creates temporary files. Why? Asking
>>> this as this requires the the access credentails to be given
>>> delete permissions along with write permissions.
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Tal Grynbaum
I read somewhere that its because s3 has to know the size of the file
upfront
I dont really understand this,  as to why is it ok  not to know it for the
temp files and not ok for the final files.
The delete permission is the minor disadvantage from my side,  the worst
thing is that i have a cluster of 100 machines sitting idle for 15 minutes
waiting for copy to end.

Any suggestions how to avoid that?

On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:

> Hi
>
> When Spark saves anything to S3 it creates temporary files. Why? Asking
> this as this requires the the access credentails to be given
> delete permissions along with write permissions.
>


Re: how to select first 50 value of each group after group by?

2016-07-06 Thread Tal Grynbaum
You can use rank window function to rank each row in the group,  and then
filter the rowz with rank < 50

On Wed, Jul 6, 2016, 14:07  wrote:

> hi there
> I have a DF with 3 columns: id , pv, location.(the rows are already
> grouped by location and sort by pv in des)  I wanna get the first 50 id
> values grouped by location. I checked the API of
> dataframe,groupeddata,pairRDD, and found no match.
>   is there a way to do this naturally?
>   any info will be appreciated.
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>