Re: Join highly skewed datasets

๏̯͡๏ Sun, 28 Jun 2015 15:06:07 -0700

Could you please suggest and help me understand further.

This is the actual sizes


-sh-4.1$ hadoop fs -count dw_lstg_item
           1          764      2041084436189
/sys/edw/dw_lstg_item/snapshot/2015/06/25/00
*This is not skewed there is exactly one etntry for each item but its 2TB*
So should its replication be set to 1 ?

The below two datasets (RDD) are unioned and their total size is 150G.
These can be skewed and
hence we use block join with Scoobi + MR.
*So should its replication be set to 3 ?*
-sh-4.1$ hadoop fs -count
/apps/hdmi-prod/b_um/epdatasets/exptsession/2015/06/20
           1          101        73796345977
/apps/hdmi-prod/b_um/epdatasets/exptsession/2015/06/20
-sh-4.1$ hadoop fs -count
/apps/hdmi-prod/b_um/epdatasets/exptsession/2015/06/21
           1          101        85559964549
/apps/hdmi-prod/b_um/epdatasets/exptsession/2015/06/21

Also can you suggest the number of executors to be used in this case , each
executor is started with max 14G of memory?

Is it equal to 2TB + 150G (Total data) = 20150 GB/14GB = 1500 executors ?
Is this calculation correct ?

And also please suggest on the
"(should be memory-and-disk or disk-only), number of partitions (should be
large, multiple of num executors),"

https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

When do i choose this setting ?  (Attached is my code for reference)



On Sun, Jun 28, 2015 at 2:57 PM, Koert Kuipers <ko...@tresata.com> wrote:

> a blockJoin spreads out one side while replicating the other. i would
> suggest replicating the smaller side. so if lstgItem is smaller try 3,1
> or else 1,3. this should spread the "fat" keys out over multiple (3)
> executors...
>
>
> On Sun, Jun 28, 2015 at 5:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
> wrote:
>
>> I am able to use blockjoin API and it does not throw compilation error
>>
>> val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary,
>> Long))] = lstgItem.blockJoin(viEvents,1,1).map {
>>
>> }
>>
>> Here viEvents is highly skewed and both are on HDFS.
>>
>> What should be the optimal values of replication, i gave 1,1
>>
>>
>>
>> On Sun, Jun 28, 2015 at 1:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>> wrote:
>>
>>> I incremented the version of spark from 1.4.0 to 1.4.0.1 and ran
>>>
>>>  ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn  -Phive
>>> -Phive-thriftserver
>>>
>>> Build was successful but the script faild. Is there a way to pass the
>>> incremented version ?
>>>
>>>
>>> [INFO] BUILD SUCCESS
>>>
>>> [INFO]
>>> ------------------------------------------------------------------------
>>>
>>> [INFO] Total time: 09:56 min
>>>
>>> [INFO] Finished at: 2015-06-28T13:45:29-07:00
>>>
>>> [INFO] Final Memory: 84M/902M
>>>
>>> [INFO]
>>> ------------------------------------------------------------------------
>>>
>>> + rm -rf /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist
>>>
>>> + mkdir -p /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib
>>>
>>> + echo 'Spark 1.4.0.1 built for Hadoop 2.4.0'
>>>
>>> + echo 'Build flags: -Phadoop-2.4' -Pyarn -Phive -Phive-thriftserver
>>>
>>> + cp
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/assembly/target/scala-2.10/spark-assembly-1.4.0.1-hadoop2.4.0.jar
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>>
>>> + cp
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/examples/target/scala-2.10/spark-examples-1.4.0.1-hadoop2.4.0.jar
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>>
>>> + cp
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/network/yarn/target/scala-2.10/spark-1.4.0.1-yarn-shuffle.jar
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>>
>>> + mkdir -p
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/examples/src/main
>>>
>>> + cp -r /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/examples/src/main
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/examples/src/
>>>
>>> + '[' 1 == 1 ']'
>>>
>>> + cp
>>> '/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar'
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>>
>>> cp:
>>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar:
>>> No such file or directory
>>>
>>> LM-SJL-00877532:spark-1.4.0 dvasthimal$ ./make-distribution.sh  --tgz
>>> -Phadoop-2.4 -Pyarn  -Phive -Phive-thriftserver
>>>
>>>
>>>
>>> On Sun, Jun 28, 2015 at 1:41 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> you need 1) to publish to inhouse maven, so your application can depend
>>>> on your version, and 2) use the spark distribution you compiled to launch
>>>> your job (assuming you run with yarn so you can launch multiple versions of
>>>> spark on same cluster)
>>>>
>>>> On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>> wrote:
>>>>
>>>>> How can i import this pre-built spark into my application via maven as
>>>>> i want to use the block join API.
>>>>>
>>>>> On Sun, Jun 28, 2015 at 1:31 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I ran this w/o maven options
>>>>>>
>>>>>> ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn  -Phive
>>>>>> -Phive-thriftserver
>>>>>>
>>>>>> I got this spark-1.4.0-bin-2.4.0.tgz in the same working directory.
>>>>>>
>>>>>> I hope this is built with 2.4.x hadoop as i did specify -P
>>>>>>
>>>>>> On Sun, Jun 28, 2015 at 1:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>  ./make-distribution.sh  --tgz --*mvn* "-Phadoop-2.4 -Pyarn
>>>>>>> -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean 
>>>>>>> package"
>>>>>>>
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>>
>>>>>>>  ./make-distribution.sh  --tgz --*mvn* -Phadoop-2.4 -Pyarn
>>>>>>> -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean 
>>>>>>> package"
>>>>>>> Both fail with
>>>>>>>
>>>>>>> + echo -e 'Specify the Maven command with the --mvn flag'
>>>>>>>
>>>>>>> Specify the Maven command with the --mvn flag
>>>>>>>
>>>>>>> + exit -1
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Deepak
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>


-- 
Deepak

VISummaryDataProvider.scala
Description: Binary data

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Join highly skewed datasets

Reply via email to