date:20180115

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta

would it not be like appending lines to the same file in that case?

On Tue, Jan 16, 2018 at 4:50 AM, kant kodali  wrote:

> Got it! What about overwriting the same file instead of appending?
>
> On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> What Gerard means is that if you are adding new files in to the same base
>> path (key) then its fine, but in case you are appending lines to the same
>> file then changes will not be picked up.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Jan 16, 2018 at 12:20 AM, kant kodali  wrote:
>>
>>> Hi,
>>>
>>> I am not sure I understand. any examples ?
>>>
>>> On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas 
>>> wrote:
>>>
 Hi,

 You can monitor a filesystem directory as streaming source as long as
 the files placed there are atomically copied/moved into the directory.
 Updating the files is not supported.

 kr, Gerard.

 On Mon, Jan 15, 2018 at 11:41 PM, kant kodali 
 wrote:

> Hi All,
>
> I am wondering if HDFS can be a streaming source like Kafka in Spark
> 2.2.0? For example can I have stream1 reading from Kafka and writing to
> HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that
> stream2 will be pulling the latest updates written by stream1.
>
> Thanks!
>


>>>
>>
>

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread ayan guha

http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html#input-sources


On Tue, Jan 16, 2018 at 3:50 PM, kant kodali  wrote:

> Got it! What about overwriting the same file instead of appending?
>
> On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> What Gerard means is that if you are adding new files in to the same base
>> path (key) then its fine, but in case you are appending lines to the same
>> file then changes will not be picked up.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Jan 16, 2018 at 12:20 AM, kant kodali  wrote:
>>
>>> Hi,
>>>
>>> I am not sure I understand. any examples ?
>>>
>>> On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas 
>>> wrote:
>>>
 Hi,

 You can monitor a filesystem directory as streaming source as long as
 the files placed there are atomically copied/moved into the directory.
 Updating the files is not supported.

 kr, Gerard.

 On Mon, Jan 15, 2018 at 11:41 PM, kant kodali 
 wrote:

> Hi All,
>
> I am wondering if HDFS can be a streaming source like Kafka in Spark
> 2.2.0? For example can I have stream1 reading from Kafka and writing to
> HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that
> stream2 will be pulling the latest updates written by stream1.
>
> Thanks!
>


>>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali

Got it! What about overwriting the same file instead of appending?

On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta 
wrote:

> What Gerard means is that if you are adding new files in to the same base
> path (key) then its fine, but in case you are appending lines to the same
> file then changes will not be picked up.
>
> Regards,
> Gourav Sengupta
>
> On Tue, Jan 16, 2018 at 12:20 AM, kant kodali  wrote:
>
>> Hi,
>>
>> I am not sure I understand. any examples ?
>>
>> On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas 
>> wrote:
>>
>>> Hi,
>>>
>>> You can monitor a filesystem directory as streaming source as long as
>>> the files placed there are atomically copied/moved into the directory.
>>> Updating the files is not supported.
>>>
>>> kr, Gerard.
>>>
>>> On Mon, Jan 15, 2018 at 11:41 PM, kant kodali 
>>> wrote:
>>>
 Hi All,

 I am wondering if HDFS can be a streaming source like Kafka in Spark
 2.2.0? For example can I have stream1 reading from Kafka and writing to
 HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that
 stream2 will be pulling the latest updates written by stream1.

 Thanks!

>>>
>>>
>>
>

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta

What Gerard means is that if you are adding new files in to the same base
path (key) then its fine, but in case you are appending lines to the same
file then changes will not be picked up.

Regards,
Gourav Sengupta

On Tue, Jan 16, 2018 at 12:20 AM, kant kodali  wrote:

> Hi,
>
> I am not sure I understand. any examples ?
>
> On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas 
> wrote:
>
>> Hi,
>>
>> You can monitor a filesystem directory as streaming source as long as the
>> files placed there are atomically copied/moved into the directory.
>> Updating the files is not supported.
>>
>> kr, Gerard.
>>
>> On Mon, Jan 15, 2018 at 11:41 PM, kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I am wondering if HDFS can be a streaming source like Kafka in Spark
>>> 2.2.0? For example can I have stream1 reading from Kafka and writing to
>>> HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that
>>> stream2 will be pulling the latest updates written by stream1.
>>>
>>> Thanks!
>>>
>>
>>
>

Re: spark-submit can find python?

2018-01-15 Thread Jeff Zhang

Hi Manuel,

Looks like you are using the virtualenv of spark. Virtualenv will create
python enviroment in executor.

>>> --conf 
>>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
\
And you are not making proper configuration, spark.pyspark.virtualenv.bin.path
should point to the virtualenv executable file which needs to be installed
on all the nodes of cluster. You can check the following link for more
details of how to use virtualenv in pyspark.

https://community.hortonworks.com/articles/104949/using-virtualenv-with-pyspark-1.html



Manuel Sopena Ballesteros 于2018年1月16日周二 上午8:02写道：

> Apologies, I copied the wrong spark-submit output from running in a
> cluster. Please find below the right output for the question asked:
>
>
>
> -bash-4.1$ spark-submit --master yarn \
>
> > --deploy-mode cluster \
>
> > --driver-memory 4g \
>
> > --executor-memory 2g \
>
> > --executor-cores 4 \
>
> > --queue default \
>
> > --conf spark.pyspark.virtualenv.enabled=true \
>
> > --conf spark.pyspark.virtualenv.type=native \
>
> > --conf
> spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
>
> > --conf
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
> \
>
> > --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
>
> > --py-files $HAIL_HOME/build/distributions/hail-python.zip \
>
> > test.py
>
>
>
> 18/01/16 10:42:49 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 18/01/16 10:42:50 WARN DomainSocketFactory: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
> 18/01/16 10:42:50 INFO RMProxy: Connecting to ResourceManager at
> wp-hdp-ctrl03-mlx.mlx/10.0.1.206:8050
>
> 18/01/16 10:42:50 INFO Client: Requesting a new application from cluster
> with 4 NodeManagers
>
> 18/01/16 10:42:50 INFO Client: Verifying our application has not requested
> more than the maximum memory capability of the cluster (450560 MB per
> container)
>
> 18/01/16 10:42:50 INFO Client: Will allocate AM container, with 4505 MB
> memory including 409 MB overhead
>
> 18/01/16 10:42:50 INFO Client: Setting up container launch context for our
> AM
>
> 18/01/16 10:42:50 INFO Client: Setting up the launch environment for our
> AM container
>
> 18/01/16 10:42:50 INFO Client: Preparing resources for our AM container
>
> 18/01/16 10:42:51 INFO Client: Use hdfs cache file as spark.yarn.archive
> for HDP,
> hdfsCacheFile:hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
>
> 18/01/16 10:42:51 INFO Client: Source and destination file systems are the
> same. Not copying
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/hail-test2/hail/build/libs/hail-all-spark.jar ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/hail-all-spark.jar
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/requirements.txt ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/requirements.txt
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/test.py ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/test.py
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/usr/hdp/2.6.3.0-235/spark2/python/lib/pyspark.zip ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/pyspark.zip
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/usr/hdp/2.6.3.0-235/spark2/python/lib/py4j-0.10.4-src.zip ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/py4j-0.10.4-src.zip
>
> 18/01/16 10:42:51 INFO Client: Uploading resource
> file:/home/mansop/hail-test2/hail/build/distributions/hail-python.zip ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/hail-python.zip
>
> 18/01/16 10:42:52 INFO Client: Uploading resource
> file:/tmp/spark-592e7e0f-6faa-4c3c-ab0f-7dd1cff21d17/__spark_conf__8493747840734310444.zip
> ->
> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/__spark_conf__.zip
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing view acls to: mansop
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing modify acls to: mansop
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing view acls groups to:
>
> 18/01/16 10:42:52 INFO SecurityManager: Changing modify acls groups to:
>
> 18/01/16 10:42:52 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(mansop);
> groups with view permissions: Set(); users  with modify permissions:
> Set(mansop); groups

Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan

Hi, thanks for reporting, can you include the steps to reproduce this bug?

On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu  wrote:

> Did you include any picture ?
>
> Looks like the picture didn't go thru.
>
> Please use third party site.
>
> Thanks
>
>  Original message 
> From: Tomasz Gawęda 
> Date: 1/15/18 2:07 PM (GMT-08:00)
> To: d...@spark.apache.org, user@spark.apache.org
> Subject: Broken SQL Visualization?
>
> Hi,
>
> today I have updated my test cluster to current Spark master, after that
> my SQL Visualization page started to crash with following error in JS:
>
> Screenshot was cut for readability and to hide internal server names ;)
>
> It may be caused by upgrade or by some code changes, but - to be honest -
> I did not use any new operators nor any new Spark function, so it should
> render correctly, like few days ago. Some Visualizations work fine, some
> crashes, I don't have any doubts why it may not work. Can anyone help me?
> Probably it is a bug in Spark, but it's hard to me to say in which place.
>
> Thanks in advance!
>
> Pozdrawiam / Best regards,
>
> Tomek
>

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali

Hi,

I am not sure I understand. any examples ?

On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas  wrote:

> Hi,
>
> You can monitor a filesystem directory as streaming source as long as the
> files placed there are atomically copied/moved into the directory.
> Updating the files is not supported.
>
> kr, Gerard.
>
> On Mon, Jan 15, 2018 at 11:41 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I am wondering if HDFS can be a streaming source like Kafka in Spark
>> 2.2.0? For example can I have stream1 reading from Kafka and writing to
>> HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that
>> stream2 will be pulling the latest updates written by stream1.
>>
>> Thanks!
>>
>
>

RE: spark-submit can find python?

2018-01-15 Thread Manuel Sopena Ballesteros

Apologies, I copied the wrong spark-submit output from running in a cluster. 
Please find below the right output for the question asked:

-bash-4.1$ spark-submit --master yarn \
> --deploy-mode cluster \
> --driver-memory 4g \
> --executor-memory 2g \
> --executor-cores 4 \
> --queue default \
> --conf spark.pyspark.virtualenv.enabled=true \
> --conf spark.pyspark.virtualenv.type=native \
> --conf 
> spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
> --conf 
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
>  \
> --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
> --py-files $HAIL_HOME/build/distributions/hail-python.zip \
> test.py

18/01/16 10:42:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/01/16 10:42:50 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/01/16 10:42:50 INFO RMProxy: Connecting to ResourceManager at 
wp-hdp-ctrl03-mlx.mlx/10.0.1.206:8050
18/01/16 10:42:50 INFO Client: Requesting a new application from cluster with 4 
NodeManagers
18/01/16 10:42:50 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (450560 MB per container)
18/01/16 10:42:50 INFO Client: Will allocate AM container, with 4505 MB memory 
including 409 MB overhead
18/01/16 10:42:50 INFO Client: Setting up container launch context for our AM
18/01/16 10:42:50 INFO Client: Setting up the launch environment for our AM 
container
18/01/16 10:42:50 INFO Client: Preparing resources for our AM container
18/01/16 10:42:51 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
18/01/16 10:42:51 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
18/01/16 10:42:51 INFO Client: Uploading resource 
file:/home/mansop/hail-test2/hail/build/libs/hail-all-spark.jar -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/hail-all-spark.jar
18/01/16 10:42:51 INFO Client: Uploading resource 
file:/home/mansop/requirements.txt -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/requirements.txt
18/01/16 10:42:51 INFO Client: Uploading resource file:/home/mansop/test.py -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/test.py
18/01/16 10:42:51 INFO Client: Uploading resource 
file:/usr/hdp/2.6.3.0-235/spark2/python/lib/pyspark.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/pyspark.zip
18/01/16 10:42:51 INFO Client: Uploading resource 
file:/usr/hdp/2.6.3.0-235/spark2/python/lib/py4j-0.10.4-src.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/py4j-0.10.4-src.zip
18/01/16 10:42:51 INFO Client: Uploading resource 
file:/home/mansop/hail-test2/hail/build/distributions/hail-python.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/hail-python.zip
18/01/16 10:42:52 INFO Client: Uploading resource 
file:/tmp/spark-592e7e0f-6faa-4c3c-ab0f-7dd1cff21d17/__spark_conf__8493747840734310444.zip
 -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0045/__spark_conf__.zip
18/01/16 10:42:52 INFO SecurityManager: Changing view acls to: mansop
18/01/16 10:42:52 INFO SecurityManager: Changing modify acls to: mansop
18/01/16 10:42:52 INFO SecurityManager: Changing view acls groups to:
18/01/16 10:42:52 INFO SecurityManager: Changing modify acls groups to:
18/01/16 10:42:52 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(mansop); groups 
with view permissions: Set(); users  with modify permissions: Set(mansop); 
groups with modify permissions: Set()
18/01/16 10:42:52 INFO Client: Submitting application 
application_1512016123441_0045 to ResourceManager
18/01/16 10:42:52 INFO YarnClientImpl: Submitted application 
application_1512016123441_0045
18/01/16 10:42:53 INFO Client: Application report for 
application_1512016123441_0045 (state: ACCEPTED)
18/01/16 10:42:53 INFO Client:
 client token: N/A
 diagnostics: AM container is launched, waiting for AM container to 
Register with RM
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1516059772092
 final status: UNDEFINED
 tracking URL: 
http://wp-hdp-ctrl03-mlx.mlx:8088/proxy/application_1512016123441_0045/
 user: mansop
18/01/16 10:42:54 INFO Client: Application report for

spark-submit can find python?

2018-01-15 Thread Manuel Sopena Ballesteros

Hi all,

I am quite new to spark and need some help troubleshooting the execution of an 
application running on a spark cluster...

My spark environment is deployed using Ambari (HDP), YARM is the resource 
scheduler and hadoop as file system.

The application I am trying to run is a python script (test.py).

The worker nodes have python 2.6 so I am asking spark to spin up a virtual 
environment based on python 2.7.

I can successfully run this test app in a single node (see below):

-bash-4.1$ spark-submit \
> --conf spark.pyspark.virtualenv.type=native \
> --conf spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
> --conf 
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
>  \
> --conf spark.pyspark.python=/home/mansop/hail-test/python-2.7.2/bin/python \
> --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
> --py-files $HAIL_HOME/build/distributions/hail-python.zip \
> test.py
hail: info: SparkUI: http://192.168.10.201:4040
Welcome to
 __  __ <>__
/ /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-0320a61
[Stage 2:==> (91 + 4) / 
100]Summary(samples=3, variants=308, call_rate= 

 1.00, contigs=['1'], multiallelics=0, snps=308, mnps=0, 
insertions=0, deletions=0, complex=0, star=0, max_alleles=2)


However spark crashes while trying to run my test script (error below) throwing 
this error message 
/d0/hadoop/yarn/local/usercache/mansop/appcache/application_1512016123441_0032/container_1512016123441_0032_02_01/tmp/1515989862748-0/bin/python

-bash-4.1$ spark-submit --master yarn \
> --deploy-mode cluster \
> --driver-memory 4g \
> --executor-memory 2g \
> --executor-cores 4 \
> --queue default \
> --conf spark.pyspark.virtualenv.type=native \
> --conf 
> spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
> --conf 
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
>  \
> --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
> --py-files $HAIL_HOME/build/distributions/hail-python.zip \
> test.py
18/01/16 09:55:17 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/01/16 09:55:18 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/01/16 09:55:18 INFO RMProxy: Connecting to ResourceManager at 
wp-hdp-ctrl03-mlx.mlx/10.0.1.206:8050
18/01/16 09:55:18 INFO Client: Requesting a new application from cluster with 4 
NodeManagers
18/01/16 09:55:18 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (450560 MB per container)
18/01/16 09:55:18 INFO Client: Will allocate AM container, with 4505 MB memory 
including 409 MB overhead
18/01/16 09:55:18 INFO Client: Setting up container launch context for our AM
18/01/16 09:55:18 INFO Client: Setting up the launch environment for our AM 
container
18/01/16 09:55:18 INFO Client: Preparing resources for our AM container
18/01/16 09:55:19 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
18/01/16 09:55:19 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
18/01/16 09:55:19 INFO Client: Uploading resource 
file:/home/mansop/hail-test2/hail/build/libs/hail-all-spark.jar -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/hail-all-spark.jar
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/home/mansop/requirements.txt -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/requirements.txt
18/01/16 09:55:20 INFO Client: Uploading resource file:/home/mansop/test.py -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/test.py
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/usr/hdp/2.6.3.0-235/spark2/python/lib/pyspark.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/pyspark.zip
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/usr/hdp/2.6.3.0-235/spark2/python/lib/py4j-0.10.4-src.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/py4j-0.10.4-src.zip
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/home/mansop/hail-test2/hail/build/distributions/hail-python.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/hail-python.zip
18/01/16 09:55:20 INFO Client: Uploading resource

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gerard Maas

Hi,

You can monitor a filesystem directory as streaming source as long as the
files placed there are atomically copied/moved into the directory.
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali  wrote:

> Hi All,
>
> I am wondering if HDFS can be a streaming source like Kafka in Spark
> 2.2.0? For example can I have stream1 reading from Kafka and writing to
> HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that
> stream2 will be pulling the latest updates written by stream1.
>
> Thanks!
>

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu

Did you include any picture ?
Looks like the picture didn't go thru.
Please use third party site. 
Thanks
 Original message From: Tomasz Gawęda 
 Date: 1/15/18  2:07 PM  (GMT-08:00) To: 
d...@spark.apache.org, user@spark.apache.org Subject: Broken SQL Visualization? 

Hi,
today I have updated my test cluster to current Spark master, after that my SQL 
Visualization page started to crash with following error in JS:

Screenshot was cut for readability and to hide internal server names ;)


It may be caused by upgrade or by some code changes, but - to be honest - I did 
not use any new operators nor any new Spark function, so it should render 
correctly, like few days ago. Some Visualizations work fine, some crashes, I 
don't have any doubts why
 it may not work. Can anyone help me? Probably it is a bug in Spark, but it's 
hard to me to say in which place.



Thanks in advance!
Pozdrawiam / Best regards,
Tomek

can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali

Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0?
For example can I have stream1 reading from Kafka and writing to HDFS and
stream2 to read from HDFS and write it back to Kakfa ? such that stream2
will be pulling the latest updates written by stream1.

Thanks!

Re: Timestamp changing while writing

2018-01-15 Thread Bryan Cutler

Spark internally stores timestamps as UTC values, so cearteDataFrame will
covert from local time zone to UTC. I think there was a Jira to correct
parquet output. Are the values you are seeing offset from your local time
zone?

On Jan 11, 2018 4:49 PM, "sk skk"  wrote:

> Hello,
>
> I am using createDataframe and passing java row rdd and schema . But it is
> changing the time value when I write that data frame to a parquet file.
>
> Can any one help .
>
> Thank you,
> Sudhir
>

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski

Hi Michael,

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> val r1 = spark.range(1)
r1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> r1.as("left").join(r1.as("right")).filter($"left.id" === $"right.id
").show
+---+---+
| id| id|
+---+---+
|  0|  0|
+---+---+

Am I missing something? When aliasing a table, use the identifier in column
refs (inside).


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma  wrote:

> Hi Jacek & Gengliang,
>
> let's take a look at the following query:
>
> val pos = spark.read.parquet(prefix + "POSITION.parquet")
> pos.createOrReplaceTempView("POSITION")
> spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
> POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
> ").collect()
>
> This query is working for me right now using spark 2.2.
>
> Now we can try implementing the same logic with DataFrame API:
>
> pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()
>
> I am getting the following error:
>
> "Join condition is missing or trivial.
>
> Use the CROSS JOIN syntax to allow cartesian products between these
> relations.;"
>
> I have tried using alias function, but without success:
>
> val pos2 = pos.alias("P2")
> pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()
>
> This also leads us to the same error.
> Am  I missing smth about the usage of alias?
>
> Now let's rename the columns:
>
> val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
> pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()
>
> It works!
>
> There is one more really odd thing about all this: a colleague of mine
> has managed to get the same exception ("Join condition is missing or
> trivial") also using original SQL query, but I think he has been using
> empty tables.
>
> Thanks,
> Michael
>
>
> On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
>  wrote:
> > Hi Michael,
> >
> > You can use `Explain` to see how your query is optimized.
> > https://docs.databricks.com/spark/latest/spark-sql/
> language-manual/explain.html
> > I believe your query is an actual cross join, which is usually very slow
> in
> > execution.
> >
> > To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
> >
> >
> > 在 2018年1月15日，下午6:09，Jacek Laskowski  写道：
> >
> > Hi Michael,
> >
> > -dev +user
> >
> > What's the query? How do you "fool spark"?
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma 
> > wrote:
> >>
> >> Hi all,
> >>
> >> If I try joining the table with itself using join columns, I am
> >> getting the following error:
> >> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> >> allow cartesian products between these relations.;"
> >>
> >> This is not true, and my join is not trivial and is not a real cross
> >> join. I am providing join condition and expect to get maybe a couple
> >> of joined rows for each row in the original table.
> >>
> >> There is a workaround for this, which implies renaming all the columns
> >> in source data frame and only afterwards proceed with the join. This
> >> allows us to fool spark.
> >>
> >> Now I am wondering if there is a way to get rid of this problem in a
> >> better way? I do not like the idea of renaming the columns because
> >> this makes it really difficult to keep track of the names in the
> >> columns in result data frames.
> >> Is it possible to deactivate this check?
> >>
> >> Thanks,
> >> Michael
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
>

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Georg Heiler

I do not know that module, but in literature PUL is the exact term you
should look for.

Matt Hicks  schrieb am Mo., 15. Jan. 2018 um 20:56 Uhr:

> Is it fair to assume this is what I need?
> https://github.com/ispras/pu4spark
>
>
>
> On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.hei...@gmail.com
> wrote:
>
>> As far as I know spark does not implement such algorithms. In case the
>> dataset is small
>> http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
>>  might
>> be of interest to you.
>>
>> Jörn Franke  schrieb am Mo., 15. Jan. 2018 um
>> 20:04 Uhr:
>>
>> I think you look more for algorithms for unsupervised learning, eg
>> clustering.
>>
>> Depending on the characteristics different clusters might be created , eg
>> donor or non-donor. Most likely you may find also more clusters (eg would
>> donate but has a disease preventing it or too old). You can verify which
>> clusters make sense for your approach so I recommend not only try two
>> clusters but multiple and see which number is more statistically
>> significant .
>>
>> On 15. Jan 2018, at 19:21, Matt Hicks  wrote:
>>
>> I'm attempting to create a training classification, but only have
>> positive information.  Specifically in this case it is a donor list of
>> users, but I want to use it as training in order to determine
>> classification for new contacts to give probabilities that they will donate.
>>
>> Any insights or links are appreciated. I've gone through the
>> documentation but have been unable to find any references to how I might do
>> this.
>>
>> Thanks
>>
>> ---*Matt Hicks*
>>
>> *Chief Technology Officer*
>>
>> 405.283.6887 <(405)%20283-6887> | http://outr.com
>>
>> 
>>
>>

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks

Is it fair to assume this is what I need? https://github.com/ispras/pu4spark  





On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.hei...@gmail.com  wrote:
As far as I know spark does not implement such algorithms. In case the dataset
is small
http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
 might be of interest to you.
Jörn Franke  schrieb am Mo., 15. Jan. 2018 um 20:04 Uhr:
I think you look more for algorithms for unsupervised learning, eg clustering.
Depending on the characteristics different clusters might be created , eg donor
or non-donor. Most likely you may find also more clusters (eg would donate but
has a disease preventing it or too old). You can verify which clusters make
sense for your approach so I recommend not only try two clusters but multiple
and see which number is more statistically significant .
On 15. Jan 2018, at 19:21, Matt Hicks  wrote:

I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Georg Heiler

As far as I know spark does not implement such algorithms. In case the
dataset is small
http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
might
be of interest to you.

Jörn Franke  schrieb am Mo., 15. Jan. 2018 um
20:04 Uhr:

> I think you look more for algorithms for unsupervised learning, eg
> clustering.
>
> Depending on the characteristics different clusters might be created , eg
> donor or non-donor. Most likely you may find also more clusters (eg would
> donate but has a disease preventing it or too old). You can verify which
> clusters make sense for your approach so I recommend not only try two
> clusters but multiple and see which number is more statistically
> significant .
>
> On 15. Jan 2018, at 19:21, Matt Hicks  wrote:
>
> I'm attempting to create a training classification, but only have positive
> information.  Specifically in this case it is a donor list of users, but I
> want to use it as training in order to determine classification for new
> contacts to give probabilities that they will donate.
>
> Any insights or links are appreciated. I've gone through the documentation
> but have been unable to find any references to how I might do this.
>
> Thanks
>
> ---*Matt Hicks*
>
> *Chief Technology Officer*
>
> 405.283.6887 <(405)%20283-6887> | http://outr.com
>
> 
>
>

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Jörn Franke

I think you look more for algorithms for unsupervised learning, eg clustering.

Depending on the characteristics different clusters might be created , eg donor 
or non-donor. Most likely you may find also more clusters (eg would donate but 
has a disease preventing it or too old). You can verify which clusters make 
sense for your approach so I recommend not only try two clusters but multiple 
and see which number is more statistically significant .

> On 15. Jan 2018, at 19:21, Matt Hicks  wrote:
> 
> 
> I'm attempting to create a training classification, but only have positive 
> information.  Specifically in this case it is a donor list of users, but I 
> want to use it as training in order to determine classification for new 
> contacts to give probabilities that they will donate.
> 
> Any insights or links are appreciated. I've gone through the documentation 
> but have been unable to find any references to how I might do this.
> 
> Thanks
> 
> ---
> Matt Hicks
> Chief Technology Officer
> 405.283.6887 | http://outr.com
>

Re: 3rd party hadoop input formats for EDI formats

2018-01-15 Thread Jörn Franke

I do not want to make advertisement for certain third party components.

Hence, just some food for thought:
Python Pandas supports some of those formats (it is not an inputformat though).

Some commercial offers just provide etl to convert it into another format 
supported already by Spark .

Then you have a lot of message gateways which receive these messages and can 
also convert them.

As a last thing you have third party libraries supporting these formats and it 
is rather easily to create your own inputformat for them based on that.

So it is not only about finding an inputformat, but may also make an 
architectural decision to convert these formats into Spark supported ones.

> On 15. Jan 2018, at 19:01, Saravanan Nagarajan  wrote:
> 
> Hello All,
> 
>  Ned to research the availability of both open source and commercial 
> libraries to read healthcare EDI formats such as HL7, 835, 837. Each library 
> need to be researched/ranked on several criteria like pricing if commercial, 
> suitability for integration into sagacity, stability of library, maturity 
> /stability of API, support options etc. Any documentation or suggestion would 
> help . Thanks!

[Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks

I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

3rd party hadoop input formats for EDI formats

2018-01-15 Thread Saravanan Nagarajan

Hello All,

 Ned to research the availability of both open source and commercial
libraries to read healthcare EDI formats such as HL7, 835, 837. Each
library need to be researched/ranked on several criteria like pricing if
commercial, suitability for integration into sagacity, stability of
library, maturity /stability of API, support options etc. Any documentation
or suggestion would help . Thanks!

Re: End of Stream errors in shuffle

2018-01-15 Thread pratyush04

Hi Fernando,

There is a limit of 2GB on blocks for shuffle, since you say the job fails
while doing shuffle of 200GB data, it might be due to this.
These links give more idea about this:
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-2GB-limit-for-partitions-td10435.html
https://issues.apache.org/jira/browse/SPARK-5928

Thanks,
Pratyush




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma

Hi Jacek & Gengliang,

let's take a look at the following query:

val pos = spark.read.parquet(prefix + "POSITION.parquet")
pos.createOrReplaceTempView("POSITION")
spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
").collect()

This query is working for me right now using spark 2.2.

Now we can try implementing the same logic with DataFrame API:

pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()

I am getting the following error:

"Join condition is missing or trivial.

Use the CROSS JOIN syntax to allow cartesian products between these relations.;"

I have tried using alias function, but without success:

val pos2 = pos.alias("P2")
pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()

This also leads us to the same error.
Am  I missing smth about the usage of alias?

Now let's rename the columns:

val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()

It works!

There is one more really odd thing about all this: a colleague of mine
has managed to get the same exception ("Join condition is missing or
trivial") also using original SQL query, but I think he has been using
empty tables.

Thanks,
Michael


On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
 wrote:
> Hi Michael,
>
> You can use `Explain` to see how your query is optimized.
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
> I believe your query is an actual cross join, which is usually very slow in
> execution.
>
> To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>
>
> 在 2018年1月15日，下午6:09，Jacek Laskowski  写道：
>
> Hi Michael,
>
> -dev +user
>
> What's the query? How do you "fool spark"?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma 
> wrote:
>>
>> Hi all,
>>
>> If I try joining the table with itself using join columns, I am
>> getting the following error:
>> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> allow cartesian products between these relations.;"
>>
>> This is not true, and my join is not trivial and is not a real cross
>> join. I am providing join condition and expect to get maybe a couple
>> of joined rows for each row in the original table.
>>
>> There is a workaround for this, which implies renaming all the columns
>> in source data frame and only afterwards proceed with the join. This
>> allows us to fool spark.
>>
>> Now I am wondering if there is a way to get rid of this problem in a
>> better way? I do not like the idea of renaming the columns because
>> this makes it really difficult to keep track of the names in the
>> columns in result data frames.
>> Is it possible to deactivate this check?
>>
>> Thanks,
>> Michael
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-15 Thread abdul.h.hussain

Hi,

My Spark app is mapping lines from a text file to case classes stored within an 
RDD.

When I run the following code on this rdd:
.collect.map(line => if(validate_hostname(line, data_frame)) 
line).foreach(println)

It correctly calls the method validate_hostname by passing the case class and 
another data_frame defined within the main method. Unfortunately the above map 
only returns a TraversableLike collection so I can't do transformations and 
joins on this data structure so I'm tried to apply a filter on the rdd with the 
following code:
.filter(line => validate_hostname(line, data_frame)).count()

Unfortunately the above method with filtering the rdd does not pass the 
data_frame so I get a NullPointerException though it correctly passes the case 
class which I print within the method.

Where am I going wrong?

When

Regards,
Abdul Haseeb Hussain

End of Stream errors in shuffle

2018-01-15 Thread Fernando Pereira

Hi,

I'm facing a very strange error that occurs halfway of long execution Spark
SQL jobs:

18/01/12 22:14:30 ERROR Utils: Aborting task
java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes
expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
at
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
(...)

Since I get this in several jobs, I wonder if it might be a problem at the
comm layer.
Did anyone face a similar problem?

It always happens in a job which does a shuffle of 200GB reading then in
partitions of ~64MB for a groupBy. And it is weird that it only fails when
it processed over 1000 partitions (16 cores on one node)

I even tried changing the spark.shuffle.file.buffer config but it just
seems to change the point when it occurs.

Really would appreciate some hints - what it could be, what to try, test,
how to debug - as I feel pretty much blocked here.

Thanks in advance
Fernando

Re: Inner join with the table itself

2018-01-15 Thread Gengliang Wang

Hi Michael,

You can use `Explain` to see how your query is optimized. 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html 

I believe your query is an actual cross join, which is usually very slow in 
execution.

To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.


> 在 2018年1月15日，下午6:09，Jacek Laskowski  写道：
> 
> Hi Michael,
> 
> -dev +user
> 
> What's the query? How do you "fool spark"?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski 
> Mastering Spark SQL https://bit.ly/mastering-spark-sql 
> 
> Spark Structured Streaming https://bit.ly/spark-structured-streaming 
> 
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams 
> 
> Follow me at https://twitter.com/jaceklaskowski
>  
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma  > wrote:
> Hi all,
> 
> If I try joining the table with itself using join columns, I am
> getting the following error:
> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> allow cartesian products between these relations.;"
> 
> This is not true, and my join is not trivial and is not a real cross
> join. I am providing join condition and expect to get maybe a couple
> of joined rows for each row in the original table.
> 
> There is a workaround for this, which implies renaming all the columns
> in source data frame and only afterwards proceed with the join. This
> allows us to fool spark.
> 
> Now I am wondering if there is a way to get rid of this problem in a
> better way? I do not like the idea of renaming the columns because
> this makes it really difficult to keep track of the names in the
> columns in result data frames.
> Is it possible to deactivate this check?
> 
> Thanks,
> Michael
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
>

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski

Hi Michael,

-dev +user

What's the query? How do you "fool spark"?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma 
wrote:

> Hi all,
>
> If I try joining the table with itself using join columns, I am
> getting the following error:
> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> allow cartesian products between these relations.;"
>
> This is not true, and my join is not trivial and is not a real cross
> join. I am providing join condition and expect to get maybe a couple
> of joined rows for each row in the original table.
>
> There is a workaround for this, which implies renaming all the columns
> in source data frame and only afterwards proceed with the join. This
> allows us to fool spark.
>
> Now I am wondering if there is a way to get rid of this problem in a
> better way? I do not like the idea of renaming the columns because
> this makes it really difficult to keep track of the names in the
> columns in result data frames.
> Is it possible to deactivate this check?
>
> Thanks,
> Michael
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: spark-submit can find python?

Re: Broken SQL Visualization?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

RE: spark-submit can find python?

spark-submit can find python?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: Broken SQL Visualization?

can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: Timestamp changing while writing

Re: Inner join with the table itself

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: 3rd party hadoop input formats for EDI formats

[Spark ML] Positive-Only Training Classification in Scala

3rd party hadoop input formats for EDI formats

Re: End of Stream errors in shuffle

Re: Inner join with the table itself

[Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

End of Stream errors in shuffle

Re: Inner join with the table itself

Re: Inner join with the table itself

27 matches

Site Navigation

Mail list logo

Footer information